### **TF-IDF: Exercises**
 
- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.
 
- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.
 
- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!
 
- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [12]:
#import pandas library
import pandas as pd

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df = pd.read_csv(r'C:\Users\teore\OneDrive\Documents\GitHub\NLP_tutorial\tf_idf\train.txt', sep=';')

#print the shape of dataframe
print(df.shape)
df.columns = ['text', 'emotion']
#print top 5 rows
print(df.head())

(15999, 2)
                                                text  emotion
0  i can go from feeling so hopeless to so damned...  sadness
1   im grabbing a minute to post i feel greedy wrong    anger
2  i am ever feeling nostalgic about the fireplac...     love
3                               i am feeling grouchy    anger
4  ive been feeling a little burdened lately wasn...  sadness


In [13]:
#check the distribution of Emotion
df['emotion'].value_counts()

emotion
joy         5362
sadness     4665
anger       2159
fear        1937
love        1304
surprise     572
Name: count, dtype: int64

In [14]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
df["emotion_num"] = df['emotion'].map({'joy': 0, 'sadness': 1 , 'anger': 2 , 'fear': 3 , 'love': 4 , 'surprise': 5})

#checking the results by printing top 5 rows
print(df.head())

                                                text  emotion  emotion_num
0  i can go from feeling so hopeless to so damned...  sadness            1
1   im grabbing a minute to post i feel greedy wrong    anger            2
2  i am ever feeling nostalgic about the fireplac...     love            4
3                               i am feeling grouchy    anger            2
4  ive been feeling a little burdened lately wasn...  sadness            1


### **Modelling without Pre-processing Text data**

In [17]:
#import train-test split
from sklearn.model_selection import train_test_split

#Do the 'train-test' splitting with test size of 20%
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['emotion_num'], test_size = 0.2, random_state = 2022, stratify=df['emotion_num'])
#Note: Give Random state 2022 and also do the stratify sampling

In [18]:
#print the shapes of X_train and X_test
print(X_train.shape)
print(X_test.shape)

(12799,)
(3200,)



**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [25]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

#1. create a pipeline object
rf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('random_forest', RandomForestClassifier())
])


#2. fit with X_train and y_train
rf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = rf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.93      0.89      1073
           1       0.91      0.90      0.91       933
           2       0.87      0.83      0.85       432
           3       0.86      0.82      0.84       387
           4       0.87      0.73      0.79       261
           5       0.79      0.72      0.75       114

    accuracy                           0.87      3200
   macro avg       0.86      0.82      0.84      3200
weighted avg       0.87      0.87      0.87      3200




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.


In [27]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

#1. create a pipeline object
nb = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])


#2. fit with X_train and y_train
nb.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = nb.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.73      0.95      0.82      1073
           1       0.72      0.94      0.81       933
           2       0.92      0.55      0.69       432
           3       0.90      0.56      0.69       387
           4       0.91      0.22      0.36       261
           5       0.86      0.05      0.10       114

    accuracy                           0.75      3200
   macro avg       0.84      0.54      0.58      3200
weighted avg       0.79      0.75      0.72      3200




**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and Bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [26]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

#1. create a pipeline object
rf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1,2))),
    ('random_forest', RandomForestClassifier())
])


#2. fit with X_train and y_train
rf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = rf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.96      0.88      1073
           1       0.91      0.90      0.91       933
           2       0.91      0.79      0.85       432
           3       0.89      0.77      0.83       387
           4       0.91      0.68      0.78       261
           5       0.86      0.66      0.75       114

    accuracy                           0.87      3200
   macro avg       0.88      0.79      0.83      3200
weighted avg       0.87      0.87      0.86      3200




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [29]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

#1. create a pipeline object
rf = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('random_forest', RandomForestClassifier())
])


#2. fit with X_train and y_train
rf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = rf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.93      0.88      1073
           1       0.91      0.91      0.91       933
           2       0.88      0.79      0.83       432
           3       0.84      0.82      0.83       387
           4       0.85      0.70      0.77       261
           5       0.88      0.66      0.75       114

    accuracy                           0.86      3200
   macro avg       0.87      0.80      0.83      3200
weighted avg       0.86      0.86      0.86      3200



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [30]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [31]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
df["preprocessed_comment"] = df["text"].apply(preprocess)
# this will take some time, please be patient

**Build a model with pre processed text**

In [32]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment
X_train, X_test, y_train, y_test = train_test_split(df['preprocessed_comment'], df['emotion_num'], test_size = 0.2, random_state = 2022, stratify=df['emotion_num'])
#Note: Give Random state 2022 and also do the stratify sampling

**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [33]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

#1. create a pipeline object
rf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1,2))),
    ('random_forest', RandomForestClassifier())
])


#2. fit with X_train and y_train
rf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = rf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.90      0.89      1073
           1       0.86      0.94      0.90       933
           2       0.87      0.83      0.85       432
           3       0.92      0.82      0.86       387
           4       0.80      0.74      0.77       261
           5       0.82      0.75      0.78       114

    accuracy                           0.87      3200
   macro avg       0.86      0.83      0.84      3200
weighted avg       0.87      0.87      0.87      3200




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [35]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

#1. create a pipeline object
rf = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('random_forest', RandomForestClassifier())
])


#2. fit with X_train and y_train
rf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = rf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.91      0.87      1073
           1       0.89      0.89      0.89       933
           2       0.87      0.81      0.84       432
           3       0.83      0.83      0.83       387
           4       0.82      0.65      0.73       261
           5       0.79      0.67      0.72       114

    accuracy                           0.85      3200
   macro avg       0.84      0.79      0.81      3200
weighted avg       0.85      0.85      0.85      3200



## **Please write down Final Observations**


## [**Solution**](./tf_idf_exercise_solutions.ipynb)