<a href="https://colab.research.google.com/github/sibteali786/500-AI-Machine-learning-Deep-learning-Computer-vision-NLP-Projects-with-code/blob/main/12_tf_idf/tf_idf_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **TF-IDF: Exercises**

- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [3]:
#import pandas library
import pandas as pd
import requests
from io import StringIO
#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
url = "https://github.com/codebasics/nlp-tutorials/raw/refs/heads/main/12_tf_idf/Emotion_classify_Data.csv"
response = requests.get(url)
df = pd.read_csv(StringIO(response.text))


#print the shape of dataframe
df.shape

#print top 5 rows
df.head()

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [5]:
#check the distribution of Emotion
df['Emotion'].value_counts()

Unnamed: 0_level_0,count
Emotion,Unnamed: 1_level_1
anger,2000
joy,2000
fear,1937


In [6]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
df['Emotion_num'] = df['Emotion'].map({
    'joy': 0,
    'fear': 1,
    'anger': 2
})

#checking the results by printing top 5 rows
df.head()


Unnamed: 0,Comment,Emotion,Emotion_num
0,i seriously hate one subject to death but now ...,fear,1
1,im so full of life i feel appalled,anger,2
2,i sit here to write i start to dig out my feel...,fear,1
3,ive been really angry with r and i feel like a...,joy,0
4,i feel suspicious if there is no one outside l...,fear,1


### **Modelling without Pre-processing Text data**

In [None]:
#import train-test split


#Do the 'train-test' splitting with test size of 20%
#Note: Give Random state 2022 and also do the stratify sampling



In [None]:
#print the shapes of X_train and X_test




**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn


#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report



**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.


In [None]:
#import MultinomialNB from sklearn



#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred


#4. print the classfication report



**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and Bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#1. create a pipeline object




#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report



**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#import TfidfVectorizer from sklearn



#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred


#4. print the classfication report


<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [None]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [None]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient



**Build a model with pre processed text**

In [None]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment


**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#1. create a pipeline object




#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report



**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred



#4. print the classfication report

## **Please write down Final Observations**


## [**Solution**](./tf_idf_exercise_solutions.ipynb)