<a href="https://colab.research.google.com/github/sneharc16/DTU-MLR-Assignments-Deep-Learning/blob/main/DTU_MLR_TF_IDF_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **TF-IDF: Exercises**

- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [None]:
!kaggle datasets download -d praveengovi/emotions-dataset-for-nlp

!unzip emotions-dataset-for-nlp.zip



Dataset URL: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp
License(s): CC-BY-SA-4.0
Downloading emotions-dataset-for-nlp.zip to /content
  0% 0.00/721k [00:00<?, ?B/s]
100% 721k/721k [00:00<00:00, 105MB/s]
Archive:  emotions-dataset-for-nlp.zip
  inflating: test.txt                
  inflating: train.txt               
  inflating: val.txt                 


In [None]:
!mv train.txt train.csv
!mv val.txt val.csv
!mv test.txt test.csv

In [None]:
#import pandas library
import pandas as pd

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
train = pd.read_csv("train.csv", sep = ";", names=['Sentence', 'Emotion'])
val = pd.read_csv("val.csv", sep = ";")
test = pd.read_csv("test.csv", sep = ";")

#print the shape of dataframe
print(train.shape)

#print top 5 rows
print(train.head())

(16000, 2)
                                            Sentence  Emotion
0                            i didnt feel humiliated  sadness
1  i can go from feeling so hopeless to so damned...  sadness
2   im grabbing a minute to post i feel greedy wrong    anger
3  i am ever feeling nostalgic about the fireplac...     love
4                               i am feeling grouchy    anger


In [None]:
#check the distribution of Emotion
train["Emotion"].value_counts()



Emotion
joy         5362
sadness     4666
anger       2159
fear        1937
love        1304
surprise     572
Name: count, dtype: int64

In [None]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
emotion_num_list = []
for i in train["Emotion"]:
    if i == "joy":
        emotion_num_list.append(0)
    elif i == "fear":
        emotion_num_list.append(1)
    elif i == "anger":
        emotion_num_list.append(2)
    elif i == "sadness":
        emotion_num_list.append(3)
    elif i == "love":
        emotion_num_list.append(4)
    elif i == "surprise":
        emotion_num_list.append(5)

# Add the new column "Emotion_num" to the DataFrame
train["Emotion_num"] = emotion_num_list

#checking the results by printing top 5 rows
print (train.head())

                                            Sentence  Emotion  Emotion_num
0                            i didnt feel humiliated  sadness            3
1  i can go from feeling so hopeless to so damned...  sadness            3
2   im grabbing a minute to post i feel greedy wrong    anger            2
3  i am ever feeling nostalgic about the fireplac...     love            4
4                               i am feeling grouchy    anger            2


### **Modelling without Pre-processing Text data**

In [None]:
#import train-test split
from sklearn.model_selection import train_test_split

#Do the 'train-test' splitting with test size of 20%
X_train, X_test = train_test_split(train, test_size=0.2, random_state=2022, stratify=train["Emotion"])

#Note: Give Random state 2022 and also do the stratify sampling



In [None]:
#print the shapes of X_train and X_test
print (X_train.shape)
print (X_test.shape)


(12800, 3)
(3200, 3)



**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object
pipe_RFC = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(3, 3))),
    ('classifier', RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe_RFC.fit(X_train["Sentence"], X_train["Emotion"])




In [None]:
#3. get the predictions for X_test and store it in y_pred
y_pred = pipe_RFC.predict(X_test["Sentence"])


#4. print the classfication report
print(classification_report(X_test["Emotion"], y_pred))

              precision    recall  f1-score   support

       anger       0.53      0.18      0.26       432
        fear       0.54      0.28      0.37       387
         joy       0.41      0.87      0.56      1072
        love       0.62      0.11      0.19       261
     sadness       0.58      0.32      0.41       933
    surprise       0.54      0.13      0.21       115

    accuracy                           0.46      3200
   macro avg       0.54      0.31      0.33      3200
weighted avg       0.51      0.46      0.41      3200




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.


In [None]:
#import MultinomialNB from sklearn
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object
pipe_multi = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])


#2. fit with X_train and y_train

pipe_multi.fit(X_train["Sentence"], X_train["Emotion"])

#3. get the predictions for X_test and store it in y_pred
pipe_multi.predict(X_test["Sentence"])

#4. print the classfication report
print(classification_report(X_test["Emotion"], y_pred))

              precision    recall  f1-score   support

       anger       0.54      0.18      0.26       432
        fear       0.57      0.26      0.36       387
         joy       0.41      0.88      0.56      1072
        love       0.63      0.11      0.19       261
     sadness       0.58      0.32      0.41       933
    surprise       0.50      0.13      0.21       115

    accuracy                           0.46      3200
   macro avg       0.54      0.31      0.33      3200
weighted avg       0.52      0.46      0.41      3200




**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and Bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#1. create a pipeline object
pipe_uni_bi = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', RandomForestClassifier())
])

#2. fit with X_train and y_train
pipe_uni_bi.fit(X_train["Sentence"], X_train["Emotion"])


#3. get the predictions for X_test and store it in y_pred
pipe_uni_bi.predict(X_test["Sentence"])


#4. print the classfication report
print(classification_report(X_test["Emotion"], y_pred))

              precision    recall  f1-score   support

       anger       0.53      0.18      0.26       432
        fear       0.54      0.28      0.37       387
         joy       0.41      0.87      0.56      1072
        love       0.62      0.11      0.19       261
     sadness       0.58      0.32      0.41       933
    surprise       0.54      0.13      0.21       115

    accuracy                           0.46      3200
   macro avg       0.54      0.31      0.33      3200
weighted avg       0.51      0.46      0.41      3200




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


> data: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

#IMDB title classifier begins

In [None]:
#!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

!unzip imdb-dataset-of-50k-movie-reviews.zip


Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  imdb-dataset-of-50k-movie-reviews.zip
replace IMDB Dataset.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
imdb = pd.read_csv("/content/IMDB Dataset.csv")

X_train, X_test = train_test_split(imdb, test_size=0.05, random_state=2022, stratify=imdb["sentiment"])

In [None]:
print(X_train.head())
X_train.shape


                                                  review sentiment
38751  'Panic in the Streets (1950)' owes more to Bri...  positive
18095  I got to see an early preview of this movie an...  negative
26076  The performances were superb, the costumes del...  positive
20799  Wesley Snipes is perfectly cast as Blade, a ha...  positive
6663   Ernst Lubitsch gave us wonderful films like De...  positive


(2500, 2)

In [None]:
#import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

## generate cosine similarity
import math

def cosine_similarity(vector1, vector2):
    """
    Calculate the cosine similarity between two vectors.

    Args:
        vector1 (list): The first vector.
        vector2 (list): The second vector.

    Returns:
        float: The cosine similarity between the two vectors.
    """

    # IMPLEMENT

    dot_product = np.dot(vector1, vector2)
    norm1 = np.linalg.norm(vector1)
    norm2 = np.linalg.norm(vector2)
    cosine_sim = dot_product / (norm1 * norm2)

    return cosine_sim

## implement get_recomendation()

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]

    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return imdb['review'][movie_indices]

In [None]:
### Try to implement cosine similarity from scratch

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
#1. create a pipeline object
pipe_imdb = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe_imdb.fit(X_train["review"], X_train["sentiment"])




In [None]:
#3. get the predictions for X_test and store it in y_pred

y_pred_imdb = pipe_imdb.predict(X_test["review"])

#4. print the classfication

print (classification_report(X_test["sentiment"], y_pred_imdb))

              precision    recall  f1-score   support

    negative       0.80      0.81      0.80     23750
    positive       0.81      0.80      0.80     23750

    accuracy                           0.80     47500
   macro avg       0.80      0.80      0.80     47500
weighted avg       0.80      0.80      0.80     47500



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [None]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)



In [None]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient

X_train["preprocessed_comment"] = X_train["review"].apply(preprocess)

print(X_train.head())




                                                  review sentiment  \
38751  'Panic in the Streets (1950)' owes more to Bri...  positive   
18095  I got to see an early preview of this movie an...  negative   
26076  The performances were superb, the costumes del...  positive   
20799  Wesley Snipes is perfectly cast as Blade, a ha...  positive   
6663   Ernst Lubitsch gave us wonderful films like De...  positive   

                                    preprocessed_comment  
38751  Panic Streets 1950 owe british noir american c...  
18095  get early preview movie hope time edit way imp...  
26076  performance superb costume deliver unique feel...  
20799  Wesley Snipes perfectly cast Blade half human ...  
6663   Ernst Lubitsch give wonderful film like Design...  


In [None]:
X_train_preprocessed = X_train.copy()

X_train_preprocessed.head()

Unnamed: 0,review,sentiment,preprocessed_comment
38751,'Panic in the Streets (1950)' owes more to Bri...,positive,Panic Streets 1950 owe british noir american c...
18095,I got to see an early preview of this movie an...,negative,get early preview movie hope time edit way imp...
26076,"The performances were superb, the costumes del...",positive,performance superb costume deliver unique feel...
20799,"Wesley Snipes is perfectly cast as Blade, a ha...",positive,Wesley Snipes perfectly cast Blade half human ...
6663,Ernst Lubitsch gave us wonderful films like De...,positive,Ernst Lubitsch give wonderful film like Design...


In [None]:
X_test["preprocessed_comment"] = X_test["review"].apply(preprocess)
print(X_test.head())
X_test_preprocessed = X_test.copy()

                                                  review sentiment  \
23059  This movie was in a sci-fi 50-pack a friend of...  negative   
6765   I bought Jack-O a number of months ago at a Bl...  negative   
21325  While it contains facts that are not widely re...  negative   
32551  I have never seen so much talent and money use...  negative   
7937   I simply cannot believe the number of people c...  negative   

                                    preprocessed_comment  
23059  movie sci fi 50 pack friend get Christmas simi...  
6765   buy Jack o number month ago Blockbuster video ...  
21325  contain fact widely report exactly truth take ...  
32551  see talent money produce bad entire life state...  
7937   simply believe number people compare favourabl...  


In [None]:
print(X_test_preprocessed.head())

                                                  review sentiment  \
23059  This movie was in a sci-fi 50-pack a friend of...  negative   
6765   I bought Jack-O a number of months ago at a Bl...  negative   
21325  While it contains facts that are not widely re...  negative   
32551  I have never seen so much talent and money use...  negative   
7937   I simply cannot believe the number of people c...  negative   

                                    preprocessed_comment  
23059  movie sci fi 50 pack friend get Christmas simi...  
6765   buy Jack o number month ago Blockbuster video ...  
21325  contain fact widely report exactly truth take ...  
32551  see talent money produce bad entire life state...  
7937   simply believe number people compare favourabl...  


**Build a model with pre processed text**

In [None]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment


# Assuming X_train and X_test are your preprocessed features
X_train_nlp, X_test_nlp, y_train_nlp, y_test_nlp = train_test_split(
    X_train_preprocessed["preprocessed_comment"],
    X_train_preprocessed["sentiment"],
    test_size=0.2,
    random_state=2022,
    stratify=X_train_preprocessed["sentiment"]
)
# Create DataFrames for X_train_nlp and X_test_nlp (if needed)
#no need

# Display the first few rows of X_train_nlp
#done later





18396    heart Darkness Movie Review Joseph Conrad Hear...
33553    like Summerslam look arena curtain look overal...
15662    actually pretty good like show tv good episode...
7432     energetic entertain minute film > see long tim...
23768    title sum heap crap take hint see Fred olen Ra...
Name: preprocessed_comment, dtype: object

**Let's check the scores with our best model till now**
- Random Forest


**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
#1. create a pipeline object
pipe_nlp = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', RandomForestClassifier())
])



#2. fit with X_train and y_train
pipe_nlp.fit(X_train_preprocessed["preprocessed_comment"], X_train_preprocessed["sentiment"])


#3. get the predictions for X_test and store it in y_pred
y_pred_nlp = pipe_nlp.predict(X_test_preprocessed["preprocessed_comment"])


#4. print the classfication report
print(classification_report(X_test_preprocessed["sentiment"], y_pred_nlp))


              precision    recall  f1-score   support

    negative       0.83      0.76      0.79      1250
    positive       0.78      0.85      0.81      1250

    accuracy                           0.80      2500
   macro avg       0.81      0.80      0.80      2500
weighted avg       0.81      0.80      0.80      2500




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [None]:
#1. create a pipeline object
pipe_tf_idf_nlp = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe_tf_idf_nlp.fit(X_train_preprocessed["preprocessed_comment"], X_train_preprocessed["sentiment"])


#3. get the predictions for X_test and store it in y_pred
y_pred_tf_idf_nlp = pipe_tf_idf_nlp.predict(X_test_preprocessed["preprocessed_comment"])


#4. print the classfication report
print (classification_report(X_test_preprocessed["sentiment"], y_pred_tf_idf_nlp))

              precision    recall  f1-score   support

    negative       0.80      0.81      0.80      1250
    positive       0.81      0.80      0.80      1250

    accuracy                           0.80      2500
   macro avg       0.80      0.80      0.80      2500
weighted avg       0.80      0.80      0.80      2500



## **Please write down Final Observations**
