## ***TF-IDF (Term Frequency - Inverse Document Frequency)***
- weighting scheme that measures **how important a word is to a document relative to the entire collection (corpus)**

- `TF = (occurance of term t) / (total words int the text)`
- `IDF = log {(total number of documents) / (number of documents containing term t)}`

- **TF-IDF(t,d)=TF(t,d)×IDF(t)**
***

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
data = pd.read_csv('emotion.csv')
data

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear
...,...,...
5932,i begun to feel distressed for you,fear
5933,i left feeling annoyed and angry thinking that...,anger
5934,i were to ever get married i d have everything...,joy
5935,i feel reluctant in applying there because i w...,fear


In [4]:
def map_emo(emo):
    if emo == 'fear':
        return 0
    elif emo == 'anger':
        return 1
    else:
        return 2

In [5]:
data["Emotion"] = data["Emotion"].apply(map_emo)

In [6]:
data.head()

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,0
1,im so full of life i feel appalled,1
2,i sit here to write i start to dig out my feel...,0
3,ive been really angry with r and i feel like a...,2
4,i feel suspicious if there is no one outside l...,0


In [8]:
data["Emotion"].value_counts()

Emotion
1    2000
2    2000
0    1937
Name: count, dtype: int64

***text modification***

In [9]:
import spacy

In [11]:
nlp = spacy.load('en_core_web_sm')

In [12]:
def modify_text (text):
    doc = nlp(text)
    # lemmitization, stop words & puncs remiving
    modified_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return modified_tokens

In [18]:
# data["Mod_comm"] = data["Comment"].apply(modify_text) 

***test train split***

In [19]:
from sklearn.model_selection import train_test_split

train_comm, test_comm, train_emm, test_emm = train_test_split(data["Comment"], data["Emotion"], test_size=0.2, random_state=42)

In [20]:
train_comm, train_emm

(4945    i remember the day i was on the phone with my ...
 5428    i had to go to the gym so many times this last...
 1344    i feel irritated that he either interrupts my ...
 1888    i really hate this feeling when you really giv...
 2480                   i was left feeling a little shaken
                               ...                        
 3772    im feeling stressed about upcoming events drow...
 5191    i will adress those issues and attempt to reas...
 5226    i can remember when cammie was a couple of mon...
 5390    i know is what i feel and i feel absolutely te...
 860                   i answered feeling rather skeptical
 Name: Comment, Length: 4749, dtype: object,
 4945    2
 5428    0
 1344    1
 1888    1
 2480    0
        ..
 3772    1
 5191    0
 5226    1
 5390    0
 860     0
 Name: Emotion, Length: 4749, dtype: int64)

In [21]:
test_comm, test_emm

(1867    i do give up at times when i feel there s no p...
 3988    im a firm believer that nothing makes a woman ...
 4516    i was feeling very vulnerable and down no one ...
 1397    i closed her eyes in anger and feeling disgust...
 1669                         i feel like being distracted
                               ...                        
 642     i was feeling even less splendid and had nothi...
 1253                           i feel confused after that
 3094    i may pour out the half empty cup here i will ...
 3733    i really want this challenge to be a fun way f...
 3523    i do buy synthetic pearls when i feel the need...
 Name: Comment, Length: 1188, dtype: object,
 1867    1
 3988    2
 4516    0
 1397    1
 1669    1
        ..
 642     2
 1253    0
 3094    2
 3733    0
 3523    2
 Name: Emotion, Length: 1188, dtype: int64)

***td-idf vectorization***

In [34]:
vectorizer = TfidfVectorizer(smooth_idf=True)

In [35]:
vectorizer.fit_transform(train_comm, train_emm)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 74244 stored elements and shape (4749, 7900)>

In [36]:
vectorizer.vocabulary_

{'remember': 5662,
 'the': 6961,
 'day': 1660,
 'was': 7572,
 'on': 4758,
 'phone': 5038,
 'with': 7747,
 'my': 4495,
 'be': 577,
 'fri': 2763,
 'shannon': 6107,
 'telling': 6901,
 'her': 3204,
 'how': 3319,
 'cried': 1549,
 'because': 593,
 'feeling': 2540,
 'truly': 7204,
 'happy': 3110,
 'again': 153,
 'had': 3063,
 'to': 7064,
 'go': 2933,
 'gym': 3056,
 'so': 6352,
 'many': 4183,
 'times': 7044,
 'this': 6996,
 'last': 3872,
 'spring': 6494,
 'that': 6958,
 'just': 3759,
 'kind': 3804,
 'of': 4723,
 'got': 2956,
 'used': 7398,
 'neurotic': 4585,
 'and': 256,
 'then': 6970,
 'went': 7633,
 'away': 494,
 'feel': 2537,
 'irritated': 3642,
 'he': 3152,
 'either': 2163,
 'interrupts': 3593,
 'quiet': 5444,
 'time': 7042,
 'or': 4795,
 'wakes': 7531,
 'me': 4240,
 'up': 7376,
 'really': 5536,
 'hate': 3134,
 'when': 7651,
 'you': 7870,
 'give': 2897,
 'much': 4462,
 'damn': 1629,
 'about': 16,
 'someone': 6390,
 'but': 919,
 'all': 198,
 'person': 5010,
 'show': 6181,
 'is': 3644,
 'sim

In [37]:
test_preds = vectorizer.transform(test_comm).toarray()

In [38]:
np.where(test_preds[:5] != 0)

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 4, 4, 4, 4]),
 array([ 429,  577,  795,  967, 1979, 2537, 2769, 2897, 3446, 4621, 4762,
        5136, 6975, 7044, 7376, 7651,  691, 2537, 2993, 3204, 3401, 4153,
        4413, 4462, 4658, 6935, 6950, 6958, 6961, 7064, 7190, 7760,  256,
         605, 1207, 1854, 2019, 2163, 2334, 2540, 3065, 3127, 3831, 4240,
        4621, 4762, 4795, 5536, 5662, 6965, 7064, 7089, 7452, 7522, 7572,
        7631, 7873,  256,  264,  929, 1208, 1922, 2440, 2540, 3204, 3446,
        6996, 7107,  622, 1959, 2537, 3993]))

In [39]:
test_preds[0][429]

np.float64(0.1760004762550248)