# Sentiment Analysis Term Frequency - Inverse Document Frequency (TF-IDF)


Based on **Stats Wire** video: https://www.youtube.com/watch?v=TMzaK2-K5C4&list=PLBSCvBlTOLa_wS8iy84DfyizdSs7ps7L5&

In [20]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [2]:
df1 = pd.read_csv("data/amazon_1.csv")
df2 = pd.read_csv("data/amazon_2.csv")

# Concatenate the two DataFrames
df = pd.concat([df1, df2], ignore_index=True)

In [3]:
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [4]:
df.shape

(413840, 6)

In [5]:
df.isnull().sum().sort_values(ascending=False)

Brand Name      65171
Review Votes    12296
Price            5933
Reviews            62
Product Name        0
Rating              0
dtype: int64

In [6]:
df.dropna(inplace=True)

In [7]:
df.isnull().sum().sort_values(ascending=False)

Product Name    0
Brand Name      0
Price           0
Rating          0
Reviews         0
Review Votes    0
dtype: int64

In [9]:
df['Rating'].value_counts()

5    180253
1     57535
4     50421
3     26058
2     20068
Name: Rating, dtype: int64

In [10]:
df[df["Rating"] != 3].head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [11]:
# removing 3 star ratings
df = df[df["Rating"] != 3]

In [12]:
df["Rating"].value_counts()

5    180253
1     57535
4     50421
2     20068
Name: Rating, dtype: int64

In [13]:
# Creating new labels
df["Positively Rated"] = np.where(df["Rating"] > 3, 1, 0)

In [14]:
df["Positively Rated"].value_counts()

1    230674
0     77603
Name: Positively Rated, dtype: int64

In [15]:
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1


In [None]:
# specify the order of the labels
order = [0, 1]

# plot the countplot
sns.countplot(data=df, x="Positively Rated", order=order)

## Train test split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(df["Reviews"], df["Positively Rated"], random_state=50, test_size=0.25)

In [18]:
print(X_train[:5])

333828                                  It works very well.
104940                                             useless!
45920     It "looked" new, as it was in a box packaged i...
360788    Working perfectly internationally... cant wait...
51954                 Bad battery does not keep a charge...
Name: Reviews, dtype: object


In [19]:
print(y_train[:5])

333828    1
104940    0
45920     0
360788    1
51954     0
Name: Positively Rated, dtype: int64


## Term Frequency - Inverse Document Frequency (TF-IDF)

**Term Frequency - Inverse Document Frequency (TF-IDF)** is a numerical statistic that is commonly used in information retrieval and text mining to measure the importance of a term in a document wiothin a collection or corpus.

TF-IDF takes into account two factors: term frequency and inverse document frequency.

1. **Term Frequency (TF)**: The term frequency of a term (usually a word) in a document is a measure of how frequently the term appears in the document. It indicates the importance of the term within that particular document. TF is typically calculated as the ratio of the number of occurences of a term to the total number of terms in the document. However, there can be variations in the exact calculation, such as the use of logarithmic scaling to prevent bias towards longer documents.

2. **Inverse Document Frequency (IDF)**: The inverse document frequency of a term is a measure of how significant or rare term is across the entire corpus. It helps to differentiate commonly occuring terms from those that are more unique and potentially more importnat. IDF is usually calculated as the logarithm of the ratio of thetotal number of documents in the corpus to the number of documents containing the term. The logarithmic scaling is used to dampen the effect of IDF and prevent extremely rare terms from dominating.

The TF-IDF score is calculated by multiplying the term frequency (TF) with the inverse document frequency (IDF) for each term in a document. This score represents the realtive importance of the term within the document and the corpus. Higher TF-IDF scores indicate that a term is more relevant or important to a specific document.

In [21]:
vect = TfidfVectorizer().fit(X_train)

In [22]:
len(vect.get_feature_names_out())

53403

In [23]:
X_train_vectorized = vect.transform(X_train)

## Logistict Regression

In [24]:
model = LogisticRegression(max_iter=1000)

In [25]:
model.fit(X_train_vectorized, y_train)

In [26]:
predictions = model.predict(vect.transform(X_test))

In [27]:
print(predictions[:5])

[0 1 0 1 0]


## ROC AUC

In [28]:
print(f"AUC: {roc_auc_score(y_test, predictions)}")

AUC: 0.9286180926051923


In [29]:
feature_names = np.array(vect.get_feature_names_out())

In [30]:
sorted_coef_index = model.coef_[0].argsort()

In [31]:
print(f"Negative words {feature_names[sorted_coef_index[:10]]}")

Negative words ['not' 'worst' 'useless' 'waste' 'disappointed' 'return' 'terrible'
 'horrible' 'poor' 'stopped']


In [32]:
print(f"Positive words: {feature_names[sorted_coef_index[:-11:-1]]}")

Positive words: ['love' 'great' 'excellent' 'amazing' 'perfect' 'awesome' 'easy' 'best'
 'perfectly' 'loves']
