# Sentiment Analysis with TF-IDF
We will extrac TF-IDF from the text data and use this feature to perform sentiment analysis.

## Intro to TF-IDF
Suggested by its name, TF-IDF is the product of TF (term frequency) and IDF (inverse document frequency).

TF is the frequency that a word (token) occurs in a given doucment. It meausre how important this work is to this document.

IDF is the log of inverse frequency that a word occurs in all documents. It measure how much information the word carries in distinguish doucments. 

For more details, see [Wikipedia page of TF-IDF](https://en.wikipedia.org/wiki/Tf–idf)

# Prepare data

In [1]:
import numpy as np
import pandas as pd

In [2]:
train_data_path = "datasets/twitter_sentiment_analysis/twitter_training.csv"
train_data = pd.read_csv(train_data_path,header=None)
train_data.columns = ["Tweet_ID","entity","sentiment","Tweet_content"]

test_data_path = "datasets/twitter_sentiment_analysis/twitter_validation.csv"
test_data = pd.read_csv(test_data_path,header=None)
test_data.columns = ["Tweet_ID","entity","sentiment","Tweet_content"]

In [3]:
## Inlcude Only "Positive" and "Negatvie" twitts to form a binary classification problem
## Label Positve as 1 and Negative as 0
train_data = train_data[train_data.sentiment.isin(["Positive","Negative"])]
train_data["label"] = train_data.sentiment.map({"Positive":1, "Negative":0})
test_data = test_data[test_data.sentiment.isin(["Positive","Negative"])]
test_data["label"] = test_data.sentiment.map({"Positive":1, "Negative":0})

# Calculating TF-IDF with Scikit-Learn

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

## With preprocessing

In [17]:
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit(train_data.Tweet_content.apply(str).tolist()+test_data.Tweet_content.apply(str).tolist())

In [18]:
print(f"number of colums in the feature matix: {len(vectorizer.get_feature_names_out())}")
print(vectorizer.get_feature_names_out())
## Number of columns is really large and we see some tokens that carries no meaning.

number of colums in the feature matix: 19245
['00' '000' '00011' ... '٥υ' 'ℐℓ٥' '𝐍𝐄𝐖𝐒𝐔𝐏𝐃𝐀𝐓𝐄𝐒']


In [20]:
train_tfidf = vectorizer.transform(train_data.Tweet_content.apply(str))
## The returned object is a sparse matrix

In [22]:
print(train_tfidf[0])

  (0, 11372)	0.6624843348550716
  (0, 8699)	0.4758230552743775
  (0, 7362)	0.4162697838351611
  (0, 2657)	0.40177903530027353


## With customized tokenizer

In [25]:
import re
import spacy

In [43]:
def generate_tokenizer(nlp, lemmetize = True):
    def process_tweet_spacy(tweet):
        # remove old sytle retweet text "RT"
        tweet = str(tweet)
        tweet2 = re.sub(r'^RT[\s]+','', tweet)
        # remove hyperlinks
        tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)
        # remove hashtags
        # only removing the hash # sign from the word
        tweet2 = re.sub(r'#', '', tweet2)

        doc = nlp(tweet2)
        # remove stopworks and punctuation
        if lemmetize:
            return [token.lemma_.lower() for token in doc if (not token.is_stop) and (not token.is_punct) ]
        else:
            return [token.text.lower() for token in doc if (not token.is_stop) and (not token.is_punct) ]
    return process_tweet_spacy

In [44]:
nlp = spacy.load("en_core_web_sm")
mytokenizer = generate_tokenizer(nlp)

In [45]:
vectorizer = TfidfVectorizer(tokenizer=mytokenizer)
vectorizer.fit(train_data.Tweet_content.apply(str).tolist()+test_data.Tweet_content.apply(str).tolist())



In [47]:
print(f"number of colums in the feature matix: {len(vectorizer.get_feature_names_out())}")
print(vectorizer.get_feature_names_out())

number of colums in the feature matix: 19919
['\n' '\n\n' '\n\n ' ... '🧢' '🧻' '🪓.']


In [46]:
train_tfidf = vectorizer.transform(train_data.Tweet_content.apply(str))
## The returned object is a sparse matrix

In [48]:
test_tfidf = vectorizer.transform(test_data.Tweet_content.apply(str))

# Sentiment classification with L1 Logistic Regression and Tf-IDF
We train a logitist regression with features extracted above and use this model to perform sentiment classification

In [49]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

In [50]:
ytrain, ytest = train_data.label.values, test_data.label.values

In [55]:
lr = LogisticRegression(penalty='l1', solver = 'liblinear', max_iter=1000)
lr.fit(train_tfidf, ytrain)

In [56]:
ypred = lr.predict(test_tfidf)

In [57]:
print(f"Accuracy over test data is {accuracy_score(ytest, ypred)}")
print(f"Precision over test data is {precision_score(ytest, ypred)}")
print(f"Recall over test data is {recall_score(ytest, ypred)}")
print(f"F1 score over test data is {f1_score(ytest, ypred)}")

Accuracy over test data is 0.9281767955801105
Precision over test data is 0.9343065693430657
Recall over test data is 0.924187725631769
F1 score over test data is 0.9292196007259528


Looks like we achieve pretty high socres even without tuning the L1 penalty parameter.