# Sentiment Analysis with Logistic Regression
We apply Logitstic Regression on the Twitter Sentiment Analysis data. To ease the problem, we will filter the dataset to include only positive and negative twitts. Then we will perform text preprocessing using the functions we built previously and extract features based on processed tweets. 

In [1]:
import numpy as np
import pandas as pd

# Prepare Data

In [2]:
train_data_path = "datasets/twitter_sentiment_analysis/twitter_training.csv"
train_data = pd.read_csv(train_data_path,header=None)
train_data.columns = ["Tweet_ID","entity","sentiment","Tweet_content"]

test_data_path = "datasets/twitter_sentiment_analysis/twitter_validation.csv"
test_data = pd.read_csv(test_data_path,header=None)
test_data.columns = ["Tweet_ID","entity","sentiment","Tweet_content"]

In [3]:
## Inlcude Only "Positive" and "Negatvie" twitts to form a binary classification problem
## Label Positve as 1 and Negative as 0
train_data = train_data[train_data.sentiment.isin(["Positive","Negative"])]
train_data["label"] = train_data.sentiment.map({"Positive":1, "Negative":0})
test_data = test_data[test_data.sentiment.isin(["Positive","Negative"])]
test_data["label"] = test_data.sentiment.map({"Positive":1, "Negative":0})

In [4]:
train_data.head()

Unnamed: 0,Tweet_ID,entity,sentiment,Tweet_content,label
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...,1
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...,1
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...,1
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...,1
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...,1


In [5]:
test_data.head()

Unnamed: 0,Tweet_ID,entity,sentiment,Tweet_content,label
2,8312,Microsoft,Negative,@Microsoft Why do I pay for WORD when it funct...,0
3,4371,CS-GO,Negative,"CSGO matchmaking is so full of closet hacking,...",0
5,6273,FIFA,Negative,Hi @EAHelp I’ve had Madeleine McCann in my cel...,0
6,7925,MaddenNFL,Positive,Thank you @EAMaddenNFL!! \n\nNew TE Austin Hoo...,1
7,11332,TomClancysRainbowSix,Positive,"Rocket League, Sea of Thieves or Rainbow Six: ...",1


# Feature Extraction
We extract 3 features from each tweets. The first is just 1 which will represent the intercept in logistic regression model. The second feature is sum of word frequencies in postive samples and the third feature is sum of word frequencies in negative samples. 

In [6]:
import spacy
import re

We define a class that can perform extraction of the above feateures for us.

In [47]:
class TwitterFreqFeatureExtractor:
    def __init__(self, nlp):
        self.freqs={}
        self.nlp = nlp
    
    def process_tweet_spacy(self, tweet, lemmetize=True):
        # remove old sytle retweet text "RT"
        tweet = str(tweet)
        tweet2 = re.sub(r'^RT[\s]+','', tweet)
        # remove hyperlinks
        tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)
        # remove hashtags
        # only removing the hash # sign from the word
        tweet2 = re.sub(r'#', '', tweet2)

        doc = self.nlp(tweet2)
        # remove stopworks and punctuation
        if lemmetize:
            return [token.lemma_.lower() for token in doc if (not token.is_stop) and (not token.is_punct) ]
        else:
            return [token.text.lower() for token in doc if (not token.is_stop) and (not token.is_punct) ]

    def fit(self, X, y):
        for text, label in zip(X,y):
            tokenized_text = self.process_tweet_spacy(text)
            for token in tokenized_text:
                if (label, token) in self.freqs:
                    self.freqs[(label,token)] += 1
                else:
                    self.freqs[(label,token)] = 1 

    def transform(self, X):
        feature_mat = np.zeros((len(X),3))
        for i, text in enumerate(X):
            tokenized_text = self.process_tweet_spacy(text)
            pos_freq = 0
            neg_freq = 0
            for token in set(tokenized_text):
                pos_freq += self.freqs.get((1,token),0)
                neg_freq += self.freqs.get((0, token), 0)
            feature_mat[i,0] = 1
            feature_mat[i,1] = pos_freq
            feature_mat[i,2] = neg_freq
        return feature_mat


Now we apply `TwitterFeqFeatureExtractor` to extractor feature for both training data and test data

In [48]:
nlp = spacy.load("en_core_web_sm")
extractor = TwitterFreqFeatureExtractor(nlp)

In [49]:
extractor.fit(train_data.Tweet_content, train_data.label)

In [50]:
Xtrain = extractor.transform(train_data.Tweet_content)

In [51]:
Xtest = extractor.transform(test_data.Tweet_content)

In [52]:
ytrain = train_data.label.values
ytest = test_data.label.values

# Sentiment classification with Logistic Regression
We train a logitist regression with features extracted above and use this model to perform sentiment classification

In [53]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

In [54]:
lr = LogisticRegression(penalty='none')
lr.fit(Xtest, ytest)

In [56]:
ypred = lr.predict(Xtest)

In [58]:
print(f"Accuracy over test data is {accuracy_score(ytest, ypred)}")
print(f"Precision over test data is {precision_score(ytest, ypred)}")
print(f"Recall over test data is {recall_score(ytest, ypred)}")
print(f"F1 score over test data is {f1_score(ytest, ypred)}")


Accuracy over test data is 0.7900552486187845
Precision over test data is 0.8007380073800738
Recall over test data is 0.7833935018050542
F1 score over test data is 0.7919708029197081
