# Obama Tweets

To start off, we first specify all the imports that we'll need for this notebook:

In [158]:
pip install gensim

Note: you may need to restart the kernel to use updated packages.


In [159]:
import nltk
import pandas as pd
import re, string
import os
import numpy as np
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from nltk.probability import FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix, precision_score, recall_score
import gensim
from gensim.models import Word2Vec


We now load the data from the input excel into a dataframe

In [160]:
#data loading
data = pd.ExcelFile('C:/Users/utsav/OneDrive/UIC/Fall_2023/CS_583/Project/training-Obama-Romney-tweets.xlsx')
obama = pd.read_excel(data, 'Obama')
obama.head(5)

Unnamed: 0.1,Unnamed: 0,date,time,Anootated tweet,Unnamed: 4,Unnamed: 5
0,,,,"1: positive, -1: negative, 0: neutral, 2: mixed",Class,Your class
1,,2012-10-16 00:00:00,10:28:53-05:00,"Kirkpatrick, who wore a baseball cap embroider...",0,
2,,2016-12-10 00:00:00,10:09:00-05:00,Question: If <e>Romney</e> and <e>Obama</e> ha...,2,
3,,2012-10-16 00:00:00,10:04:30-05:00,#<e>obama</e> debates that Cracker Ass Cracker...,1,
4,,2012-10-16 00:00:00,10:00:36-05:00,RT @davewiner Slate: Blame <e>Obama</e> for fo...,2,


## Data Cleaning

We can now start cleaning the data.<br>
We drop the first row from the dataframe:

In [161]:
obama = obama[1:]
obama.head(5)

Unnamed: 0.1,Unnamed: 0,date,time,Anootated tweet,Unnamed: 4,Unnamed: 5
1,,2012-10-16 00:00:00,10:28:53-05:00,"Kirkpatrick, who wore a baseball cap embroider...",0,
2,,2016-12-10 00:00:00,10:09:00-05:00,Question: If <e>Romney</e> and <e>Obama</e> ha...,2,
3,,2012-10-16 00:00:00,10:04:30-05:00,#<e>obama</e> debates that Cracker Ass Cracker...,1,
4,,2012-10-16 00:00:00,10:00:36-05:00,RT @davewiner Slate: Blame <e>Obama</e> for fo...,2,
5,,2012-10-16 00:00:00,09:50:08-05:00,@Hollivan @hereistheanswer Youre missing the ...,0,


Now, we can drop the columns that we do not need, namely: `Unnamed: 0`, `date`, `time` and `Unnamed: 5`.<br>
We also rename `Unnamed: 4 ` to `class` and `Anootated tweet` to `tweet`.

In [162]:
obama = obama.drop(['Unnamed: 0', 'date', 'time', 'Unnamed: 5'], axis=1)
obama = obama.rename(columns={'Unnamed: 4': 'class', 'Anootated tweet': 'tweet'})
obama.head(5)

Unnamed: 0,tweet,class
1,"Kirkpatrick, who wore a baseball cap embroider...",0
2,Question: If <e>Romney</e> and <e>Obama</e> ha...,2
3,#<e>obama</e> debates that Cracker Ass Cracker...,1
4,RT @davewiner Slate: Blame <e>Obama</e> for fo...,2
5,@Hollivan @hereistheanswer Youre missing the ...,0


We can check the number of classes available in the dataset:

In [163]:
print(obama['class'].value_counts())

-1            1922
0             1896
1             1653
2             1474
0               82
2               70
-1              46
1               26
irrevelant      23
irrelevant       1
Name: class, dtype: int64


For this notebook, we are only interested in the classes `-1, 0 and 1`. Therefore, we drop all the other classes from the dataframe.<br>
We also change the column to be an integer, since the values are a mix of string and integers right now.

In [164]:
obama_df = obama[obama['class'].isin(['-1', '0', '1',-1,0,1])].copy(deep=True)
obama_df['class']=obama_df['class'].astype(int)
print(obama_df['class'].value_counts())
obama_df.head(5)

 0    1978
-1    1968
 1    1679
Name: class, dtype: int64


Unnamed: 0,tweet,class
1,"Kirkpatrick, who wore a baseball cap embroider...",0
3,#<e>obama</e> debates that Cracker Ass Cracker...,1
5,@Hollivan @hereistheanswer Youre missing the ...,0
7,I was raised as a Democrat left the party yea...,-1
8,The <e>Obama camp</e> can't afford to lower ex...,0


Now, we can start working on the actual tweets. The first few steps that we need to perform are cleaning the text itself and tokenizing it.<br>
For that, we use 2 functions:

In [165]:
def clean(text):
    text = text.lower()
    text = re.sub(r'@[A-Za-z0-9]+', '', text)
    text = re.sub(r'#[A-Za-z0-9]+', '', text)
    text = re.sub(r'https?://[A-Za-z0-9./]+', '', text)
    text = re.sub(r'www.[^ ]+', '', text)
    text = re.sub(r'[^a-z]', ' ', text)
    text = re.sub(r' +', ' ', text)
    return text

regexp = RegexpTokenizer('\w+')

nltk.download('stopwords')

def tokenize(text):
    stop_words = stopwords.words('english')
    text = clean(text)
    text = regexp.tokenize(text)
    text = [w for w in text if w not in stop_words]
    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\utsav\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The code cell below performs the task of going through our dataset, cleaning and tokenizing every tweet.

In [166]:
obama_df['tweet_token'] = obama_df['tweet'].apply(lambda stext: tokenize(str(stext)))

We then remove the words that are less than 2 characters, and appear less than 2 times, since those are most likely noise and would not contribute anything to our dataset.

In [167]:
#remove words with length less than 2
obama_df['tweet_string'] = obama_df['tweet_token'].apply(lambda x:' '.join([item for item in x if len(item)>2]))
#Find a frequency distribution, and remove words with frequency less than 1
all_words = ' '.join([text for text in obama_df['tweet_string']])
tokenized_obama_df = nltk.tokenize.word_tokenize(all_words)
fdist = FreqDist(tokenized_obama_df)
obama_df['tweet_string_fdist'] = obama_df['tweet_token'].apply(lambda x: ' '.join([item for item in x if fdist[item] > 1 ]))

Now we have to perform the important task of Lemmatizing our dataset. To do this, we use `WordNetLemmatizer` with `Parts-Of-Speech tags`.

In [168]:
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatiser(text):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(text))  
    wordnet_tagged = map(lambda x: (x[0], pos_tagger(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            lemmatized_sentence.append(word)
        else:        
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\utsav\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Applying this to our dataset:

In [169]:
obama_df['tweet'] = obama_df['tweet_string_fdist'].apply(lambda x: lemmatiser(x))
obama_df.head(5)

Unnamed: 0,tweet,class,tweet_token,tweet_string,tweet_string_fdist
1,wore cap barack obama signature look jason jou...,0,"[kirkpatrick, wore, baseball, cap, embroidered...",kirkpatrick wore baseball cap embroidered bara...,wore cap barack obama signature look jason jou...
3,obama debate cracker as cracker tonight,1,"[e, obama, e, debates, cracker, ass, cracker, ...",obama debates cracker ass cracker tonight tuned,obama debates cracker ass cracker tonight
5,miss point afraid understand big picture dont ...,0,"[youre, missing, point, im, afraid, understand...",youre missing point afraid understand bigger p...,missing point afraid understand bigger picture...
7,raise democrat leave party year ago never see ...,-1,"[raised, democrat, left, party, years, ago, li...",raised democrat left party years ago lifetime ...,raised democrat left party years ago never see...
8,obama camp afford low expectation tonight deba...,0,"[e, obama, camp, e, afford, lower, expectation...",obama camp afford lower expectations tonight d...,obama camp afford lower expectations tonight d...


We can drop the columns `tweet_token`, `tweet_string` and `tweet_string_fdist` now.<br>
We also remove the null values, to make sure that we do not have any empty records left after the previous steps.

In [170]:
obama_df = obama_df.drop(['tweet_token', 'tweet_string', 'tweet_string_fdist'], axis=1)
obama_df.dropna(inplace=True)
print(obama_df.shape)
obama_df.head(5)

(5625, 2)


Unnamed: 0,tweet,class
1,wore cap barack obama signature look jason jou...,0
3,obama debate cracker as cracker tonight,1
5,miss point afraid understand big picture dont ...,0
7,raise democrat leave party year ago never see ...,-1
8,obama camp afford low expectation tonight deba...,0


We can now take a look at the distribution of the classes in our dataset:

In [171]:
print(obama_df['class'].value_counts())

 0    1978
-1    1968
 1    1679
Name: class, dtype: int64


## Creating train and test data splits

Our data is now ready for processing. To train and test our models, we will perform a train-test-split of 80-20.

In [172]:
df_X = obama_df['tweet']
df_Y = obama_df['class']
X_train, X_test, y_train, y_test = train_test_split(df_X,df_Y,test_size=0.2,random_state = 1551)

## Vectorization

### TF-IDF Vectorization

We will use `TfidfVectorizer` from `sklearn`.

In [173]:
tfidf_vectorizer = TfidfVectorizer(use_idf=True, ngram_range=(1,2))
X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_vectors_tfidf = tfidf_vectorizer.transform(X_test)

### Word2Vec

We will also try using `Word2Vec` from `gensim`.

In [174]:
X_train_tok= [nltk.word_tokenize(i) for i in X_train]  
X_test_tok= [nltk.word_tokenize(i) for i in X_test]

In [175]:
#building Word2Vec model
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(next(iter(word2vec.values())))
    def fit(self, X, y):
        return self
    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

obama_df['clean_text_tok']=[nltk.word_tokenize(i) for i in obama_df['tweet']] 
model = Word2Vec(obama_df['clean_text_tok'],min_count=1) 
w2v = dict(zip(model.wv.index_to_key , model.wv.vectors))   
modelw = MeanEmbeddingVectorizer(w2v)

X_train_vectors_w2v = modelw.transform(X_train_tok)
X_val_vectors_w2v = modelw.transform(X_test_tok)

Now, we are ready to create and test on our models.

## Models:

Before we start running our models, to make things easier, we will create a dataframe to keep a track of our performance metrics.

In [176]:
performance = pd.DataFrame(columns=['Model','Vectorization','Accuracy', 'Precision', 'Recall', 'F1 Score'])

To make things even easier, I have written a small function to store our metrics in the dataframe.

In [177]:
def write_metrics(model_name,vector_name,test,predictions):
    global performance
    new_data = {'Model': model_name,
                'Vectorization': vector_name,
                'Accuracy': round(accuracy_score(test,predictions),2),
                'Precision': round(precision_score(test,predictions, average='weighted'),2),
                'Recall': round(recall_score(test,predictions, average='weighted'),2),
                'F1 Score': round(f1_score(test,predictions, average='weighted'),2)}
    performance = performance.append(new_data, ignore_index=True)

### Model 1: Logistic Regression

In [178]:
from sklearn.linear_model import LogisticRegression

##### TF-IDF

In [179]:
lr_model_tfidf = LogisticRegression(solver='saga',C=5,penalty='l2',random_state=44) #4=57%
lr_model_tfidf.fit(X_train_vectors_tfidf, y_train)
lr_tfidf_y_pred = lr_model_tfidf.predict(X_test_vectors_tfidf)
print(classification_report(y_test,lr_tfidf_y_pred))
write_metrics('Logistic Regression','TF-IDF',y_test,lr_tfidf_y_pred)

              precision    recall  f1-score   support

          -1       0.60      0.61      0.61       428
           0       0.49      0.52      0.51       365
           1       0.62      0.57      0.59       332

    accuracy                           0.57      1125
   macro avg       0.57      0.57      0.57      1125
weighted avg       0.57      0.57      0.57      1125



  performance = performance.append(new_data, ignore_index=True)


##### Word2Vec:

In [180]:
lr_model_w2v = LogisticRegression(solver='liblinear',C=10,penalty='l2',random_state=4) #4=57%
lr_model_w2v.fit(X_train_vectors_w2v, y_train)
lr_w2v_y_pred = lr_model_w2v.predict(X_val_vectors_w2v)
print(classification_report(y_test,lr_w2v_y_pred))
write_metrics('Logistic Regression','Word2Vec',y_test,lr_w2v_y_pred)

              precision    recall  f1-score   support

          -1       0.54      0.56      0.55       428
           0       0.42      0.47      0.44       365
           1       0.50      0.40      0.45       332

    accuracy                           0.48      1125
   macro avg       0.49      0.48      0.48      1125
weighted avg       0.49      0.48      0.48      1125



  performance = performance.append(new_data, ignore_index=True)


In [181]:
performance.head(5)

Unnamed: 0,Model,Vectorization,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,TF-IDF,0.57,0.57,0.57,0.57
1,Logistic Regression,Word2Vec,0.48,0.49,0.48,0.48
