<a href="https://colab.research.google.com/github/sjay8/Project/blob/main/Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recurrent Neural Networks: Obama/Trump Tweet Classifier

In this workshop, we will train a Recurrent Neural Network to classify between Obama's tweets and Trump's tweets. See our CAIS++ blog post [here](http://caisplusplus.usc.edu/blog/curriculum/lesson8) for more information about RNNs/LSTMs.

## Load Data and Preprocessing

Relevant Keras documentation [here](https://keras.io/preprocessing/text/).

1. **Tokenize** the tweets: convert each tweet into a sequence of word indices.
2. **Pad** the input sequences with blank spots at the end to make sure they're all the same length
3. Load in pre-trained word embeddings so that we can convert each word index into a unique **word embedding vector**

**Word embeddings** are a way of encoding words into n-dimensional vectors with continuous values so that the vector contains some information about the words' meanings. **Embedded word vectors** are often more useful than **one-hot vectors** in natural language processing applications, because the vectors themselves contain some embedded information about the word's meaning, instead of just identifying which word is there. 

For example, similar words (e.g. "him" and "her") tend to have similar embedded vectors, whereas if we were just using one-hot vectors (i.e. only one "1" for "him" and one "1" for "her"), no notion of the words' actual meanings would be conveyed. Although this may sound crazy at first, word embeddings can even convey relationships between analogous words: for example: `king-man+woman≈queen`.

For more info on word embeddings, check out this introductory [blog post](https://www.springboard.com/blog/introduction-word-embeddings/). Two of the most common word embeddings methods are [word2vec](https://www.tensorflow.org/tutorials/representation/word2vec) and [GloVe](https://nlp.stanford.edu/projects/glove/).

In [2]:
#after the data set was uploaded into the files tab 

import io
import numpy as np
import pandas as pd
import re
# plotting
import seaborn as sns
from wordcloud import WordCloud
# nltk
from nltk.stem import WordNetLemmatizer
# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report



In [81]:
data = pd.read_csv("data.csv")

In [4]:
!pip install keras



In [82]:
### building data intution 
print('length of data is', len(data))
print('Count of columns in the data is:  ', len(data.columns))
print('Count of rows in the data is:  ', len(data))

# how many unique values are there in valence?
print(data['valence'].unique())
data['valence'].nunique()

#working with only relevant parts of data
data = data.drop('author', 1)
#data=data[['valence','tweet']]
data.tail(5) #checking if above worked 


length of data is 1600000
Count of columns in the data is:   3
Count of rows in the data is:   1600000
[0 4]


Unnamed: 0,valence,tweet
1599995,4,Just woke up. Having no school is the best fee...
1599996,4,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,Happy 38th Birthday to my boo of alll time!!! ...
1599999,4,happy #charitytuesday @theNSPCC @SparksCharity...


In [83]:
data_pos = data[data['valence'] == 4]
data_neg = data[data['valence'] == 0]

data_pos = data_pos.iloc[:int(20000)]
data_neg = data_neg.iloc[:int(20000)]
data= pd.concat([data_pos, data_neg])
data['tweet']=data['tweet'].str.lower()
data ['tweet'].head()

800000         i love @health4uandpets u guys r the best!! 
800001    im meeting up with one of my besties tonight! ...
800002    @darealsunisakim thanks for the twitter add, s...
800003    being sick can be really cheap when it hurts t...
800004      @lovesbrooklyn2 he has that effect on everyone 
Name: tweet, dtype: object

In [113]:
label_to_sentiment = {0:"Negative", 4:"Positive"}
def mapper(label):
     return label_to_sentiment[label]
data.valence = data.valence.apply(lambda x: mapper(x))

In [84]:
stopwords = ['a', 'about','like', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can',"can't", 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from',
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i','im', 'if', 'in',
             'into','is', 'it', 'its',"it's","i'm",  'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "she's", 'should', "should've",'so', 'some', 'such',
             't', 'than', 'that', "that'll","that's", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "you'd","you'll", "you're",
             "you've", 'your', 'yours', 'yourself', 'yourselves']

In [85]:
data['tweet']= data['tweet'].str.lower()
def clean_symbols(data):
    data= re.sub('https?://[A-Za-z0-9./]+',' ',data)
    data= re.sub('@[A-Za-z0-9]+',' ', data)
    data= re.sub('#[A-za-z0-9]',' ', data) # remove #s
    data = re.sub('[()!%.;,?]', ' ', data)
    data = re.sub('[0-9]',' ', data)
    data = data.split()
    data = [w for w in data if not w in stopwords]
    data = " ".join(word for word in data)
    return data
data['tweet'] = data['tweet'].apply(lambda x: clean_symbols(x))




In [114]:
data.head()

Unnamed: 0,valence,tweet
800000,Positive,love u guys r best
800001,Positive,meeting one besties tonight cant wait - girl talk
800002,Positive,thanks twitter add sunisa got meet hin show dc...
800003,Positive,sick really cheap hurts much eat real food plu...
800004,Positive,effect everyone


In [116]:
from keras.preprocessing.text import Tokenizer
#Tokenizing 
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data.tweet)
word_index = tokenizer.word_index
print(word_index)
vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary Size :", vocab_size)

#Padding 
from keras.preprocessing.sequence import pad_sequences
X_train = pad_sequences(tokenizer.texts_to_sequences(data.tweet),maxlen = 30)
X_test = pad_sequences(tokenizer.texts_to_sequences(data.tweet),maxlen = 30)

Vocabulary Size : 31467


In [120]:
labels = ['Negative', 'Positive']
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(data.valence.to_list())
y_train = encoder.transform(data.valence.to_list())
y_test = encoder.transform(data.valence.to_list())
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)


In [149]:
from tensorflow.keras.layers import Conv1D, Bidirectional, LSTM, Dense, Input, Dropout
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.callbacks import ModelCheckpoint

model=Sequential()
model.add(Embedding(vocab_size,300,input_length=30,trainable=False))
# Input layer 

model.add(Input(shape=(30,), dtype='int32'))

# Passed on to the LSTM layer
model.add(Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(512, activation='relu'))

# Output Layer
model.add(Dense(1, activation='sigmoid'))




In [148]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
model.summary()

Model: "sequential_18"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_20 (Embedding)    (None, 30, 300)           9440100   
                                                                 
 input_5 (InputLayer)        multiple                  0         
                                                                 
 bidirectional_3 (Bidirectio  (None, 128)              186880    
 nal)                                                            
                                                                 
 dense_26 (Dense)            (None, 512)               66048     
                                                                 
 dropout_20 (Dropout)        (None, 512)               0         
                                                                 
 dense_27 (Dense)            (None, 512)               262656    
                                                     

In [130]:
training = model.fit(X_train, y_train, batch_size=64, epochs=8,
                    validation_data=(X_test, y_test))

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


In [131]:
def predict_tweet_sentiment(score):
    return "Positive" if score>0.5 else "Negative"
scores = model.predict(X_test, verbose=1, batch_size=10000)
model_predictions = [predict_tweet_sentiment(score) for score in scores]



In [133]:
from sklearn.metrics import classification_report
print(classification_report(list(data.valence), model_predictions))

              precision    recall  f1-score   support

    Negative       0.95      0.91      0.93     20000
    Positive       0.92      0.95      0.93     20000

    accuracy                           0.93     40000
   macro avg       0.93      0.93      0.93     40000
weighted avg       0.93      0.93      0.93     40000



## Creating a RNN
Here are some useful resources to help you:

#### Keras Documentation
* [Embedding layer](https://keras.io/layers/embeddings/) (our word embeddings are pre-trained, so you won't have to worry too much about this)
* [**Recurrent layers (e.g. RNN, LSTM)**](https://keras.io/layers/recurrent/) (Feel free to play around with different layers.)
* [Dense layer](https://keras.io/layers/core/#dense) (Use this for your final classification "vote")
* [Activation functions](https://keras.io/activations/), [Dropout](https://keras.io/layers/core/#dropout), [Batch Normalization](https://keras.io/layers/normalization/), [Optimizers](https://keras.io/optimizers/)

#### Examples
* [Keras sequential model guide](https://keras.io/getting-started/sequential-model-guide/) (Scroll down to "Sequence classification with LSTM")
* [Keras LSTM for IMDB review sentiment analysis](https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py)

**Hint**: if you have multiple recurrent layers, remember to use `return_sequences=True` if and only if you're adding another recurrent layer after the current one. This makes it so that the recurrent layer spits out an output after each timestep (or element in the sequence), instead of just at the very end of the sequence.

### Train/Evaluate the Model
Train the network. Look to make sure that the loss is decreasing and the accuracy is decreasing, but be on the lookout for overfitting. 

Aim for **90%** final validation accuracy! (Pretty much the same as using a separate test set)

## Finished early? There's another way to extract features from words without word embeddings.


Recall that **Word embeddings** are a way of encoding words into n-dimensional vectors with continuous values so that the vector contains some information about the words' meanings. Intuitively speaking, the embedding is trained so that words with semantically similar meanings are close to each other in the n-dimensional space.

Here we will use another method called **TF-IDF vectorization** (term frequency, inverse document frequency). Note that you can also run TF-IDF using n-word phrases rather than individual words.

See [this blog post](https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7) for more details!