# Sentiment Analyis and Prediction

Index:

-   [Sources](#Sources)
-   [PreProcessing](#PreProcessing)
-   [Model](#Model)
-   [Results](#Results)
-   [Conclusion](#Conclusion)
-   [Usage](#Usage)
-   [Bonus](#Bonus)

## Sources

As the grade isn't abouyt how we elaborate the dataset we used a [DataSet](https://www.kaggle.com/maxjon/complete-tweet-sentiment-extraction-data) originally made for a competition
Thanks to the author of this dataset
To interact with the data we will use [numpy](https://numpy.org/), [pandas](https://pandas.pydata.org/), [nltk](https://www.nltk.org/), [pickle](https://docs.python.org/3/library/pickle.html) and [sklearn](https://scikit-learn.org/stable/)
We also used [tweepy](https://www.tweepy.org/) to try our trained model

## PreProcessing

### Setup

In [29]:
import numpy as np
import pandas as pd
import sklearn as sk
import nltk
import re

## Tokenization
nltk.download('punkt')

# Normalization
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

#Cleaning
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/romain/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/romain/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/romain/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/romain/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Importation

We import the train data in order to preprocess it

In [30]:
train = pd.read_csv("data/train.csv")
# Remove null data
train = train.dropna()

print(train.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27480 entries, 0 to 27480
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   textID         27480 non-null  object
 1   text           27480 non-null  object
 2   selected_text  27480 non-null  object
 3   sentiment      27480 non-null  object
dtypes: object(4)
memory usage: 1.0+ MB
None


### Tokenization

In order analyze the string data we 'tokenize' it
It means that we split every sentence into an array of word and ponctuation

In [31]:
train["text"] = [nltk.tokenize.word_tokenize(i) for i in train["text"]]
print(train.head())


       textID                                               text  \
0  cb774db0d1  [I, `, d, have, responded, ,, if, I, were, going]   
1  549e992a42  [Sooo, SAD, I, will, miss, you, here, in, San,...   
2  088c60f138                  [my, boss, is, bullying, me, ...]   
3  9642c003ef             [what, interview, !, leave, me, alone]   
4  358bd9e861  [Sons, of, *, *, *, *, ,, why, couldn, `, t, t...   

                         selected_text sentiment  
0  I`d have responded, if I were going   neutral  
1                             Sooo SAD  negative  
2                          bullying me  negative  
3                       leave me alone  negative  
4                        Sons of ****,  negative  


### Remove Noise from our dataSet

The noises are things like empty word, spaces, single letter or special characters.
What we'll need to do now is to remove all the unwanted data with RegEx for example
We may want to lower the words too

In [32]:
def cleanData(word : str):
    return re.sub(r'[^A-Za-z0-9_]','',word).lower()

### Normalization

We now have all our sentence sliced up into words, but we do face a problem.
The same word can have multiple forms depending on the context.
Our goal will be to transform all those version into the radical in order to get a smaller dictionnary.
The process of grouping together forms of a word is called **Lemmatisation**.
Firstly we tag the words to identify their type and secondly we use a dictionary to transform them to a simplest form.

### StopWords

What we want to do at the same time is removing useless words.
Some words might not be helpful for us to understand the whole sentence, those are called stopwords
For example: "the", "an", "in" doesn't help and would weaken the model as they appear to be really common in all type of sentence.
They doesn't help us taking a decision

In [33]:
tag_map = dict()
tag_map['V'] = nltk.corpus.wordnet.VERB
tag_map['J'] = nltk.corpus.wordnet.ADJ
tag_map['N'] = nltk.corpus.wordnet.NOUN
tag_map['R'] = nltk.corpus.wordnet.ADV
tag_map['S'] = nltk.corpus.wordnet.ADJ_SAT

# map the given posTag to the matching postag for nltk
def convertPostagToLemmitizationTag(pos_tag):
    return tag_map.get(pos_tag, nltk.corpus.wordnet.NOUN)

In [34]:
def lemmatize_sentence(sent : str):
    lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in nltk.pos_tag(sent):
        ## Clean our words with the function we made earlier
        cleanedWord = cleanData(lemmatizer.lemmatize(word, convertPostagToLemmitizationTag(tag[0])))
        ## If the word isn't null or a stopword then we add it to our final array
        if cleanedWord is not None and len(cleanedWord) > 0 and cleanedWord not in nltk.corpus.stopwords.words('english'):
            lemmatized_sentence.append(cleanedWord)
    return lemmatized_sentence


In [35]:
## lemmatize and clean the sentence
train["text"] = [lemmatize_sentence(i) for i in train["text"]]
print(train.head())


       textID                                text  \
0  cb774db0d1                       [respond, go]   
1  549e992a42       [sooo, sad, miss, san, diego]   
2  088c60f138                        [bos, bully]   
3  9642c003ef           [interview, leave, alone]   
4  358bd9e861  [sons, put, release, already, buy]   

                         selected_text sentiment  
0  I`d have responded, if I were going   neutral  
1                             Sooo SAD  negative  
2                          bullying me  negative  
3                       leave me alone  negative  
4                        Sons of ****,  negative  


Now to get a quick idea of the words repartition, we iterate through all the dataset, take the most used words for each sentiment
We do have a pretty good idea of what type of words are mostly used

In [36]:
commonDic = dict()
dataSet = {}

for i in ["neutral", "positive", "negative"]:
    commonDic[i] = []
    for sentence in train[train["sentiment"] == i]["text"]:
        commonDic[i] = np.concatenate((commonDic[i], sentence))

    commonDic[i] = nltk.FreqDist(commonDic[i])
    print(i + ": " + str(commonDic.get(i).most_common(10)))

neutral: [('get', 1264), ('go', 1166), ('day', 645), ('work', 640), ('http', 596), ('lol', 494), ('u', 479), ('like', 474), ('time', 471), ('know', 456)]
positive: [('day', 1345), ('good', 1191), ('love', 1012), ('happy', 866), ('get', 774), ('go', 636), ('thanks', 559), ('mother', 533), ('great', 495), ('like', 471)]
negative: [('get', 922), ('go', 836), ('miss', 636), ('work', 494), ('like', 490), ('day', 420), ('feel', 408), ('sad', 405), ('im', 370), ('bad', 369)]


### Encoding

To process the data easier we'll use an Encoder, firstly we'll encode the Y data (the sentiment).
In order to do that we'll use the sk LabelEncoder

In a second time we do the same for our our words by creating a dictionary of all our worlds and encoding it
After that we encode each row
But that time we will not use the LabelEncoder, instead we use the CountVectorizer
Basically it create a dictionnary of known words, a matrix where each word is a row in the matrix.
This way it's way easier to process data in our futur model


In [37]:
sentimentEncoder = sk.preprocessing.LabelEncoder()

encodedSentiment = sentimentEncoder.fit_transform(train["sentiment"])

In [38]:
textEncoder = sk.feature_extraction.text.CountVectorizer()
encodedText = textEncoder.fit_transform([ ' '.join(i) for i in train["text"]])

## Model

Now that our data is cleaned up, we want to run our Model.
As we want to put a label on a given sentence we conclude that we'll be using a classification algortihm.

But Firstly we'll split up the data we previously preprocessed in order to get a training and testing set

In [39]:
Xtrain, Xtest, Ytrain, Ytest = sk.model_selection.train_test_split(encodedText, encodedSentiment, test_size=0.15)

To find the best matching algorithms, we decide to test out a few classificaiton algorithms on a lightweight dataset

In [12]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier

In [32]:
for model in [GaussianNB(), MultinomialNB(), DecisionTreeClassifier(), RandomForestClassifier(), svm.SVC(), KNeighborsClassifier()]:
    model.fit(Xtrain.toarray()[:1000],Ytrain[:1000])
    prediction = model.predict(Xtest.toarray()[:1000])
    accuracy = sk.metrics.accuracy_score(Ytest[:1000], prediction)
    with open(type(model).__name__ + ".result", "w") as f:
        f.write("\n\n" + type(model).__name__ + ": " + str(accuracy))
        f.write("\nPrediction:\n")
        f.write(str(prediction))
        f.write("\nReal:\n")
        f.write(str(Ytest[:1000]))
    print(type(model).__name__ + ": " + str(accuracy))

GaussianNB: 0.397
MultinomialNB: 0.56
DecisionTreeClassifier: 0.608
RandomForestClassifier: 0.611
SVC: 0.585
KNeighborsClassifier: 0.473


We now have a clearer idea of which type of algorithm we want to use.
In order to know which one will best fit our data we decide to train the Top3 with the full set of data.

In [33]:
for model in [MultinomialNB(), DecisionTreeClassifier(), RandomForestClassifier()]:
    model.fit(Xtrain.toarray(),Ytrain)
    prediction = model.predict(Xtest.toarray())
    accuracy = sk.metrics.accuracy_score(Ytest, prediction)
    with open(type(model).__name__ + ".bigResult", "w") as f:
        f.write("\n\n" + type(model).__name__ + ": " + str(accuracy))
    print(type(model).__name__ + ": " + str(accuracy))

MultinomialNB: 0.6491994177583698
DecisionTreeClassifier: 0.6669092673459486
RandomForestClassifier: 0.6933527413876759


We now have to decide which algorithm to use in our definitive version, in our case we will use the RandomForestClassifier.
By the way, this algorithm is well known to work perfectly with text analysis so it's pretty logic.

### Reusability

As we don't want to train our model each time, we train a model and save it with pickle.
That way we can load it quickly and make predictions without retraining the model every time.

In [13]:
import pickle

model = RandomForestClassifier()
model.fit(Xtrain.toarray(), Ytrain)
prediction = model.predict(Xtest.toarray())
accuracy = sk.metrics.accuracy_score(Ytest, prediction)
with open(type(model).__name__ + ".finalResult", "w") as f:
    f.write("\n\n" + type(model).__name__ + ": " + str(accuracy))
print(type(model).__name__ + ": " + str(accuracy))

pickle.dump(model, open('trainedModel.sav', 'wb'))

RandomForestClassifier: 0.6926249393498302


We now have a pre-trained model that we can use on the run like so:

In [13]:
model = pickle.load(open('trainedModel.sav', 'rb'))
result = model.score(Xtest.toarray(), Ytest)
print(result)


0.9539058709364386


## Results

In [41]:
import pickle

model = pickle.load(open('trainedModel.sav', 'rb'))
result = model.score(Xtest.toarray(), Ytest)
print(result)

0.9539058709364386


As we can see there, our model, once trained is very efficient and as an accuracy score of 95% which is pretty good
We are satisfied with this accuracy and keep our train model like so

## Conclusion

To conclude, we do have now a model trained to identify the sentiment of a tweet.
The most important part is how we processed the data to make a more precise and efficient model.
We had to make a few decisions about how we process the data, removing stopwords or not, keeping special characters, encoding algorithm
The model we decided to use is also really important, and we do think that the RandomForestClassifier fits our need.

## Usage

We can use this tweet sentiment recognizer for multiple usages if combined with a tweet scraper
For example, it could be interesting to determine the whole sentiment of a twitter account or by month to know the mindset of a given person
But most scrapers are done lately so we'll have to use the official API


In [42]:
def get_sentiment(tweet : str):
    encoded_text = textEncoder.transform([' '.join(lemmatize_sentence(nltk.tokenize.word_tokenize(tweet)))]) #([ ' '.join(i) for i in lemmatize_sentence(tweet)])
    return sentimentEncoder.inverse_transform(model.predict(encoded_text.toarray()))[0]

print(get_sentiment("i'm really sad those day"))
print(get_sentiment("do french fries have a soul ?"))
print(get_sentiment("so happy to see my friend today"))

negative
neutral
positive


## Bonus

Here is a sample code to interact with a user profile to get his tweets positivity score

In [63]:

import tweepy
def get_profile_positivity(username : str, nb_tweet : int = 100, detailled : bool = False):
    auth = tweepy.OAuthHandler(
        open("./info/api.key","r").read(),
        open("./info/apikey.secret","r").read()
    )
    auth.set_access_token(
        open("./info/access.token","r").read(),
        open("./info/accesssecret.token","r").read()
    )

    api = tweepy.API(auth)
    positivity = {
        "negative":0,
        "neutral":0,
        "positive":0
    }
    if detailled:
        positivity["negative"] = []
        positivity["positive"] = []
        positivity["neutral"] = []


    for t in api.user_timeline(username, count=nb_tweet, include_rts=False):
        if not detailled:
            positivity[get_sentiment(t.text)] += 1
        else:
            positivity[get_sentiment(t.text)].append(t.text)

    return positivity

In [64]:
print(get_profile_positivity("realDonaldTrump", 5000, detailled=True))

{'negative': ['Sleepy Eyes Chuck Todd is so happy with the fake voter tabulation process that he can’t even get the words out straight. Sad to watch!', 'The Vice President has the power to reject fraudulently chosen electors.', 'How can you certify an election when the numbers being certified are verifiably WRONG. You will see the real number… https://t.co/jfBOEEVjX7', 'Sorry, but the number of votes in the Swing States that we are talking about is VERY LARGE and totally OUTCOME DETE… https://t.co/KZKiATT1lB', 'Why haven’t they done signature verification in Fulton County, Georgia. Why haven’t they deducted all of the dead p… https://t.co/Bkz8kFz41u', 'Some States are very slow to inoculate recipients despite successful and very large scale distribution of vaccines… https://t.co/uCpPoqzWuA', 'Our Republican Senate just missed the opportunity to get rid of Section 230, which gives unlimited power to Big Tec… https://t.co/bevkn4zsNf', 'Watching @FoxNews is almost as bad as watching Fake 

In [60]:
print(get_profile_positivity("the_weird_weeb", 5000))




{'negative': 5, 'neutral': 65, 'positive': 1}
