# Twitter Sentiment Analysis

> This kernel is the solution for the challenge launched by School of AI - Algiers, which consist of building a system that can classify tweets as Sad or Happy.

1. Solution

> We will start by reading some tweets so we can understand our data better. We will then try to transform our tweets into something usable by different ML models, where we are going to choose the more efficient. We will finally fine tune our model and then test it to see its efficiency on new data.

# Update

After getting some comments on the School of AI - Algiers group, especially from Belkacem, I updated the following:

I used lemmatization instead of steaming
I also noticed that I was mistaken when I stopped the max_features parameter at 20000 while doing GridSearch, I should have tested a bigger one, because if it stopped at 20000 (which is the max), it may get better using a bigger one. I just added None (no limit).

# Content
> Loading the data
> 
> Visualize the tweets
> 
> Emoticons
> 
> Most used words
> 
> Stop words
> 
> Stemming
> 
> Prepare the data
> 
> Bag of Words
> 
> Building the pipeline
> 
> Select a model
> 
> Fine tune the model
> 
> Testing the model
> 
> Test your tweet

# Load the data

In [1]:
import numpy as np
import pandas as pd

# This is for making some large tweets to be displayed
pd.options.display.max_colwidth = 100

# I got some encoding issue, I didn't knew which one to use !
# This post suggested an encoding that worked!
# https://stackoverflow.com/questions/19699367/unicodedecodeerror-utf-8-codec-cant-decode-byte
train_data = pd.read_csv("../input/twitter-dataset/train_twitter.csv")

In [2]:
train_data

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #iger...
1,2,0,Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias…...
2,3,0,We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connec...
3,4,0,I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr....
4,5,1,What amazing service! Apple won't even talk to me about a question I have unless I pay them $19....
...,...,...,...
7915,7916,0,Live out loud #lol #liveoutloud #selfie #smile #sony #music #headphones https://instagram.com/p/...
7916,7917,0,We would like to wish you an amazing day! Make every minute count #tls #today #iphone #accessori...
7917,7918,0,Helping my lovely 90 year old neighbor with her iPad this morning has just made me realise that ...
7918,7919,0,"Finally got my #smart #pocket #wifi stay connected anytime,anywhere! #ipad and #samsung #s3 #gad..."


# Visualize the tweets

# From the tweets above, we can already make some remarks about the data:

> We can see that there is some garbage like '&amp', '&lt' (which are basically used in HTML) that aren't gonna help us in our classification

> In twitter, people mention their friends with tags like @username, there is a lot of them in our data. I was discussing with a friend about the usefulness of tags in our classification, for him, people tend to mention more friends when they are happy, but I think that people may mention people because they made bad things. When we face this kind of uncertainty, it's better to try the different options and evaluate which will do well, this is what we are gonna do.

In [3]:
# We will now take a look at random tweets
# to gain more insights

rand_indexs = np.random.randint(1,len(train_data),50).tolist()
train_data["tweet"][rand_indexs]

2414    Gain Followers RT This MUST FOLLOW ME I FOLLOW BACK Follow everyone who rts Gain #iphone #sougof...
1409    My new headphone #instacool #me #sony #headphone #phone #sound #music http://instagram.com/p/dDN...
2677    The #Prophet as a #Husband . #iphone : http://ift.tt/2boR0kb #android : http://ift.tt/2aSURcJ #k...
6376    Hey Guys! Look Motorola-MOTO-G5-Plus-01108NARTL-64GB http://zpr.io/nv8Rv #money #today #life #tw...
4405    I beat a personal record today on the bike! #vsco #vscocam #greatoutdoors #beautiful #statenisla...
3170     Welcome ! #birthday #gift #likeit it #surprise you #samsung #s4 http://instagram.com/p/fribPpILv9/
529     I must go out some job done & 2 spread happiness, in order to reveal #FF #quote #twinagoya #Japa...
158                                   My phone just deleted every single one of my contacts. #random #apple
5077    Got my new baby!!!! #samsung #samsungs6 #s6 #galaxy #new #android https://instagram.com/p/9op9KQ...
5023                       P

# Note
> you will not have the same results at each execution because of the randomization. For me, after some execution, I noticed this:

> There is tweets with a url (like tweet 35546): we must think about a way to handle URLs, I thought about deleting them because a domain name or the protocol used will not make someone happy or sad unless the domain name is 'food.com'.

> The use of hashtags: we should keep only the words without '#' so words like python and the hashtag '#python' can be seen as the same word, and of course they are.
> Words like 'as', 'to' and 'so' should be deleted, because they only serve as a way to link phrases and words

# Emoticons
> The internet language includes so many emoticons, people also tend to create their own, so we will first analyze the emoticons included in our dataset, try to classify them as happy and said, and make sure that our model know about them.

In [4]:
# We are gonna find what emoticons are used in our dataset
import re
tweets_text = train_data.tweet.str.cat()
emos = set(re.findall(r" ([xX:;][-']?.) ",tweets_text))
emos_count = []
for emo in emos:
    emos_count.append((tweets_text.count(emo), emo))
sorted(emos_count,reverse=True)

[(4493, ':/'),
 (200, ':)'),
 (77, ':D'),
 (43, 'x.'),
 (35, 'xx'),
 (33, ':3'),
 (32, ':('),
 (31, ';)'),
 (26, 'xD'),
 (22, 'XZ'),
 (20, ':-)'),
 (17, 'XD'),
 (14, ';-)'),
 (9, 'X.'),
 (9, ':P'),
 (8, ':*'),
 (6, ':-D'),
 (5, ':O'),
 (4, ':-('),
 (4, ":')"),
 (3, ';p'),
 (3, ':|'),
 (3, ':p'),
 (3, ':]'),
 (3, '::'),
 (2, ';D'),
 (2, ':o'),
 (2, ':@'),
 (1, 'x)'),
 (1, 'X)'),
 (1, ';I'),
 (1, ':v'),
 (1, ':-p'),
 (1, ':-/'),
 (1, ':-*'),
 (1, ":'(")]

> We should by now know which emoticons are used (and its frequency) to build two regex, one for the happy ones and another for the sad ones. We will then use them in the preprocessing process to mark them as using happy emoticons or sad ones.

In [5]:
HAPPY_EMO = r" ([xX;:]-?[dD)]|:-?[\)]|[;:][pP]) "
SAD_EMO = r" (:'?[/|\(]) "
print("Happy emoticons:", set(re.findall(HAPPY_EMO, tweets_text)))
print("Sad emoticons:", set(re.findall(SAD_EMO, tweets_text)))

Happy emoticons: {':-D', ';)', 'XD', ':P', 'x)', ':p', ':)', 'xD', ';D', 'X)', ';p', ':-)', ':D', ';-)'}
Sad emoticons: {":'(", ':/', ':|', ':('}


# Most used words
> What we are going to do next is to define a function that will show us top words, so we may fix things before running our learning algorithm. This function takes as input a text and output words sorted according to their frequency, starting with the most used word.

In [6]:
nltk.download('punkt')

NameError: name 'nltk' is not defined

In [7]:
import nltk
from nltk.tokenize import word_tokenize

# Uncomment this line if you haven't downloaded punkt before
# or just run it as it is and uncomment it if you got an error.
#nltk.download('punkt')
def most_used_words(text):
    tokens = word_tokenize(text)
    frequency_dist = nltk.FreqDist(tokens)
    print("There is %d different words" % len(set(tokens)))
    return sorted(frequency_dist,key=frequency_dist.__getitem__, reverse=True)

In [8]:
most_used_words(train_data.tweet.str.cat())[:100]

There is 28800 different words


['#',
 ':',
 '!',
 'http',
 '.',
 'iphone',
 '@',
 ',',
 'my',
 'to',
 'the',
 'I',
 'apple',
 'a',
 'and',
 'iPhone',
 '&',
 '...',
 'for',
 'https',
 'it',
 '?',
 'samsung',
 'Apple',
 'is',
 'phone',
 'you',
 'new',
 'me',
 'of',
 'on',
 'in',
 '$',
 'with',
 "n't",
 '…',
 "'s",
 'sony',
 '*',
 ')',
 'Samsung',
 'this',
 'have',
 'life',
 'like',
 'at',
 '-',
 'an',
 'that',
 'your',
 'so',
 'FOLLOW',
 'now',
 'cute',
 'day',
 'all',
 'just',
 'today',
 'photography',
 'ipad',
 'RT',
 'not',
 'android',
 'instagram',
 'love',
 'i',
 'from',
 "'m",
 'fun',
 'get',
 'Sony',
 '<',
 '(',
 'out',
 'be',
 'do',
 'instagood',
 'are',
 'music',
 'got',
 'beautiful',
 'news',
 'funny',
 'fashion',
 'case',
 'Follow',
 'who',
 'but',
 'tech',
 'This',
 'time',
 'work',
 'galaxy',
 'photooftheday',
 'smile',
 'everyone',
 'up',
 'iPad',
 'app',
 'ME']

# stop words

> What we can see is that stop words are the most used, but in fact they don't help us determine if a tweet is happy/sad, however, they are consuming memory and they are making the learning process slower, so we really need to get rid of them.

In [9]:
from nltk.corpus import stopwords

#nltk.download("stopwords")

mw = most_used_words(train_data.tweet.str.cat())
most_words = []
for w in mw:
    if len(most_words) == 1000:
        break
    if w in stopwords.words("english"):
        continue
    else:
        most_words.append(w)

There is 28800 different words


In [10]:
# What we did is to filter only non stop words.
# We will now get a look to the top 1000 words
sorted(most_words)

['!',
 '#',
 '$',
 '%',
 '&',
 "'",
 "''",
 "'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '--',
 '.',
 '..',
 '...',
 '/',
 '//bit.ly/rhymeapp',
 '//ebay.to/2yI9MR7',
 '//ift.tt/2aSURcJ',
 '//ift.tt/2boR0kb',
 '//itunes.apple.com/us/app/love360/id809353957',
 '//reallyreal.com/',
 '//steemit.com/photography/',
 '//www.youtube.com/watch',
 '1',
 '10',
 '16',
 '2',
 '20',
 '2011',
 '2012',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '3',
 '30',
 '4',
 '4G',
 '4s',
 '5',
 '50',
 '5s',
 '6',
 '6s',
 '7',
 '8',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 'A',
 'ALL',
 'APPLE',
 'AT',
 'Air',
 'All',
 'And',
 'Android',
 'App',
 'AppStore',
 'Apple',
 'At',
 'BACK',
 'Baby',
 'Be',
 'Beauty',
 'Best',
 'Birthday',
 'Black',
 'Book',
 'But',
 'Buy',
 'Ca',
 'Case',
 'Cases',
 'Charm',
 'Check',
 'Christmas',
 'Click',
 'D',
 'Dating',
 'Day',
 'Decor',
 'Do',
 'Download',
 'Exquisite',
 'FOLLOW',
 'FOLLOWBACK',
 'FREE',
 'FUCKING',
 'Facebook',
 'Family',
 

# Stemming

> You should have noticed something, right? There are words that have the same meaning, but written in a different manner, sometimes in the plural and sometimes with a suffix (ing, es ...), this will make our model think that they are different words and also make our vocabulary bigger (waste of memory and time for the learning process). The solution is to reduce those words with the same root, this is called stemming.

In [11]:
# I'm defining this function to use it in the 
# Data Preparation Phase
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer

#nltk.download('wordnet')
def stem_tokenize(text):
    stemmer = SnowballStemmer("english")
    stemmer = WordNetLemmatizer()
    return [stemmer.lemmatize(token) for token in word_tokenize(text)]

def lemmatize_tokenize(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in word_tokenize(text)]

>  will stop here, but you can visualize tweets more and more to gain insights and take decisions about how to transform your data.

# Prepare the data

> In this phase, we will transform our tweets into a more usable data by our ML models.

# Bag of Words

We are going to use the Bag of Words algorithm, which basically takes a text as input, extract words from it (this is our vocabulary) to use them in the vectorization process. When a tweet comes in, it will vectorize it by counting the number of occurrences of each word in our vocabulary.

For example, we have this two tweets: "I learned a lot today" and "hahaha I got you".

> #  tweet / words       I     learned     a     lot    today    hahaha      got    you
 
1. > #  first            1      1          1      1      1        0          0       0
   
* > #  second            1      0          0      0      0        1          1       1

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Building the pipeline

> It's always a good practice to make a pipeline of transformation for your data, it will make the process of data transformation really easy and reusable. We will implement a pipeline for transforming our tweets to something that our ML models can digest (vectors)

In [13]:
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline

In [14]:
# We need to do some preprocessing of the tweets.
# We will delete useless strings (like @, # ...)
# because we think that they will not help
# in determining if the person is Happy/Sad

class TextPreProc(BaseEstimator,TransformerMixin):
    def __init__(self, use_mention=False):
        self.use_mention = use_mention
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # We can choose between keeping the mentions
        # or deleting them
        if self.use_mention:
            X = X.str.replace(r"@[a-zA-Z0-9_]* ", " @tags ")
        else:
            X = X.str.replace(r"@[a-zA-Z0-9_]* ", "")
            
        # Keeping only the word after the #
        X = X.str.replace("#", "")
        X = X.str.replace(r"[-\.\n]", "")
        # Removing HTML garbage
        X = X.str.replace(r"&\w+;", "")
        # Removing links
        X = X.str.replace(r"https?://\S*", "")
        # replace repeated letters with only two occurences
        # heeeelllloooo => heelloo
        X = X.str.replace(r"(.)\1+", r"\1\1")
        # mark emoticons as happy or sad
        X = X.str.replace(HAPPY_EMO, " happyemoticons ")
        X = X.str.replace(SAD_EMO, " sademoticons ")
        X = X.str.lower()
        return X

In [15]:
# This is the pipeline that will transform our tweets to something eatable.
# You can see that we are using our previously defined stemmer, it will
# take care of the stemming process.
# For stop words, we let the inverse document frequency do the job
from sklearn.model_selection import train_test_split

sentiments = train_data['label']
tweets = train_data['tweet']

# I get those parameters from the 'Fine tune the model' part
vectorizer = TfidfVectorizer(tokenizer=lemmatize_tokenize, ngram_range=(1,2))
pipeline = Pipeline([
    ('text_pre_processing', TextPreProc(use_mention=True)),
    ('vectorizer', vectorizer),
])

# Let's split our data into learning set and testing set
# This process is done to test the efficency of our model at the end.
# You shouldn't look at the test data only after choosing the final model
learn_data, test_data, sentiments_learning, sentiments_test = train_test_split(tweets, sentiments, test_size=0.3)

# This will tranform our learning data from simple text to vector
# by going through the preprocessing tranformer.
learning_data = pipeline.fit_transform(learn_data)

# Select a model

> When we have our data ready to be processed by ML models, the question we should ask is which model to use?
> 
> The answer varies depending on the problem and data, for example, it's known that Naive Bias has proven good efficacy against Text Based Problems.
> 
> A good way to choose a model is to try different candidate, evaluate them using cross validation, then chose the best one which will be later tested against our test data.

In [16]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

lr = LogisticRegression()
bnb = BernoulliNB()
mnb = MultinomialNB()

models = {
    'logitic regression': lr,
    'bernoulliNB': bnb,
    'multinomialNB': mnb,
}

for model in models.keys():
    scores = cross_val_score(models[model], learning_data, sentiments_learning, scoring="f1", cv=10)
    print("===", model, "===")
    print("scores = ", scores)
    print("mean = ", scores.mean())
    print("variance = ", scores.var())
    models[model].fit(learning_data, sentiments_learning)
    print("score on the learning data (accuracy) = ", accuracy_score(models[model].predict(learning_data), sentiments_learning))
    print("")

=== logitic regression ===
scores =  [0.6557377  0.67213115 0.6090535  0.66945607 0.63865546 0.69166667
 0.61666667 0.73895582 0.70119522 0.59504132]
mean =  0.6588559577595945
variance =  0.0018324049614875221
score on the learning data (accuracy) =  0.9256854256854257

=== bernoulliNB ===
scores =  [0.4744186  0.51818182 0.5470852  0.51351351 0.62222222 0.60633484
 0.51101322 0.62337662 0.52017937 0.55111111]
mean =  0.5487436524535472
variance =  0.002423318604056238
score on the learning data (accuracy) =  0.9332611832611832

=== multinomialNB ===
scores =  [0.50485437 0.47       0.51674641 0.48241206 0.48730964 0.49756098
 0.48039216 0.54368932 0.46766169 0.53773585]
mean =  0.4988362478846593
variance =  0.0006429778313069493
score on the learning data (accuracy) =  0.9444444444444444



> None of those models is likely to be overfitting, I will choose the multinomialNB.

# Fine tune the model

> I'm going to use the GridSearchCV to choose the best parameters to use.
> 
> What the GridSearchCV does is trying different set of parameters, and for each one, it runs a cross validation and estimate the score. At the end we can see what are the best parameter and use them to build a better classifier.

In [18]:
from sklearn.model_selection import GridSearchCV

grid_search_pipeline = Pipeline([
    ('text_pre_processing', TextPreProc()),
    ('vectorizer', TfidfVectorizer()),
    ('model', MultinomialNB()),
])

params = [
    {
        'text_pre_processing__use_mention': [True, False],
        'vectorizer__max_features': [1000, 2000, 5000, 10000, 20000, None],
        'vectorizer__ngram_range': [(1,1), (1,2)],
    },
]
grid_search = GridSearchCV(grid_search_pipeline, params, cv=5, scoring='f1')
grid_search.fit(learn_data, sentiments_learning)
print(grid_search.best_params_)

{'text_pre_processing__use_mention': True, 'vectorizer__max_features': 5000, 'vectorizer__ngram_range': (1, 2)}


> Testing our model against data other than the data used for training our model will show how well the model is generalising on new data.

# Note

We shouldn't test to choose the model, this will only let us confirm that the choosen model is doing well.

In [19]:
mnb.fit(learning_data, sentiments_learning)

MultinomialNB()

In [20]:
testing_data = pipeline.transform(test_data)
mnb.score(testing_data, sentiments_test)

0.8409090909090909

In [22]:
sub_data = pd.read_csv("../input/twitter-dataset/test_twitter.csv", encoding='ISO-8859-1')

In [23]:
sub_data

Unnamed: 0,id,tweet
0,7921,I hate the new #iphone upgrade. Won't let me download apps. #ugh #apple sucks
1,7922,currently shitting my fucking pants. #apple #iMac #cashmoney #raddest #swagswagswag http://insta...
2,7923,"I'd like to puts some CD-ROMS on my iPad, is that possible?' â Yes, but wouldn't that block th..."
3,7924,"My ipod is officially dead. I lost all my pictures and videos from the 1D and 5sos concert,and f..."
4,7925,Been fighting iTunes all night! I only want the music I $&@*# paid for
...,...,...
1948,9869,"#SamsungGalaxyNote7 Explodes, Burns 6-Year-Old. Thanks for rushing your products to market #Sams..."
1949,9870,Now Available - Hoodie. Check it out here - http://zetasupplies.co.uk/products/hoodie-2?utm_camp...
1950,9871,"There goes a crack right across the screen. If you could actually provide a more durable screen,..."
1951,9872,@codeofinterest as i said #Adobe big time we may well as include #apple to


In [33]:
# Predecting on the test.csv

sub_learning = pipeline.transform(sub_data.tweet)
sub = pd.DataFrame(sub_data.id, columns=("id", "label"))
sub["label"] = mnb.predict(sub_learning)
print(sub)

        id  label
0     7921      1
1     7922      0
2     7923      0
3     7924      0
4     7925      0
...    ...    ...
1948  9869      0
1949  9870      0
1950  9871      0
1951  9872      0
1952  9873      0

[1953 rows x 2 columns]


# Test your tweet

> The most exciting part ! Don't be too hard with my classifier...

In [28]:
# Just run it
model = MultinomialNB()
model.fit(learning_data, sentiments_learning)
tweet = pd.Series([input(),])
tweet = pipeline.transform(tweet)
proba = model.predict_proba(tweet)[0]
print("The probability that this tweet is sad is:", proba[0])
print("The probability that this tweet is happy is:", proba[1])

 Happy


The probability that this tweet is sad is: 0.9719787681735573
The probability that this tweet is happy is: 0.028021231826441933


In [37]:
# import the modules we'll need
from IPython.display import HTML
import pandas as pd
import numpy as np
import base64
# function that takes in a dataframe and creates a text link to  
# download it (will only work for files < 2MB or so)
def create_download_link(df, title = "Download CSV file", filename = "sample_submission_text_analysis.csv"):  
    csv = df.to_csv(index=False)
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

# create a random sample dataframe
df = pd.DataFrame(sub)

# create a link to download the dataframe
create_download_link(df)

# ↓ ↓ ↓  Yay, download link! ↓ ↓ ↓ 