# About the notebook

Do exploratory data analysis and sentiment analysis 

# About the data

Dataset consists of tweets of Trump.


# About this Notebook

- <a href =#1> 1. Import packages and libraries  </a>
- <a href =#2> 2. Read the data </a>
- <a href =#3> 3. Handle null values </a>
- <a href =#4> 4. Exploratory data analysis </a>
- <a href =#5> 5. Clean the data </a>
- <a href =#6> 6. Apply data cleaning steps to data</a>
- <a href =#7> 7. Visualize the most Common words in the tweets </a>
- <a href =#8> 8.  Get the sentiment analysis of the tweets </a>
- <a href =#9> 9.  Save the data </a>

<br>
Credit goes to:
https://www.kaggle.com/shailaja4247/sentiment-analysis-of-tweets-wordclouds-textblob#About-the-data

<div id=1> <h2>  1. Import packages and libraries  </h2> </div> 

In [None]:
pip install plotly

In [1]:
import re
import string
import numpy as np 
import random
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from collections import Counter

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkit')

from tqdm import tqdm
import os
import nltk
import spacy
import random
from spacy.util import compounding
from spacy.util import minibatch

import warnings
warnings.filterwarnings("ignore")

import os

ModuleNotFoundError: No module named 'plotly'

**Below is a helper Function which generates random colors which can be used to give different colors to your plots.Feel free to use it**

<div id=2> <h2> 2. Read the data  </h2></div> 

In [2]:
import pandas as pd
import numpy as np
train=pd.read_csv('./data/input/realdonaldtrump_20170120-20191231.csv')

In [3]:
print(train.shape)

(9690, 8)


In [4]:
train.describe()

Unnamed: 0,id,retweets,favorites
count,9690.0,9690.0,9690.0
mean,1.055627e+18,17968.644066,79928.356347
std,1.103727e+17,10248.538428,39531.033392
min,8.22421e+17,535.0,2317.0
25%,9.675642e+17,12113.0,55761.5
50%,1.06973e+18,16155.5,72963.5
75%,1.155308e+18,21364.0,95168.25
max,1.21221e+18,302269.0,835575.0


In [5]:
train.isna().sum()

id              0
link            0
content         0
date            0
retweets        0
favorites       0
mentions     7850
hashtags     8911
dtype: int64

hashtags have lot of null values, mentions also have lot of null values.

<div id=3> <h2> 3. Handle null values </h2> </div>

In [6]:
train['hashtags'].unique()[:10]

array([nan, '#InaugurationDay', '#AmericaFirst', '#InaugurationDay,#MAGA',
       '#MAGA', '#MarchForLife', '#NeverForgetpic', '#SuperBowl', '#1',
       '#AmericaFirstpic'], dtype=object)

In [7]:
train['mentions'].unique()[:10]

array([nan, '@FoxNews', '@WhiteHouse', '@FoxNews,@CNN', '@DavidMuir,@ABC',
       '@DHSgov', '@romoabcnews,@DavidMuir,@POTUS,@ABCWorldNews,@ABC2020',
       '@ABC', '@GOP,@VP,@SenateMajLdr,@SpeakerRyan',
       '@SeanHannity,@FoxNews'], dtype=object)

1. hashtags: This column does not have something very significant for us to analyse. We have mostly words like #ixzz4 etc which make no sense and we wont be able to analyse their sentiment. So we can get rid of this column
2. mentions usually mention another person and we wont really get any sentiment by analysing that. mentions too does not have anything siginificant for us to analyse. Hence we can do away with that too. Let us drop columns which dont have any siginificantly useful information. 

In [8]:
train=train.drop(['link','mentions','hashtags'], axis=1)

<div id=4> <h2> 4. EDA (Exploratory data analysis)  </h2></div>

In [9]:
train.head()

Unnamed: 0,id,content,date,retweets,favorites
0,8.22421e+17,It all begins today! I will see you at 11:00 A...,1/20/2017 6:31,59617,238180
1,8.22502e+17,Today we are not merely transferring power fro...,1/20/2017 11:51,17022,97230
2,8.22502e+17,"power from Washington, D.C. and giving it back...",1/20/2017 11:51,15441,80281
3,8.22502e+17,What truly matters is not which party controls...,1/20/2017 11:52,36987,161112
4,8.22502e+17,"January 20th 2017, will be remembered as the d...",1/20/2017 11:53,56418,201434


Lets look at the distribution of reviews in the train set

<div id=5> <h2>5. Data Preprocessing </h2> </div>

Let's first clean the data, remove stopwords etc and perform basic pre-processing

**Removing weird spaces**

In [10]:
def remove_spaces(text):
    text=text.strip()
    text=text.split()
    return ' '.join(text)

<b> Spelling Correction </b>

In [11]:
def edits1(word):
    letters='abcdefghijklmnopqrstuvwxyz'
    splits=[(word[:i], word[i:]) for i in range(len(word)+1)]
    deletes=[L+R[1:] for L,R in splits if R]
    transposes=[L+R[1] +R[0] + R[2:] for L,R in splits if len(R)>1]
    replaces = [L+c+R[1:] for L,R in splits if R for c in letters]
    inserts = [L+c+ R for L,R in splits for c in letters]
    return set(deletes+transposes+replaces+inserts)
def edits2(word):
    return(e2 for e1 in edits1(word) for e2 in edits1(e1))

<b> Contraction </b>

In [12]:
contraction = {'cause':'because',
              'aint': 'am not',
              'aren\'t': 'are not'}

def mapping_replacer(x,dic):
    for words in dic.keys():
        if ' ' + words + ' ' in x:
            x=x.replace(' '+ words +' ' ,' '+dic[words]+' ' )
    return x

<b> Stemming, lemmetisation and tokenisation
</b>

In [13]:
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.lancaster import LancasterStemmer

nltk.LancasterStemmer
ls = LancasterStemmer()
lem = WordNetLemmatizer()
def lexicon_normalization(text):
    words = word_tokenize(text) 
    
    
    # 1- Stemming
    words_stem = [ls.stem(w) for w in words]
    
    # 2- Lemmatization
    words_lem = [lem.lemmatize(w) for w in words_stem]
    return words_lem

<b> Handling emojis </b>

In [14]:
import emoji
import re 
#from emot.emo_unicode import UNICODE_EMO
def convert_emojis(text):
    for emot in emoji.UNICODE_EMOJI:
        text = re.sub(r'('+emot+')', "_".join(emoji.UNICODE_EMOJI[emot].replace(",","").replace(":","").split()), text)
    return text

<b>Removing links, brackets, numbers, punctuations etc. </b>


In [15]:
def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('\'','', text)
    
    return text

<b> Handling stopwords </b>

In [16]:
from collections import Counter
def remove_stopword(text):
    stop_words = stopwords.words('english')
    stopwords_dict = Counter(stop_words)
    text = ' '.join([word for word in text.split() if word not in stopwords_dict])
    return text

<b> Tokenisation </b>

In [17]:
def tokenise(text):
    words = word_tokenize(text) 
    return words


<div id = 6> <h2> 6. Apply data cleaning steps to data </h2> </div>

Cleaning Regex Expressions from data

In [18]:
import re
train['content'] = train['content'].map(lambda x: re.sub(r'\W+', ' ', x))
train['content'] = train['content'].replace(r'\W+', ' ', regex=True)


In [19]:
train.head()

Unnamed: 0,id,content,date,retweets,favorites
0,8.22421e+17,It all begins today I will see you at 11 00 A ...,1/20/2017 6:31,59617,238180
1,8.22502e+17,Today we are not merely transferring power fro...,1/20/2017 11:51,17022,97230
2,8.22502e+17,power from Washington D C and giving it back t...,1/20/2017 11:51,15441,80281
3,8.22502e+17,What truly matters is not which party controls...,1/20/2017 11:52,36987,161112
4,8.22502e+17,January 20th 2017 will be remembered as the da...,1/20/2017 11:53,56418,201434


In [20]:
train['content']=train['content'].apply(lambda x: mapping_replacer(x, contraction))

In [21]:
train['content'] = train['content'].apply(lambda x:clean_text(x))

In [22]:
train['content']=train['content'].apply(lambda x: remove_stopword(x))


In [23]:
train['content']=train['content'].apply(lambda x: lexicon_normalization(x))

In [24]:
train.head()

Unnamed: 0,id,content,date,retweets,favorites
0,8.22421e+17,"[begin, today, see, swear, mov, continu, work,...",1/20/2017 6:31,59617,238180
1,8.22502e+17,"[today, mer, transfer, pow, on, admin, anoth, ...",1/20/2017 11:51,17022,97230
2,8.22502e+17,"[pow, washington, c, giv, back, am, peopl, ina...",1/20/2017 11:51,15441,80281
3,8.22502e+17,"[tru, mat, party, control, govern, wheth, gove...",1/20/2017 11:52,36987,161112
4,8.22502e+17,"[janu, rememb, day, peopl, becam, rul, nat]",1/20/2017 11:53,56418,201434


<div id=7><h2>7. Visualize the most Common words in the tweets </h2></div>

In [25]:
top = Counter([item for sublist in train['content'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

Unnamed: 0,Common_words,count
0,com,3154
1,twit,2456
2,gre,2332
3,stat,1760
4,new,1404
5,u,1404
6,http,1344
7,pic,1327
8,presid,1231
9,peopl,1143


In [26]:
blacklist = ['http','https','www','com', 'ev','u','ly','pic','would']

def remove_words(text):
    text = [i for i in text if (i not in blacklist)]
    return text

In [27]:
#train['content']=remove_words(train['content'])

train['content'] = train['content'].apply(lambda x: [i for i in x if i not in blacklist])

In [28]:
fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in Selected Text', orientation='h', 
             width=700, height=700,color='Common_words')
fig.show()

In [29]:
top = Counter([item for sublist in train['content'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp = temp.iloc[1:,:]
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Purples')

Unnamed: 0,Common_words,count
1,gre,2332
2,stat,1760
3,new,1404
4,presid,1231
5,peopl,1143
6,democr,1134
7,country,1050
8,trump,858
9,get,846
10,thank,844


<div id = 8><h2>8. Get the sentiment analysis of the tweets</h2></div>

Let's look at the most common words in different sentiments

In [31]:
from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator

from textblob import TextBlob

def get_tweet_sentiment(tweet): 
    ''' 
    Utility function to classify sentiment of passed tweet 
    using textblob's sentiment method 
    '''
    # create TextBlob object of passed tweet text 
    analysis = TextBlob(tweet) 
    
    # set sentiment 
    if analysis.sentiment.polarity > 0:
        return '+1'
    elif analysis.sentiment.polarity == 0: 
        return '0'
    else: 
        return '-1'
    # return analysis.sentiment.polarity

In [32]:
train['sentiment']=train['content'].apply(lambda x: get_tweet_sentiment(' '.join(x)))

In [33]:
train.head()

Unnamed: 0,id,content,date,retweets,favorites,sentiment
0,8.22421e+17,"[begin, today, see, swear, mov, continu, work,...",1/20/2017 6:31,59617,238180,0
1,8.22502e+17,"[today, mer, transfer, pow, on, admin, anoth, ...",1/20/2017 11:51,17022,97230,0
2,8.22502e+17,"[pow, washington, c, giv, back, am, peopl, ina...",1/20/2017 11:51,15441,80281,0
3,8.22502e+17,"[tru, mat, party, control, govern, wheth, gove...",1/20/2017 11:52,36987,161112,0
4,8.22502e+17,"[janu, rememb, day, peopl, becam, rul, nat]",1/20/2017 11:53,56418,201434,0


In [34]:
Positive_sent = train[train['sentiment']=='+1']
Negative_sent = train[train['sentiment']=='0']
Neutral_sent = train[train['sentiment']=='-1']

<b> Overall tweet sentiment </b>

In [35]:
print('Number of tweets with positive sentiment', Positive_sent['sentiment'].count())
print('Number of tweets with negative sentiment', Negative_sent['sentiment'].count())
print('Number of tweets with neutral sentiment', Neutral_sent['sentiment'].count())


Number of tweets with positive sentiment 4038
Number of tweets with negative sentiment 3375
Number of tweets with neutral sentiment 2277


<b><i>So we see that overall the tweets are neutral in nature, followed by positive sentiment for the time in which they are analysed.</i></b>

<div id = 9><h2>9. Save the data </h2></div>

In [None]:
df.to_csv('./output/results/textblob_prediction_data.csv')