# Twitter Sentiment Analysis
First of all, load the data

In [1]:
import pandas as pd
data = pd.read_csv('TwitterSentimentAnalysis/train_tweets.csv')
data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
id       31962 non-null int64
label    31962 non-null int64
tweet    31962 non-null object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


### 1. Basic Feature Extraction

#### i. Number of words
We use the text data to extract a number of features.We can begin with the word count. Word count for the reason being, the positive statements tend to be larger than the neagtive statements.

In [3]:
#For world count, add a new column word_count into the dataframe and use apply method of pandas to apply
# split function to all the rows of dataframe
data['wordCount'] = data['tweet'].apply(lambda x: len(str(x).split()))
data[['tweet','wordCount']].head()

Unnamed: 0,tweet,wordCount
0,@user when a father is dysfunctional and is s...,18
1,@user @user thanks for #lyft credit i can't us...,19
2,bihday your majesty,3
3,#model i love u take with u all the time in ...,14
4,factsguide: society now #motivation,4


#### ii. Number of characters
This is based on previous feature intuition. Here we calculate the number of characters in each row

In [4]:
data['charCount'] = data['tweet'].apply(lambda x:len(str(x))) #Spaces included as well
data[['tweet','charCount']].head()

Unnamed: 0,tweet,charCount
0,@user when a father is dysfunctional and is s...,102
1,@user @user thanks for #lyft credit i can't us...,122
2,bihday your majesty,21
3,#model i love u take with u all the time in ...,86
4,factsguide: society now #motivation,39


#### iii. Average Word Length
Here we extract the average word length of each tweet. This is done by taking the sum of the
length of all the words and then divide it by the total number of the words

In [5]:
#First of all we create a function which does all the math we need to do on the text and then we apply lambda
def averageCount(word):
    words = word.split() #Produces a list of words
    return sum([len(i) for i in words])/len(words)

data['avgCount'] = data['tweet'].apply(lambda x:averageCount(x))
data[['tweet','avgCount']].head()

Unnamed: 0,tweet,avgCount
0,@user when a father is dysfunctional and is s...,4.555556
1,@user @user thanks for #lyft credit i can't us...,5.315789
2,bihday your majesty,5.666667
3,#model i love u take with u all the time in ...,4.928571
4,factsguide: society now #motivation,8.0


#### iv. Number of Stopwords
Generally while solving any NLP problem, we remove the stopwords but sometimes they can also give some
valuable insights we might have missed earlier.

In [6]:
from nltk.corpus import stopwords
stop = stopwords.words('english') #Gives a list of english stopwords

data['stopwords'] = data['tweet'].apply(lambda x:len([i for i in x.split() if i in stop]))
data[['tweet', 'stopwords']].head()

Unnamed: 0,tweet,stopwords
0,@user when a father is dysfunctional and is s...,10
1,@user @user thanks for #lyft credit i can't us...,5
2,bihday your majesty,1
3,#model i love u take with u all the time in ...,5
4,factsguide: society now #motivation,1


In [7]:
stopwords.words('spanish')[:5]

['de', 'la', 'que', 'el', 'en']

#### v. Number of special characters
Special characters are worth a feature because it lets us know about the number of hashtags and mentions
present in the text
Use startswith() beacuse the mentions and hashtags start with the special symbol

In [8]:
# Number of hashtags
data['hashtags'] = data['tweet'].apply(lambda x:len([i for i in x if i.startswith('#')]))

# Number of mentions
data['mentions'] = data['tweet'].apply(lambda x:len([i for i in x if i.startswith('@')]))

data[['tweet', 'hashtags','mentions']].head()

Unnamed: 0,tweet,hashtags,mentions
0,@user when a father is dysfunctional and is s...,1,1
1,@user @user thanks for #lyft credit i can't us...,3,2
2,bihday your majesty,0,0
3,#model i love u take with u all the time in ...,1,0
4,factsguide: society now #motivation,1,0


#### vi. Number of numerics
Just like we calculated the number of words, we can similarly calculate the number of numerics present in the 
tweet. Not of much use to us, but is a useful feature that need to be extracted.

.isdigit() method can be used which basically return a boolean value

In [9]:
data['numerics'] = data['tweet'].apply(lambda x:len([i for i in x.split() if x.isdigit()]))
data[['tweet', 'numerics']].head()

Unnamed: 0,tweet,numerics
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


#### vii. Number of uppercase words
This is one of the very useful features because most of the anger or rage words are generally expressed by 
writing in uppercase. So a count of it, can help gain those insights.

.isupper() mwthod can be used to check whether the words is in uppercase or not. Returns a boolean value

In [10]:
#Ex:
name = "PRAMOD"
print("For all char upper: ",name.isupper())
name = "Pramod"
print("For initial char upper: ",name.isupper())
name = "pramod"
print("For no char upper: ",name.isupper())

For all char upper:  True
For initial char upper:  False
For no char upper:  False


In [11]:
data['upper'] = data['tweet'].apply(lambda x:len([i for i in x.split() if i.isupper()]))
data[['tweet', 'upper']].head()

Unnamed: 0,tweet,upper
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


## 2. Basic Pre-processing
Up until now, we saw the methods or what basic features that need to be extracted. Before diving into text and features extraction, our main goal should be the cleansing of data, so that we can obtain better features.
We do so by doing some pre-processing on the data

#### i. LowerCase
The first pre-processing we will do is convert all the tweets or text into lowercase. This avoids having multiple copies of the same text. Ex: 'Basic' and 'basic' are interpreted as different and would get included in the word cloud as two different words.

In [12]:
data['tweet'] = data['tweet'].apply(lambda x:x.lower())
data[['tweet']].head()

Unnamed: 0,tweet
0,@user when a father is dysfunctional and is s...
1,@user @user thanks for #lyft credit i can't us...
2,bihday your majesty
3,#model i love u take with u all the time in ...
4,factsguide: society now #motivation


#### ii. Remove punctuation
The next step is to remove punctuation since it does not add any useful information to the data. Therefore
removing them would not only help in removing the clutter but reduce the size of training data as well.

In [13]:
# [^\w\s] - Negated \w and \s i.e. not a word character as well whitespace
# \w stands for word character - A-Z,a-z,0-9,_
# \s stands for white spaces - newline, space, tab, form feed

data['tweet'] = data['tweet'].str.replace('[^\w\s]', '')
data[['tweet']].head()

# This piece of code removes any non word character from the tweets

Unnamed: 0,tweet
0,user when a father is dysfunctional and is so...
1,user user thanks for lyft credit i cant use ca...
2,bihday your majesty
3,model i love u take with u all the time in u...
4,factsguide society now motivation


#### iii. Removal of stopwords
The stopwords are the most commonly occuring words which are not if much use like, it, is , a, etc. For the purpose either you can define your own set of stopwords or use the existing library.

<code>from nltk.corpus import stopwords
words = stopwords.words(specify the language)
</code>

In [14]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

data['tweet'] = data['tweet'].apply(lambda x:" ".join([i for i in x.split() if i not in stop]))
data[['tweet']].head()

Unnamed: 0,tweet
0,user father dysfunctional selfish drags kids d...
1,user user thanks lyft credit cant use cause do...
2,bihday majesty
3,model love u take u time urð ðððð ððð
4,factsguide society motivation


#### iv. Common words removal
Previously, we removed the most commonly occuring words in general sense. We can also remove the commonly occuring words in the text data as well.

For the purpose, we need to calculate the occurence of each word in the entire text data.<br>
To do so first join each text and then split it to convert it into a list.<br>
Then use the list as a basis for conversion into pandas Series<br>
Finally, use value_counts() method to get the count of each character.

In [15]:
# Calculate the total occurence of each word in the entire text data.
word_freq = pd.Series((' '.join(data['tweet'])).split()).value_counts()[:10]
word_freq

user     17473
love      2647
ð         2511
day       2199
â         1797
happy     1663
amp       1582
im        1139
u         1136
time      1110
dtype: int64

In [16]:
# Use the index of word_freq as a basis to remove the top 10 most occuring and not so useful words
index = list(word_freq.index)
index

['user', 'love', 'ð', 'day', 'â', 'happy', 'amp', 'im', 'u', 'time']

In [17]:
data['tweet'] = data['tweet'].apply(lambda x: " ".join([i for i in x.split() if i not in index]))
data[['tweet']].head()

Unnamed: 0,tweet
0,father dysfunctional selfish drags kids dysfun...
1,thanks lyft credit cant use cause dont offer w...
2,bihday majesty
3,model take urð ðððð ððð
4,factsguide society motivation


#### v. Rare words removal
Remove the least occuring words, since they also do not add any value to the analysis

In [18]:
#word_freq is a pandas series which consists of all the words count
word_freq = pd.Series(" ".join(data['tweet']).split()).value_counts()[-10:]
word_freq

lateral                1
geneathawright         1
haiapnadiltohawaara    1
maxxxlife              1
translation            1
gasps                  1
hayirlicumalar         1
noof                   1
âïâïâïwhat             1
theriveourberlin       1
dtype: int64

In [19]:
#Removing the rarely occuring words
data['tweet'] = data['tweet'].apply(lambda x: " ".join([i for i in x.split() if i not in word_freq]))
data[['tweet']].head()

Unnamed: 0,tweet
0,father dysfunctional selfish drags kids dysfun...
1,thanks lyft credit cant use cause dont offer w...
2,bihday majesty
3,model take urð ðððð ððð
4,factsguide society motivation


#### vi. Spelling Correction
We usually make spelling mistakes while posting tweets in hastle or any possible reason. Pressence of spelling mistakes makes two similar words appear different, therefore they must be corrected.

Spelling correction can be done by 'textblob'

Care should be taken for spelling correction, because often we type in abbreviations and what TextBlob does is, it tries to correct spelling assuming the length of the correct word is same as the length of the word to be corrected. This often leads to mistake.

For ex: We abbreviate 'your' as 'ur', which on correction would yield 'or' which we don't want.

In [20]:
from textblob import TextBlob
TextBlob("analytcs as always").correct()

TextBlob("analysis as always")

In [21]:
from textblob import TextBlob
data['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

0    father dysfunctional selfish drags kiss dysfun...
1    thanks left credit can use cause dont offer wh...
2                                       midday majesty
3                               model take or ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

#### vii. Tokenization
Tokenization refers to dividing the text into sequence of words or sentences.

In [22]:
print(data['tweet'][1])
from textblob import TextBlob
TextBlob(data['tweet'][1]).words

thanks lyft credit cant use cause dont offer wheelchair vans pdx disapointed getthanked


WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])

In [23]:
TextBlob("So let's hear it out from me. This is a tale from long long back in past").words

WordList(['So', 'let', "'s", 'hear', 'it', 'out', 'from', 'me', 'This', 'is', 'a', 'tale', 'from', 'long', 'long', 'back', 'in', 'past'])

#### viii. Stemming
Stemming refers to the removal of suffixes like "ing", "ly", "s", etc by a simple rule based approach.

For the purpose, we'll use <b>PorterStemmer</b> from the nltk library

In [24]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

data['tweet'][:5].apply(lambda x: " ".join([stemmer.stem(i) for i in x.split()]))

0        father dysfunct selfish drag kid dysfunct run
1    thank lyft credit cant use caus dont offer whe...
2                                       bihday majesti
3                              model take urð ðððð ððð
4                              factsguid societi motiv
Name: tweet, dtype: object

#### ix. Lemmatization
Lemmatization is more useful than stemming because it converts the word into root words, rather than just stripping the suffixes. It uses vocabulary and does a morphological analysis to obtain the root word. Therefore, lemmatization is usually preferred over stemming.

In [25]:
from textblob import Word
data['tweet'] = data['tweet'].apply(lambda x: " ".join([Word(i).lemmatize() for i in x.split()]))
data[['tweet']].head()

Unnamed: 0,tweet
0,father dysfunctional selfish drag kid dysfunct...
1,thanks lyft credit cant use cause dont offer w...
2,bihday majesty
3,model take urð ðððð ððð
4,factsguide society motivation


In [26]:
Word("suffixes").lemmatize()

'suffix'

In [27]:
data.head()

Unnamed: 0,id,label,tweet,wordCount,charCount,avgCount,stopwords,hashtags,mentions,numerics,upper
0,1,0,father dysfunctional selfish drag kid dysfunct...,18,102,4.555556,10,1,1,0,0
1,2,0,thanks lyft credit cant use cause dont offer w...,19,122,5.315789,5,3,2,0,0
2,3,0,bihday majesty,3,21,5.666667,1,0,0,0,0
3,4,0,model take urð ðððð ððð,14,86,4.928571,5,1,0,0,0
4,5,0,factsguide society motivation,4,39,8.0,1,1,0,0,0


### 3. Advanced text processing
Up until now we have done all the basic pre-processing steps in order to clean our data. Now we can move on to the NLP part to gain insightful features.

#### i. N-grams
N-grams are the combination of multiple words used together. Ngrasm with N=1 are unigrams, N=2 are bigrams, N=3 are trigrams and so on.

Unigrams do not contain as much information as bigram or trigram. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to follow the given one. The more the n-gram more context you have to work with. If the n-gram is too short, we would fail to capture important differences. On the other hand, if n-gram is too long, we would fail to capture the general knowledge and will get stuck to only particular cases.

In [28]:
from textblob import TextBlob
import time
print(TextBlob(data['tweet'][1]).ngrams(2),"\n")

start = time.time()
a = data['tweet'][:5].apply(lambda x: TextBlob(x).ngrams(2))
end = time
.time()
print(end-start)

[WordList(['thanks', 'lyft']), WordList(['lyft', 'credit']), WordList(['credit', 'cant']), WordList(['cant', 'use']), WordList(['use', 'cause']), WordList(['cause', 'dont']), WordList(['dont', 'offer']), WordList(['offer', 'wheelchair']), WordList(['wheelchair', 'van']), WordList(['van', 'pdx']), WordList(['pdx', 'disapointed']), WordList(['disapointed', 'getthanked'])] 

0.008085012435913086


#### ii. Term Frequency
Term frequency is the count of a word present in a sentence, to the length of the sentence.

TF = (No. of times word T appears in a row)/(No. of words in that row)

In [29]:
term_freq_0_row = (data['tweet'][1:2]).apply(lambda x:pd.value_counts(x.split())).sum(axis=0).reset_index()
term_freq_0_row.columns = ["words", "freq"]
term_freq_0_row

Unnamed: 0,words,freq
0,dont,1
1,pdx,1
2,van,1
3,credit,1
4,disapointed,1
5,offer,1
6,use,1
7,wheelchair,1
8,thanks,1
9,getthanked,1


#### iii. Inverse Document Frequency
The intution behind IDF is that if the words appears too many times in the document then the word is of no use to us.

Therefore, the IDF is the log of the ratio of the total number of rows in which the word is present.

IDF = log(N/n), <br>where <b>N</b>=Total number of rows, <br>and <b>n</b>=number of rows in which the word is present.

The more the value of IDF, more unique the word is.

In [31]:
import numpy as np
for i, word in enumerate(term_freq_0_row['words']):
    term_freq_0_row.loc[i,'idf'] = np.log(len(data)/len(data[data['tweet'].str.contains(word)]))
term_freq_0_row

Unnamed: 0,words,freq,idf
0,dont,1,3.745585
1,pdx,1,8.762865
2,van,1,5.236505
3,credit,1,7.327781
4,disapointed,1,10.372303
5,offer,1,6.522155
6,use,1,3.552287
7,wheelchair,1,9.273691
8,thanks,1,4.597751
9,getthanked,1,9.679156


#### iv. Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF is the multiplication of the TF and IDF as we calculated above.

In [33]:
term_freq_0_row['tfidf'] = term_freq_0_row['idf']*term_freq_0_row['freq']
term_freq_0_row

Unnamed: 0,words,freq,idf,tfidf
0,dont,1,3.745585,3.745585
1,pdx,1,8.762865,8.762865
2,van,1,5.236505,5.236505
3,credit,1,7.327781,7.327781
4,disapointed,1,10.372303,10.372303
5,offer,1,6.522155,6.522155
6,use,1,3.552287,3.552287
7,wheelchair,1,9.273691,9.273691
8,thanks,1,4.597751,4.597751
9,getthanked,1,9.679156,9.679156


In [35]:
term_freq_0_row[term_freq_0_row['tfidf']<4]

Unnamed: 0,words,freq,idf,tfidf
0,dont,1,3.745585,3.745585
6,use,1,3.552287,3.552287
11,cant,1,3.538194,3.538194


We can see that the words like dont, cant, use have been penalized because they are most commonly occurring words, whereas the word like disappointed isn't because it's the least appearing word and would be helpful in analyzing the sentiment. 

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfid = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
                       stop_words='english',ngram_range=(1,1))
data_vect = tfid.fit_transform(data['tweet'])
data_vect

<31962x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 114029 stored elements in Compressed Sparse Row format>

#### v. Bag of Words
Bag of Words(BoW) refers to the representation of text which describes the presence of words within the text. The intution behind it is that 2 similar kind of textfields will contain similar kind of words and will therefore have similar bag of words.

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True,analyzer='word',ngram_range=(1,1))
data_bow = bow.fit_transform(data['tweet'])
data_bow

<31962x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 128389 stored elements in Compressed Sparse Row format>

#### vi. Sentiment Analysis
Before applying any Machine Learning or Deep Learning models, we can extract a feature called sentiment using the textblob library.

In [42]:
from textblob import TextBlob
data['tweet'][:5].apply(lambda x:TextBlob(x).sentiment)

0    (-0.3, 0.5354166666666667)
1                    (0.2, 0.2)
2                    (0.0, 0.0)
3                    (0.0, 0.0)
4                    (0.0, 0.0)
Name: tweet, dtype: object

As we can see from above it returns a tuple which represnts polarity and subjectivity of each tweet. Here we only extract polarity as it indicates the sentiment. The value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment. This can also work as a feature for building a machine learning model.

In [45]:
data['sentiment'] = data['tweet'].apply(lambda x:TextBlob(x).sentiment[0])
data[['tweet','sentiment']].head()

Unnamed: 0,tweet,sentiment
0,father dysfunctional selfish drag kid dysfunct...,-0.3
1,thanks lyft credit cant use cause dont offer w...,0.2
2,bihday majesty,0.0
3,model take urð ðððð ððð,0.0
4,factsguide society motivation,0.0


#### vii. Words Embeddings
Word embedding is the representation of text in the form of vectors. The underlying idea here is that similar words will have a minimum distance between their vectors.

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 

In [47]:
data.head()

Unnamed: 0,id,label,tweet,wordCount,charCount,avgCount,stopwords,hashtags,mentions,numerics,upper,sentiment
0,1,0,father dysfunctional selfish drag kid dysfunct...,18,102,4.555556,10,1,1,0,0,-0.3
1,2,0,thanks lyft credit cant use cause dont offer w...,19,122,5.315789,5,3,2,0,0,0.2
2,3,0,bihday majesty,3,21,5.666667,1,0,0,0,0,0.0
3,4,0,model take urð ðððð ððð,14,86,4.928571,5,1,0,0,0,0.0
4,5,0,factsguide society motivation,4,39,8.0,1,1,0,0,0,0.0
