## Intro to NLP with Twitter Sentyment Analysis
##### By Ruben Seoane

** Original Resource: https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/


##### Importing Libraries:

In [4]:
import pandas as pd

#### 1. Basic Feature Extraction

Let's get the dataset first:

In [5]:
train = pd.read_csv('train_E6oV3lV.csv')

##### 1.1 Number of Words

Let's start by counting teh number of words per tweet, to get see whether there is any correlation between text length and sentyment, we will use the Python function _split_

In [6]:
train['word_count'] = train['tweet'].apply(lambda x: len(str(x).split(" ")))
train[['tweet','word_count']].head()

Unnamed: 0,tweet,word_count
0,@user when a father is dysfunctional and is s...,21
1,@user @user thanks for #lyft credit i can't us...,22
2,bihday your majesty,5
3,#model i love u take with u all the time in ...,17
4,factsguide: society now #motivation,8


##### 1.2 Number of characters

Based on the previous feature intuition, we calculate the length of the twitter to get the # of charachters:
* The calculation includes the # of spaces, they can be removed later.

In [7]:
train['char_count'] = train['tweet'].str.len() ## this also includes spaces
train[['tweet','char_count']].head()

Unnamed: 0,tweet,char_count
0,@user when a father is dysfunctional and is s...,102
1,@user @user thanks for #lyft credit i can't us...,122
2,bihday your majesty,21
3,#model i love u take with u all the time in ...,86
4,factsguide: society now #motivation,39


##### 1.3 Average Word Length

This feature might potentialy help us get to a more precise model, we calculate it on a per tweet basis (sum of the length of all words divided by the tweet legth):

In [8]:
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['tweet'].apply(lambda x: avg_word(x))
train[['tweet','avg_word']].head()


Unnamed: 0,tweet,avg_word
0,@user when a father is dysfunctional and is s...,4.555556
1,@user @user thanks for #lyft credit i can't us...,5.315789
2,bihday your majesty,5.666667
3,#model i love u take with u all the time in ...,4.928571
4,factsguide: society now #motivation,8.0


##### 1.4 Number of Stopwords

Usually during NLP analysis, stop words are eliminated. But knowing how many of then there are might be a useful feature to add to the model, so we'll count them and store it in a variable:

In [9]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

train['stopwords'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
train[['tweet','stopwords']].head()

Unnamed: 0,tweet,stopwords
0,@user when a father is dysfunctional and is s...,10
1,@user @user thanks for #lyft credit i can't us...,5
2,bihday your majesty,1
3,#model i love u take with u all the time in ...,5
4,factsguide: society now #motivation,1


##### 1.5 Number of special charachters

Other potentially interesting feature is calculating the number of hashtags or mentions in a tweet, which gives an extra dimension for analysis.
We apply the _starts with_ function as # or mentions always appear at the beginning of a word:

In [10]:
train['hastags'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
train[['tweet','hastags']].head()

Unnamed: 0,tweet,hastags
0,@user when a father is dysfunctional and is s...,1
1,@user @user thanks for #lyft credit i can't us...,3
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,1
4,factsguide: society now #motivation,1


In [11]:
train['mentions'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('@')]))
train[['tweet','mentions']].head()

Unnamed: 0,tweet,mentions
0,@user when a father is dysfunctional and is s...,1
1,@user @user thanks for #lyft credit i can't us...,2
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


##### 1.6 Number of numeric characters

As with the number of words we can calculate the number of numerics in the tweets. It's not specially usefull in this use case but is commonly used and run in other cases:

In [12]:
train['numerics'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
train[['tweet','numerics']].head()

Unnamed: 0,tweet,numerics
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


##### 1.7 Number of uppercase words

Negative sentyment like anger, rages is usually expressed in UPPERCASE words, which makes it another dimension (feature) that we should extract:

In [13]:
train['upper'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
train[['tweet','upper']].head()

Unnamed: 0,tweet,upper
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


#### 2. Basic Preprocessing

After extracting some basic features, let's start cleaning the data before further analysis, we will employ some basic pre-processing methods.

##### 2.1 Lower case

To avoid having multiple copies of the same words ("book" and "Book" will be counted as 2) we will transform capitalized words into lower case:

In [14]:
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
train['tweet'].head()

0    @user when a father is dysfunctional and is so...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model i love u take with u all the time in ur...
4                  factsguide: society now #motivation
Name: tweet, dtype: object

##### 2.2 Removing Puntuation

As punctuation doesn't add relevant information, removing it will help us reduce the size of the training data.

In [15]:
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')
train['tweet'].head()

0    user when a father is dysfunctional and is so ...
1    user user thanks for lyft credit i cant use ca...
2                                  bihday your majesty
3    model i love u take with u all the time in urð...
4                    factsguide society now motivation
Name: tweet, dtype: object

##### 2.3 Removing Stop Words

Stop words or any commonly ocurring words should be removed from the text data. It can be done either by a manualy created list or through predefined libraries:


In [16]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
train['tweet'].head()

0    user father dysfunctional selfish drags kids d...
1    user user thanks lyft credit cant use cause do...
2                                       bihday majesty
3                model love u take u time urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

##### 2.4 Common Word Removal

In the previous steps we removed general common words. Now we will look for the 10 most frequently ocurring words within the text data.

In [17]:
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]
freq

user     17473
love      2647
ð         2511
day       2199
â         1797
happy     1663
amp       1582
im        1139
u         1136
time      1110
dtype: int64

Let's remove this words as they don't add any use for classification purposes:

In [18]:
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

##### 2.5 Rare words removal

Now lets remove rarely ocurring words within the text. Due to their reduced frequency, the association between them and another words is dominated by noise. The trick is to remove and replace them with more commonly occuring sinonyms that tend to appear more frequently.

In [19]:
freq = pd.Series("".join(train["tweet"]).split()).value_counts()[-10:]
freq

jason            1
begging          1
rememberyoull    1
lateeven         1
iphone45         1
womenonly        1
devolved         1
healingpeople    1
tot              1
nickelodeon      1
dtype: int64

In [20]:
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

##### 2.6 Spelling Correction

As it is very common to encounter spelling mistakes in text from social media, applying a correction process is a necessary pre-processing step as it will reduce the number of variations and copies from the same word.

We will use the _textblob_ library to implement this step:

In [24]:
from textblob import TextBlob
train['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

0    father dysfunctional selfish drags kiss dysfun...
1    thanks left credit can use cause dont offer wh...
2                                       midday majesty
3                               model take or ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

#### 2.7 Tokenization
Tokenization is the process of dividing the text into a sequence of words or sentences. We have used the textblob library to first transform our tweets into a blob and then converted them into a series of words.

In [25]:
TextBlob(train['tweet'][1]).words

WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])

#### 2.8 Stemming
Stemming is the process of removing suffices from a word, like “ing”, “ly”, “s”, etc. by a simple rule-based approach. We will use PorterStemmer from the NLTK library.

In [27]:
from nltk import PorterStemmer
st = PorterStemmer()
train['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0        father dysfunct selfish drag kid dysfunct run
1    thank lyft credit cant use caus dont offer whe...
2                                       bihday majesti
3                              model take urð ðððð ððð
4                              factsguid societi motiv
Name: tweet, dtype: object

#### 2.9 Lematization
Lemmatization is a more effective option than stemming as it converts the word into its root, rather than stripping the suffices. It makes use of the vocabulary and does a morphological analysis to obtain the root word. Therefore, **we usually prefer using lemmatization over stemming.**

In [31]:
from textblob import Word
train['tweet'] = train['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train['tweet'].head()

0    father dysfunctional selfish drag kid dysfunct...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

### 3 Advanced Text Processing

The basic preprocessing for cleaning the data it's done now. We will start extracting features through NLP techniques.

#### 3.1 N-grams
N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

Let’s extract bigrams from our tweets using the ngrams function of the textblob library.

In [32]:
TextBlob(train['tweet'][0]).ngrams(2)

[WordList(['father', 'dysfunctional']),
 WordList(['dysfunctional', 'selfish']),
 WordList(['selfish', 'drag']),
 WordList(['drag', 'kid']),
 WordList(['kid', 'dysfunction']),
 WordList(['dysfunction', 'run'])]

#### 3.2 Term Frequency
Term frequency is the ratio of the count of a word present in a sentence, relative to the length of the sentence.
We can generalize term frequency as:
**TF = (Number of times term T appears in the particular row) / (number of terms in that row)**

This is the term frequency table of a tweet:

In [33]:
tf1 = (train['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1

Unnamed: 0,words,tf
0,cant,1
1,cause,1
2,getthanked,1
3,lyft,1
4,pdx,1
5,van,1
6,thanks,1
7,credit,1
8,disapointed,1
9,wheelchair,1


#### 3.3 Inverse Document Frequency
Inverse Document Frequency (IDF) comes to mean that the more frequently a word appears in all documents, the less valuable is to us, like "the", "is", "a", etc.

Therefore, the IDF of each word is the log of the ratio of the total number of rows to the number of rows in which that word is present. The highed the IDF value, the more unique the word is.

IDF = log(N/n), where, N is the total number of rows and n is the number of rows in which the word was present.

Let’s calculate IDF for the same tweets for which we calculated the TF:

In [39]:
for i,word in enumerate(tf1['words']):
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['tweet'].str.contains(word)])))

tf1

NameError: name 'numpy' is not defined

#### 3.4 Term Frequency - Inverse Document Frequency

#### 3.5 Bag of Words

#### 3.6 Sentiment Analysis
#### 3.7 Word Embeddings