# Introduction

One of the biggest breakthroughs required for achieving any level of artificial intelligence is to have machines which can process text data.

It has become imperative for an organization to have a structure in place to mine actionable insights from the text being generated. From social media analytics to risk management and cybercrime protection, dealing with text data has never been more important.

In this notebook we will discuss different feature extraction methods, starting with some basic techniques which will lead into advanced Natural Language Processing techniques. We will also learn about pre-processing of the text data in order to extract better features from clean data.

# Contents:

* ### Basic feature extraction
    - Number of words
    - Number of characters
    - Average word length
    - Number of stopwords
    - Number of special characters
    - Number of numerics
    - Number of uppercase words
* ### Basic Text Pre-processing
    - Lower casing
    - Punctuation removal
    - Stopwords removal
    - Frequent words removal
    - Rare words removal
    - Spelling correction
    - Tokenization
    - Stemming
    - Lemmatization
* ### Advance Text Processing
    - N-grams
    - Term Frequency
    - Inverse Document Frequency
    - Term Frequency-Inverse Document Frequency (TF-IDF)
    - Bag of Words
    - Sentiment Analysis
    - Word Embedding


# Basic Feature Extraction

We can use text data to extract a number of features even if we don’t have sufficient knowledge of Natural Language Processing.

Before starting, let’s quickly read the training file from the dataset in order to perform different tasks on it. In the entire notebook, we will use the twitter sentiment [dataset](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/) from the Analytics Vidhya datahack platform.

In [2]:
import numpy as np
import pandas as pd
train_data = pd.read_csv('tweet_data_av/train_tweets.csv')

In [3]:
train_data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


### Number of Words

In [5]:
train_data['word_count']=train_data['tweet'].apply(lambda x:len(str(x).split(" ")))
train_data[['tweet','word_count']].head()

Unnamed: 0,tweet,word_count
0,@user when a father is dysfunctional and is s...,21
1,@user @user thanks for #lyft credit i can't us...,22
2,bihday your majesty,5
3,#model i love u take with u all the time in ...,17
4,factsguide: society now #motivation,8


### Number of characters

In [7]:
train_data['char_count'] = train_data['tweet'].str.len() ## this also includes spaces
train_data[['tweet','char_count']].head()

Unnamed: 0,tweet,char_count
0,@user when a father is dysfunctional and is s...,102
1,@user @user thanks for #lyft credit i can't us...,122
2,bihday your majesty,21
3,#model i love u take with u all the time in ...,86
4,factsguide: society now #motivation,39


### Average Word Length

In [9]:
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/len(words))

train_data['avg_word'] = train_data['tweet'].apply(lambda x: avg_word(x))
train_data[['tweet','avg_word']].head()

Unnamed: 0,tweet,avg_word
0,@user when a father is dysfunctional and is s...,4.555556
1,@user @user thanks for #lyft credit i can't us...,5.315789
2,bihday your majesty,5.666667
3,#model i love u take with u all the time in ...,4.928571
4,factsguide: society now #motivation,8.0


### Number of stopwords

In [10]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

train_data['stopwords'] = train_data['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
train_data[['tweet','stopwords']].head()

Unnamed: 0,tweet,stopwords
0,@user when a father is dysfunctional and is s...,10
1,@user @user thanks for #lyft credit i can't us...,5
2,bihday your majesty,1
3,#model i love u take with u all the time in ...,5
4,factsguide: society now #motivation,1


### Number of special characters
One more interesting feature which we can extract from a tweet is calculating the number of hashtags or mentions present in it. This also helps in extracting extra information from our text data.

In [11]:
train_data['hastags'] = train_data['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
train_data[['tweet','hastags']].head()

Unnamed: 0,tweet,hastags
0,@user when a father is dysfunctional and is s...,1
1,@user @user thanks for #lyft credit i can't us...,3
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,1
4,factsguide: society now #motivation,1


In [12]:
train_data['at_the_rate_tags'] = train_data['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('@')]))
train_data[['tweet','at_the_rate_tags']].head()

Unnamed: 0,tweet,at_the_rate_tags
0,@user when a father is dysfunctional and is s...,1
1,@user @user thanks for #lyft credit i can't us...,2
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


### Number of numerics

In [13]:
train_data['numerics'] = train_data['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
train_data[['tweet','numerics']].head()

Unnamed: 0,tweet,numerics
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


In [14]:
train_data[['tweet','numerics']].tail()

Unnamed: 0,tweet,numerics
31957,ate @user isz that youuu?ðððððð...,0
31958,to see nina turner on the airwaves trying to...,0
31959,listening to sad songs on a monday morning otw...,0
31960,"@user #sikh #temple vandalised in in #calgary,...",0
31961,thank you @user for you follow,0


### Number of Uppercase words

In [15]:
train_data['upper'] = train_data['tweet'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
train_data[['tweet','upper']].head()

Unnamed: 0,tweet,upper
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


### Number of Lowercase words

In [17]:
train_data['lower'] = train_data['tweet'].apply(lambda x: len([x for x in x.split() if x.islower()]))
train_data[['tweet','lower']].head()

Unnamed: 0,tweet,lower
0,@user when a father is dysfunctional and is s...,18
1,@user @user thanks for #lyft credit i can't us...,19
2,bihday your majesty,3
3,#model i love u take with u all the time in ...,14
4,factsguide: society now #motivation,4


# Basic Pre-processing

### Lower case

In [19]:
train_data['tweet'] = train_data['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
train_data['tweet'].head()

0    @user when a father is dysfunctional and is so...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model i love u take with u all the time in ur...
4                  factsguide: society now #motivation
Name: tweet, dtype: object

###  Removing Punctuation

In [20]:
train_data['tweet'] = train_data['tweet'].str.replace('[^\w\s]','')
train_data['tweet'].head()

0    user when a father is dysfunctional and is so ...
1    user user thanks for lyft credit i cant use ca...
2                                  bihday your majesty
3    model i love u take with u all the time in urð...
4                    factsguide society now motivation
Name: tweet, dtype: object

### Removal of Stop Words

Stop words (or commonly occurring words) should be removed from the text data.

In [21]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
train_data['tweet'] = train_data['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
train_data['tweet'].head()

0    user father dysfunctional selfish drags kids d...
1    user user thanks lyft credit cant use cause do...
2                                       bihday majesty
3                model love u take u time urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

### Common word removal

Previously, we just removed commonly occurring words in a general sense. We can also remove commonly occurring words from our text data First, let’s check the 10 most frequently occurring words in our text data then take call to remove or retain

In [22]:
freq = pd.Series(' '.join(train_data['tweet']).split()).value_counts()[:10]
freq

user     17473
love      2647
ð         2511
day       2199
â         1797
happy     1663
amp       1582
im        1139
u         1136
time      1110
dtype: int64

Let’s remove these words as their presence will not of any use in classification of our text data.

In [28]:
freq = list(freq)
train_data['tweet'] = train_data['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train_data['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

### Rare words removal

In [30]:
freq = pd.Series(' '.join(train_data['tweet']).split()).value_counts()[-10:]
freq

knownððâfathersday    1
lanâ                  1
punediaries           1
siouxfalls            1
hierarchy             1
taitung               1
twinsies              1
amok                  1
russianscum           1
gelber                1
dtype: int64

In [31]:
freq = list(freq)
train_data['tweet'] = train_data['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train_data['tweet'].head()

0    father dysfunctional selfish drags kids dysfun...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

### Spelling correction

We’ve all seen tweets with a plethora of spelling mistakes. Our timelines are often filled with hastly sent tweets that are barely legible at times.

In that regard, spelling correction is a useful pre-processing step because this also will help us in reducing multiple copies of words.

To achieve this we will use the textblob library.

In [33]:
from textblob import TextBlob
train_data['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

0    father dysfunctional selfish drags kiss dysfun...
1    thanks left credit can use cause dont offer wh...
2                                       midday majesty
3                               model take or ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

Note that it will actually take a lot of time to make these corrections. Therefore, just for the purposes of learning, I have shown this technique by applying it on only the first 5 rows. Moreover, we cannot always expect it to be accurate so some care should be taken before applying it.

We should also keep in mind that words are often used in their abbreviated form. For instance, ‘your’ is used as ‘ur’. We should treat this before the spelling correction step, otherwise these words might be transformed into any other word like the one shown above:

### Tokenization

Tokenization refers to dividing the text into a sequence of words or sentences. In our example, we have used the textblob library to first transform our tweets into a blob and then converted them into a series of words.

In [34]:
TextBlob(train_data['tweet'][1]).words

WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])

### Stemming

Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based approach. For this purpose, we will use PorterStemmer from the NLTK library.

In [35]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
train_data['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0        father dysfunct selfish drag kid dysfunct run
1    thank lyft credit cant use caus dont offer whe...
2                                       bihday majesti
3                              model take urð ðððð ððð
4                              factsguid societi motiv
Name: tweet, dtype: object

### Lemmatization

Lemmatization is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices. It makes use of the vocabulary and does a morphological analysis to obtain the root word. Therefore, we usually prefer using lemmatization over stemming.

In [36]:
from textblob import Word
train_data['tweet'] = train_data['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train_data['tweet'].head()

0    father dysfunctional selfish drag kid dysfunct...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3                              model take urð ðððð ððð
4                        factsguide society motivation
Name: tweet, dtype: object

# Advance Text Processing

### N-grams

N-grams are the combination of multiple words used together. Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.

Unigrams do not usually contain as much information as compared to bigrams and trigrams. The basic principle behind n-grams is that they capture the language structure, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

So, let’s quickly extract bigrams from our tweets using the ngrams function of the textblob library.

In [37]:
TextBlob(train_data['tweet'][0]).ngrams(2)

[WordList(['father', 'dysfunctional']),
 WordList(['dysfunctional', 'selfish']),
 WordList(['selfish', 'drag']),
 WordList(['drag', 'kid']),
 WordList(['kid', 'dysfunction']),
 WordList(['dysfunction', 'run'])]

### Term frequency

Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence.

Therefore, we can generalize term frequency as:

` TF = (Number of times term T appears in the particular row) / (number of terms in that row) `

In [38]:
tf1 = (train_data['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1

Unnamed: 0,words,tf
0,cause,1
1,lyft,1
2,credit,1
3,dont,1
4,disapointed,1
5,use,1
6,cant,1
7,thanks,1
8,wheelchair,1
9,pdx,1


### Inverse Document Frequency

The intuition behind inverse document frequency (IDF) is that a word is not of much use to us if it’s appearing in all the documents.

Therefore, the IDF of each word is the log of the ratio of the total number of rows to the number of rows in which that word is present.

`IDF = log(N/n)`, where, N is the total number of rows and n is the number of rows in which the word was present.


In [40]:
for i,word in enumerate(tf1['words']):
    tf1.loc[i, 'idf'] = np.log(train_data.shape[0]/(len(train_data[train_data['tweet'].str.contains(word)])))

tf1

Unnamed: 0,words,tf,idf
0,cause,1,5.690172
1,lyft,1,8.762865
2,credit,1,7.327781
3,dont,1,3.745585
4,disapointed,1,10.372303
5,use,1,3.552287
6,cant,1,3.538194
7,thanks,1,4.597751
8,wheelchair,1,9.273691
9,pdx,1,8.762865


### Term Frequency – Inverse Document Frequency (TF-IDF)

TF-IDF is the multiplication of the TF and IDF which we calculated above. 

In [41]:
tf1['tfidf'] = tf1['tf'] * tf1['idf']
tf1

Unnamed: 0,words,tf,idf,tfidf
0,cause,1,5.690172,5.690172
1,lyft,1,8.762865,8.762865
2,credit,1,7.327781,7.327781
3,dont,1,3.745585,3.745585
4,disapointed,1,10.372303,10.372303
5,use,1,3.552287,3.552287
6,cant,1,3.538194,3.538194
7,thanks,1,4.597751,4.597751
8,wheelchair,1,9.273691,9.273691
9,pdx,1,8.762865,8.762865


We don’t have to calculate TF and IDF every time beforehand and then multiply it to obtain TF-IDF. Instead, sklearn has a separate function to directly obtain it:

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train_data['tweet'])

train_vect

<31962x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 114042 stored elements in Compressed Sparse Row format>

### Bag of Words

Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. Further, that from the text alone we can learn something about the meaning of the document.

For implementation, sklearn provides a separate function for it as shown below:

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(train_data['tweet'])
train_bow

<31962x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 128387 stored elements in Compressed Sparse Row format>

### Sentiment Analysis

If you recall, our problem was to detect the sentiment of the tweet. So, before applying any ML/DL models (which can have a separate feature detecting the sentiment using the textblob library), let’s check the sentiment of the first few tweets.

In [45]:
train_data['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)

0    (-0.3, 0.5354166666666667)
1                    (0.2, 0.2)
2                    (0.0, 0.0)
3                    (0.0, 0.0)
4                    (0.0, 0.0)
Name: tweet, dtype: object

Above, you can see that it returns a tuple representing polarity and subjectivity of each tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment. This can also work as a feature for building a machine learning model.

In [46]:
train_data['sentiment'] = train_data['tweet'].apply(lambda x: TextBlob(x).sentiment[0] )
train_data[['tweet','sentiment']].head()

Unnamed: 0,tweet,sentiment
0,father dysfunctional selfish drag kid dysfunct...,-0.3
1,thanks lyft credit cant use cause dont offer w...,0.2
2,bihday majesty,0.0
3,model take urð ðððð ððð,0.0
4,factsguide society motivation,0.0


### Word Embeddings

Word Embedding is the representation of text in the form of vectors. The underlying idea here is that similar words will have a minimum distance between their vectors.

Word2Vec models require a lot of text, so either we can train it on our training data or we can use the pre-trained word vectors developed by Google, Wiki, etc.

Here, we will use pre-trained word vectors which can be downloaded from the [glove](https://nlp.stanford.edu/projects/glove/) website. There are different dimensions (50,100, 200, 300) vectors trained on wiki data. For this example, I have downloaded the 100-dimensional version of the model.

The first step here is to convert it into the word2vec format.

In [47]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

ModuleNotFoundError: No module named 'gensim'

Now, we can load the above word2vec file as a model.

In [48]:
from gensim.models import KeyedVectors # load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

ModuleNotFoundError: No module named 'gensim'

Let’s say our tweet contains a text saying ‘go away’. We can easily obtain it’s word vector using the above model:

In [49]:
model['go']

NameError: name 'model' is not defined

In [50]:
model['away']

NameError: name 'model' is not defined

We then take the average to represent the string ‘go away’ in the form of vectors having 100 dimensions.

In [51]:
(model['go'] + model['away'])/2

NameError: name 'model' is not defined