# Introduction to Natural Language Processing

## Source
- [A Gentle Introduction to NLP](https://towardsdatascience.com/a-gentle-introduction-to-natural-language-processing-e716ed3c0863)
- [NLP Getting Started Kaggle Exercise](https://www.kaggle.com/philculliton/nlp-getting-started-tutorial)

## What is NLP
NLP is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages.

## Applications
- Chat Bots
- Language Translation
- Summarization
- Sentiment Analysis
- Transcription
... and many more

## Challenges in NLP
- Languages change every day; New Words, Acronyms, New Rules, ...
- Dialects
- Domain specific & Contextual
- Interleving of other language words in a language
... and many more

## Pre-processing for NLP
1. Data Cleansing - remove special characters, symbols, punctuation, html tags<> etc from the raw data which contains no information for the model to learn, these are simply noise in our data.
2. Case Conversion - convert to lower case for all characters to bring in uniformity
3. Tokensiation - process of breaking up text document into individual words called tokens.
4. Stop Words Removal - common words that do not contribute much of the information in a text document. Words like ‘the’, ‘is’, ‘a’ have less value and add noise to the text data.
5. Stemming - process of reducing a word to its stem/root word. It reduces inflection in words (e.g. ‘help’, ’helping’, ’helped’, ’helpful’) to their root form (e.g. ‘help’).
6. Lemmatization - same thing as stemming, converting a word to its root form but with one difference i.e., the root word in this case belongs to a valid word in the language. For example the word caring would map to ‘care’ and not ‘car’ as the in case of stemming.

## N-Grams
The combination of multiple words used together, Ngrams with N=1 are called unigrams. Similarly, bigrams (N=2), trigrams (N=3) and so on can also be used.
N-grams can be used when we want to preserve sequence information in the document, like what word is likely to follow the given one. Unigrams don’t contain any sequence information because each word is taken individually.

## Text Data Vectorization
The process of converting text into numbers is called text data vectorization. Now after text preprocessing, we need to numerically represent text data i.e., encoding the data in numbers which can be further used by algorithms.

### Bag of Words
It is one of the simplest text vectorization techniques. The intuition behind BOW is that two sentences are said to be similar if they contain similar set of words. BOW constructs a dictionary of d unique words in corpus(collection of all the tokens in data). 

BOW uses Tokenized Words

### TF-IDF
Stands for Term Frequency(TF)-Inverse Document Frequency. 
Term Frequency defines the probability of finding a word in the document. 
- Term Frequency(wi, dj) = Number of times wi occurs in dj/Total number of words in dj
    - wi is Word i
    - dj is Document j

Inverse Document Frequency defines how unique is the word in the total corpus.
- IDF(wi, Dc) = log(N/ni)
    - wi is Word i
    - Dc is all documents in the corpus
    - N is total number of documents
    - ni is document which contains word i
    - If wi is more frequent in the corpus then IDF value decreases.
    - If wi is not frequent which means ni decreases and hence IDF value increases.

TF-IDF is the multiplication of TF and IDF values. It gives more weightage to words which occurs more in the document and less in the corpus

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [2]:
# Dataset of tweets about disasters
train_df = pd.read_csv("data/train.csv")

In [3]:
train_df.sample(10)

Unnamed: 0,id,keyword,location,text,target
6136,8754,siren,New York,WHELEN MODEL 295SS-100 SIREN AMPLIFIER POLICE ...,0
1442,2078,casualty,"Trinity, Bailiwick of Jersey",@ScriptetteSar @katiecool447 btw the 30th is a...,1
6964,9989,tsunami,but i love kaylen ??,I hope this tsunami clears b4 i have to walk o...,1
836,1214,blizzard,The ?? below ???,@BubblyCuteOne ?????????? ok ok okayyyyyy Ima ...,1
169,244,airplane%20accident,nyc,The shooting or the airplane accident https:/...,1
5840,8346,ruin,,I ruin everything ????,0
6630,9495,terrorist,peshawar pakistan,@OfficialMqm you are terrorist,0
3248,4669,engulfed,Bahrain,He came to a land which was engulfed in tribal...,1
7600,10855,,,Evacuation order lifted for town of Roosevelt:...,1
6391,9134,suicide%20bomb,Nigeria,#GRupdates Pic of 16yr old PKK suicide bomber ...,1


In [4]:
train_df.shape

(7613, 5)

In [16]:
# Not a disaster
import random
train_df[train_df["target"] == 0]["text"].values[random.randint(0, len(train_df))]

"@JustineJayyy OHGOD XD I didn't mean it so =P But you have that fire truck in the back of you to make up for it so you good xD"

In [17]:
# A Disaster
train_df[train_df["target"] == 1]["text"].values[random.randint(0, len(train_df))]

"Wreckage 'Conclusively Confirmed' as From MH370: Malaysia PM: Investigators and the families of those who were... http://t.co/MSsq0sVnBM"

*Find the issue in the above code*

The theory behind the model we'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's CountVectorizer to count the words in each tweet and turn them into data our machine learning model can process.

In [20]:
train_df["text"][0:5]

0    Our Deeds are the Reason of this #earthquake M...
1               Forest fire near La Ronge Sask. Canada
2    All residents asked to 'shelter in place' are ...
3    13,000 people receive #wildfires evacuation or...
4    Just got sent this photo from Ruby #Alaska as ...
Name: text, dtype: object

In [18]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [22]:
example_train_vectors.todense().shape

(5, 54)

In [25]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[3].todense().shape)
print(example_train_vectors[3].todense())

(1, 54)
[[1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1
  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1]]


The above tells us that:

There are 54 unique words (or "tokens") in the first five tweets.
The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.
Now let's create vectors for all of our tweets.

In [26]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])

In [28]:
print(train_vectors.todense().shape)

(7613, 21637)


In [29]:
train_vectors[0]

<1x21637 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [31]:

print(train_vectors[0].todense())

[[0 0 0 ... 0 0 0]]


In [40]:
x = train_vectors[0].todense()
x[x != 0]

matrix([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [35]:
tfidf_vectorizer = feature_extraction.text.TfidfVectorizer()

train_vectors_tfidf = tfidf_vectorizer.fit_transform(train_df["text"])

In [36]:
print(train_vectors_tfidf[0].todense().shape)

(1, 21637)


In [37]:
train_vectors_tfidf[0]

<1x21637 sparse matrix of type '<class 'numpy.float64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [38]:
train_vectors_tfidf[0].todense()

matrix([[0., 0., 0., ..., 0., 0., 0.]])

In [39]:
x = train_vectors_tfidf[0].todense()
x[x != 0]

matrix([[0.20827049, 0.36169348, 0.1893734 , 0.4187282 , 0.29044218,
         0.4187282 , 0.25921234, 0.11990035, 0.2530913 , 0.32654636,
         0.10169057, 0.18001477, 0.24477443]])

## Language Modelling

In [41]:
# The words contained in each tweet are a good indicator of whether they're about a real disaster or not. 
# The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

# Assuming Linear Connection, let's build a linear model
clf = linear_model.RidgeClassifier()


In [44]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=5, scoring="f1")
scores

array([0.6025641 , 0.50168919, 0.56940063, 0.50781969, 0.67275495])

In [45]:
scores_tfidf = model_selection.cross_val_score(clf, train_vectors_tfidf, train_df["target"], cv=5, scoring="f1")
scores_tfidf

array([0.62962963, 0.55507372, 0.64457332, 0.59444444, 0.72337043])

## Pre-processing

In [46]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /Users/drs/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/drs/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/drs/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/drs/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [47]:
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [48]:
len(stop_words)

179

In [49]:
lemmatizer = WordNetLemmatizer()

In [53]:
lemmatizer.lemmatize("careless")

'careless'

In [54]:
def data_preprocessing(tweet):
    
    # Cleansing
    tweet = re.sub(re.compile('<.*?>'), '', tweet)  # remove html tags
    tweet = re.sub('[^A-Za-z0-9]+',' ', tweet)

    # lowercase
    tweet = tweet.lower()

    # tokenization
    tokens = word_tokenize(tweet)

    # stop words removal
    tweet = [word for word in tokens if word not in stop_words]

    # Lemmatization
    tweet = [lemmatizer.lemmatize(word) for word in tweet]

    # join words back
    tweet = ' '.join(tweet)

    return tweet

In [55]:
train_df[train_df["target"] == 1]["text"].values[0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [56]:
data_preprocessing(train_df[train_df["target"] == 1]["text"].values[0])

'deed reason earthquake may allah forgive u'

In [57]:
train_df['processed_text'] = train_df['text'].apply(lambda tweet: data_preprocessing(tweet))

In [58]:
train_df.sample(5)

Unnamed: 0,id,keyword,location,text,target,processed_text
3848,5476,flames,,I'll cry until my pity party's in flames ????,0,cry pity party flame
3647,5196,fatalities,,'Among other main factors behind pedestrian fa...,1,among main factor behind pedestrian fatality p...
3995,5674,floods,,Have you ever remembered an old song something...,0,ever remembered old song something heared year...
1233,1774,buildings%20on%20fire,New Hampshire,Video: Fire burns two apartment buildings and...,1,video fire burn two apartment building blow ca...
4117,5850,hailstorm,Washington State,We The Free Hailstorm Maxi http://t.co/ERWs6IELdG,1,free hailstorm maxi http co erws6ieldg


In [59]:
train_vectors_tfidf_processed = tfidf_vectorizer.fit_transform(train_df["processed_text"])

In [60]:
print(train_vectors_tfidf_processed[0].todense().shape)

(1, 20153)


In [62]:
scores = model_selection.cross_val_score(clf, train_vectors_tfidf_processed, train_df["target"], cv=5, scoring="f1")
scores

array([0.58666667, 0.52413793, 0.59369818, 0.5418251 , 0.72143975])

In [63]:
# Using Naive Bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [64]:
clf_nb = MultinomialNB()

In [66]:
scores = model_selection.cross_val_score(clf_nb, train_vectors_tfidf_processed, train_df["target"], cv=5, scoring="f1")
scores

array([0.5891182 , 0.60367893, 0.64479081, 0.61056401, 0.75154321])

In [67]:
clf_nb.fit(train_vectors_tfidf_processed, train_df["target"])

MultinomialNB()

In [68]:
y_pred = clf_nb.predict(train_vectors_tfidf_processed)
print('Train Accuracy:', accuracy_score(train_df["target"], y_pred))

Train Accuracy: 0.9018783659529752


In [69]:
from sklearn.linear_model import LogisticRegression

In [70]:
clf_lr = LogisticRegression()
scores = model_selection.cross_val_score(clf_lr, train_vectors_tfidf_processed, train_df["target"], cv=3, scoring="f1")
scores

array([0.58553792, 0.54888508, 0.62795941])