# **Introduction to Natural Language Processing**

**Today's Agenda**
* What is NLP?
* Applications of NLP
* NLP around us
* Essentials
    * Corpus
    * Tokens
    * Sentences and words
    * Stopwords
    * Stemming
    * Lemmatizing
* Machine Learning in NLP
    * What is ML?
    * What are features?
    * Sentences to Vectors
        * BOW
        * TF-IDF
        * Deep Learning based ideas
* Project

# Essential modules

In [None]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')

# **Extracting Sentences & Words from Sentences**

https://www.nltk.org/api/nltk.tokenize.html

In [None]:
text = "The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania. NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit, plus a cookbook. NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems. There are 32 universities in the US and 25 countries using NLTK in their courses. NLTK supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities"

In [None]:
text

In [None]:
sentences = nltk.sent_tokenize(text)

In [None]:
sentences

In [None]:
words = nltk.word_tokenize(text)

In [None]:
words

# **Removing Stop Words**

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.

For Further Reading - https://www.opinosis-analytics.com/knowledge-base/stop-words-explained/#.YQJDdI4zY2w

https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stop_words

In [None]:
for i in range(len(sentences)):
    print('Original --> ', [word for word in sentences[i].split()])
    print('New -->      ', [word for word in sentences[i].split() if word.lower() not in stop_words])
    sentences[i] = ' '.join([word for word in sentences[i].split() if word.lower() not in stop_words])

In [None]:
sentences

## Impact of Removing Stop Words


In [None]:
onebillionspecial = """Rickrolling, alternatively Rick-rolling or Rickroll, is a prank and an Internet meme involving an unexpected appearance of the music video for the 1987 Rick Astley song "Never Gonna Give You Up". The meme is a type of bait and switch using a disguised hyperlink that leads to the music video. When victims click on a seemingly unrelated link, the site with the music video loads instead of what was expected, and in doing so they are said to have been 'Rickrolled'. The meme has also extended to using the song's lyrics in unexpected places.The meme grew out of a similar bait-and-switch trick called "duckrolling" that was popular on the 4chan website in 2006. The video bait-and-switch trick grew popular on 4chan by the 2007 April Fools' Day, and spread to other Internet sites later that year. The meme gained mainstream attention in 2008 through several publicized events, particularly when YouTube used it on its 2008 April Fools' Day event. Initially, Astley, who had only recently returned to performing after a ten-year hiatus, was hesitant about using his newfound popularity from the meme to further his career, but accepted the fame when he Rickrolled the 2008 Macy's Thanksgiving Day Parade with a surprise performance of the song. Since then, Astley has seen his performance career revitalized by the meme's popularity. Astley himself has also been Rickrolled several times."""

In [None]:
l1 = len(onebillionspecial)
l1

In [None]:
sentences_rick = nltk.sent_tokenize(onebillionspecial)

In [None]:
for i in range(len(sentences_rick)):
  print('Original --> ', [word for word in sentences_rick[i].split()])
  print('New -->      ', [word for word in sentences_rick[i].split() if word.lower() not in stop_words])
  sentences_rick[i] = ' '.join([word for word in sentences_rick[i].split() if word.lower() not in stop_words])

In [None]:
l2 = 0
for i in sentences_rick:
  l2+=len(i)
l2

In [None]:
l1-l2

In [None]:
f"Hence {(l1-l2)*100/l1}% data is not giving us any context and our model will use it to generate insights if we don't remove the stopwords"

# **Stemming**

"Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language."

Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word.

Source - https://www.datacamp.com/community/tutorials/stemming-lemmatization-python


In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
for i in range(len(sentences)):
  sentences[i] = ' '.join([stemmer.stem(word) for word in nltk.word_tokenize(sentences[i]) if word.lower() not in stop_words])

In [None]:
sentences

PorterStemmer uses Suffix Stripping to produce stems. PorterStemmer will give the root (stem) of the word "cats" by simply removing the 's' after cat. This is a suffix added to cat to make it plural. But if you look at 'trouble', 'troubling' and 'troubled' they are stemmed to 'trouble' because **PorterStemmer algorithm does not follow linguistics rather a set of 5 rules for different cases that are applied in phases (step by step) to generate stems**. This is the reason why PorterStemmer does not often generate stems that are actual English words. It does not keep a lookup table for actual stems of the word but applies algorithmic rules to generate stems. It uses the rules to decide whether it is wise to strip a suffix.

**Lemmatization**

Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words. Because lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

In [None]:
sentences = nltk.sent_tokenize(text)
sentences

In [None]:
for i in range(len(sentences)):
  sentences[i] = ' '.join([lemmatizer.lemmatize(word) for word in sentences[i].split() if word.lower() not in stop_words])

In [None]:
sentences

In [None]:
lemmatizer.lemmatize('demonstration', 'v')

**Bag of Words (BoW)**

In [None]:
import re
import pandas as pd
sentences = nltk.sent_tokenize(text)
corpus = []
for i in range(len(sentences)):
    sentence = re.sub('[^a-zA-Z]', ' ', sentences[i])
    sentence = sentence.lower()
    sentence = sentence.split()
    sentence = [stemmer.stem(word) for word in sentence if not word in set(stopwords.words('english'))]
    sentence = ' '.join(sentence)
    corpus.append(sentence)

In [None]:
corpus

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

In [None]:
X

In [None]:
for x in X:
    print(*x)

In [None]:
df1 = pd.DataFrame(X)
df1

**TF-IDF**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
Y = cv.fit_transform(corpus).toarray()

In [None]:
Y

In [None]:
for y in Y:
    print(*y)

In [None]:
df = pd.DataFrame(Y)

In [None]:
df

# Spam or Ham?

In [None]:
df_spam_ham = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv')

In [None]:
df_spam_ham

In [None]:
df_spam_ham = df_spam_ham.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'], axis = 1)

In [None]:
df_spam_ham

# EDA on Data

In [None]:
import seaborn as sns
sns.set_style("whitegrid")
ax = sns.countplot(x="v1",data=df_spam_ham)

In [None]:
content = df_spam_ham[['v2']].to_numpy()

In [None]:
content

In [None]:
sizes = []
for content_val in content:
    sizes.append(len(content_val[0]))

In [None]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(figsize=(16, 10), dpi=80)
figure = plt.hist(sizes, bins = 40)
plt.show()

In [None]:
import re
corpus = []
for i in range(0, len(df_spam_ham)):
    sms = re.sub('[^a-zA-Z]', ' ', df_spam_ham['v2'][i])
    sms = sms.lower()
    sms = sms.split()
    sms = [stemmer.stem(word) for word in sms if not word in stopwords.words('english')]
    sms = ' '.join(sms)
    corpus.append(sms)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 3000)
feature_vectors = cv.fit_transform(corpus).toarray()

In [None]:
target_vectors = pd.get_dummies(df_spam_ham['v1']).iloc[:,1].values

In [None]:
from sklearn.model_selection import train_test_split
feature_vectors_train, feature_vectors_test, target_vectors_train, target_vectors_test = train_test_split(feature_vectors, target_vectors, test_size = 0.20, random_state = 0)

In [None]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(feature_vectors_train, target_vectors_train)

predictions = spam_detect_model.predict(feature_vectors_test)

In [None]:
predictions

![](https://miro.medium.com/max/1400/1*VSchph99Wiv6tQpNIvMJbw.png)

In [None]:
import sklearn.metrics
from sklearn.metrics import confusion_matrix
df_cm_cv = pd.DataFrame(sklearn.metrics.confusion_matrix(target_vectors_test, predictions))
sns.set(font_scale=1.4)
sns.heatmap(df_cm_cv, annot=True, annot_kws={"size": 16})

In [None]:
from sklearn.metrics import f1_score
sklearn.metrics.f1_score(target_vectors_test, predictions)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv_unigram_bigram = CountVectorizer(max_features=3000, ngram_range=(1,2))
feature_vectors_unigram_bigram = cv_unigram_bigram.fit_transform(corpus).toarray()

In [None]:
target_vectors_unigram_bigram=pd.get_dummies(df_spam_ham['v1']).iloc[:,1].values

In [None]:
from sklearn.model_selection import train_test_split
feature_vectors_train_ub, feature_vectors_test_ub,target_vectors_train_ub, target_vectors_test_ub = train_test_split(feature_vectors_unigram_bigram, target_vectors_unigram_bigram, test_size = 0.20, random_state = 0)

In [None]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(feature_vectors_train_ub, target_vectors_train_ub)

predictions_ub=spam_detect_model.predict(feature_vectors_test_ub)

In [None]:
import sklearn.metrics
from sklearn.metrics import confusion_matrix
df_cm_ub = pd.DataFrame(sklearn.metrics.confusion_matrix(target_vectors_test_ub, predictions_ub))
sns.set(font_scale=1.4)
sns.heatmap(df_cm_ub, annot=True, annot_kws={"size": 16})

In [None]:
from sklearn.metrics import f1_score
sklearn.metrics.f1_score(target_vectors_test_ub, predictions_ub)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features=2500)
feature_vectors_tfidf = cv.fit_transform(corpus).toarray()

In [None]:
target_vectors_tfidf = pd.get_dummies(df_spam_ham['v1']).iloc[:,1].values

In [None]:
from sklearn.model_selection import train_test_split
feature_vectors_train_tfidf, feature_vectors_test_tfidf, target_vectors_train_tfidf, target_vectors_test_tfidf = train_test_split(feature_vectors_tfidf, target_vectors_tfidf, test_size = 0.20, random_state = 0)

In [None]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(feature_vectors_train_tfidf, target_vectors_train_tfidf)

predictions_tfidf = spam_detect_model.predict(feature_vectors_test_tfidf)

In [None]:
predictions_tfidf

In [None]:
import sklearn.metrics
from sklearn.metrics import confusion_matrix
df_cm_tfidf = pd.DataFrame(sklearn.metrics.confusion_matrix(target_vectors_test_tfidf, predictions_tfidf))
sns.set(font_scale=1.4)
sns.heatmap(df_cm_tfidf, annot=True, annot_kws={"size": 16})

In [None]:
from sklearn.metrics import f1_score
sklearn.metrics.f1_score(target_vectors_test_tfidf, predictions_tfidf)

Datasets which can be tried out after this session - 
* https://www.kaggle.com/snap/amazon-fine-food-reviews
* https://www.kaggle.com/crowdflower/twitter-airline-sentiment

Resources for NLP - 
* [NLP Book](https://web.stanford.edu/~jurafsky/slp3/)
* [NLTK](https://www.youtube.com/watch?v=FLZvOKSCkxY&list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL)
* [Stanford NLP Course](https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z)