# Natural Language Processing

This is an area of computer science and artificial intelligence concerned with the interaction between computers and human (natural) languages. NLP is used to apply ML models to text and language.

Examples:
- Dictating something into your iphone that is then converted to text.
- Setiment analysis, identifying the mood or subjective opinions within large amounts of text
- Chat bots

Main NLP libraries:
- NLTK
- SpaCy
- Stanford NLP
- Open NLP

### NLP - Bag of Words

Very popular NLP model -  Used to preprocess the texts to classify before fitting the classification algorithms on the observations containing the texts.

Involves two things:
1. A vocabulary of known words
2. A measure of the presence of known words

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# import the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter='\t', quoting = 3)
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


Here we are going to try and predict whether a restaurant review is good (1) or bad (0).

In [3]:
# cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []
for i in range(len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    #stemming is the process of taking the root of a word. i.e. Loved/Loves -> Love
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/timothyflack/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
corpus[0:10]

['wow love place',
 'crust good',
 'tasti textur nasti',
 'stop late may bank holiday rick steve recommend love',
 'select menu great price',
 'get angri want damn pho',
 'honeslti tast fresh',
 'potato like rubber could tell made ahead time kept warmer',
 'fri great',
 'great touch']

In [5]:
# create a bag of words model
# creates a sparse matrix where 1 corresponds to the presence of a specific words/0 no presence
# this process is called tokenisation

from sklearn.feature_extraction.text import CountVectorizer

#the loop in the previous cell can be completed if you wanted to using 
#some options in the CountVectorizer class
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

In [6]:
X.shape

(1000, 1500)

In [7]:
X[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

X, as you can see is a very sparse matrix!
Sparsity can be reduced using dimensionality reduction methods if wanted.

In [8]:
# include dependent variable 
y = dataset.iloc[:, 1].values
y.shape

(1000,)

Now you simply use a classification model. Most common models used are
- Naive Bayes
- Decision Tree
- Random Forest

We'll use Naive Bayes for this example.

In [10]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [11]:
# Fitting the Naive Bayes to training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [12]:
# Predicting the test set results
y_pred = classifier.predict(X_test)
# Making the confusion matrix to evaluate performace
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[55, 42],
       [12, 91]])