# Topic 39: Natural Language Processing

### Agenda today:
- Text analytics and NLP
- Pre-Processing for NLP 
    - Tokenization
    - Stopwords removal
    - Lexicon normalization: lemmatization and stemming
- Feature Engineering for NLP
    - Bag-of-Words
    - Term frequency-Inverse Document Frequency (tf-idf)
- Text Classification: Satire Detection

## Part I. Text Analytics and NLP
NLP allows computers to interact with text data in a structured and sensible way. In this section, we will discuss some steps and approaches to common text data analytic procedures. In other words, with NLP, computers are taught to understand human language, its meaning and sentiments. Some of the applications of natural language processing are:
- Chatbots 
- Classifying documents 
- Speech recognition and audio processing 

In this section, we will introduce you to the preprocessing steps, feature engineering, and other steps you need to take in order to format text data for machine learning tasks. 

#### Overview of NLP process 
<img src="resources/nlpprocess.png" style="width:500px;">

## Part II. Pre-Processing for NLP

In [None]:
import nltk
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import train_test_split
from matplotlib import cm
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.naive_bayes import MultinomialNB

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

### Tokenization 
Tokenization is the process of splitting documents into units of observations. We usually represent the tokens as __n-gram__, where n represent the consecutive words occuring in a document. In the case of unigram (one word token), the sentence "David works here" can be tokenized into?

"David", "works", "here"
"David works", "works here"

In [None]:
review = 'From the beginning of the movie, it gives the feeling the director is trying to portray something, what I mean to say that instead of the story dictating the style in which the movie should be made, he has gone in the opposite way, he had a type of move that he wanted to make, and wrote a story to suite it. And he has failed in it very badly. I guess he was trying to make a stylish movie. Any way I think this movie is a total waste of time and effort. In the credit of the director, he knows the media that he is working with, what I am trying to say is I have seen worst movies than this. Here at least the director knows to maintain the continuity in the movie. And the actors also have given a decent performance.'

In [None]:
review

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')

tokenized_review = tokenizer.tokenize(review)
print(tokenized_review)

The RegexpTokenizer is a tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

In [None]:
len(tokenized_review)

In [None]:
fdist = FreqDist(tokenized_review)
print(fdist)

In [None]:
plt.figure(figsize=(10,10))
fdist.plot(30)

Are the words very informative? Can we extract useful information based on this frequency distribution of the most common words? 

### Stopwords Removal

In [None]:
stop_words=set(stopwords.words("english"))
print(stop_words)

In [None]:
filtered_review=[]
for w in tokenized_review:
    if w.lower() not in stop_words:
        filtered_review.append(w.lower())
print("Filtered Sentence:",filtered_review)

In [None]:
print(len(tokenized_review))
print(len(filtered_review))

In [None]:
fdist = FreqDist(filtered_review)
plt.figure(figsize=(10,10))
fdist.plot(30)

Now we have removed **semantically meaningless** words.

#### Lexicon Normalization 
Aside from stopwords, a different type of noise can arise in NLP. For example, collect, collection, collected, and collecting are all similar words. Using stemming and lemmatization would reduce all variations of the same word to the root version of all its derivations. 

###### Stemming 
Stemming allows us to remove different variations of the same word. For example, collect, collection and collecting will all be reduced to the same single word collect.
- Stemming is the process of reducing inflection in words to their root forms, such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
 
- Stems are created by removing the suffixes or prefixes used with a word.
<img src="attachment:Screen%20Shot%202019-08-13%20at%2010.45.02%20AM.png" width=400;>

In [None]:
# Stemming
from nltk.stem import PorterStemmer

In [None]:
ps = PorterStemmer()

stemmed_review=[]
for w in filtered_review:
    stemmed_review.append(ps.stem(w))

len(set(stemmed_review))

In [None]:
fdist = FreqDist(stemmed_review)
fdist.plot(30)

#### Lemmatization
The only difference between lemmatization and stemming is that lemmatization returns real words. For example, instead of returning "movi" like Porter stemmer would, "movie" will be returned by the lemmatizer.

- Unlike Stemming, Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. 

- In Lemmatization, the root word is called Lemma. 

- A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

<img src="resources/lemmatization.png" width=400;>

In [None]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 

In [None]:
print("movies:", lemmatizer.lemmatize("movies")) 
print("collecting:", lemmatizer.lemmatize("collecting")) 
print("collection:", lemmatizer.lemmatize("collection")) 
print("collections:", lemmatizer.lemmatize("collections")) 

In [None]:
# comparing it with stemming 
print("movies:", ps.stem("movies")) 
print("collecting :", ps.stem("collecting")) 

In [None]:
# we can also lemmatize our original reviews
lemmatized_review=[]
for w in filtered_review:
    lemmatized_review.append(lemmatizer.lemmatize(w))

len(set(lemmatized_review))

## Part III. Feature Engineering for NLP 
The machine learning algorithms we have encountered so far represent features as variables that take on different values for each observation. For example, we represent individuals with distinct education levels, income, and such. 

This is done differently in NLP. In order to pass text data to machine learning algorithms, we need to represent each text observation numerically. One such method is called **Bag-of-words (BoW)**. 

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling. A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

- A vocabulary of known words.
- A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. 

The intuition behind BoW is that a document is similar to another if they have similar contents. Bag of Words data can be represented as a **Document Term Matrix**, or a Term Document Matrix, in which each column is an unique word and each row is a document. For example:

- Document 1: "I love dogs"
- Document 2: "I love cats"
- Document 3: "I love all animals"
- Document 4: "I hate dogs"


Can be represented as:
<img src="attachment:Screen%20Shot%202019-03-22%20at%208.16.32%20AM.png" style="width:600px;">

In [None]:
# implementing it in python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# Convert a collection of text documents to a matrix of token counts

docs = ['I love dogs','I love cats','I love all animals', 'I hate dogs']
vec = CountVectorizer()
X = vec.fit_transform(docs)

df = pd.DataFrame(X.toarray(), columns = vec.get_feature_names())
df

In [None]:
vec.get_feature_names()

In [None]:
# 0 - animal lovers, 1 - animal haters
y = np.array([0, 0, 0, 1])

In [None]:
nb = MultinomialNB()
nb.fit(df, y)

In [None]:
nb.predict(df)

In [None]:
X

### TF-IDF 
There are many schemas for determining the values of each entry in a document term matrix, and one of the most common schema is called the TF-IDF -- term frequency-inverse document frequency. Essentially, tf-idf *normalizes* the raw count of the document term matrix. And it represents how important a word is in the given document. 

- TF (Term Frequency)
term frequency is simply the frequency of words in a document, and it can be represented as the number of times a term shows up in a document. 

- IDF (inverse document frequency)
IDF represents the measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

$$idf(w) = log (\frac{number of documents}{num of documents containing W})$$

tf-idf is the product of term frequency and inverse document frequency, or tf * idf. 

In [None]:
# let's implement it 
import pandas as pd
review_1 = "6/10 Acting, not great but some good acting.<br /><br />4/10 Director, makes some stupid decisions for this film.<br /><br />2/10 Writer, story makes no sense at all and has huge amount of flaws.<br /><br />4/10 Overall score for this movie.<br /><br />Don't waste your time with this film, it's not worth it. I gave 4 for this movie and it may be too much. Characters are so over exaggerated than they can ever be in real life and some pretty unexplainable stuff happens 'storywise', not in good way. Because of the style this film has been filmed you get bored after 30 minutes (too many special effects: slow motions and camera shakes and fast forwards). It's always good that movie uses music to make the story go smooth but there's too many tracks in this one. In the first hour there is almost 50/50 dialogs and musics"
review_2 = "Devil Hunter gained notoriety for the fact that it's on the DPP 'Video Nasty' list, but it really needn't have been. Many films on the list where there for God (and DPP) only known reasons, and while this isn't the tamest of the bunch; there isn't a lot here that warrants banning...which is a shame because I never would have sat through it where it not for the fact that it's on 'the shopping list'. The plot actually gives the film a decent base - or at least more of a decent base than most cannibal films - and it follows an actress who is kidnapped and dragged off into the Amazon jungle. A hunter is then hired to find her, but along the way he has to brave the natives, lead by a man who calls himself 'The Devil' (hence the title). The film basically just plods along for eighty five minutes and there really aren't many scenes of interest. It's a real shame that Jess Franco ended up making films like this because the man clearly has talent; as seen by films such as The Diabolical Dr Z, Venus in Furs, Faceless and She Kills in Ecstasy, but unfortunately his good films are just gems amongst heaps of crap and Devil Hunter is very much a part of the crap. I saw this film purely because I want to be able to say I've seen everything on the DPP's list (just two more to go!), and I'm guessing that's why most other people who have seen it, saw it. But if you're not on the lookout for Nasties; there really is no reason to bother with this one."
review_3 = "`Stanley and Iris' is a heart warming film about two people who find each other and help one another overcome their problems in life. Stanley's life is difficult, because he never learned to read or write. Iris is a widower with two teenage children working in a bakery where she meets Stanley. She decides to teach Stanley how to read at her home in her spare time. Over time they become romantically involved. After Stanley learns to read, he goes off to a good job in Chicago, only to return to Iris and ask her to marry him.<br /><br />It's a really good film without nudity, violence, or profanity, that which is rare in today's films. A good film all round. <br /><br />"
review_4 = "This may not be a memorable classic, but it is a touching romance with an important theme that stresses the importance of literacy in modern society and the devastating career and life consequences for any unfortunate individual lacking this vital skill.<br /><br />The story revolves around Iris, a widow who becomes acquainted with a fellow employee at her factory job, an illiterate cafeteria worker named Stanley. Iris discovers that Stanley is unable to read, and after he loses his job, she gives him reading lessons at home in her kitchen. Of course, as you might predict, the two, although initially wary of involvement, develop feelings for each other...<br /><br />Jane Fonda competently plays Iris, a woman with problems of her own, coping with a job lacking prospects, two teenage children (one pregnant), an unemployed sister and her abusive husband. However, Robert DeNiro is of course brilliant in his endearing portrayal of the intelligent and resourceful, but illiterate, Stanley, bringing a dignity to the role that commands respect. They aren't your typical charming young yuppie couple, as generally depicted in on screen romances, but an ordinary working class, middle aged pair with pretty down to earth struggles.<br /><br />I won't give the ending away, but it's a lovely, heartwarming romance and a personal look into the troubling issue of adult illiteracy, albeit from the perspective of a fictional character."
df = pd.DataFrame([review_1,review_2,review_3, review_4],columns = ['review'])
df

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

token = RegexpTokenizer(r'[a-zA-Z]+')
cv = CountVectorizer(lowercase=True, stop_words='english', tokenizer=token.tokenize)
text_counts= cv.fit_transform(df['review'])


In [None]:
# what datatype is the output of a CountVectorizer?

pd.DataFrame(text_counts)

In [None]:
df = pd.DataFrame(text_counts.toarray(),columns = cv.get_feature_names())
df

In [None]:
# use tfidf vectorizer instead
review_1 = "6/10 Acting, not great but some good acting.<br /><br />4/10 Director, makes some stupid decisions for this film.<br /><br />2/10 Writer, story makes no sense at all and has huge amount of flaws.<br /><br />4/10 Overall score for this movie.<br /><br />Don't waste your time with this film, it's not worth it. I gave 4 for this movie and it may be too much. Characters are so over exaggerated than they can ever be in real life and some pretty unexplainable stuff happens 'storywise', not in good way. Because of the style this film has been filmed you get bored after 30 minutes (too many special effects: slow motions and camera shakes and fast forwards). It's always good that movie uses music to make the story go smooth but there's too many tracks in this one. In the first hour there is almost 50/50 dialogs and musics"
review_2 = "Devil Hunter gained notoriety for the fact that it's on the DPP 'Video Nasty' list, but it really needn't have been. Many films on the list where there for God (and DPP) only known reasons, and while this isn't the tamest of the bunch; there isn't a lot here that warrants banning...which is a shame because I never would have sat through it where it not for the fact that it's on 'the shopping list'. The plot actually gives the film a decent base - or at least more of a decent base than most cannibal films - and it follows an actress who is kidnapped and dragged off into the Amazon jungle. A hunter is then hired to find her, but along the way he has to brave the natives, lead by a man who calls himself 'The Devil' (hence the title). The film basically just plods along for eighty five minutes and there really aren't many scenes of interest. It's a real shame that Jess Franco ended up making films like this because the man clearly has talent; as seen by films such as The Diabolical Dr Z, Venus in Furs, Faceless and She Kills in Ecstasy, but unfortunately his good films are just gems amongst heaps of crap and Devil Hunter is very much a part of the crap. I saw this film purely because I want to be able to say I've seen everything on the DPP's list (just two more to go!), and I'm guessing that's why most other people who have seen it, saw it. But if you're not on the lookout for Nasties; there really is no reason to bother with this one."
review_3 = "`Stanley and Iris' is a heart warming film about two people who find each other and help one another overcome their problems in life. Stanley's life is difficult, because he never learned to read or write. Iris is a widower with two teenage children working in a bakery where she meets Stanley. She decides to teach Stanley how to read at her home in her spare time. Over time they become romantically involved. After Stanley learns to read, he goes off to a good job in Chicago, only to return to Iris and ask her to marry him.<br /><br />It's a really good film without nudity, violence, or profanity, that which is rare in today's films. A good film all round. <br /><br />"
review_4 = "This may not be a memorable classic, but it is a touching romance with an important theme that stresses the importance of literacy in modern society and the devastating career and life consequences for any unfortunate individual lacking this vital skill.<br /><br />The story revolves around Iris, a widow who becomes acquainted with a fellow employee at her factory job, an illiterate cafeteria worker named Stanley. Iris discovers that Stanley is unable to read, and after he loses his job, she gives him reading lessons at home in her kitchen. Of course, as you might predict, the two, although initially wary of involvement, develop feelings for each other...<br /><br />Jane Fonda competently plays Iris, a woman with problems of her own, coping with a job lacking prospects, two teenage children (one pregnant), an unemployed sister and her abusive husband. However, Robert DeNiro is of course brilliant in his endearing portrayal of the intelligent and resourceful, but illiterate, Stanley, bringing a dignity to the role that commands respect. They aren't your typical charming young yuppie couple, as generally depicted in on screen romances, but an ordinary working class, middle aged pair with pretty down to earth struggles.<br /><br />I won't give the ending away, but it's a lovely, heartwarming romance and a personal look into the troubling issue of adult illiteracy, albeit from the perspective of a fictional character."
df = pd.DataFrame([review_1,review_2,review_3, review_4],columns = ['review'])
df

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
text_tf = tf.fit_transform(df['review'])

pd.DataFrame(text_tf.todense(),columns = tf.get_feature_names())

## Text Classification
Now that you have a great basic understanding of feature engineering and preprosessing in NLP, we can move on to text classification using Naive Bayes and other classification algorithm. We can treat the engineered dataframes like any other dataframes that you have worked with before. 

Now, how would Naive Bayes treat the classification problem? What is the prior, posterior, and evidence, in the calculation?

### 1. Preprocessing & Cleaning 

In [None]:
df = pd.read_csv('resources/nlp_classification.csv')

In [None]:
df.shape

In [None]:
df.target.value_counts()

In [None]:
df.head()

In [None]:
df.body[0]

In [None]:
data = df['body']
target = df['target']

In [None]:
processed_data = [d.split() for d in data.to_list()]
print(processed_data[:2])

In [None]:
total_vocab = set()
for comment in processed_data:
    total_vocab.update([c.lower() for c in comment])
len(total_vocab)

In [None]:
# creating a list with all lemmatized outputs
lemmatized_output = []

for listy in processed_data:
    lemmed = ' '.join([lemmatizer.lemmatize(w) for w in listy])
    lemmatized_output.append(lemmed)

In [None]:
lemmatized_output[0]

In [None]:
X_lem = lemmatized_output

y_lem = target

### 2. Corpus Statistics and Exploratory Data Analysis

In [None]:
## setting stopwords and punctuations
import string

sw_list = stopwords.words('english')
sw_list += list(string.punctuation)
sw_list += ["''", '""', '...', '``', '’', '“', '’', '”', '‘', '‘', '©',
            'said', 'one', 'com', 'satirewire', '-', '–', '—', 'satirewire.com']
sw_set = set(sw_list)

#### Most Frequent Words

In [None]:
df_freq_satire = df[df['target']==1]
df_freq_not_satire = df[df['target']==0]

In [None]:
data_sat = df_freq_satire['body']
data_not_sat = df_freq_not_satire['body']

In [None]:
data_sat

In [None]:
pros_satire = [d.split() for d in data_sat.to_list()]
pros_not_satire = [d.split() for d in data_not_sat.to_list()]

In [None]:
total_vocab_sat = set()
for comment in pros_satire:
    total_vocab_sat.update([c.lower() for c in comment])
len(total_vocab_sat)

In [None]:
total_vocab_NOT_sat = set()
for comment in pros_not_satire:
    total_vocab_NOT_sat.update([c.lower() for c in comment])
len(total_vocab_NOT_sat)

In [None]:
# For EDA

flat_satire = [item.lower() for sublist in pros_satire for item in sublist if item not in sw_list ]
flat_not_satire = [item.lower() for sublist in pros_not_satire for item in sublist if item not in sw_list]

In [None]:
satire_freq = FreqDist(flat_satire)
not_satire_freq = FreqDist(flat_not_satire)

In [None]:
# Top 20 satire words:

satire_freq.most_common(20)

In [None]:
# Top 20 non-satire words:

not_satire_freq.most_common(20)

#### Normalized word frequencies:

In [None]:
satire_total_word_count = sum(satire_freq.values())
satire_top_25 = satire_freq.most_common(25)
print("Word \t\t Normalized Frequency")
print()
for word in satire_top_25:
    normalized_frequency = word[1]/satire_total_word_count
    print("{} \t\t {:.4}".format(word[0], normalized_frequency))

In [None]:
not_satire_total_word_count = sum(not_satire_freq.values())
not_satire_top_25 = not_satire_freq.most_common(25)
print("Word \t\t Normalized Frequency")
print()
for word in not_satire_top_25:
    normalized_frequency = word[1]/not_satire_total_word_count
    print("{} \t\t {:.4}".format(word[0], normalized_frequency))

#### Let's visualize it!

In [None]:
# create counts of satire and not satire with values and words
satire_bar_counts = [x[1] for x in satire_freq.most_common(25)]
satire_bar_words = [x[0] for x in satire_freq.most_common(25)]

not_satire_bar_counts = [x[1] for x in not_satire_freq.most_common(25)]
not_satire_bar_words = [x[0] for x in not_satire_freq.most_common(25)]

In [None]:
# set the color of our bar graphs
color = cm.viridis_r(np.linspace(.4,.8, 30))

In [None]:
new_figure = plt.figure(figsize=(16,4))

ax = new_figure.add_subplot(121)
ax2 = new_figure.add_subplot(122)

# Generate a line plot on first axes
ax.bar(satire_bar_words, satire_bar_counts, color=color)
# ax.plot(colormap='PRGn')

# Draw a scatter plot on 2nd axes
ax2.bar(not_satire_bar_words, not_satire_bar_counts, color=color )

ax.title.set_text('Satire')
ax2.title.set_text('Not Satire')

for ax in new_figure.axes:
    plt.sca(ax)
    plt.xticks(rotation=60)

plt.tight_layout(pad=0)

# plt.savefig('word count bar graphs.png')

plt.show()

#### Word Clouds

In [None]:
# Getting our data into a dictionary
# FORMAT:  dictionary = dict(zip(keys, values))
# !pip install wordcloud
from wordcloud import WordCloud
satire_dictionary = dict(zip(satire_bar_words, satire_bar_counts))
not_satire_dictionary = dict(zip(not_satire_bar_words, not_satire_bar_counts))

In [None]:
# Create the word cloud:

wordcloud = WordCloud(colormap='Spectral').generate_from_frequencies(satire_dictionary)

# Display the generated image w/ matplotlib:

plt.figure(figsize=(10,10), facecolor='k')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)

# Uncomment the next line if you want to save your image:
# plt.savefig('satire_wordcloud.png')

plt.show()

In [None]:
wordcloud = WordCloud(colormap='Spectral').generate_from_frequencies(not_satire_dictionary)

plt.figure(figsize=(10,10), facecolor='k')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)
# plt.savefig('not_satire_wordcloud.png')

plt.show()

## Let's classify!

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [None]:
X_train_lem, X_test_lem, y_train_lem, y_test_lem = train_test_split(X_lem, y_lem, test_size=0.20, random_state=1)

tfidf = TfidfVectorizer(stop_words=sw_set)

tfidf_data_train_lem = tfidf.fit_transform(X_train_lem)
tfidf_data_test_lem = tfidf.transform(X_test_lem)

tfidf_data_train_lem

In [None]:
# these matrices are usually sparse!!

non_zero_cols = tfidf_data_train_lem.nnz / float(tfidf_data_train_lem.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_cols))

percent_sparse = 1 - (non_zero_cols / float(tfidf_data_train_lem.shape[1]))
print('Percentage of columns containing ZERO: {}'.format(percent_sparse))

In [None]:
rf_classifier_lem = RandomForestClassifier(n_estimators=100, random_state=0)

In [None]:
rf_classifier_lem.fit(tfidf_data_train_lem, y_train_lem)

rf_train_preds_lem = rf_classifier_lem.predict(tfidf_data_train_lem)
rf_test_preds_lem = rf_classifier_lem.predict(tfidf_data_test_lem)

In [None]:
print(classification_report(y_train_lem, rf_train_preds_lem))
print(classification_report(y_test_lem, rf_test_preds_lem))

In [None]:
plot_confusion_matrix(rf_classifier_lem, tfidf_data_train_lem, y_train_lem);

In [None]:
plot_confusion_matrix(rf_classifier_lem, tfidf_data_test_lem, y_test_lem);

In [None]:
importances = sorted(list(zip(rf_classifier_lem.feature_importances_, tfidf.get_feature_names())))[-20:]
impts = pd.DataFrame(importances, columns=['impt', 'feat'])
plt.barh(impts.feat, impts.impt);

# Conclusions and Next Steps
- Learning foundations of NLP allows us represent our language in a way that computers understand
- We can use the machine learning algorithms that we already learned to classify text documents
- However, there are still disadvantages to represent language this way
- Topic Modeling 
- Word embeddings: word2vec and doc2vec
