# NLP Series - Class 1

***

Welcome to the NLP (Natural Language Processing) Series, developed by Visagio and Digital House! In this series, you will be introduced to the NLP world: what exactly is NLP and why is it important for a Data Scientist to know it? What are the current main applications? How can we solve a real problem with NLP? <br>
All these questions will be answered in this 3 classes series. In this 1st class,  we will:
 - what is NLP
 - understand why NLP presents a different challenge compared to 'normal' datasets (like the Titanic or IMDb Score)
 - the some fundamental approaches to NLP
 - the best practices when dealing with an NLP problem (cleaning, visualizing etc)
 - how to tackle problems with this basic NLP algorithm
 - some ways of diving deeper into NLP
 
Ready? Let's go! <br>

The series 'How to Data Science' was written and developed by Abelardo Fukasawa. I'm a Data Scientist at Visagio and a Machine Learning Researcher at USP's Grupo Turing. Feel free to contact me on my __[LinkedIn](https://www.linkedin.com/in/abelardofukasawa/)__ or my __[GitHub](https://github.com/abefukasawa)__ :) <br>
***

## So, what is Natural Language Processing?

In science fiction (a genre that I love!) it is pretty common to see robots or other artificial agents communicating seamlessly with humans. Not through command lines in a terminal nor specific reserved commands, but talking the way we talk. Take Blade Runner's JOI or Star Wars's C-3PO, for example. They manage to interact with other character's simply by speaking in their language. In other words, these intelligent artificial agents, at least the ones depicted in these stories, can comprehend and reproduce the way we use language to communicate - the natural language. <br>
This is a common way of thinking about the goal of NLP: to develop agents that can communicate with us, naturally, in order to have a better relation/interface with humans. Therefore, **NLP is a field that brings together Machine Learning and Linguistics in order to create agents that can understand and reproduce our 'human way of communicating'**. And that's pretty much about it! It may sound simple, but NLP presents a totally different challenge from 'raw data' Data Science and Machine Learning. <br>
Some famous applications for NLP are word correctors, Google Duplex, personal assistants (Apple Siri, Microsoft Cortana),  Amazon Alexa and Chatbots!
<br>

PS: NLP can be broken down into NLU (Natural Language Understanding) and NLG (Natural Language generation). They are instrinsically connected, but are two very different problems.

## Why is NLP a separate and challenging field?

To answer this question, let's remember a very traditional challenge/dataset from Kaggle - the IMDb Score. The goal of this challenge is to predict the score a given movie would receive. There are plenty of parameters: movie budget, revenue, cast, director,  facebook likes etc. So, after cleaning, reducing dimensionality and doing a little bit of feature engineering, you may have applied something like an ensemble algorithm, or a neural net for the fancy ones, and boom - you get a result. You had a multidimensional input and your algorithm discovered the mapping function between them and the desired output. Great, simple, fast. You may have used a little bit of knowledge (cof,  guessing?) about movies in order to do your feature engineering, but you didn't have to deeply understand how movies are made, how the color palette is used by directors to convey emotions, neither the art references the piece did. In other words, for IMDb score, you don't have to be a movie expert - you use math to discover the best features and you're done.
NLP does not work this way. As languages have their own words, morphology, semantic, syntax and syntagmas, there are linguistic rules into your dataset. Let's take a simple example:
 - the brown dog ate the red apple
 - the red dog ate the brown apple
 <br>
The 2 sentences up here depict two different scenes, even though they carry different meanings. in the first one, I see a beautiful dog eating a nice apple. In the second one, I see a quite odd dog (would it be a pokémon? Never saw a red dog) eating a rotten apple - ew. Language is not only about content, it's about structures, time and context. Because of this, NLP is considered an area inside machine learning - a quite interesting and challenging one. <br>

## Text Classification using Naive Bayes in News

Ok,  we talked a lot about NLP, let's get our hands dirty and start coding! For now, we will be using the __[20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/)__. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is built-in scikit learn, so there's no need to get from external resources.
We will begin using a pretty common algorithm: *Naive bayes*! Although simple, *Naive Bayes* can be quite useful, not only performance-wise, but also to set a baseline to your models. It is a simple and fast algorithm to begin your experimentations.

### Step 1: Load and Explore the Dataset

In [1]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [4]:
print(twenty_train.target_names) #prints all the categories
print('_____')
print("\n".join(twenty_train.data[0].split("\n")[:20]))

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
_____
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you


As you can see, it seens like an e-mail from someone to his/her neighborhood asking about an unknown interesting car. Let's see another one

In [5]:
print("\n".join(twenty_train.data[1].split("\n")[:20]))

From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>


### Step 2: Convert words into features - the Bag of Words model

As we just saw, our dataset is a collection of words. The first and most fundamental approach is to understand language by it's contents. I know, my first example this can be a trap, but can extract powerful information about the text's theme from its contents. Even though those 2 sentences carried different meanings, they both referenced a dog eating an apple. If we want to classify texts by their themes, this may be very useful! <br>
Ok, but we can't feed our dataset as it is in a *Naive Bayes* algorithm. It deals with features, not strings. Well, texts are just a bunch of ordered strings, so we can convert these strings into vectors. We will do this by using the *Bag of Words* model (I'm lazy, imma call it BoW, okay?)!
Briefly, we segment each text file into words (for English splitting by space), and count # of times each word occurs in each document and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature). Therefore, the BoW columns will be composed of all the words in our dataset (not just of a document, but from all documents), and each line of our BoW matrix will be a document. Each element in this BoW matrix will show how many times a given word (column) appears in each line (document). Scikit-learn has a high level component which will create feature vectors for us ‘CountVectorizer’ <br>
A nice insight: the more diverse is our dataset, the more sparse will be our matrix. Can you figure out why?

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

As we can see, we have ~11k individual documents and ~130k words in our BoW! It's very natural to have more words than documents in orders of power, as we have different themes. But, let's wait a minute. Is this what we really want?

### TF-IDF and noise in our language

Let's do a simple experiment. Above, I'm talking about the shape of our BoW. If we countd every word in our paragraph, would I be able to predict the theme by it's contents?

In [7]:
import numpy as np
import operator
paragraph = "As we can see, we have ~11k individual documents and ~130k words in our BoW! It's very natural to have more words than documents in orders of power, as we have different themes. But, let's wait a minute. Is this what we rly want?"
unique, counts = np.unique(paragraph.split(" "), return_counts=True)
paragraph_dict = dict(zip(unique, counts))
print(sorted(paragraph_dict.items(), key=operator.itemgetter(1)))

[('As', 1), ('BoW!', 1), ('But,', 1), ('Is', 1), ("It's", 1), ('a', 1), ('and', 1), ('as', 1), ('can', 1), ('different', 1), ('individual', 1), ("let's", 1), ('minute.', 1), ('more', 1), ('natural', 1), ('of', 1), ('orders', 1), ('our', 1), ('power,', 1), ('rly', 1), ('see,', 1), ('than', 1), ('themes.', 1), ('this', 1), ('to', 1), ('very', 1), ('wait', 1), ('want?', 1), ('what', 1), ('~11k', 1), ('~130k', 1), ('documents', 2), ('in', 2), ('words', 2), ('have', 3), ('we', 4)]


Well, the most common words are 'we' and 'have'. They are too generic to describe the content of our paragraph! The truly important words just appear once. If we would replicate this experiment in the twenty_train dataset, we would be inconclusive about the documents theme: a lot of 'a', 'the', 'as' etc would show up being the most important words. <br>
Connectors bring noise to our conclusions, so we need an algorithm that can deal with this. Thus comes TF-IDF, or Term Frequency Inverse Document Frequency. I'd like to show the formula and then explain the algorithm: <br>
![TF-IDF](../data/tfidf_formula.png)
From term to term:
 - the W represents the weight each word/feature will have in a given document when passing through the classifier
 - the tf is # of times the word appeared in that document
 - the N is the number of documents in our dataset. in our case, 11314
 - the df is the number of documents that contains the word/feature
 
As we can come to conclusion, important words of a given theme will appear mainly in text of that theme, many times, and only in those texts. This is what TF-IDF does, it goes after the really important words, ignoring stopwords noise.

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

### Step 3 - Bulding our NB Classifier Pipeline

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [10]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7738980350504514

Ok, so we have a 77.39% score baseline! Pretty neat, huh? 

### Can we raise this acc score with a better algorithm?

In [11]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)),
])
_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)
predicted_svm = text_clf_svm.predict(twenty_test.data)
np.mean(predicted_svm == twenty_test.target)



0.8238183749336165

Nice! We raised the score just by switching to a more effective algorithm! But know that we only achieved this because of the feature information TF-IDF brought to us.

### Step 4 - Optimize with Grid Search and what are N-Grams?

In [12]:
from sklearn.model_selection import GridSearchCV
parameters_svm = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf-svm__alpha': (1e-2, 1e-3),
}
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)
print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)



0.8979140887396146
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


Nice! By doing a simple Grid Search, we raised our acc to almost 90%! Explaning the parameters in the Grid:
 - clf-svm_alpha: the 'sensibility' of the SVM
 - tfidf__use_idf: whether or not we show use the TF-IDF transformation. This is a nice proof of concept of how powerful and important it is!
 - vect__ngram_range: *ngrams* refer to how many words are considered a single token. With (1, 1),  features consist only of single words. With (1, 2) range, features are constructed using both single words as well as pairs of words. 2-gramns, or Bigrams, add information about sequencing and the relation of words within our model! This showed to be quite powerful too.