![title](img/wordcloud.jpg)

# Jupyter notebook - cheatsheet

Few useful command for notebook:
1. shift+enter - run command in cell
2. Enter - open edit mode for cell

Try Enter on this cell
Try shift + enter on this cell


# Question
If you have any question about NLP or Machine Learning please ask me:
1. Face To Face
2. [On Piazza](http://piazza.com/nlprafalpronko/spring2018/nlp101) access code **nlp101**
3. [On LinkedIn](https://www.linkedin.com/in/rafalpronko/)

If you want practice in home I created a dockerFile with python+spark it can be downloaded from [here](https://github.com/rafalpronko/python-pyspark).

# ==============================================================
# Let's create new start-up: Global Listings


>Our goal is to fully democratize international e-Commerce,
making it possible for any company, small or large, to offer their products worldwide
while providing superior customer service.

one branch of activity is translation the auction between eBay and Amazon marketplaces. For this purpose you need to tackle following challenges:
1. How to choose proper category for the products on market place? - **our main problem**
2. How to enrich the product from Amazon to Ebay?
3. How to prepare shorter title during conversion from Amazon to Ebay?
4. How to create a bullet points from Ebay to Amazon?
5. Machine translation problem

Because our client can sell / publish only well categorized items we need to solve the first problem ASAP. 

First what you need to do is... [research](http://lmgtfy.com/?q=e-commerce+items+categorization) the state-of-the-art solution. 


After a week research you will find few nice solutions:
1. [Large-Scale Item Categorization in e-Commerce Using Multiple Recurrent Neural Networks](http://www.kdd.org/kdd2016/subtopic/view/large-scale-item-categorization-in-e-commerce-using-multiple-recurrent-neur/)
2. [Large-scale Multi-class and Hierarchical Product Categorization for an
E-commerce Giant](https://www.aclweb.org/anthology/C/C16/C16-1051.pdf)
3. [Classifying e-commerce products based on images and text](http://cbonnett.github.io/Insight.html)

Now you know - everything you need is **Natural Language Processing** and **text mining**. 

# Natural Language Processing / Text mining - short introduction

>Natural Language Processing (NLP) is “ability of machines to understand and interpret human language the way it is written or spoken.”

## Key Application Areas of Natural Language Processing

### Automatic summarizer
For given text we need create a summary of text / extract key-point.
### Sentimental analysis
For given text we need to predict the subject.
### <span style="color:red">Text classification</span> - our problem ;)
It is designed to categorize different journals, news stories according to their domain. Multi-document classification is also possible. The most popular example of text classification is spam detection in emails.
### Information extraction
Extract the key information from the text e.g. date from email - to create calendar events. 

![](img/future-applications-of-nlp.png) 
source: https://www.xenonstack.com/blog/data-science/overview-of-artificial-intelligence-and-role-of-natural-language-processing-in-big-data

In this short introduction to Natural Language Processing (NLP) we will cover topics like:
1. Read data to Pandas dataframe
2. Fill null data, remove null rows, select interesting rows from dataframe
3. Clean data: remove stopwords, remove noise, lemmatization, stemmining, word tokenization, ngrams generation
4. Vector representation for text: bag of word, tf-idf
5. Classification using Naive Bayes

### How to put the item on Amazon in [English](http://amazon.com) to proper root category? 

![](img/BrowseNodes.png)

Amazon browsenodes (categories) we can find here http://www.findbrowsenodes.com/. Please look at few first categories and their children - you can see the tree is quite big.

Example of category path: 'Clothing, Shoes & Jewelry'->'Novelty, Costumes & More'->'Costumes & Accessories'->'More Accessories'->'Kids & Baby'

to simplify our job we will try to assignee products (items) to root category. 

What's more, we want to use only **title** to achieve our goal. 

### How we want to achieve this goal? 

1. Scraping data from Amazon
2. Analyzing collected data
3. Building a model

Initial data is provided by external company specialized in scraping sites.

So it's the time to start!

Next steps:
1. We need choose only important data
2. We need clean / normalize the text
3. Building the classifier


# Read and manipalate data  - Pandas DataFrame
Data frame is a way to store data in rectangular grids that can easily be overviewed. You can imagine the data frame as table in database where you have a rows and columns. On this workshop we will used dataframe to storage data from file (in bare python and in pyspark). In Python we have nice library for dataframe: Pandas. 


More information about DataFrame and Pandas: https://pandas.pydata.org/

To start working with Pandas we need import the library:

In [None]:
import pandas as pd # if we need use some library in python we need to import this library

In [None]:
pd.__version__ # we can check the version of our library

## Read data to Pandas

To Pandas we can load many types of data sources: 
1. Excel
2. CSV
3. SQL
4. Google Big Query,
...
On this workshop we will use csv file

To read data from csv to pandas we use [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

In [None]:
df_data = pd.read_csv('small_data.csv', sep="^")

To see our data in dataframe we can use few functions - the most popular one is [head()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) - to show n first elements from dataframe (we have also tail function - to show n last elements).

[More information](https://pandas.pydata.org/pandas-docs/stable/basics.html)

In [None]:
df_data.head()

In [None]:
df_data.columns # show names of the columns

Data description:
* **asin** - unique identifier
* **salesRank** - seller's rank in particular category
* **ImUrl** - image url
* **categories** - categories of the product
* **title** - title of the product
* **description** - description
* **price** - price
* **related** - which product is related to this one
* **brand** - product brand

Now we should look closer to the data. Let's sum up each columns.

We can see how many rows and columns we have in dataframe

In [None]:
df_data.shape

To see not empty values devide by columns

In [None]:
df_data.count()

Count can show us how many empty value we have - we know in each column should be 240000 rows - if count show us less it means we have some empty value - we need to tackle this issue. 

Beacouse we need only two columns - categories and title - lets focuse only on this. 

Categories has 240000 - so zero empty value but title - 239991 - number of empty value is real small (240000 - 239991) so we can simple remove this rows. 

For more information how to work with empty value you can find [here](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)

**Remove** 

In pandas we have a simple method to remove empty values. Just we need to use `.dropna(subset, inplace=True/False)` on dataframe.

subset is a list of column where the dataframe will be looking for empyt value
inplace=True - it means the operation should be done on this dataframe

In [None]:
df_data.dropna(subset=['title'], inplace=True) 

In [None]:
# TODO - show how many items left

For our exercise we will need only two columns `title` and `categories` so let's we remove other columns - for this we will use [drop(subset, axis=1/0,inplace=True)](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) function.

subset / inplace - the same like in dropna()
axis=1 - means remove columns

In [None]:
df_data.drop(['asin', 'salesRank', 'imUrl', 'description', 'price', 'related', 'brand'], axis=1, inplace=True)

We used `inplace=True` because we want to change current dataFrame, `axis=1` - drop columns

In [None]:
#TODO - show how many rows and columns is still in DataFrame

In [None]:
#TODO show the columns in our dataframe

After removing null values and keeping only two columns, we can look at our target column (categories). 

As we saw above we can show only limited number of characters in columns as default, fortunately we can change it. 

In [None]:
pd.set_option('display.max_colwidth', -1)

In [None]:
#TODO show first 10 line (you can do it by .head(10))

**Questions**

1. How many categories has first/second/third row? 
2. What you can observe? What is your first impression?

As we can see we have mixed type of categories: 
* just root category (root category is the first one in whole tree)
* root category and its children
* items belong to more then one category tree (3rd row).

To simplify our job we just will try to build the classifier for root category. 

We need to extract root category from our structure. 

This structure ```[['Clothing, Shoes & Jewelry', 'Girls'], ['Clothing, Shoes & Jewelry', 'Novelty, Costumes & More', 'Costumes & Accessories', 'More Accessories', 'Kids & Baby']]``` is called list of list - something between ```[]``` is just a list - https://docs.python.org/2/tutorial/datastructures.html. 

But in dataframe cell this is just a text - we need evaluate this string to the list (https://docs.python.org/2/library/functions.html#eval).

And we need to apply this eval function for all rows - in pandas we can do it via apply (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html)

In [None]:
def choose_only_first_root_category(element):
    return eval(element)[0][0]

In this specific case eval means - change string `"[['Clothing, Shoes & Jewelry', 'Girls'],..."` to list `[['Clothing, Shoes & Jewelry', 'Girls'],...` and take only first element from first sublist. 

In [None]:
df_data['categories'] = df_data.categories.apply(choose_only_first_root_category)

In [None]:
#TODO show the 15 first elements from dataframe

# Steps to build our classifier

1. ~~We need to choose only interesting data~~
2. We need clean / normalize the text
3. Build the classifier

# Cleaning the text

In our classifier we want to use a simple method representation of text data - **bag of word**. 

###  How Bag of Word works. 

As a example let's take two sentences: ```The quick brown fox jumps over the lazy dog``` and ```Never jump over the lazy dog quickly```. In our example this two sentences is called corpus. For this corpus we need to create a vocabulary:
```
{
The: 1,
quick: 2,
brown: 3,
fox: 4,
jumps: 5,
over: 6,
the: 7, 
lazy: 8,
dog: 9,
Never: 10,
jump: 11,
quickly: 12
}
```
than we create a vector over each sentence. 

First sentence:

```[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]```

```[The, quick, brown, fox, jumps, over, the, lazy, dog, , , ]```

second

```[0, 0, 0, 0, 0,  1, 1, 1, 1, 1, 1, 1]```

```[ , , , , , over, the, lazy, dog, Never, jump, quickly]```

What is not important in this representation:
1. order
2. capital letters
3. `quickly` can be equal to `quick`,... 
4. `the` is just a stop word (not informative)

so our vocabulary should looks like:

```
{
quick: 1,
brown: 2,
fox: 3,
jump: 4,
over: 5,
lazy: 6,
dog: 7,
never: 8,
}
```
Try to think how our vectors should looks like...


To summarize observation and clean / normalize text we need to:

1. Make sentences lowercase
2. Split it by the words
3. Remove stop words
4. cast quickly to quick, jumps to jump and so on

Let's start clearing the text

In [None]:
# create a sentece
sentence = "The quick brown fox jumps over the lazy do"

In [None]:
sentence

#### Lowercasing

In [None]:
sentence = sentence.lower()

In [None]:
sentence

#### Splitting by the words

In [None]:
sentence_split = sentence.split()

In [None]:
sentence_split

or we can use the [nltk](http://www.nltk.org/) library and word tokenization method.

In [None]:
from nltk import word_tokenize # import ready function from nltk

In [None]:
sentence_tokenize = word_tokenize(sentence)

In [None]:
sentence_tokenize

What is the difference? 
Let's take another example. 

In [None]:
"Two books, in my shop".split()

In [None]:
word_tokenize("Two books, in my shop")

As you can see in first case we have `book,` in second `book` and `,` are split

Clean text: 
1. ~~Make sentences lowercase~~
2. ~~Split it by the words~~
3. Remove stop words
4. cast quickly to quick, jumps to jump and so on

Now time to remove stop words - for this we can use nltk and [stop words](https://en.wikipedia.org/wiki/Stop_words) build in

In [None]:
from nltk.corpus import stopwords

In [None]:
# Define stop words language - our titles are in English
stop_words = stopwords.words('english')

In [None]:
for word in sentence_tokenize:
    if word not in stop_words:
        print (word)

Clean text: 
1. ~~Make sentences lowercase~~
2. ~~Split it by the words~~
3. ~~Remove stop words~~
4. cast quickly to quick, jumps to jump and so on

#### Cast quickly to quick
**[Lammanize](https://en.wikipedia.org/wiki/Lemmatisation) and [stemming](https://en.wikipedia.org/wiki/Stemming)**

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer # we need import lemmatizer

In [None]:
lem = WordNetLemmatizer() # create a lemmatizer

In [None]:
print(lem.lemmatize('jumps')) 

it works

In [None]:
from nltk.stem.porter import PorterStemmer # import stemmer

In [None]:
stem = PorterStemmer()
print(stem.stem('jumps')) 

What is the difference? 

In [None]:
print(lem.lemmatize('quickly')) 

In [None]:
print(stem.stem('quickly')) 

Clean text: 
1. ~~Make sentences lowercase~~
2. ~~Split it by the words~~
3. ~~Remove stop words~~
4. ~~cast quickly to quick, jumps to jump and so on~~

Ok, we've finished cleaning the single sentence - now is time to create a dictionary and vectors. Quite a lot to do but we can use build-in function for this. 

### Faster method to clean, normalize and build the bag of word

For this tasks we have a  library in Python - sklearn. 

Ok, let's return to our main goal and our main dataset (as a reminder - we need to build a model to assign the title of item to proper root category)

We will do it in few steps:
1. Split the data to two parts - train data and test data - this is important to test the model on different data than we train
2. Clean/normalize/build bag of word
3. Prepare labels / category
4. Build / train classifier
5. Evaluate classifier

#### Split train / test sets

For split data we will use sklearn method [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
from sklearn.cross_validation import train_test_split # import methods

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_data['title'], df_data['categories'])

Create a model:
1. ~~Split the data to two parts - train data and test data - this is important to test the model on different data than we train~~
2. Clean/normalize/build bag of word
3. Prepare labels / category
4. Build / train classifier
5. Evaluate classifier

#### Clean/normalize/build bag of word
For this step we will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
# count vectorizer will be responsible for creating a dictionary and word vector
from sklearn.feature_extraction.text import CountVectorizer 

In [None]:
bag_of_word_vec = CountVectorizer(analyzer='word', # analyze only full word
                                  lowercase=True, # lower case
                                  ngram_range=(1, 1), # split by single word
                                  stop_words=stop_words, # use our stop words
                                  tokenizer=word_tokenize # use our nltk word_tokenizer
                                 ) # create a simple count vectorizer

Because CountVectorizer is a model in sklearn we need to train this model - in sklearn to train model we are using `.fit`.

Fit create just a vocabulary for bag of word model

In [None]:
bag_of_word_vec.fit(X_train)

After we trained model - we need to transform data - create a vectors - for train and for test

In [None]:
X_train_vec = bag_of_word_vec.transform(X_train)

In [None]:
X_test_vec = bag_of_word_vec.transform(X_test) #for test we have to create also

Create a model:
1. ~~Split the data to two parts - train data and test data - this is important to test the model on different data than we train~~
2. ~~Clean/normalize/build bag of word~~
3. Prepare labels / category
4. Build / train classifier
5. Evaluate classifier

#### Preparing labels / category

Our target classes - categories are the text but for classifier we need change them to the numbers. It means that we need to transform text to simple number.

**Example**
1. Book should be change to 0
2. TV -> 1
etc. 

For that we need to import [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [None]:
# TODO import LabelEncoder

In [None]:
#TODO base on code from CountVectorizer - create LabelEncoder

In [None]:
#TODO train the label encoder

In [None]:
#TODO transform y_train

In [None]:
#TODO transform y_test

Create a model:
1. ~~Split the data to two parts - train data and test data - this is important to test the model on different data than we train~~
2. ~~Clean/normalize/build bag of word~~
3. ~~Prepare labels / category~~
4. Build / train classifier
5. Evaluate classifier

#### Build the models


As a classifier we will use [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [None]:
#TODO import classifier

In [None]:
#TODO train the classifier

In [None]:
preds = #TODO predict the new value

In [None]:
#TODO show 10 first prediction

In [None]:
#TODO show 10 first label

Create a model:
1. ~~Split the data to two parts - train data and test data - this is important to test the model on different data than we train~~
2. ~~Clean/normalize/build bag of word~~
3. ~~Prepare labels / category~~
4. ~~Build / train classifier~~
5. Evaluate classifier

#### Evaluate the model


In Sklearn we can create a classification report for each class for this we will use [classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [None]:
from sklearn.metrics import classification_report # metrics for classification

In [None]:
print(classification_report(y_true=y_test, y_pred=preds))

Create a model:
1. ~~Split the data to two parts - train data and test data - this is important to test the model on different data than we train~~
2. ~~Clean/normalize/build bag of word~~
3. ~~Prepare labels / category~~
4. ~~Build / train classifier~~
5. ~~Evaluate classifier~~

FINISH - you create you first model for text classification

# How to improve our model?

1. Instead of single word vocabulary we can use multi word vocabulary it calls ngrams - how to do it? in Count vectorizer we need to change ```ngram_range = (1,1) ``` to ```ngram_range=(1,2)``` - it means use one and two words as a vocabulary - we can of course use only two-words or three-words
2. Use different tokenization methods
3. Use different stop words model
4. Instead of simple bag of words we can use [TF-IDF]((TfidfVectorizer http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) model 

![](img/bonus.png)



# TF-IDF

TF-IDF is composed by two terms:
1. TF - term frequency
2. IDF - Inverse document frequency

**How to compute **

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF(t) = Total number of documents / Number of documents with term t in it.

TF-IDF = TF(t) * IDF(t)

**Example**
We have two sentences  ```The quick brown fox jumps over the lazy dog``` and  ```Never jump over the lazy dog quickly``` - for simplify we **do not** clean our text so `quick` is different than `quickly`

word `quick` in first sentence appear one time so 

TF(quick) = 1 / 9 

IDF(quick) = 2 / 1

TF-IDF(quick) = (1 / 9) * 1 = 1/9

Do the same with `dog` - but use the python code:

In [None]:
sentence = 'The quick brown fox jumps over the lazy dog'
dog_in_first_document = 1
number_of_term_in_first_document = len(sentence.split())
number_document_with_dog = 2
number_of_document = 2

TF = dog_in_first_document/number_of_term_in_first_document
IDF = number_of_document / number_document_with_dog

TF_IDF = TF*IDF
print(TF_IDF)

Now we can do it use Sklearn

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer # import tfidf
tf_idf = TfidfVectorizer(
      analyzer='word', # analyze only full word
      lowercase=True, # lower case
      ngram_range=(1, 2), # split by single word
      stop_words=stop_words, # use our stop words
      tokenizer=word_tokenize # use our nltk word_tokenizer
)

#TODO - fit the Vectorizer (only on train), transform train/test set, build NaiveBayes classifier,
# show the classification report

# [PIPELINE](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In sklearn we do not have to make each transform separately - we can use pipeline to simplify the model - below we have a simple example how to do it.

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
clf = make_pipeline(TfidfVectorizer(ngram_range=(1,2)), MultinomialNB())

In [None]:
#TODO fit the model

In [None]:
#TODO predict new data

In [None]:
#TODO show classification report

In [None]:
#TODO compare data

# FIND THE BEST MODEL

Many times during create new model we need to test many different parameters - like number of ngrams, learning rate and so on. 

Sklearn has methods for this - it is as simple as create new classifier

In [None]:
clf_best = make_pipeline(TfidfVectorizer(ngram_range=(1,2)), MultinomialNB())

In [None]:
clf_best

In [None]:
from sklearn.grid_search import GridSearchCV

In [None]:
parameters = {'tfidfvectorizer__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
              'tfidfvectorizer__use_idf': (True, False),
              'tfidfvectorizer__norm': ('l1', 'l2'),
              'multinomialnb__alpha': (1.0, 0.9, 0.8, 0.7),
 } #parameters to test

In [None]:
gs_clf = GridSearchCV(clf_best, parameters, n_jobs=-1)

In [None]:
#TODO fit the classifier

In [None]:
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

In [None]:
#TODO predict the new labels

In [None]:
#TODO create a classification report

In [None]:
#TODO Compare this model with previous one

![](img/links.png)

1. [Speech and Language Processing (3rd ed. draft)](https://web.stanford.edu/~jurafsky/slp3/)
2. [Natural Language Processing with Python](http://www.nltk.org/book/)
3. [Deep Learning For NLP](https://github.com/andrewt3000/dl4nlp)
4. [NLP step by step](https://www.tutorialspoint.com/artificial_intelligence/artificial_intelligence_natural_language_processing.htm)