## Introduction to machine learning.
## Natural language processing.
___
   
*Radoslav Petkov*

### About me
___

* Contact: https://www.linkedin.com/in/radoslav-petkov-8a4a53144/
* Sofia University, Computer Science, Bsc (2nd year)
* Has been working for Sirma since the end of 2014

### Summary
___

* Supervised vs Unsupervised machine learning
* Representation of words and sentences
* Autoencoders
* Sequence to sequence models
* Memory Networks


## Supervised vs Unsupervised

![](un_supervised.png)


# Supervised
___

###  $ y= a*X + b$

## Regression
___

### $y \in /R $

In [25]:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
boston = load_boston()

dataset = pd.DataFrame(data=boston["data"], columns=boston["feature_names"])
dataset["Target"] = boston["target"]


In [30]:
print(boston["DESCR"])

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [53]:
dataset[dataset["Target"] < 10][:2]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Target
384,20.0849,0.0,18.1,0.0,0.7,4.368,91.2,1.4395,24.0,666.0,20.2,285.83,30.63,8.8
385,16.8118,0.0,18.1,0.0,0.7,5.277,98.1,1.4261,24.0,666.0,20.2,396.9,30.81,7.2


In [60]:
dataset[(dataset["Target"] > 10) & (dataset["Target"] < 30)][:2]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6


In [65]:
dataset[dataset["Target"] > 45][:2]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Target
161,1.46336,0.0,19.58,0.0,0.605,7.489,90.8,1.9709,5.0,403.0,14.7,374.43,1.73,50.0
162,1.83377,0.0,19.58,1.0,0.605,7.802,98.2,2.0407,5.0,403.0,14.7,389.61,1.92,50.0


## Classification
___

### $y \in [0..c]$

In [41]:
from sklearn.datasets import load_iris

iris = load_iris()
pd_iris = pd.DataFrame(data=iris["data"], columns=iris["feature_names"])
pd_iris["Type"] = iris["target"]

In [42]:
print(iris["DESCR"])

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [49]:
pd_iris [ pd_iris["Type"] == 0][:2]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Type
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0


In [50]:
pd_iris [ pd_iris["Type"] == 1][:2]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Type
50,7.0,3.2,4.7,1.4,1
51,6.4,3.2,4.5,1.5,1


In [52]:
pd_iris [ pd_iris["Type"] == 2][:2]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Type
100,6.3,3.3,6.0,2.5,2
101,5.8,2.7,5.1,1.9,2


# What the computer sees?

*  ### What is one hot encoding of words?

Simply said, it is representation where each word is vector with size equal to the vocabulary size and there is 1 at the index equal to the word index.

Imagine you have the following corpus of words: *Machine*, *learning*, *rocks*.

***One representation would be :***

Machine -> [1 0 0],  learning -> [0 1 0], rocks -> [0 0 1]


* ### What about Bag of Words?

Each sentence is vector with size equal to the vocablary size storing the occurances of each word in the sentence.
There are several additional modifications such as:
* **TF-IDF**

    We assign weights of each word instead of occurances.The weights tend to filter out common terms. If the weight is 0 then the word is present in every sentence.
    TF is the count of the word in the current sentence.
    IDF is the count of the word in all sentences we are working with.
* **Hashing**

    Instead of using dictionary to vectorize, a special hash function is used.

### Examples
___

Lets have the following corpus: *machine*, *learning*, *rocks*, *this*, *robot*

And the following sentences:

* *machine learning rocks*
* *this machine learning rocks*
* *this robot rocks*

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["machine learning rocks", "this machine learning rocks", "this robot rocks"]
corpus_transformed = CountVectorizer().fit_transform(corpus)

In [18]:
for (sentence, transf) in zip(corpus, corpus_transformed.toarray()):
    print(sentence, transf)

machine learning rocks [1 1 0 1 0]
this machine learning rocks [1 1 0 1 1]
this robot rocks [0 0 1 1 1]


*  ### Hmm, now the fancy word vectors

Vectors from lattent space formed by the corpus used for training unsupervised model.
They can catch the 'semantic meaning of a word'.

![](word_embedding2.jpg)

![](word_embedding.png)

In [5]:
from gensim.models import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('/media/radoslav/6906F83679A14133/Download/glove/GoogleNews-vectors-negative300.bin', binary=True)

In [6]:
word_vectors.wv.most_similar(positive=['walking', 'swam'], negative=['swimming'])[0]

('walked', 0.751861572265625)

In [14]:
word_vectors.wv.most_similar(positive=['France', 'Sofia'], negative=['Paris'])[0]

('Bulgaria', 0.7505655288696289)

In [40]:
word_vectors.wv.most_similar(positive=['queen', 'man'], negative=['woman'])[0]

('king', 0.6958590149879456)

### Now lets try what a model can do on one of those IQ quizzes

In [16]:
word_vectors.wv.doesnt_match(["apple", "banana", "orange", "bread"])

'bread'

## Autoencoders
___

![](autoencoder.png)

## Sequence to sequence models
___

![](thank-you-for-your-attention-now-its-time-for-questions.jpg)

### Sources of some of the images:

* https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/machine_learning.html
* https://medium.com/@curiousily/credit-card-fraud-detection-using-autoencoders-in-keras-tensorflow-for-hackers-part-vii-20e0c85301bd
* https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

### Useful references:

* https://github.com/fmi/machine-learning-lectures
* https://www.kaggle.com/
* https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/machine_learning.html
* https://en.wikipedia.org/wiki/Feature_hashing
* https://en.wikipedia.org/wiki/Tf%E2%80%93idf


## <center>All links above are interesting and useful if you want to deep dive in the world of machine learning.</center>

 ### <center>The end, thanks!</center>