# TP 2. dataset vowel : NB, LDA, QDA, kNN, Decision tree

<img src="http://media.giphy.com/media/citBl9yPwnUOs/giphy.gif"  width="300">

## Plan :

   [1) Description du dataset vowel](#1)
   
   [2) Un peu de statistique](#2)
   
   [3) Naive Bayes / LDA / QDA](#3)
   
   [4) k-NN](#4)
   
   [5) Decision tree](#5)

**A) Importation des modules de base**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

**B) Adresse pour charger les données:**

https://web.stanford.edu/~hastie/ElemStatLearn/data.html

Les mettre au format csv et les placer dans le même dossier que votre fichier Python.


*ATTENTION : J'ai supprimé manuellement la 1ère colonne qui est simplement le numéro de ligne* 

In [None]:
train_load = pd.read_csv('vowel.train.csv', sep=',')
test_load = pd.read_csv('vowel.test.csv', sep=',')

**C) Vérifications élémentaires**

In [None]:
(type(train_load),type(test_load))

In [None]:
train_load.shape

In [None]:
test_load.shape

**D) Séparation en (input,output) pour train et test** (input/output = features/response)

In [None]:
(x_train, y_train) = (train_load.iloc[:,1:11], train_load.iloc[:,0])
(x_test, y_test) = (test_load.iloc[:,1:11], test_load.iloc[:,0])

<img src="http://media.giphy.com/media/ASd0Ukj0y3qMM/giphy.gif" width = 300>
<a id="1"></a> 
 
# 1. Description du dataset vowel


**A) x_train**

In [None]:
x_train.describe()

In [None]:
color = dict(boxes='DarkGreen', whiskers='DarkOrange', medians='DarkBlue', caps='Gray')
x_train.plot.box(color=color, sym='r+');

In [None]:
fig = plt.figure(figsize=(15,5))
for k in range(0,10):
    f = fig.add_subplot(2,5,k+1)
    f.hist(x_train.iloc[:,k], bins = 25);

In [None]:
f, ax = plt.subplots(figsize=(8, 8))
corr = x_train.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap="coolwarm",
            square=True,annot = True);

**A) y_train**

In [None]:
y_train.value_counts()

<img src="https://media.giphy.com/media/qqtvGYCjDNwac/giphy.gif" width = 300>
<a id="2"></a> 
 
# 2. Un peu de statistique

**A) Test de normalité. Exemples**

In [None]:
import scipy.stats as stats

2 exemples : données uniformes, données normales

In [None]:
u = stats.uniform.rvs(size = 100)
stats.normaltest(u)

In [None]:
v = stats.norm.rvs(size = 100)
stats.normaltest(v)

**B) Test de normalité sur x_train**

In [None]:
resTest = stats.normaltest(x_train)

In [None]:
for k in range(0,10):
    print("variable ", k, "%.3f" %  resTest.pvalue[k])

In [None]:
resTest.pvalue > 0.01

**C) QQ plots sur x_train**

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(10,10))
import statsmodels.api as sm 

ax= axes.flatten()
for i in range(10):
   sm.qqplot(np.array(x_train.iloc[:,i]), line='s', ax = ax[i], xlabel= i, ylabel= "")

**D) Régression logistique multinomiale**

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression()
y_pred_logreg = logreg.fit(x_train, y_train).predict(x_test)

In [None]:
print("Number of mislabeled points out of a total %d points : %d. Error rate : %d %%"
       % (x_test.shape[0],(y_test != y_pred_logreg).sum(), (y_test != y_pred_logreg).sum()/float(x_test.shape[0])*100))

<img src="http://media0.giphy.com/media/3BRDkVjKikYW4/giphy.gif" width = 300>
<a id="3"></a> 
 
# 3. Naive Bayes / LDA / QDA

**A) module de machine learning sklearn**

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
GNB = GaussianNB()
LDA = LinearDiscriminantAnalysis()
QDA = QuadraticDiscriminantAnalysis()

**B) GNB**

**C) LDA**

**D) QDA**

**E) Analyse voyelle par voyelle**

<img src="http://media3.giphy.com/media/eNTxLwTGW7E64/giphy.gif" width = 300>
<a id="4"></a> 
 
# 4. k-NN

**A) neighbors function**

In [None]:
from sklearn import neighbors

**B) neighbors vs QDA**

**C) KNN pour k de 2 à 15**

**D) KNN3 VS best KNN**

**E) Ecrire sa propre fonction KNN**

<img src="http://media1.giphy.com/media/hbd8nlok7kqnS/giphy.gif" width = 300>
<a id="5"></a> 
 
# 5. Decision tree

On pourra s'aider de la documentation suivante :

http://scikit-learn.org/stable/modules/tree.html

In [2]:
from sklearn import tree                                                        
clf = tree.DecisionTreeClassifier()                                                                                                                      
#[height, hair-length, voice-pitch]                                             
X = [ [180, 15,0],                                                              
      [167, 42,1],                                                              
      [136, 35,1],                                                              
      [174, 15,0],                                                              
      [141, 28,1]]                                                                                                                                       
Y = ['man', 'woman', 'woman', 'man', 'woman']                                                                                                               
clf = clf.fit(X, Y)                                                             
prediction = clf.predict([[133, 37,1]])                                         
print(prediction)    

['woman']
