# Learning scikit-learn 

## An Introduction to Machine Learning in Python

### at PyData Chicago 2016

**enhancing** _Sebastian Raschka_ codes https://github.com/rasbt/pydata-chicago2016-ml-tutorial

# Table of Contents


* [3 Introduction to Classification](#3-Introduction-to-Classification)
    * [The Iris dataset](#The-Iris-dataset)
    * [Class label encoding](#Class-label-encoding)
    * [Scikit-learn's in-build datasets](#Scikit-learn's-in-build-datasets)
    * [Test/train splits](#Test/train-splits)
    * [Logistic Regression](#Logistic-Regression)
    * [K-Nearest Neighbors](#K-Nearest-Neighbors)
    * [3 - Exercises](#3---Exercises)


In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# 3 Introduction to Classification

### The Iris dataset

In [None]:
df = pd.read_csv('dataset_iris.txt', 
                 encoding='utf-8', 
                 comment='#',
                 sep=',')
df.tail()

In [None]:
X = df.iloc[:, :4].values 
y = df['class'].values
np.unique(y)

### Class label encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

l_encoder = LabelEncoder()
l_encoder.fit(y)
l_encoder.classes_

In [None]:
y_enc = l_encoder.transform(y)
np.unique(y_enc)

In [None]:
np.unique(l_encoder.inverse_transform(y_enc))

### Scikit-learn's in-build datasets

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
print(iris['DESCR'])

### Test/train splits

In [None]:
X, y = iris.data[:, :2], iris.target
# ! We only use 2 features for visual purposes

print('Class labels:', np.unique(y))
print('Class proportions:', np.bincount(y))

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=123)

print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=123,
        stratify=y)

print('Class labels:', np.unique(y_train))
print('Class proportions:', np.bincount(y_train))

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='newton-cg', 
                        multi_class='multinomial', 
                        random_state=1)

lr.fit(X_train, y_train)
print('Test accuracy %.2f' % lr.score(X_test, y_test))

In [None]:
from mlxtend.evaluate import plot_decision_regions

plot_decision_regions

plot_decision_regions(X=X, y=y, clf=lr, X_highlight=X_test)
plt.xlabel('sepal length [cm]')
plt.xlabel('sepal width [cm]');

### K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

kn = KNeighborsClassifier(n_neighbors=4)

kn.fit(X_train, y_train)
print('Test accuracy %.2f' % kn.score(X_test, y_test))

In [None]:
plot_decision_regions(X=X, y=y, clf=kn, X_highlight=X_test)
plt.xlabel('sepal length [cm]')
plt.xlabel('sepal width [cm]');

### 3 - Exercises

- Which of the two models above would you prefer if you had to choose? Why?
- What would be possible ways to resolve ties in KNN when `n_neighbors` is an even number?
- Can you find the right spot in the scikit-learn documentation to read about how scikit-learn handles this?
- Train & evaluate the Logistic Regression and KNN algorithms on the 4-dimensional iris datasets. 
  - What performance do you observe? 
  - Why is it different vs. using only 2 dimensions? 
  - Would adding more dimensions help?

<div style='height:100px;'></div>