# Data Modeling with Scikit-Learn

**Outline:**

* [Intro to Scikit-Learn](#Intro-to-Scikit-Learn)
* [Loading Iris Data](#Loading-Iris-Data)
* [Creating a Model](#Creating-a-Model)
* [Training and Testing a Model](#Training-and-Testing-a-Model)
  * [Performing Cross-Validation](#Performing-Cross-Validation)
* [Evaluating a Model](#Evaluating-a-Model)
* [Scikit-Learn Algorithm Cheat Sheet](#Scikit-Learn-Algorithm-Cheat-Sheet)

![](supervised-classification.png)
<div style="text-align: center;">
<strong>Credit:</strong> http://www.nltk.org/book/ch06.html
</div>

## Intro to Scikit-Learn

In [None]:
from IPython.core.display import HTML
HTML("<iframe src=http://scikit-learn.org/ width=800 height=350></iframe>")

## Loading Iris Data

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')

In [None]:
df.head()

In [None]:
X = df.drop(['species'], axis=1)
y = df['species']

## Creating a Model

**Note:** 4-step modeling pattern

### K-nearest neighbors (KNN) classification

**Step 1:** Import the model (import)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** Instantiate an estimator (instantiate)

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

**Step 3:** Fit the model (fit)

In [None]:
knn.fit(X, y)

**Step 4:** Make a prediction (predict)

In [None]:
X_new = [[3, 5, 4, 2]]
knn.predict(X_new)

In [None]:
X_new = [
    [3, 5, 4, 2], 
    [5, 4, 3, 2]
]
knn.predict(X_new)

### Try a different value for K

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=10)

# fit
knn.fit(X, y)

# predict
knn.predict(X_new)

### Use a different classification model

In [None]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X, y)

# predict
logreg.predict(X_new)

In [None]:
from sklearn import svm

clf = svm.SVC()

clf.fit(X, y)

clf.predict(X_new)

## Training and Testing a Model

### Procedure 1: Train and test on the (same) entire dataset

In [None]:
from sklearn import metrics

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
X = df.drop(['species'], axis=1)
y = df['species']

### Logistic Regression

In [None]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X, y)

# predict
y_pred = logreg.predict(X)

metrics.accuracy_score(y, y_pred)

### KNN (K = 5)

In [None]:
# import
from sklearn.neighbors import KNeighborsClassifier

# instantiate
knn = KNeighborsClassifier(n_neighbors=5)

# fit
knn.fit(X, y)

# predict
y_pred = knn.predict(X)

metrics.accuracy_score(y, y_pred)

### KNN (K = 1)

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=1)

# fit
knn.fit(X, y)

# predict
y_pred = knn.predict(X)

metrics.accuracy_score(y, y_pred)

### Procedure 2: Train and test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
print(y_train.shape)
print(y_test.shape)

### Logistic Regression

In [None]:
# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X_train, y_train)

# predict
y_pred = logreg.predict(X_test)

metrics.accuracy_score(y_test, y_pred)

### KNN (K = 5)

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=5)

# fit
knn.fit(X_train, y_train)

# predict
y_pred = knn.predict(X_test)

metrics.accuracy_score(y_test, y_pred)

### KNN (K = 1)

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=1)

# fit
knn.fit(X_train, y_train)

# predict
y_pred = knn.predict(X_test)

metrics.accuracy_score(y_test, y_pred)

### Find a better value for K

In [None]:
k_range = range(1, 26)
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(metrics.accuracy_score(y_test, y_pred))

In [None]:
scores

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')

### Select the best value for K

In [None]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=11)

# fit
knn.fit(X, y)

# predict
X_new = [[3, 5, 4, 2]]
knn.predict(X_new)

### Performing Cross-Validation

* Parameter tuning
* Model selection
* Feature selection

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=False)

print('{} {:^65} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf.split(range(0, 25)), start=1):
    print('{:^10} {} {:^30}'.format(iteration, str(data[0]), str(data[1])))

In [None]:
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors=20)
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
print(scores)

In [None]:
print(scores.mean())

Let's try varying the value for K.

In [None]:
k_range = range(1, 31)
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
    k_scores.append(scores.mean())

print(k_scores)

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

## Evaluating a Model

UCI Machine Learning Repository: [Spambase Data Set](https://archive.ics.uci.edu/ml/datasets/Spambase)

In [None]:
import pandas as pd

In [None]:
spam = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data', header=None)

In [None]:
spam.head()

In [None]:
X = spam.drop(57, axis=1)
y = spam[57]

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)

In [None]:
from sklearn import metrics

print(metrics.accuracy_score(y_test, y_pred_class))

**Null accuracy:** accuracy that could be achieved by always predicting the most frequent class

In [None]:
pd.Series(y_test).value_counts()

In [None]:
pd.Series(y_test).value_counts().head(1) / len(y_test)

### Confusion Matrix

In [None]:
print(metrics.confusion_matrix(y_test, y_pred_class))

In [None]:
print(metrics.accuracy_score(y_test, y_pred_class))

In [None]:
print(metrics.recall_score(y_test, y_pred_class))

In [None]:
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

In [None]:
print(TP / float(TP + FP))
print(metrics.precision_score(y_test, y_pred_class))

### Classification Report

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_class, target_names=['ham', 'spam']))

## Scikit-Learn Algorithm Cheat Sheet

![](scikit-learn-algorithm-cheat-sheet.png)
<div style="text-align: center;">
<strong>Credit:</strong> http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
</div>