# Brief Intro to Machine Learning: Data Modeling with Scikit-Learn

**Outline:**

* [Loading Iris Data](#Loading-Iris-Data)
* [Creating a Model](#Creating-a-Model)
* [Training and Testing a Model](#Training-and-Testing-a-Model)
* [Evaluating a Model](#Evaluating-a-Model)

## Loading Iris Data

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('data/iris.csv')

In [4]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
X = df.drop(['species'], axis=1)
y = df['species']

## Creating a Model

**Note:** 4-step modeling pattern

### K-nearest neighbors (KNN) classification

**Step 1:** Import the model (import)

In [6]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** Instantiate an estimator (instantiate)

In [11]:
knn = KNeighborsClassifier(n_neighbors=3)

**Step 3:** Fit the model (fit)

In [12]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

**Step 4:** Make a prediction (predict)

In [13]:
X_new = [[3, 5, 4, 2]]
knn.predict(X_new)

array(['versicolor'], dtype=object)

In [14]:
X_new = [
    [3, 5, 4, 2], 
    [5, 4, 3, 2]
]
knn.predict(X_new)

array(['versicolor', 'versicolor'], dtype=object)

### Try a different value for K

In [15]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=10)

# fit
knn.fit(X, y)

# predict
knn.predict(X_new)

array(['versicolor', 'versicolor'], dtype=object)

### Use a different classification model

In [16]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X, y)

# predict
logreg.predict(X_new)

array(['virginica', 'setosa'], dtype=object)

In [17]:
from sklearn import svm

clf = svm.SVC()

clf.fit(X, y)

clf.predict(X_new)

array(['virginica', 'versicolor'], dtype=object)

## Training and Testing a Model

### Procedure 1: Train and test on the (same) entire dataset

In [18]:
#from sklearn.datasets import load_iris
from sklearn import metrics

#iris = load_iris()
df = pd.read_csv('data/iris.csv')

#X = iris.data
#y = iris.target
X = df.drop(['species'], axis=1)
y = df['species']

### Logistic Regression

In [19]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X, y)

# predict
y_pred = logreg.predict(X)

metrics.accuracy_score(y, y_pred)

0.96

### KNN (K = 5)

In [20]:
# import
from sklearn.neighbors import KNeighborsClassifier

# instantiate
knn = KNeighborsClassifier(n_neighbors=5)

# fit
knn.fit(X, y)

# predict
y_pred = knn.predict(X)

metrics.accuracy_score(y, y_pred)

0.9666666666666667

### KNN (K = 1)

In [21]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=1)

# fit
knn.fit(X, y)

# predict
y_pred = knn.predict(X)

metrics.accuracy_score(y, y_pred)

1.0

### Procedure 2: Train and test split

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [23]:
print(X_train.shape)
print(X_test.shape)

(120, 4)
(30, 4)


In [24]:
print(y_train.shape)
print(y_test.shape)

(120,)
(30,)


### Logistic Regression

In [25]:
# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X_train, y_train)

# predict
y_pred = logreg.predict(X_test)

metrics.accuracy_score(y_test, y_pred)

0.9333333333333333

## Evaluating a Model

UCI Machine Learning Repository: [Spambase Data Set](https://archive.ics.uci.edu/ml/datasets/Spambase)

In [68]:
import pandas as pd

In [69]:
spam = pd.read_csv('data/spambase.csv', header=None)

In [70]:
spam.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [71]:
X = spam.drop(57, axis=1)
y = spam[57]

In [72]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [73]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)

In [74]:
from sklearn import metrics

print(metrics.accuracy_score(y_test, y_pred_class))

0.9087749782797567


**Null accuracy:** accuracy that could be achieved by always predicting the most frequent class

In [75]:
pd.Series(y_test).value_counts()

0    691
1    460
Name: 57, dtype: int64

In [76]:
pd.Series(y_test).value_counts().head(1) / len(y_test)

0    0.600348
Name: 57, dtype: float64

In [77]:
print(metrics.accuracy_score(y_test, y_pred_class))

0.9087749782797567


In [79]:
print(metrics.recall_score(y_test, y_pred_class))

0.8652173913043478


In [80]:
metrics.f1_score(y_test, y_pred_class)

0.8834628190899