# Brief Intro to Machine Learning: Data Modeling with Scikit-Learn

**Outline:**

* [Loading Iris Data](#Loading-Iris-Data)
* [Creating a Model](#Creating-a-Model)
* [Training and Testing a Model](#Training-and-Testing-a-Model)
* [Evaluating a Model](#Evaluating-a-Model)

## Loading Iris Data

UCI Machine Learning Repository: [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris)

In [1]:
import pandas as pd

In [2]:
iris_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
df = pd.read_csv(iris_data_url, names=columns)

In [3]:
df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
X = df.drop(['class'], axis='columns')
y = df['class']

## Creating a Model

**Note:** 4-step modeling pattern

### K-nearest neighbors (KNN) classification

**Step 1:** Import the model (import)

In [6]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** Instantiate an estimator (instantiate)

In [7]:
knn = KNeighborsClassifier(n_neighbors=3)

**Step 3:** Fit the model (fit)

In [8]:
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

**Step 4:** Make a prediction (predict)

In [9]:
X_new = [[3, 5, 4, 2]]
knn.predict(X_new)

array(['Iris-versicolor'], dtype=object)

In [10]:
X_new = [
    [3, 5, 4, 2], 
    [5, 4, 3, 2]
]
knn.predict(X_new)

array(['Iris-versicolor', 'Iris-versicolor'], dtype=object)

### Try a different value for K

In [11]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=10)

# fit
knn.fit(X, y)

# predict
knn.predict(X_new)

array(['Iris-versicolor', 'Iris-versicolor'], dtype=object)

### Use a different classification model

In [12]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X, y)

# predict
logreg.predict(X_new)



array(['Iris-virginica', 'Iris-setosa'], dtype=object)

In [13]:
from sklearn import svm

clf = svm.SVC()

clf.fit(X, y)

clf.predict(X_new)



array(['Iris-virginica', 'Iris-versicolor'], dtype=object)

## Training and Testing a Model

### Procedure 1: Train and test on the (same) entire dataset

In [14]:
#from sklearn.datasets import load_iris
from sklearn import metrics

iris_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
df = pd.read_csv(iris_data_url, names=columns)

X = df.drop(['class'], axis='columns')
y = df['class']

### Logistic Regression

In [15]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X, y)

# predict
y_pred = logreg.predict(X)

metrics.accuracy_score(y, y_pred)



0.96

### KNN (K = 5)

In [16]:
# import
from sklearn.neighbors import KNeighborsClassifier

# instantiate
knn = KNeighborsClassifier(n_neighbors=5)

# fit
knn.fit(X, y)

# predict
y_pred = knn.predict(X)

metrics.accuracy_score(y, y_pred)

0.9666666666666667

### KNN (K = 1)

In [17]:
# instantiate
knn = KNeighborsClassifier(n_neighbors=1)

# fit
knn.fit(X, y)

# predict
y_pred = knn.predict(X)

metrics.accuracy_score(y, y_pred)

1.0

### Procedure 2: Train and test split

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [19]:
print(X_train.shape)
print(X_test.shape)

(120, 4)
(30, 4)


In [20]:
print(y_train.shape)
print(y_test.shape)

(120,)
(30,)


### Logistic Regression

In [21]:
# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X_train, y_train)

# predict
y_pred = logreg.predict(X_test)

metrics.accuracy_score(y_test, y_pred)



0.9333333333333333

## Evaluating a Model

UCI Machine Learning Repository: [Spambase Data Set](https://archive.ics.uci.edu/ml/datasets/Spambase)

In [68]:
import pandas as pd

In [22]:
spam_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data'
spam = pd.read_csv(spam_data_url, header=None)

In [23]:
spam.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [24]:
X = spam.drop(57, axis='columns')
y = spam[57]

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [26]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)



In [27]:
from sklearn import metrics

print(metrics.accuracy_score(y_test, y_pred_class))

0.9035621198957429


**Null accuracy:** accuracy that could be achieved by always predicting the most frequent class

In [28]:
pd.Series(y_test).value_counts()

0    691
1    460
Name: 57, dtype: int64

In [29]:
pd.Series(y_test).value_counts().head(1) / len(y_test)

0    0.600348
Name: 57, dtype: float64

In [30]:
print(metrics.accuracy_score(y_test, y_pred_class))

0.9035621198957429


In [31]:
print(metrics.recall_score(y_test, y_pred_class))

0.8586956521739131


In [32]:
metrics.f1_score(y_test, y_pred_class)

0.876803551609323