# Brief Intro to Machine Learning: Data Modeling with Scikit-Learn

**Outline:**

* [Loading Iris Data](#Loading-Iris-Data)
* [Creating a Model](#Creating-a-Model)
* [Training, Testing, and Evaluating a Model](#Training,-Testing,-and-Evaluating-a-Model)

## Loading Iris Data

UCI Machine Learning Repository: [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris)

In [None]:
import pandas as pd

In [None]:
iris_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
df = pd.read_csv(iris_data_url, names=columns)

In [None]:
df.head()

In [None]:
X = df.drop(['class'], axis='columns')
y = df['class']

---

## Creating a Model

**Note:** 4-step modeling pattern

### K-nearest neighbors (KNN) classification

**Step 1:** Import the model (import)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** Instantiate an estimator (instantiate)

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

**Step 3:** Fit the model (fit)

In [None]:
knn.fit(X, y)

**Step 4:** Make a prediction (predict)

In [None]:
X_new = [[3, 5, 4, 2]]
knn.predict(X_new)

In [None]:
X_new = [
    [3, 5, 4, 2], 
    [5, 4, 3, 2]
]
knn.predict(X_new)

### Try a different value for K

In [None]:
# import
from sklearn.neighbors import KNeighborsClassifier

# instantiate
knn = KNeighborsClassifier(n_neighbors=10)

# fit
knn.fit(X, y)

# predict
knn.predict(X_new)

### Use a different classification model

In [None]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X, y)

# predict
logreg.predict(X_new)

In [None]:
# import
from sklearn import svm

# instantiate
clf = svm.SVC()

# fit
clf.fit(X, y)

# predict
clf.predict(X_new)

---

## Training, Testing, and Evaluating a Model

In [None]:
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [None]:
iris_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
df = pd.read_csv(iris_data_url, names=columns)

In [None]:
X = df.drop(['class'], axis='columns')
y = df['class']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [None]:
print(X_train.shape)
print(X_test.shape)

In [None]:
print(y_train.shape)
print(y_test.shape)

In [None]:
# import
from sklearn.linear_model import LogisticRegression

# instantiate
logreg = LogisticRegression()

# fit
logreg.fit(X_train, y_train)

# predict
y_pred = logreg.predict(X_test)

In [None]:
print(metrics.accuracy_score(y_test, y_pred))

**Null accuracy:** accuracy that could be achieved by always predicting the most frequent class

In [None]:
pd.Series(y_test).value_counts()

In [None]:
pd.Series(y_test).value_counts().head(1) / len(y_test)