In [2]:
import pandas as pd

# Training a Machine Learning Classification Model with scikit-learn

**Scikit-learn:** is a python library that provides efficient versions of a large number of common Machine Learning algorithms.

scikit-learn documentation [page](https://scikit-learn.org/stable/)

**Classification:** Identifying which category an object belongs to. Applications:
- spam detection, 
- image recognition (cat vs dog),
- credit card fraud detection

**Model**: A Machine Learning algorithm. 

- Linear Models
- k-nearest neighbors Model
- Decision Trees
- Neural Networks
- etc

Let us consider the Iris dataset

## Loading the iris dataset

In [3]:
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/iris.csv'
iris = pd.read_csv(url, names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width','species'])
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Each row of the data refers to a single observed flower. 

In [4]:
len(iris) # number of flowers (aka samples or observations)

150

Each column refers to a particular piece of information that describes each sample.
There are 4 features (sepal length, sepal width, petal length, petal width).
The target variable is the iris species

In [5]:
iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [6]:
iris.species.value_counts()

Iris-virginica     50
Iris-versicolor    50
Iris-setosa        50
Name: species, dtype: int64

**Goal:** Use the features (sepal length, sepal width, petal length, petal width) to predict the species

## Classification Example: The knn algorithm

The steps in using a Scikit-Learn Model are as follows:

**Step 0**: Arrange data into a features matrix and target vector

In [7]:
# features matrix
X = iris.iloc[:,[0,1,2,3]]
X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [8]:
# target vector
y = iris.iloc[:,-1]
y.head()

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: species, dtype: object

**Step 1:** Choose a class of model by importing the appropriate estimator class for sci-kit learn

In [15]:
from sklearn.neighbors import KNeighborsClassifier

Documentation page for [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

**Step 2**: Initialize the model

In [20]:
knn_clf = KNeighborsClassifier(n_neighbors=3, weights='uniform')

Three notes:

1. Name of the object does not matter

2. Can specify tuning parameters during this step

3. All parameters not specified are set to their defaults

**Step 3:** Fit the model with data (aka "model training") with the fit() method

In [21]:
knn_clf.fit(X,y)

KNeighborsClassifier(n_neighbors=3)

**Step 4:** Apply the model to new data with the predict() method

In [24]:
new_flowers = [[3,5,4,2], # new flower 1: sepal_length=3, sepal_width=5, petal_length=4, petal_width = 2 
               [5,4,3,1]] # new flower 2: sepal_length=5, sepal_width=4, petal_length=3, petal_width = 1 
knn_clf.predict(new_flowers)

array(['Iris-versicolor', 'Iris-setosa'], dtype=object)

**You should have a lot of questions!!**
- how does knn work?
- how do I choose the hyperparameters (n_neighbors and weights)?
- Will my classifier perform well on new data?
- What other classification models are there? Which one should I use?