In [2]:
import pandas as pd

# Training a Classification Model with scikit-learn

Contents:

- [Introduction to Classification](#1.-Introduction-to-Classification)
- [The Iris dataset](#2.-The-Iris-dataset)
- [Classification algorithms](#3.-Classification-algorithms)

## 1. Introduction to Classification

In machine learning, problems are usually divided into two types: supervised and unsupervised learning.
In supervised learning, we use labeled data to train models.
Unsupervised learning, on the other hand, works with data that has no labels, finding patterns or groupings on its own. 
Classification is a key type of supervised learning.

Some common classification problems include 
- spam detection,
- image classification (like cat vs dog), and
- fraud detection in financial transactions.

In these problems, labeled data means we already know the correct answer for each example. 
For spam detection, each email is labeled as "spam" or "not spam".
In a cat vs dog problem, each image is labeled as either "cat" or "dog".
For fraud detection, each transaction is labeled as either "fraudulent" or "legitimate"

There are many algorithms used for classification tasks. Some of the most common ones include

- logistic regression,
- k-nearest neighbors (KNN),
- decision trees, and
- neural networks.

Each algorithm has its own strengths and weaknesses, and the choice of algorithm depends on the problem you're solving. This is not an exhaustive list—there are many more algorithms available for classification in machine learning.

The scikit-learn library offers simple, flexible implementations of many classification algorithms. 
It’s widely used for machine learning in Python.
Learn more about scikit-learn [here](https://scikit-learn.org/stable/).

## 2. The Iris dataset

Our first example uses the Iris dataset, a classic dataset in machine learning. 
It contains measurements of 150 iris flowers from three species: Setosa, Versicolor, and Virginica.
The goal is to classify the species based on four features: sepal length, sepal width, petal length, and petal width.

<img src="images/iris_flowers.jpg" alt="Iris Flowers" width="600"/>

In [4]:
path = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/iris.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Each row in the dataset represents an individual flower observation.

In [6]:
# number of flowers (aka samples or observations)
len(df) 

150

Each column represents specific information about each flower.
The dataset has four features: sepal length, sepal width, petal length, and petal width. 
The target variable is the iris species.

In [7]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

The dataset includes 50 samples from each species.

In [9]:
df.species.value_counts()

species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

**Goal**: Predict the species using the four features (sepal length, sepal width, petal length, petal width).

## 3. Classification algorithms

For a classification problem, the steps to use a Scikit-learn model are as follows:

**Step 1**: Organize the data into a feature matrix and a target vector.
The feature matrix contains the input data, where each row represents an observation and each column represents a feature (e.g., sepal length, sepal width).
The target vector contains the labels or categories you're trying to predict (e.g., the species of the iris).

In [13]:
# feature matrix
X = df.drop('species',axis=1)
X

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [12]:
# target vector
y = df.species
y

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: species, Length: 150, dtype: object

**Step 2**: Select a model by importing the appropriate estimator from scikit-learn. 
Today, we’ll experiment with two models: k-nearest neighbors and logistic regression.
We’ll cover these models in more detail in a future lesson.

In [14]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

**Step 3**: Create instances of the models.
This means initializing the models with any necessary hyperparameters. 
For example, in k-nearest neighbors, you can specify the number of neighbors, while logistic regression might require a regularization parameter.
We’ll cover the specific hyperparameters for each model later, at the appropriate time. 

At this stage, we also assign a name to the model (e.g., `knn_clf` or `logreg_clf`), although the name itself doesn’t affect the model's behavior. 

In [17]:
knn_clf = KNeighborsClassifier(n_neighbors=10)
logreg_clf = LogisticRegression()

The model is now ready to be trained but hasn’t seen any data yet.

**Step 4:** Train the model using the data with the `fit()` method.
This step, often called "model training," involves feeding the model the feature matrix and the target vector.

In [19]:
# train the knn model
knn_clf.fit(X,y)

In [20]:
# train the logistic regression model
logreg_clf.fit(X,y)

**Step 5**: Make predictions on new data using the predict() method.
The model uses what it learned during training to predict outcomes for unseen data.

In this example, we’ll make up two new flowers with specific measurements and see how the model classifies them:

In [22]:
new_flowers = pd.DataFrame([[3, 5, 4, 2],  # new flower 1
                            [5, 4, 3, 1]],  # new flower 2
                           columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
new_flowers

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,3,5,4,2
1,5,4,3,1


Here's how you can use the two models (k-nearest neighbors and logistic regression) to make predictions.

In [24]:
# Predictions using k-nearest neighbors model
knn_clf.predict(new_flowers)

array(['Iris-versicolor', 'Iris-setosa'], dtype=object)

In [25]:
# Predictions using logistic regression model
logreg_clf.predict(new_flowers)

array(['Iris-setosa', 'Iris-setosa'], dtype=object)

The k-nearest neighbors model predicts the first flower as Versicolor and the second as Setosa. In contrast, the logistic regression model predicts both flowers as Setosa.

Looking at the predictions, you might wonder, which of these predictions are "right"? How can we determine if the models are making accurate classifications?

This brings up even more questions:

- How do we know if the model is making accurate predictions?
- How do k-nearest neighbors and logistic regression actually work behind the scenes?
- How do we choose the right hyperparameters for each model?
- Will my classifier perform well on new, unseen data?
- What other classification models are available, and how do I know which one to choose?