<center> <img src="res/ds3000.png"> </center>

<center> <h1> Week 10 - Day 1 </h1> </center>

<center> <h2> Part 1: Supervised Machine Learning: Classification </h2></center>

## Outline
1. <a href='#1'>Scikit-Learn</a>
2. <a href='#2'>Preparing the Data for Use with Scikit-Learn</a>
3. <a href='#3'>Splitting the Data for Training and Testing</a>
4. <a href='#4'>Creating the Model</a>
5. <a href='#5'>Training the Model</a>
6. <a href='#6'>Predicting Classes</a>
7. <a href='#7'>Prediction Accuracy</a>
8. <a href='#8'>Summary: k-NN Classification</a>

<a id="1"></a>

## 1. Scikit-Learn 
* Scikit-learn, also called **sklearn**, conveniently packages the most effective machine-learning algorithms as **estimators**. 
* **Each is encapsulated, so you don’t see the intricate details and heavy mathematics of how these algorithms work.**
* You’ll use **scikit-learn** to **train each model** on a subset of your data, then **test each model** on the rest to see how well your model works. 
* Once your models are trained, you’ll put them to work making **predictions** based on **data they have not seen**. 
* https://scikit-learn.org/stable/

### 1.2. k-Nearest Neighbors Algorithm (k-NN) 
* Predict a sample’s class by looking at the **_k_ training samples** **nearest in "distance"** to the **sample** 
* Filled dots represent four distinct classes—A (blue), B (green), C (red) and D (purple) 
* **Class with the most “votes” wins**
    * **Odd _k_ value** **avoids ties** &mdash; there’s never an equal number of votes
    
<img src="res/nearest.png" alt="Diagram for the discussion of the k-nearest neighbors algorithm" width=300/>

<a id="2"></a>

## 2. Preparing the Data for Use with Scikit-Learn
* Scikit-learn estimators require samples to be stored in a **two-dimensional array of floating-point values** (or **list of lists** or **pandas `DataFrame`**): 
	* Each **row** represents one **sample** 
	* Each **column** in a given row represents one **feature** for that sample
* For **categorical features** (e.g., **strings** like `'spam'` or `'not-spam'`), you’d have to **preprocess** those features into **numerical values**—known as **one-hot encoding**

In [None]:
import pandas as pd
df = pd.read_csv("res/fruits_data.csv")
df.head()

In [None]:
df["fruit"].value_counts()

In [None]:
#define a fruit dictionary with fruit names and numeric values
fruits_dict = {1: "apple", 2: "orange", 3: "pear", 4: "clementine", 5: "banana", 6: "fig", 7:"lemon"}
fruits_dict

In [None]:
fruits_nominal = {"apple":1, "orange":2, "pear":3, "clementine": 4, "banana":5, "fig": 6, "lemon": 7}
fruits_nominal

* Add a new column to hold numeric values for fruit names (sklearn requires this)

In [None]:
def transform_fruit_name(column):
    return fruits_nominal[column]

df["target"] = df["fruit"].apply(transform_fruit_name)
df.head()

### 2.1. Visualizing the Dataset

In [None]:
import plotly.express as px
fig = px.scatter_matrix(data, dimensions = ["weight", "width", "height"], color = "fruit")
fig.show()

### 2.2. Retrieving Features and Target Columns

In [None]:
features = df[["weight", "width", "height"]]
features.head()

In [None]:
target = df["target"]
target.head()

<a id="3"></a>

## 3. Splitting the Data for Training and Testing
* Typically train a model with a subset of a dataset
* Save a portion for testing, so you can evaluate a model’s performance using unseen data
* Function **`train_test_split`** shuffles the data to randomize it, then splits the samples in the `data` array and the target values in the `target` array into training and testing sets
    * Shuffling helps ensure that the training and testing sets have similar characteristics

### 3.1. train_test_split() Function
* Returns a **tuple of four elements** in which the **first two** are the **samples** split into **training** and **testing sets**, and the **last two** are the **corresponding target values** split into **training** and **testing sets**
* Convention: 
    * **Uppercase `X`** represents **samples**
    * **Lowercase `y`** represents **target values**
* random_state allows us to specify a random seed for reproducibility
* https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)  

### 3.2. Training and Testing Set Sizes 
* **By default**, `train_test_split` reserves **75%** of the data for **training** and **25%** for **testing**
* Can specify the ratio using **train_size** or **test_size** keyword
    * train_size = .80 (80% for training and 20% for testing)
    

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
X_train

In [None]:
X_test

<a id="4"></a>

## 4. Creating the Model 
* In **scikit-learn**, **models** are called **estimators** 
* Once the dataset is split into training and testing sets, we can create a model that utilizes a ML algorithm to learn from the data, which then can be used to make predictions and classify new samples
* **`KNeighborsClassifier`** estimator implements the **k-nearest neighbors algorithm**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier()

<a id="5"></a>

## 5. Training the Model
* Can train the model with the `KNeighborsClassifier` object’s **`fit` method**
* Involves loading **sample training set (`X_train`)** and **target training set (`y_train`)** into the estimator

In [None]:
knn.fit(X=X_train, y=y_train)

* By default, **KNeighborsClassifier** uses n_neighbors=5
    * 5-nearest neighbors
    * Can change it specifying the keyword in method call

* **`KNeighborsClassifier`’s `fit` method** **just loads the data** 
    * **No initial learning process** 
    * The **estimator** is **lazy** &mdash; work is performed only when you use it to make predictions

<a id="6"></a>

## 6. Predicting Classes
* Can make predictions using the `KNeighborsClassifier`’s  **`predict` method**
* Returns an array containing the **predicted class of each test image**: 

In [None]:
predicted = knn.predict(X=X_test)

In [None]:
expected = y_test

In [None]:
results = pd.DataFrame(predicted, columns = ["Predicted"])

In [None]:
results["Expected"] = expected.values

In [None]:
results

* Locate all incorrect predictions for the entire test set:

In [None]:
results = results[results["Predicted"]!=results["Expected"]]
results

In [None]:
len(results)/len(X_test)

* Incorrectly predicted only ? of the ? test samples

### 6.1. Predicting a Single Fruit's Name

In [None]:
fruit_prediction = knn.predict([[4.3,6.2,7.2]])
fruits_dict[fruit_prediction[0]]

<a id="7"></a>

## 7. Prediction Accuracy
Estimator Method `score`
* Returns an **indication of how well the estimator performs** on **test data** 
* For **classification estimators**, returns the **prediction accuracy** for the test data:

In [None]:
accuracy = knn.score(X_test, y_test)
accuracy

In [None]:
accuracy*100

* kNeighborsClassifier with default k of 5 achieved this prediction accuracy using only the estimator’s default parameters
    * Can use hyperparameter tuning to try to determine the optimal value for k

<a id="8"></a>

## 8. Summary: k-NN Classification

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)

knn = KNeighborsClassifier()

knn.fit(X=X_train, y=y_train)

predicted = knn.predict(X=X_test)

expected = y_test

accuracy = knn.score(X_test, y_test)
accuracy