#Lab \#15: K-folds, One Hot Encoding, & the K-Means Clustering Model 
---

###**Description**: 
In this lab, you will practice implementing a K-Means model using standardization. You will also practice new concepts: K-folds and feature encoding. 

###**Goals**: 
By the end of this lab, you will:
* Create a k-means model on your own using standardization
* One hot encode categorical features of your data to make them usable in your models
* Practice using k-folds when creating models

### **Cheat Sheets:**
[K-Means Cheat Sheet](https://docs.google.com/document/d/1qjofYW__LJs2-ajXYipA6oiF00ky4ebXdCYQAFyPtSg/edit?usp=sharing) 

[Standardization and K-Folds Cheat Sheet](https://docs.google.com/document/d/1Cd6NtCp73i_yL40uGSjE66-JWxvLB01u8BAF1xq765I/edit?usp=sharing)








**Run the code below before continuing:**

In [None]:
import pandas as pd
import numpy as np
import sklearn
from sklearn import datasets, model_selection
from sklearn.cluster import KMeans

## K-Means Independent Practice
---
The following data was collected by examining wheat kernels, the seeds for growing wheat. The kernels examined belong to different varieties of wheat. We will use this data to cluster similar types of wheat together. Walk through the k-means implementation steps on the following dataset.

**Dataset Description:**
This data set includes 210 samples, but we will drop 11 to avoid NaN values. The attributes are: 
1. area (A)
2. perimeter (P)
3. compactness (C = 4*pi*A/P^2)
4. length of kernel
5. width of kernel
6. asymmetry coefficient
7. length of kernel groove


<br>


**Source:** [data.world](https://data.world/uci/seeds/workspace/project-summary?agentid=uci&datasetid=seeds)


#### **Step #1: Load the Dataset**
Run the following cell to load the data. 

In [None]:
url = "https://raw.githubusercontent.com/eliseharvey/TRAIN-plants/TRAIN-seeds/seeds_dataset.csv"
seed_df = pd.read_csv(url, sep = "	", names = ["area", "perimeter", "compactness", "length", "width", "asymmestry_coeff", "length_groove", "target"])
seed_df = seed_df.dropna()
seed_df = seed_df.drop(columns = ["target"])
seed_df

Unnamed: 0,area,perimeter,compactness,length,width,asymmestry_coeff,length_groove
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175
...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044


#### **Step #2: Create X**
Create X out of all the columns given in `seed_df`.

In [None]:
X = seed_df.values
X.shape

(199, 7)

#### **Step #3: Split the Data**
We do not need this step for K-means!

#### **Step #4: Import your model and your StandardScaler!**
Code for importing StandardScaler:
```
from sklearn.preprocessing import StandardScaler
```

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

#### **Step #5: Initialize your model and set hyperparameters**

Assign `n_clusters = 3` and add an optional parameter `random_state = 42`.


Initialize your scaler here as well:
```
scaler = StandardScaler()
```


In [None]:
kmeans = KMeans(n_clusters = 3, random_state=42)
scaler = StandardScaler()

#### **Step #6: Scale your data and fit your model**
We will skip creating a visual for now. Use the following to scale your data:
```
X_scaled = scaler.fit_transform(X)
```

In [None]:
X_scaled = scaler.fit_transform(X)
y = kmeans.fit_predict(X_scaled)

#### **Step \#7 Calculate the Silhouette Score to test the accuracy of your model**

In [None]:
from sklearn.metrics import silhouette_score
score = silhouette_score(X, kmeans.labels_, metric='euclidean')
print('silhouette score: ', score)

silhouette score:  0.4299708319743058


#### **Step \#8: Use the model**

Given the following values, predict in which cluster these kernels would fall.
1.  `area = 14`, `perimeter = 14`, `compactness = .9`, `length = 5.5`, `width = 3.2`, `asymmestry_coeff = 2.6`, and	`length_groove = 5.2`
2. `area = 12`, `perimeter = 13`, `compactness = .8`, `length = 5.3`, `width = 2.8`, `asymmestry_coeff = 4.3`, and	`length_groove = 5.3`


Use the `kmeans.predict([[]])` to complete this problem. 



```
new_seeds = np.array([[seed #1 data],[seed #2 data]])

```

####***Remember to standardize the data with the scaler you used in Step #6***

```
new_seeds = np.array([[seed #1 data],[seed #2 data]])
new_scld = scaler.fit_transform(new_seeds)
```


## Practice Together #1
---

#### **Problem #1: Using the `pokemon` dataset, one-hot encode the `type1` column which is currently a categorical variable in `pokemon_df`.**
---

**Dataset Description:**
This data set includes 898 Pokemon, 1072 including alternate forms, including their number, name, first and second type, the stat total and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed, generation, and legendary status. The attributes of each Pokemon are as follows:

* `Number`: The ID for each pokemon

* `Name`: The name of each pokemon

* `Type 1`: Each pokemon has a type, this determines weakness/resistance to attacks

* `Type 2`: Some pokemon are dual type and have 2

* `Total`: Sum of all stats that come after this, a general guide to how strong a pokemon is

* `HP`: Hit points, or health, defines how much damage a pokemon can withstand before fainting

* `Attack`: The base modifier for normal attacks (eg. Scratch, Punch)

* `Defense`: The base damage resistance against normal attacks

* `SP Atk`: Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)

* `SP Def`: Special defense, the base damage resistance against special attacks

* `Speed`: Determines which pokemon attacks first each round

* `Generation`: The generation of games where the pokemon was first introduced

* `Legendary`: Some pokemon are much rarer than others, and are dubbed "legendary"

<br>


**Source:** [data.world](https://data.world/data-society/pokemon-with-stats)


**Run the code below before continuing:**

In [None]:
url ="https://query.data.world/s/p4tnasnlximnov7fpjlu2msnmegyrb"
pokemon_df = pd.read_csv(url,  sep = ",")
pokemon_df

Unnamed: 0,number,name,type1,type2,total,hp,attack,defense,sp_attack,sp_defense,speed,generation,legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,Mega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,3,Gigantamax Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1067,896,Glastrier,Ice,,580,100,145,130,65,110,30,8,True
1068,897,Spectrier,Ghost,,580,100,65,60,145,80,130,8,True
1069,898,Calyrex,Psychic,Grass,500,100,80,80,80,80,80,8,True
1070,898,Ice Rider Calyrex,Psychic,Ice,680,100,165,150,85,130,50,8,True


**One hot encode:**

In [None]:
# ONE COLUMN encoded
# import one hot encoder
from sklearn.preprocessing import OneHotEncoder

# initialize and fit
ohe = OneHotEncoder()
transformed = ohe.fit_transform(pokemon_df[['type1']]) # note that this is one column

# ADDING MORE: Add back to the dataframe
temp_df = pokemon_df.copy() # creating a copy of the dataframe so we don't edit the original
temp_df[ohe.categories_[0]] = transformed.toarray() # this adds a the columns

# delete type1 column
temp_df = temp_df.drop(columns = ["type1"])
# view our new dataframe
temp_df

Unnamed: 0,number,name,type2,total,hp,attack,defense,sp_attack,sp_defense,speed,...,Graass,Grass,Ground,Ice,Normal,Poison,Psychic,Rock,Steel,Water
0,1,Bulbasaur,Poison,318,45,49,49,65,65,45,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Ivysaur,Poison,405,60,62,63,80,80,60,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Venusaur,Poison,525,80,82,83,100,100,80,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Mega Venusaur,Poison,625,80,100,123,122,120,80,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3,Gigantamax Venusaur,Poison,525,80,82,83,100,100,80,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1067,896,Glastrier,,580,100,145,130,65,110,30,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1068,897,Spectrier,,580,100,65,60,145,80,130,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1069,898,Calyrex,Grass,500,100,80,80,80,80,80,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1070,898,Ice Rider Calyrex,Ice,680,100,165,150,85,130,50,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


#### **Problem #2: Given the following dataset on trees, one hot encode the column `Tree_Types`.**

In [None]:
# creating initial dataframe
tree_types = ('Maple','Willow','Pine','Apple','Teak','Acacia','Neem')
tree_df = pd.DataFrame(tree_types, columns=['Tree_Types'])
tree_df

Unnamed: 0,Tree_Types
0,Maple
1,Willow
2,Pine
3,Apple
4,Teak
5,Acacia
6,Neem


In [None]:
# One hot encode here
# initialize and fit
ohe = OneHotEncoder()
transformedTrees = ohe.fit_transform(tree_df[['Tree_Types']]) # note that this is one column

# ADDING MORE: Add back to the dataframe
tempTree_df = tree_df.copy() # creating a copy of the dataframe so we don't edit the original
tempTree_df[ohe.categories_[0]] = transformedTrees.toarray() # this adds a the colum

tempTree_df = tempTree_df.drop(columns = ["Tree_Types"])

tempTree_df

Unnamed: 0,Acacia,Apple,Maple,Neem,Pine,Teak,Willow
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,1.0,0.0,0.0,0.0


#### **Problem #3 [Optional]: Similar to Problem #1, one-hot encode `type2` of the Pokemon dataset (as done with `type1`)**.

In [None]:
# One hot encode here

---
## Back To Lecture
---

## Practice Together #2
---

#### **Problem #1: Implement K-Folds on the Iris dataset. Steps 1-5 for creating the KNN model have been provided so you will only practice implementing K-folds.**

In [None]:
from sklearn.datasets import load_iris
from sklearn import metrics
# 1 - load data
iris = load_iris()

# 2 get dependent and independent data
X=iris.data
y=iris.target

# 3 - skip since we will do it in KFolds

# 4 - 5 import and initialize KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

In [None]:
# import KFolds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)

# store each score in the list "evaluations"
evaluations = []
index = 1
# loop through folds and store the accuracy score
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # fit model and make a prediction
    knn.fit(X_train, y_train)
    pred = knn.predict(X_test)
    # get accuracy score
    score = metrics.accuracy_score(y_test, pred)
    evaluations.append((index, score))
    index += 1

# print results
for num, eval in evaluations:
  print(f'Fold #{num} has an accuracy score of {eval}!')


Fold #1 has an accuracy score of 1.0!
Fold #2 has an accuracy score of 1.0!
Fold #3 has an accuracy score of 0.8333333333333334!
Fold #4 has an accuracy score of 0.9333333333333333!
Fold #5 has an accuracy score of 0.8!


**What if we had just done a normal train-test split and completed our model normally? Split the data, fit the model, make a prediction, evaluate, and compare with the k-fold implementation.**

In [None]:
# 6 - 7
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

knn.fit(X_train, y_train)
pred = knn.predict(X_test)

score = metrics.accuracy_score(y_test, pred)

print(score)

***Discussion question:*** Are there differences between the scores using k-folds and train-test split? 

#### **Problem #2: Implement K-folds on the following linear regression model on the Diabetes Dataset**

This dataset contains data from diabetic patients with features such as their BMI, age, blood pressure, and glucose levels that are useful in predicting the diabetes disease progression in patients. We will be looking at these variables that will be used to help predict disease progression in diabetic patients. **Use mean squared error when evaluating each fold.**

In [None]:
# steps 1 - 5 given
# 1 
diabetes = datasets.load_diabetes()
diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
diabetes_df['TARGET'] = diabetes.target

# 2 - using age, bmi and bp as indep. and target as dep.
X = diabetes_df[['age', 'bmi', 'bp']].values
y = diabetes_df['TARGET'].values
# 3 - skip 

# 4
from sklearn.linear_model import LinearRegression

# 5
reg = LinearRegression()

In [None]:
# Implement K-Folds here

#### **Problem #3 [Optional]: Implement K-folds on the following KNN model on the Breast Cancer Dataset**

The following dataset is taken from the [UCI ML Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)). The dataset contains mammography exam results and whether or not cancer was detected.

**Use accuracy score when evaluating each fold**

In [None]:
# steps 1 - 5 given
# 1
cancer_dataset = datasets.load_breast_cancer()
cancer_df = pd.DataFrame(data=cancer_dataset.data, columns=cancer_dataset.feature_names)
cancer_df['TARGET'] = cancer_dataset.target

# 2
X = cancer_df[["mean radius","mean texture"]].values
y = cancer_df[["TARGET"]].values

# 3 - skip

# 4 
from sklearn.neighbors import KNeighborsClassifier

# 5
model = KNeighborsClassifier(n_neighbors = 4)

In [None]:
# Implement K-Folds here

---
© 2023 The Coding School, All rights reserved