# **Data Encoding and Parameter**
In this lab, we will cover feature scaling through standardization and normalization, different feature encoding techniques, and hyperparameter turnning. We will also see firsthand how K-Folds Cross Validation aids in estimating the skill of ML models.<br>

#### **Part 1:** [Data Scaling](#p1)
#### **Part 2:** [Data Encoding](#p2)
#### **Part 3:** [Ensemble Learning](#p3)
#### **Part 4:** [Hyperparameter Tunning](#p4)
#### **Part 5:** [Exercise](#p5)





### Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/MyDrive

### Import Needed Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets, model_selection, metrics
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *

###Load Datasets

####**Diabetes Dataset (`df_diabetes`)**

In [None]:
df_diabetes = pd.read_csv("datasets/diabetes.csv")

In [None]:
df_diabetes.head()

####**Breast Cander Dataset(`df_cancer`)**

In [None]:
from sklearn.datasets import *

data = load_breast_cancer()

df_cancer = pd.DataFrame(data.data, columns=data.feature_names)
df_cancer['target'] = data.target

In [None]:
df_cancer.head()

####**Corp Dataset(`df_corp`)**

In [None]:
url = 'https://raw.githubusercontent.com/the-codingschool/TRAIN-datasets/main/crop_recommendation/crop%20recommendation%20clean.csv'
df_corp = pd.read_csv(url)

In [None]:
df_corp.head()

####**Student Dataset(`df_student`)**
This dataset is used to predict a high school student's final grade according to a discrete category of 0 - 20. This dataset contains many features, as described below:

* Medu : Mother's education status. 'none' for no education, 'primary' for through 4th grade, 'middleschool' for through 9th grade, 'highschool' for through 12th grade, 'higher' for anything over 12th grade.
* Fedu: Father's education status. Same categories as Medu.
* failures: How many classes the student has failed in the past.
* absences: The number of days the student has been absent.
* traveltime: How many minutes the student has to travel to get to school.
* goout: How the student has rated how often they go out with friends on a scale from 1 (very low) to 5 (very high).
* school: GP or MS are two different schools in Portugal where the data was collected.
* higher: Whether the student has expressed interested in taking higher education after graduating high school.
* famsize: GT3 for greater than 3 family members and LE3 for less than or equal to 3 family members.
* G3: The overall grade of the student at the end of the year.

In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/the-codingschool/TRAIN-datasets/main/student_portugal/student-por.csv'
df_student = pd.read_csv(url)

edu_map = {0: 'none', 1: 'primary', 2: 'middleschool', 3: 'highschool', 4: 'higher'}
df_student['Medu'] = df_student['Medu'].map(edu_map)
df_student['Fedu'] = df_student['Fedu'].map(edu_map)

df_student['traveltime'] *= 15
df_student['studytime'] *= 2.5

selected_features = [
    'Medu',            # Mother's Education
    'Fedu',            # Father's Education
    'failures',        # Number of class failures
    'absences',        # Number of absences
    'traveltime',      # Travel time to school
    'goout',           # Going out with friends
    'school',       # School identifier
    'higher',      # Interest in higher education
    'famsize',      # Family size ≤ 3 members
    'G3'
]

df_student = df_student[selected_features]


In [None]:
df_student.head()

<a name="p1"></a>
#**Part 1: Data Scaling**
####The purpose of scaling is to ensure that all features have equal importance during the learning process and to avoid any bias that might arise due to differences in the scales of the features

## Corp Data (df_corp)

In [None]:
df_corp.head()

In [None]:
df_corp.shape

## Retrieve data and labels
by tradition, use X for data, and y for labels. Column `crop` stores labels.

In [None]:
#Get the data and labels


## Split the dataset into training and test datasets

In [None]:
# split data into a training dataset and test dataset


## Data Scaling (Normalization)
Two popular data scaling methods include:
* Z-score normalization: return the standard score of a sample x as: z = (x - u) / s
* min-max normalization: transforms numerical data into a  range between 0 and 1


### **StandardScaler**  (z-score normalization)
Here we create a z-score normalized version of the training and testing data using sklearn's `StandardScaler()`. We will use this to compare to the original training and test sets.

In [None]:
std_scaler = StandardScaler()
# fit standardscaler based on X_train, and then apply standardscaler to X_train and X_test.
X_train_std = std_scaler.fit_transform(X_train)
X_test_std = std_scaler.transform(X_test)

### **MinMaxScaler** (min-max normalization)
Here we create a min-max normalized  version of the training and testing data using sklearn's `MinMaxScaler()`. We will use this to compare to the original training and test sets

In [None]:
#fit minmax Scaler base on X_train, and then apply the min-max scaler to X_train and X_test
norm_scaler = MinMaxScaler()
X_train_norm = norm_scaler.fit_transform(X_train)
X_test_norm = norm_scaler.transform(X_test)

## Visualize Data
Before moving onto modeling, let's see if we can visually detect any differences between these types of scaling. Specifically, create three scatter plots as follows:

One comparing the original training data's 0th column to the standardized data's 0th column.
Another comparing the original training data's 0th column to the normalized data's 0th column.
A third one comparing the standardized training data's 0th column to the normalized data's 0th column.

### **1. Create a scatter plot comparing the original training data's 0th column to the standardized data's 0th column**.

### **2. Create a scatter plot comparing the original training data's 0th column to the *normalized* data's 0th column.**

### **3. Create a scatter plot comparing the *standardized* data's 0th column to the *normalized* data's 0th column.**

## Select Classification Model -- SVM Classifier
We build SVM models on both orginal dataset and scaled dataset

In [None]:
#Initialized a SVM model for original dataset


In [None]:
#Initialized a SVM model for z-score normalized dataset



In [None]:
# Train SVM models


In [None]:
# Test SVM models and evlauate their performances


## Exercise

#### 1. Create SVM classifier on min-max normalized dataset, test it and evaluate its performance.

#### 2. compare all three SVM models' performance. Be sure to use k-fold cross validation.

<a name="p2"></a>
# **Part 2: Feature Encoding**
#### Dealing with Attributes with String Values. Decision tree and other machine learning methods currently don't work with non-numeric attributes.
####Nominal and Ordinal attributes are handled differently.
*   **Ordinal attributes**: convert each attribute value to a number. Use sklearn `OrdinalEncoder()`
*   **Nominal attributes**: for each distinct value, create a binary
attribute to store it. To reduce redundance, create N-1 binary attributes for an attribute with N values.  Use sklearn `OneHotEncoder()`



In this section, we will investigate the role that different forms of encoding have on a model's performance. Specifically, we will use KNN to predict a high school student's final grade according to a discrete category of 0 - 20. This dataset contains many features, as described below:

* `Medu` : Mother's education status. 'none' for no education, 'primary' for through 4th grade, 'middleschool' for through 9th grade, 'highschool' for through 12th grade, 'higher' for anything over 12th grade.
* `Fedu`: Father's education status. Same categories as `Medu`.
* `failures`: How many classes the student has failed in the past.
* `absences`: The number of days the student has been absent.
* `traveltime`: How many minutes the student has to travel to get to school.
* `goout`: How the student has rated how often they go out with friends on a scale from 1 (very low) to 5 (very high).
* `school`: `GP` or `MS` are two different schools in Portugal where the data was collected.
* `higher`: Whether the student has expressed interested in taking higher education after graduating high school.
* `famsize`: `GT3` for greater than 3 family members and `LE3` for less than or equal to 3 family members.
* `G3`: The overall grade of the student at the end of the year, it will be the class label.

##Student dataset (`df_student`)

### Split the data into the training and test data

In [None]:
features = df_student.drop(columns = 'G3')
label = df_student['G3']

X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=42)

### Pick the classification model

####To understand why we need encodings, try to construct a KNN model on the raw training set below

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors = 5)

model.fit(X_train, y_train)

### Data Encoding

Now, let's encode our categorical variables based on the type of variable they are,

1. First, create a a version of `X_train` and `X_test` that only contains the numerical varibles from our dataset.

2. Then, determine which categorical variables are ordered and which are unordered.

3. Encode the ordered categorical variables using the ordinal encoder.

4. Encode the unordered categorical variables using one hot encoding (dummy variable).

<br>

In [None]:
#Run the cell below to see the subset of all categorical (`object` type) columns in this data frame.
columns_to_encode = df_student.select_dtypes(include = object).columns

print(columns_to_encode)

#### **1. Create a version of `X_train` and `X_test` where the unencoded categorical variables are just dropped.**

In [None]:
X_train_drop = X_train.drop(columns = columns_to_encode)
X_test_drop = X_test.drop(columns = columns_to_encode)

X_train_drop.head()

#### **2. Determine which categorical variables are ordered and which are unordered and create two lists that contain the feature names.**

In [None]:
ordered_features = ['Medu', 'Fedu']
unordered_features = ['school', 'higher', 'famsize']

#### **3. Encode the ordered categorical variables using the ordinal encoder `OrdinalEncoder()`.**
**NOTE**: We must make copies of the dropped training and test sets using `.copy()` so that we can add the encoded columns without having to drop the original columns.

In [None]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
X_train_enc = X_train_drop.copy()
X_test_enc = X_test_drop.copy()


ord_enc = OrdinalEncoder()

X_train_enc[ordered_features] = ord_enc.fit_transform(X_train[ordered_features])
X_test_enc[ordered_features] = ord_enc.transform(X_test[ordered_features])

X_train_enc.head()

#### **4. Create a version of `X_train` and `X_test` where the unordered categorical variables are dummy variable encoded.**

Since one hot encoding creates a new feature for every possible value of categorical features, the number of columns will grow dramatically. To account for this, we will break this process into two steps:

1. Fit the one hot encoder to the training data and determine the new features.

2. Transform (encode) the training and test sets accordingly.

**Note** We must pass in the argument `drop = 'first'` to our OneHotEncoder to drop one of our encoded columns to make it dummy variable encoding.

In [None]:
dv_enc = OneHotEncoder(sparse_output = False, drop = 'first')
dv_enc.set_output(transform = 'pandas')

dv_enc.fit(X_train[unordered_features])

dv_columns = dv_enc.get_feature_names_out()
print(dv_columns)

In [None]:
X_train_enc[dv_columns] = dv_enc.transform(X_train[unordered_features])
X_test_enc[dv_columns] = dv_enc.transform(X_test[unordered_features])

X_train_enc.head()

In [None]:
X_test_enc.head()

#### **5. Build KNN (with k=5) on all attributes with categorical attributes encoded, and compare their performances .**


In [None]:
# create confusion matrix and display it


#### **6. Build a Decision Tree on all attributes with categorical attributes encoded.**

## Exercise -- Bank Dataset
In this exercise, you will need to
* Encode categorical attributes
* Pick a classification model to predict target variable y

[More information on columns](https://archive.ics.uci.edu/dataset/222/bank+marketing)



In [None]:
# load the dataset
import pandas as pd
df_bank=pd.read_csv('datasets/bank.csv', sep=';')

<a name="p3"> </a>
# **Part 3: Emsemble Learning**
---
**Ensemble Methods:** combining multiple machine learning models to create a stronger overall model. An ensemble of models generally performs better than any single constituent model for several reasons:
* Ensemble models reduce variance by averaging multiple models which helps avoid overfitting
* Different models can capture different patterns/relationships in the data
* Combining weak learner models can produce a strong overall model

Some common ensemble methods include:

* Bagging – Training each model on a random subset of the data
* Boosting – Training models sequentially, with each model focusing on the errors of the previous model
* Stacking – Combining multiple models by using the predictions of base models as inputs to a meta model
* Voting Classifiers – Combining models through averaging/majority voting on their predictions

## **Random Forest -- Bagging**

### Import library

In [None]:
from sklearn.ensemble import RandomForestClassifier

### **1.  Breast Cancer Dataset (`df_cancer`)**

In [None]:
df_cancer.head()

#### Split the data

In [None]:
X=df_cancer.drop(columns=['target'])
y=df_cancer['target']
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42)

#### Train the model

In [None]:
rf=RandomForestClassifier()
rf.fit(X_train, y_train)
pred_rf=rf.predict(X_test)

#### Evaluate the Model Performance

In [None]:
print(classification_report(y_test, pred_rf))

### **2.  Diabetes Dataset (`df_diabetes`)**

In [None]:
df_diabetes.head()

####Split the dataset

In [None]:
X=df_diabetes.drop(columns=['Outcome'])
y=df_diabetes.Outcome

In [None]:
# split the dataset into training and testing
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42)

####Train the Model

In [None]:
# build the model
from sklearn.ensemble import RandomForestClassifier
rf_diabetes=RandomForestClassifier()
rf_diabetes.fit(X_train, y_train)
pred_d=rf_diabetes.predict(X_test)

####Evaluate the Model Performance

In [None]:
# model evaluation
print(classification_report(y_test, pred_d))

## **Stacking**
Stacking works in two stages:

*  Base Models: Train multiple models on the training data.
*  Meta-Model: Train a secondary model (meta-learner) on the predictions of the base models to make the final prediction.

In [None]:
#import libraries
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

###Build base models

In [None]:
base_models=[('rf', RandomForestClassifier()),
             ('nb', GaussianNB()),
             ('dt', DecisionTreeClassifier()),
             ('svm', SVC())]

###Define meta model

In [None]:
meta_model=LogisticRegression()

### Construct stacking clasifier

In [None]:
stack_1 = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)


###Train the stacking classifier and evaluate its performance

### Exercise

Build a different stacking strategy as follows:
* Use Naive Bayes and Decision Tree for base model
* **option 1:** create stacking using SVC for meta model
* **option 2:** craete stacking using logisticregression for meta model
* compare the performance of option 1 and option 2.  Using k-fold cross validation.

####Create base models

In [None]:
# build the base model


#### option 1: Use SVC for meta model, train the stacking classifier and evaluate it

In [None]:
# build meta model


In [None]:
#build stacking classifier, and evaluate it


#### option 2: Use Logistic regression for meta model, train the stacking classifier and evaluate it


#### Compare the performance of two stacking classifier

## **XGBoost -- boosting**

### Import library

In [None]:
from xgboost import XGBClassifier

### Diabetes Dataset

In [None]:
df_diabetes.head()

#### Split the dataset

#### Build model

#### Evaluate Model

###Exercise
Apply XGBClassifier to breast cancer dataset (`df_cancer`)

<a name="p4"> </a>
# **Part 4: Hyperparameter Tunning**
Hyper-parameters are parameters that are not directly learnt within estimators. They are set before the training of the model. In scikit-learn they are passed as arguments to the constructor of the estimator classes.
* [Motivational Example](#p31)
* [Hyperparameter Tunning for KNN](#p32)
* [Hyperparameter Tunning for Decision Tree](#p33)
* [Hyperparameter Tunning for Bayes Classifer](#p34)
* [Hyperparameter Tunning for Support Vector Machine (SVM)](#p35)


<a name="p31"> </a>
## Motivational Example: How to find the best n_neighbors for a KNN

### **1. Load the dataset-- Spotify Dataset**
---
Spotify is one of the most popular digital music streaming services with over 515 million monthly users. The following dataset from Spotify data looks at different qualities of songs like energy, key, loudness, and tempo to see if a song is a top or bottom hit.

The features are as follows:
* `artist`: song artist(s)
* `song`: song title
* `duration_ms`: the track length in milliseconds (ms)
* `year`: the year the song was released
* `top half`: whether or not the song is in the top half of hits
* `danceability`: how suitable a track is for dancing (0.0: least danceable, 1.0: most danceable)
* `energy`: perceptual measure of intensity and activity (0.0 - 1.0)
* `key`: the key the track is in; integers map to pitches using standard Pitch Class notation (0: C, 1: C♯/D♭, 2:D, ..., 11: B)
* `loudness`: the overall loudness of a track in decibels (dB)
* `mode`: the modality of a track, or the type of scale from which its melodic content is derived (0: minor, 1: major)
* `speechiness`: a measure of the presence of spoken words in the track (0-0.33: music and other non-speech-like tracks, 0.33-0.66: contain both music and speech, 0.66-1.0: most likely made entirely of spoken words (e.g. talk show, audio book, poetry))
* `acousticness`: a confidence measure of whether or not the track is acoustic (0.0: low confidence, 1.0: high confidence)
* `instrumentalness`: predicts whether or not a track contains vocals (0.0: vocal content, 1.0: no vocal content)
* `liveness`: detects the presence of an audience in the recording ( > 0.8: strong likelihood the track was performed live)
* `valence`: musical positiveness conveyed by the track (lower valence: more negative, higher valence: more positive)
* `tempo`: the overall estimated tempo in beats per minute (BPM)
* `genre`: the genre in which the track belongs
* `explicit`: whether or not the song is explicit
* `explicity binary`: whether or not the song is explicit (0: no, 1: yes)

#### **Your Task**
Using the Spotify dataset, you will do the following:
* Create a SVM model that can predict whether a song will be a hit or a bust;
* Predict whether songs with various keys and energies will be hits or busts.

In [None]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vQJ9UIsI2j8vPnefdBj6GIrUGiDMsF5HRVAg4rsfaZqX5fAoTGLGydLvPXPQvE5ZSo9_aet1SC5UQji/pub?gid=1132556054&single=true&output=csv"
df_spotify = pd.read_csv(url)

df_spotify.head()

###**2. Decide independent and dependent variables**

In [None]:
features = df_spotify[['key', 'energy']]
label = df_spotify["top half"]

plt.figure(figsize=(10,6))
plt.scatter(features['key'], features['energy'], c = label)

# yellow: top hit, purple: bottom hit
plt.title("Energy vs. Key of Hit Songs Colored by Whether they were a Top or Bottom Hit")
plt.xlabel("Key")
plt.ylabel("Energy")

plt.show()

### **3: Split data into training and testing data**


In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(features, label, test_size=0.2, random_state=42)

### **4. Build a SVM model and test it**

In [None]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
pred=model.predict(X_test)

In [None]:
print(classification_report(y_test, pred))

In [None]:
cm=confusion_matrix(y_test, pred)
disp=ConfusionMatrixDisplay(cm, display_labels=['top half','bottom half'])
disp.plot()

###**5. Build a KNN model (with k=3) and test it**

In [None]:
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred_knn=knn.predict(X_test)

In [None]:
print(classification_report(y_test, pred_knn))

### **6. Use the model**
Use the model to predict whether the following songs are in the top hits.

1. A song with `key = 3` and `energy = 0.8`. According to our KNN model, will this song be in the top half of hits?
2. A song with `key = 4.5` and `energy = 0.45`. Will this song be a bust or a hit?
3. A song with `key = 1` and `energy = 0.5`. Will this song be a bust or a hit?

In [None]:
songs = pd.DataFrame([[3, 0.8], [4.5, 0.45], [1, 0.5]], columns = ["key", "energy"])
prediction = model.predict(songs)
print(prediction)

### **7. Reflection Qeustion**
What is the best n_neighbors value? We can try a set of n_neighbors and find the best one.

In [None]:
# Hyperparameter tuning

scores = {}
for n in range(1,50,2):
    full_model = KNeighborsClassifier(n_neighbors = n)
   # full_model.fit(X_train, y_train.to_numpy().reshape(-1))
    full_model.fit(X_train, y_train)
    pred = full_model.predict(X_test)
   # score = sum(pred == y_test.to_numpy().reshape(-1))/len(pred)* 100
    score = sum(pred == y_test)/len(pred)* 100
    scores[n] = score


plt.title("Accuracy on Test set across Hyperparameter values")
print(scores)
plt.plot(list(scores.keys()), list(scores.values()), label = 'Scores for all K')

# ADDING THE PERFORMANCE FOR K = SQRT SIZE FOR REFERENCE
k = int(len(X_train)**(1/2)/2)*2 - 1
full_model = KNeighborsClassifier(n_neighbors = k)
full_model.fit(X_train, y_train)
pred = full_model.predict(X_test)
score = sum(pred == y_test)/len(pred)* 100
plt.scatter([k], [score], color = 'r', marker = '*', s = 200, label = 'Square Root of Training Data Size')


top_score = max(scores.values())
best_k = list(scores.keys())[list(scores.values()).index(top_score)]
plt.scatter([best_k], [top_score], color = 'g', marker = '*', s = 200, label = 'Best Perfomance')

plt.legend()
plt.show()



# PRINTING THE RESULTS
print("Top score of optimal classifier: " + str(top_score))
print("Best Value of K to use " + str(best_k))

###**7. Try a decision tree and see how it performs?**

<a name="p32"></a>
## Hyperparameter Tunning for KNN
* Machine learning models are not intelligent enough to know what hyperparameters would lead to the highest possible accuracy on the given dataset. We can let the model try different combinations of hyperparameters during the training process and make predictions with the best combination of hyperparameter values
* As it's time consuming and sometimes impossible to try all possible combinations of hyperparameters, two strategies are used to quickly find the local optimal ones.
 * Grid Search:exhaustively searches all combinations within a predefined grid
 * Random Search:randomly samples a fixed number of parameter settings from specified distributions


###**Grid Search**

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold

kf=KFold(n_splits=10, shuffle=True, random_state=42)

# define the range of hyper parameters
parameter={'n_neighbors': np.arange(2, 30, 1)}

knn=KNeighborsClassifier()

# define grid search strategies and scope
knn_cv=GridSearchCV(knn, param_grid=parameter, cv=kf, verbose=1)

knn_cv.fit(X_train, y_train)

# print the best hyper parameter

print('\nthe best hyper parameter n_neighbors: ', knn_cv.best_estimator_)

pred=knn_cv.predict(X_test)
print(classification_report(y_test, pred))

###**Random Search**

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold
kf=KFold(n_splits=10, shuffle=True, random_state=42)

parameter={'n_neighbors': np.arange(2, 30, 1)}

knn=KNeighborsClassifier()

# define random search strategies and scope
knn_cv=RandomizedSearchCV(knn, param_distributions=parameter, cv=kf, verbose=1)

knn_cv.fit(X_train, y_train)

# print the best hyper parameter
print('\nthe best hyper parameter n_neighbors: ', knn_cv.best_estimator_)


pred=knn_cv.predict(X_test)
print(classification_report(y_test, pred))

<a name="p33"> </a>
## Hyperparameter Tunning for Decision Tree
A decision tree classifier has many parameters including
* *criterion*: functions to measure the quality of a split include *gini*, *entropy*, etc
* *max_depth*: the maximum depth of the tree
* *min_samples_split*: the minimum number of samples to split an internal node, default=2
* *min_samples_leaf*: the minimum number of samples at a leaf node, default=*1*

###**Grid Search**

In [None]:
from sklearn.tree import DecisionTreeClassifier
kf=KFold(n_splits=10, shuffle=True, random_state=42)
parameter={'max_depth':np.arange(3, 20, 1),'min_samples_split': np.arange(5, 50, 2)}
dt=DecisionTreeClassifier()
dt_gs=GridSearchCV(dt, param_grid=parameter, cv=kf, verbose=1)
dt_gs.fit(X_train, y_train)

print(dt_gs.best_estimator_)

pred=dt_gs.predict(X_test)
print(classification_report(y_test, pred))

In [None]:
print(dt_gs.best_estimator_)

pred=dt_gs.predict(X_test)
print(classification_report(y_test, pred))

###**Random Search**

In [None]:
kf=KFold(n_splits=10, shuffle=True, random_state=42)
parameter={'max_depth':np.arange(2, 20, 1),'min_samples_split': np.arange(4, 50, 2)}
dt=DecisionTreeClassifier()
dt_rs=RandomizedSearchCV(dt, param_distributions=parameter, cv=kf, verbose=1)
dt_rs.fit(X_train, y_train)

print(dt_rs.best_estimator_)

pred=dt_rs.predict(X_test)
print(classification_report(y_test, pred))

<a name="p34"> </a>
## Hyperparameter Tunning for Naive Bayes Classifier
[Naive Bayes](https://www.datacamp.com/tutorial/naive-bayes-scikit-learn)

### **Grid Search**

In [None]:
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB()
params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
NB_gs = GridSearchCV(estimator=nb,
                 param_grid=params_NB,
                 cv=kf,   # use any cross validation technique
                 verbose=1)
NB_gs.fit(X_train, y_train)

print(NB_gs.best_estimator_)
pred=NB_gs.predict(X_test)
print(classification_report(y_test,pred ))

In [None]:
print(NB_gs.best_estimator_)

In [None]:
pred=NB_gs.predict(X_test)
print(classification_report(y_test,pred ))

### **Random Search**

In [None]:
from scipy.stats import loguniform
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB()
params_NB = {'var_smoothing': loguniform(1e0,1e-9)}
NB_rs = RandomizedSearchCV(estimator=nb,
                 param_distributions=params_NB,
                 cv=kf,   # use any cross validation technique
                 verbose=1,
                 scoring='accuracy')
NB_rs.fit(X_train, y_train)

pred=NB_rs.predict(X_test)

print(NB_gs.best_estimator_)
print(classification_report(y_test, pred))

<a name="p35"> </a>
## Hyperparameter Tunning for Support Vector Machine (SVM)

### **Grid Search**

In [None]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import uniform
from sklearn.svm import SVC
svm=SVC()
param_dist = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
} # total 3*3*2 combinations
SVM_rs = RandomizedSearchCV(estimator=svm, param_distributions=param_dist, n_iter=20, cv=5)

SVM_rs.fit(X_train, y_train)


pred=SVM_rs.predict(X_test)

print(SVM_rs.best_estimator_)
print(classification_report(y_test, pred))

### **Random Search**

In [None]:
from scipy.stats import uniform
from sklearn.svm import SVC
svm=SVC()
param_dist = {
    'C': uniform(0.1, 10),  # Uniform distribution between 0.1 and 10
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'] + list(np.logspace(-3, 3, 50))
}
SVM_rs = RandomizedSearchCV(estimator=svm, param_distributions=param_dist, n_iter=20, cv=5)

SVM_rs.fit(X_train, y_train)


pred=SVM_rs.predict(X_test)

print(SVM_rs.best_estimator_)
print(classification_report(y_test, pred))

## Follow up
Only two columns ('key' and 'energy') were used in predicting whether a song will be a hit or not. This might be a reason why that classifiers didn't perform well.

Try different strategies and see if you can build a classifier with a better accuracy.

<a name="p5"> </a>
# **Part 5-- Exercise**
Use diabetes dataset (df_diabetes) in this exercise. Build the following classifiers with hyper parameter tuning:
*   Decision Tree
*   SVM
*   KNN
