<a href="https://colab.research.google.com/github/urness/CS167Fall22Code/blob/main/Day12_NotesScikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 12 Notes: Introduction to [Scikit-Learn](https://scikit-learn.org/stable/modules/classes.html)

The overall algorithm for scikit-learn (sklearn) is:
0. Load Libraries
1. Load data
2. Split Data: use `train_test_split()`
3. Create a classifier/regressor object
4. Call `fit()` (to train the model)
5. Call `predict()` to get the predictions
6. Call a metric function to measure performance

In the cell below, we do steps 0-2:

In [None]:
# Mount your drive
from google.colab import drive
drive.mount('/content/drive')



In [None]:
#classic scikit-learn algorithm

#0. import libraries
import sklearn
import pandas
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import neighbors

#1. load data
iris_data = pandas.read_csv('/content/drive/MyDrive/CS167Fall22/Datasets/irisData.csv')

#2. split data
predictors = ['sepal length', 'sepal width','petal length', 'petal width']
target = "species"
train_data, test_data, train_sln, test_sln = \
        train_test_split(iris_data[predictors], iris_data[target], test_size = 0.2, random_state=41)

In [None]:
#3. Create classifier/regressor object (change these parameters for Exercise #1)
dt = tree.DecisionTreeClassifier(random_state = 0)

#4. Call fit (to train the classification/regression model)
dt.fit(train_data,train_sln)

#5. Call predict to generate predictions
iris_predictions = dt.predict(test_data)

#6. Call a metric function to measure performance
print("Accuracy:", metrics.accuracy_score(test_sln,iris_predictions))

print("-------------------------------------------------------")
#print out a confusion matrix
iris_labels= ["Iris-setosa", "Iris-versicolor","Iris-virginica"]
conf_mat = metrics.confusion_matrix(test_sln, iris_predictions, labels=iris_labels)
print(pandas.DataFrame(conf_mat,index = iris_labels, columns = iris_labels))

Accuracy: 0.9
-------------------------------------------------------
                 Iris-setosa  Iris-versicolor  Iris-virginica
Iris-setosa                9                0               0
Iris-versicolor            0               10               1
Iris-virginica             0                2               8


## Exercise #1A
Check out the scikit-learn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=neighbors#sklearn.neighbors.KNeighborsClassifier):

Find the documentation for the kNN classifier (i.e., the classifier, not an unsupervised algorithm). Answer the following questions:
- What is the default value of k it uses?
- Does it do weighted or unweighted kNN by default?



[docs for knn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)

## Exercise #1B
Implement SciKit Learn’s KNN function
- Run with k=100 (what is the parameter that SciKit Learn uses for k?) What is the accuracy?
- Run with k=100 weighted vs. non-weighted. What is the accuracy?
- Run with k=5. What is the accuracy?


In [None]:
# here's a hint:
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier() # may need to pass in some parameters here..
neigh.fit(train_data,train_sln)


## Let's try regression now:


In [None]:
import pandas
import numpy

# load WineQuality.csv data 
wine_data = pandas.read_csv('/content/drive/MyDrive/CS167Fall22/Datasets/winequality-white.csv')

# set the predictor variables and target variable
target= 'quality'
predictors = wine_data.columns.drop(target) # use all of the columns except for quality

# use train_test_split() to split the data
train_data, test_data, train_sln, test_sln = \
        train_test_split(wine_data[predictors], wine_data[target], test_size = 0.2, random_state=41)
train_data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
24,6.9,0.40,0.14,2.4,0.085,21.0,40.0,0.99680,3.43,0.63,9.70
728,6.4,0.57,0.02,1.8,0.067,4.0,11.0,0.99700,3.46,0.68,9.50
1366,7.3,0.74,0.08,1.7,0.094,10.0,45.0,0.99576,3.24,0.50,9.80
1413,9.9,0.57,0.25,2.0,0.104,12.0,89.0,0.99630,3.04,0.90,10.10
1456,6.0,0.54,0.06,1.8,0.050,38.0,89.0,0.99236,3.30,0.50,10.55
...,...,...,...,...,...,...,...,...,...,...,...
407,12.0,0.39,0.66,3.0,0.093,12.0,30.0,0.99960,3.18,0.63,10.80
243,15.0,0.21,0.44,2.2,0.075,10.0,24.0,1.00005,3.07,0.84,9.20
321,9.3,0.61,0.26,3.4,0.090,25.0,87.0,0.99975,3.24,0.62,9.70
1104,8.0,0.48,0.34,2.2,0.073,16.0,25.0,0.99360,3.28,0.66,12.40


In [None]:
from sklearn import neighbors
from sklearn import metrics

# create our model
neigh = neighbors.KNeighborsRegressor() ### Don't miss this! Doing Regression here!!

# fit (train) the model to the data
neigh.fit(train_data, train_sln)

# use the trained model to get predictions from our test_data
predictions = neigh.predict(test_data)

# use a metric to see how good our predictions are; Don't miss this! Using Regression metrics here!!
print('MSE: ', metrics.mean_squared_error(test_sln, predictions))
print('r2: ', metrics.r2_score(test_sln, predictions))

MSE:  0.5618749999999999
r2:  0.10189810189810189


## Normalizing Data
Whoa! This is so easy! SciKit Learn is awesome!!

In [None]:
# Normalization code using StandardScaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train_data)
train_data_normalized = scaler.transform(train_data)
test_data_normalized = scaler.transform(test_data)

# now I can use train_data_normalized and test_data_normalized

# Exercise #2
- Run kNN regressor algorithm with k (n_neighbors) = 15 using non-normalized data
- Run again with normalized values
- Which provides a better R^2?


## Exercise #2B
Create a graph with k values on the x-axis and R^2 values on the y-axis...

## Exercise #3

* Go to scikit learn (https://scikit-learn.org/stable/)
* Search for “Decision Tree Regression”
* Implement a Decision Tree Regressor algorithm in Sci Kit Learn for the wine data.

Hypothesis #1: low values for the max depth a decision tree will cause low R^2 values. Increasing the max depth will increase the R^2 values; At a certain point, increasing the max depth will no longer have an effect on the R^2 values;

In [None]:
# code for hypothesis #1 testing here
#idea: Create a graph with max_depth on the x-axis, and R^2 values on the y axis.


Hypothesis #2: normalization does not affect decision trees metrics


In [None]:
# code for hypothesis #2 testing here
# idea: create decision trees with all of the same parameters (random_state=0) on normalized and non-normalized data; vary possible parameters; compare results
