<a href="https://colab.research.google.com/github/urness/CS167Fall2025/blob/main/Day12_Scikit_Learn_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day12
##Scikit Learn Practice

#### CS167: Machine Learning, Fall 2025


# Overview of the Scikit Learn 'Algorithm':

When working in Scikit Learn (`sklearn`), there is a general pattern that we can follow to implement any supported machine learning algorithm.

It goes like this:
1. Load your data using `pd.read_csv()`
2. Split your data `train_test_split()`
3. Create your classifier/regressor object
4. Call `fit()` to train your model
5. Call `predict()` to get predictions
6. Call a metric function to measure the performance of your model.

In [None]:
# Mount your drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#classic scikit-learn algorithm

#0. import libraries
import sklearn
import pandas
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import neighbors

#1. load data
iris_df = pandas.read_csv("/content/drive/MyDrive/CS167/datasets/irisData.csv")

#2. split data
predictors = ['sepal length', 'sepal width','petal length', 'petal width']
target = "species"
train_data, test_data, train_sln, test_sln = \
        train_test_split(iris_df[predictors], iris_df[target], test_size = 0.2, random_state=41)

#3. Create classifier/regressor object
dt = tree.DecisionTreeClassifier(random_state=2)

#4. Call fit (to train the classification/regression model)
dt.fit(train_data,train_sln)

#5. Call predict to generate predictions
iris_predictions = dt.predict(test_data)

#6. Call a metric function to measure performance
print("Accuracy:", metrics.accuracy_score(test_sln,iris_predictions))

# Show the acutal and predicted (this isn't necessary, but may help catch bugs)
# print("___PREDICTED___ \t  ___ACTUAL___")
# for i in range(len(test_sln)):
#     print(iris_predictions[i],"\t\t", test_sln.iloc[i])

print("-------------------------------------------------------")
#print out a confusion matrix
iris_labels= ["Iris-setosa", "Iris-versicolor","Iris-virginica"]
conf_mat = metrics.confusion_matrix(test_sln, iris_predictions, labels=iris_labels)
print(pandas.DataFrame(conf_mat,index = iris_labels, columns = iris_labels))



---

# Plotting Decision Trees

You can use `matplotlib` to plot decision trees using the `sklearn.tree.plot_tree` method.

In [None]:
# visualizing decision tree using tree.plot_tree()
import matplotlib.pyplot as plt

plt.figure(figsize=(10,10)) # Makes it so the graph isn't tiny
tree.plot_tree(dt); #if you remove the ;, you'll get more information about the tree

In [None]:
#tweak paramters to make it pretty
import matplotlib.pyplot as plt
fn=['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
cn=['setosa', 'versicolor', 'virginica']
plt.figure(figsize=(10,10))
tree.plot_tree(dt, feature_names=fn, class_names=cn, filled=True);

# Normalizing using `StandardScaler`

**Documentation**: [`sklearn.preprocessing.StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)


In [None]:
train_data.head()

In [None]:
# Normalize the training data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(train_data)

train_data_normalized = scaler.transform(train_data)
test_data_normalized = scaler.transform(test_data)

# Example of Scikit Learn with normalized data

In [None]:
#1. load data
iris_df = pandas.read_csv("/content/drive/MyDrive/CS167/datasets/irisData.csv")

#2A. split data
predictors = ['sepal length', 'sepal width','petal length', 'petal width']
target = "species"
train_data, test_data, train_sln, test_sln = \
        train_test_split(iris_df[predictors], iris_df[target], test_size = 0.2, random_state=41)

#2B. Normalize the training data
scaler = StandardScaler()
scaler.fit(train_data)
train_data_normalized = scaler.transform(train_data)
test_data_normalized = scaler.transform(test_data)

#3. Create classifier/regressor object
dt = tree.DecisionTreeClassifier(random_state=2)

#4. Call fit (to train the classification/regression model)
dt.fit(train_data_normalized,train_sln)

#5. Call predict to generate predictions
iris_predictions = dt.predict(test_data_normalized)

#6. Call a metric function to measure performance
print("Accuracy:", metrics.accuracy_score(test_sln,iris_predictions))

# Discussion Question:
Why didn’t normalize the data improve the accuracy in the previous example?


# Intro to Dummy Variables

In [None]:
titanic_df = pandas.read_csv("/content/drive/MyDrive/CS167/datasets/titanic.csv")
titanic_subset = titanic_df[["survived", "age", "fare", "embarked"]].copy()
titanic_subset.head(6)

In [None]:
# This isn't a good idea...
titanic_subset["embarked"] = titanic_subset["embarked"].map({"S": 0, "Q": 1, "C": 2})
titanic_subset.head(6)

In [None]:
#instead, use dummy variables
titanic_subset2= titanic_df[["survived", "age", "fare", "embarked"]].copy()
titanic_subset_with_dummies = pandas.get_dummies(titanic_subset2, columns=["embarked"])
titanic_subset_with_dummies.head(6)

# 💬 Exercise #1:

1. Using the Iris dataset, build a knn (try using [`sklearn.neighbors.kNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)) and use it on the Iris Dataset
    - in the call to `train_test_split` use `random_state=41`
    - don't normalize the data (for this exercise)
    - step #3 should look like, `neigh = neighbors.KNeighborsClassifier()`
    - is there a difference in performance between using a **weighted** or **unweighted** knn?
      - *hint: consider the parameter `weights` in the documentation.*
    - what if you change the number of nearest neighbors to 21?

# Exercise #2:
## Let's try regression now:

Using the `vehicles.csv` dataset, let's try out sklearn with regression:
- load the data, get the right subset
- set predictors and target variables
- use `train_test_split()` to split the data

In [None]:
import pandas
import numpy
from sklearn.model_selection import train_test_split

#1. load data; get the right subset
vehicles_df = pandas.read_csv("/content/drive/MyDrive/CS167/datasets/vehicles.csv")
gas_vehicles = vehicles_df[vehicles_df['fuelType']=='Regular'][['year', 'cylinders', 'displ', 'comb08']]
gas_vehicles.dropna(inplace=True)

#2. split the data
# set the predictor variables and target variable
predictors= ['year', 'cylinders', 'displ']
target= 'comb08'
train_data, test_data, train_sln, test_sln = train_test_split(gas_vehicles[predictors], gas_vehicles[target], test_size = 0.2, random_state=41)

And then do the next steps:

- build our model using [`sklearn.neighbors.kNeighborsRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
- fit our model using `fit()` and passing in `train_data` and `train_sln`
- get our predictions by calling `predict()`
- evaluate our predictions using `metrics.mean_squared_error()`, and `metrics.r2_score()`

In [None]:
from sklearn.neighbors import KNeighborsRegressor

#3. Create classifier/regressor object

#4. Call fit (to train the classification/regression model)

#5. Call predict to generate predictions

#6. Call a metric function to measure performance
# use a metric to see how good our predictions are
#print("R2: ", metrics.r2_score(test_sln, preds))
#print("MAE: ", metrics.mean_absolute_error(test_sln, preds))
#print("MSE: ", metrics.mean_squared_error(test_sln, preds))

#Exercise #3:
Use [`sklearn.preprocessing.StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to **first** normalize the data, and then apply kNN Regression.
- How does this effect the results?
- Can you explain these results?

# 💬 Exercise #4:
Look up an appropriate Decision Tree algorithm and apply it to the vehicles data:
- https://scikit-learn.org/stable/api/sklearn.tree.html
- Using Default values of the decision tree, what is the $R^2$ metric?
- Interpret the $R^2$ value... is it good or bad?

# Exercise #5:
Change your decision tree to have a `max_depth` of 3.
- does this help or hurt the decision tree performance?

Compare your decision tree to a kNN algorithm:
- what values of k seem to help the performance?
- What else can you do to help the performance?

Can you get a higher $R^2$ valuue using a knn algorithm or a decision tree?
- what does this indicate about the data?