<a href="https://colab.research.google.com/github/urness/CS167Fall2025/blob/main/Day11_Introduction_to_Scikit_Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day11
##Introduction to Scikit Learn

#### CS167: Machine Learning, Fall 2025




---



# Review of Information Gain

In [None]:
import math
def entropy(percentage_list):
    #input: percentage_list consists of float values that sum to 1.0
    #return: calculation of entropy for input percentages
    result = 0
    for percentage in percentage_list:
        if percentage != 0:
            result += -percentage*math.log2(percentage)
    return result

### Example #1

In [None]:
# 1. Starting Entropy
starting_entropy = entropy([7/13,6/13])
starting_entropy

In [None]:
# 2. Expected Entropy
expected_entropy = (7/13)*entropy([4/7,3/7]) + (6/13)*entropy([2/6,4/6])
expected_entropy

In [None]:
# 3. Information Gain
information_gain = starting_entropy - expected_entropy
information_gain

### Example #2
You are given the following dataset where:
- Day is the day of the week
- Chicken Wrap is whether or not a chicken wrap was available
- Hungry is whether you were hungry or not
- Choice is the choice you made as to where to eat

### What is the information gain of the `Hungry` column?

| **Day**   | **Chicken Wrap** | **Hungry** | **Choice** |
|-----------|------------------|------------|------------|
| Monday    | yes              | no         | Hubbell    |
| Wednesday | no               | yes        | Starbucks  |
| Monday    | yes              | yes        | Hubbell    |
| Wednesday | no               | no         | Hubbell    |
| Monday    | no               | yes        | Starbucks  |

In [None]:
# 1. Starting Entropy
starting_entropy = entropy([3/5,2/5])
starting_entropy

In [None]:
# 2. Expected Entropy
expected_entropy = (3/5)*entropy([1/3,2/3]) + (2/5)*entropy([2/2,0/2])
expected_entropy

In [None]:
# 3. Information Gain
information_gain = starting_entropy - expected_entropy
information_gain



---



---



# Introduction to Scikit Learn

# Overview of the Scikit Learn 'Algorithm':

When working in Scikit Learn (`sklearn`), there is a general pattern that we can follow to implement any supported machine learning algorithm.

It goes like this:
1. Load your data using `pd.read_csv()`
2. Split your data `train_test_split()`
3. Create your classifier/regressor object
4. Call `fit()` to train your model
5. Call `predict()` to get predictions
6. Call a metric function to measure the performance of your model.

In [None]:
# Mount your drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#classic scikit-learn algorithm

#0. import libraries
import sklearn
import pandas
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import neighbors

#1. load data
iris_df = pandas.read_csv("/content/drive/MyDrive/CS167/datasets/irisData.csv")

#2. split data
predictors = ['sepal length', 'sepal width','petal length', 'petal width']
target = "species"
train_data, test_data, train_sln, test_sln = \
        train_test_split(iris_df[predictors], iris_df[target], test_size = 0.2, random_state=41)

#3. Create classifier/regressor object (change these parameters for In-Class Exercise)
dt = tree.DecisionTreeClassifier(random_state=41)

#4. Call fit (to train the classification/regression model)
dt.fit(train_data,train_sln)

#5. Call predict to generate predictions
iris_predictions = dt.predict(test_data)

#6. Call a metric function to measure performance
print("Accuracy:", metrics.accuracy_score(test_sln,iris_predictions))

# Show the acutal and predicted (this isn't necessary, but may help catch bugs)
print("___PREDICTED___ \t  ___ACTUAL___")
for i in range(len(test_sln)):
    print(iris_predictions[i],"\t\t", test_sln.iloc[i])

print("-------------------------------------------------------")
#print out a confusion matrix
iris_labels= ["Iris-setosa", "Iris-versicolor","Iris-virginica"]
conf_mat = metrics.confusion_matrix(test_sln, iris_predictions, labels=iris_labels)
print(pandas.DataFrame(conf_mat,index = iris_labels, columns = iris_labels))

# Now, let's go through step-by-step:

## Step 1: Import libraries and load your data

We should be pretty familiar with this one.
- mount your drive
- import any relevant libraires
- use `pd.read_csv()` to load in your dataset

In [None]:
# Mount your drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#0. import libraries
import sklearn
import pandas
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import neighbors

#1. load data
path = '/content/drive/MyDrive/CS167/datasets/irisData.csv'
iris_df = pandas.read_csv(path)

## Step 2: Split Data

Cross-Validation is an important step in machine learning which enables us to evaluate our models. To do this, we need to split our data into `train_data` and `test_data`.
<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_cross_validation.png" width=600/>
</div>

Scikit Learn takes this a step further and splits the data up into 4 pieces:

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day06_traintestsplit.png" width=600/>
</div>

- `train_data`: holds the predictor variables of the training set
- `train_sln`: holds the target variable of the training set
- `test_data`: holds the predictor variables of the testing set
- `test_sln`: holds the target varibles of the testing set

In [None]:
#2. split data
predictors = ['sepal length', 'sepal width','petal length', 'petal width']
#predictors = iris_df.columns.drop('species')
target = "species"
train_data, test_data, train_sln, test_sln = train_test_split(iris_df[predictors], iris_df[target], test_size = 0.2, random_state=41)

In [None]:
# take a look at the data... make sure you understand what split of data is stored in each
print('train_data shape: ',train_data.shape)
print('test_data shape: ',test_data.shape)
print('train_sln shape: ',train_sln.shape)
print('test_sln shape: ',test_sln.shape)

train_data.head()

## Step 3: Create classifier/regressor object

The syntax/wording for this is going to come directly from the `sklearn` documentation.
- [Scikit Learn Decision Tree documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

The name of the model will change based on whether you are doing a __classification__ or __regression__.
- generally in the name:
    - `tree.DecicionTreeClassifier()`
    - `tree.DecisionTreeRegressor()`

In [None]:
#3. Create classifier/regressor object (change these parameters for In-Class Exercise #1)
dt = tree.DecisionTreeClassifier(random_state=2)

## Step 4: Call `fit()` to train the model

Each machine learning model has a training process associated with it. Scikit learn makes it easy to train whatever model you choose by simply calling `fit()` on that model.

We generally pass two things into `fit()`:
- `train_data`: the predictor variables we want to train our model on
- `train_sln`: the labels for each training examples


In [None]:
#4. Call fit (to train the classification/regression model)
dt.fit(train_data, train_sln)

## Step 5: Call `predict()` to get predictions

After our model is trained, it's time to run our testing data through our model and see what the model predicts.

Scikit learn lets us do this in one line:
- we're saving what the function is returning as `predictions`
- passing in `test_data`, which is the data without labels that was not included in training\

In [None]:
#5. Call predict to generate predictions
predictions = dt.predict(test_data)

# Step 6: Evaluate the Model

Now that we have some predictions, we need to check to see how close we were by passing our predictions and the actual correct answers into a metric function.

| **Type of ML** | **Metric**                | **Description**                                                                                       | Scikit Learn                                                                                                                                                            |
|----------------|---------------------------|:-------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Classification | Accuracy                  | Number correct examples divided by total number                                                       | [`sklearn.metrics.accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)                                               |
| Classification | Confusion Matrix          | Detailed table of where our model got confused.                                                       | [`sklearn.metrics.confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix)          |
| Regression     | Mean Absolute Error (MAE) | The average absolute distance from the target variable                                                | [`sklearn.metrics.mean_absolute_error`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error) |
| Regression     | Mean Squared Error (MSE)  | The average squared distance from the target variable                                                 | [`sklearn.metrics.mean_squared_error`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)    |
| Regression     | $R^2$                     | 1: perfectly fit data 0: same performance as guessing the mean of the target variable| [`sklearn.metrics.r2_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score)                                  |

Available metrics can be found in the sklearn documentation [[sklearn metrics]](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

In [None]:
from sklearn import metrics

#6. call a metric function to evaluate the model
print("Accuracy:", metrics.accuracy_score(test_sln, predictions))

Here's an example of displaying a confusion matrix:
Documentation link: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
#print out a confusion matrix
iris_labels= ["Iris-setosa", "Iris-versicolor","Iris-virginica"]
conf_mat = metrics.confusion_matrix(test_sln, predictions, labels=iris_labels)
print(pandas.DataFrame(conf_mat,index = iris_labels, columns = iris_labels))

In [None]:
# Confusion Matrix Option #2
displ = ConfusionMatrixDisplay(confusion_matrix=conf_mat, display_labels=iris_labels)
displ.plot(cmap=plt.cm.Blues)
plt.show()

## (Optional) Step 7: Print out the results to debug

Sometimes its helpful to take a closer look at your predictions. Here's some code to do just that:

In [None]:
# Show the acutal and predicted (this isn't necessary, but may help catch bugs)
print("___PREDICTED___ \t  ___ACTUAL___")
for i in range(len(test_sln)):
    print(predictions[i],"\t\t", test_sln.iloc[i])

# In-Class Exercise #1:


Take a look at the [Decision Tree Classifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html):
- Explore different parameters you could use in the `DecisionTreeClassifier`
  - *hint: in Step #3, consider different values for `max_depth`*
- Can you improve accuracy to something **greater than 85%** without touching the  `random_state` parameters?
  - Use
  - `train_data, test_data, train_sln, test_sln =
        train_test_split(iris_df[predictors], iris_df[target], test_size = 0.2, random_state=41)` and
  - `dt = tree.DecisionTreeClassifier(random_state=2)`
- How did you accomplish this?