<a href="https://colab.research.google.com/github/urness/CS167Fall2025/blob/main/Day08_Metrics_and_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day08
##Metrics and Testing

#### CS167: Machine Learning, Fall 2025


## Before we get started, let's load in our datasets:
Make sure you change the path to match your Google Drive.


In [None]:
import pandas as pd
import numpy as np

# The first step is to mount your Google Drive to your Colab account.
#You will be asked to authorize Colab to access your Google Drive. Follow the steps they lead you.

from google.colab import drive
drive.mount('/content/drive')

In [None]:
#import the data:
#make sure the path on the line below corresponds to the path where you put your dataset.
iris_df = pd.read_csv('/content/drive/MyDrive/CS167/datasets/irisData.csv')
vehicles_df = pd.read_csv('/content/drive/MyDrive/CS167/datasets/vehicles.csv')



---



# Graphs!

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

#define our data
xvals = [1,2,3,4,5]
series1 = [0.66,0.61,0.69,0.73,0.77]
series2 = [0.8,0.83,0.77,0.81,0.79]
series3 = [0.55,0.67,0.5,0.73,0.66]

#add titles to axis and graph
plt.suptitle('my rockin plot', fontsize=18)
plt.xlabel('a very cool x axis')
plt.ylabel('awesome y axis')

#plot the data
plt.plot(xvals, series1, 'ro--', label='1st series')
plt.plot(xvals, series2, 'bs-.', label='2nd series')
plt.plot(xvals, series3, 'g^-', label='3rd series')
plt.axis([0,6,0,1]) #[x_min, x_max, y_min, y_max]
plt.show()

In [None]:
gas_vehicles = vehicles_df[vehicles_df['fuelType']=='Regular']

# a silly function that returns the average MPG for the first k cars in the df
def getAverageMPG(data, k):
    return data["comb08"].iloc[0:k].mean()

number_of_points = 500

#populate the series list
series = []
for i in range(1, number_of_points):
    val = getAverageMPG(gas_vehicles, i)
    series.append(val)

#plot it!
xvals = range(1, number_of_points)
plt.suptitle('Average MPG', fontsize=18)
plt.xlabel('cars used in average')
plt.ylabel('average MPG')
plt.plot(xvals, series, 'r,-', label='MPG')
plt.legend(loc='lower right', shadow=True)
plt.axis([1, number_of_points, 10,25])
plt.show()

In [None]:
# Exercise here
# change the number of points to 20
# change the line to green triangles
# also plot the median (red dots)


## Cross-Validation Code:

A good rule of thumb is that we like to train our model with 80% of the training examples, and test it on 20% of the training examples.

Splitting datasets into training and testing sets with a Pandas DataFrame:

In [None]:
# Shuffle the data
shuffled_data = iris_df.sample(frac=1, random_state=41)

# Compute the split index (20% for test)
test_size = int(0.2 * len(shuffled_data))

# Set up training and testing sets
test_data = shuffled_data.iloc[:test_size]     # first 20%
train_data = shuffled_data.iloc[test_size:]    # remaining 80%

train_data.shape



---



## Let's see how accurate our kNN model is:


Let's bring in our `kNN()` function--here I'm calling it `classify_kNN()` becuase it uses `mode()` to return the prediction which only works for classifcation.

In [None]:
def classify_kNN(new_example,train_data,k):
    #making a copy of the training set just so we don't mess up the original
    train_data_copy = train_data.copy()

    # 1. calculate distances
    train_data_copy['distance_to_new'] = np.sqrt(
     (new_example['petal length'] - train_data_copy['petal length'])**2
    +(new_example['sepal length'] - train_data_copy['sepal length'])**2
    +(new_example['petal width'] - train_data_copy['petal width'])**2
    +(new_example['sepal width'] - train_data_copy['sepal width'])**2)

    # 2. sort
    sorted_data = train_data_copy.sort_values(['distance_to_new'])

    # 3. predict
    prediction = sorted_data.iloc[0:k]['species'].mode()[0]

    return prediction

Now, let's write a function `classify_all_kNN(test_data, train_data,k):` that:
- goes through each example in the `test_data`, and gets the prediction using our `kNN()` function
- It will return a pandas `Series` that has the predictions for each row in `test_data`.

It should look something like this:

In [None]:
def classify_all_kNN(test_data, train_data, k) -> pd.Series:
    """
    Apply kNN classification to each row in the test data.

    Parameters:
        test_data (pd.DataFrame): Data to classify.
        train_data (pd.DataFrame): Training set with labels.
        k (int): Number of neighbors.

    Returns:
        pd.Series: Predicted labels for each row in test_data.
    """
    results = []

    for i in range(len(test_data)):
        prediction = classify_kNN(test_data.iloc[i], train_data, k)
        results.append(prediction)

    return pd.Series(results)

Now, let's pull it all together and see how our kNN does:

In [None]:
from sklearn.metrics import accuracy_score

predictions5NN = classify_all_kNN(test_data,train_data,5)

#this will print out our predictions so we can see:
print('ACTUAL            PREDICTIONS')
for i in range(len(test_data)):
    print(test_data['species'].iloc[i], " ", predictions5NN.iloc[i] )

acc = accuracy_score(test_data['species'], predictions5NN)
print("accuracy:", acc)

Now, let's explore what the accuracy is for a variety of different values of k

In [None]:
k_vals = [1,3,5,9,15,21,31,51,101,119]
kNN_accuracies = []

for k in k_vals:
    predictions = classify_all_kNN(test_data,train_data,k)
    current_accuracy = accuracy_score(test_data['species'],predictions)
    kNN_accuracies.append(current_accuracy)


plt.suptitle('Iris Data k-NN Experiment',fontsize=18)
plt.xlabel('k')
plt.ylabel('accuracy')
plt.plot(k_vals,kNN_accuracies,'ro-',label='k-NN')
plt.legend(loc='lower left', shadow=True)
plt.axis([0,120,0,1])

plt.show()