<a href="https://colab.research.google.com/github/urness/CS167Fall2025/blob/main/Day14_Random_Forests_and_Dimensionality_Reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day14
##Random Forests & Dimensionality Reduction Techniques

#### CS167: Machine Learning, Fall 2025


In [None]:
# Mount your drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#0. import libraries
import sklearn
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics

data= pd.read_csv("/content/drive/MyDrive/CS167/datasets/breast-cancer-data.csv")
data.head()

In [None]:
#split data
target = "diagnosis"
predictors = data.columns.drop(target) #gets all of the columns except the target
train_data, test_data, train_sln, test_sln = train_test_split(data[predictors], data[target], test_size = 0.2, random_state=41)

# Random Forest Code

In [None]:
# a Random Forest Classifier
forest = RandomForestClassifier(random_state = 0, max_features="log2")
forest.fit(train_data,train_sln)
predictions = forest.predict(test_data)
print("accuracy score: ", metrics.accuracy_score(test_sln,predictions))

vals = data[target].unique() ## possible classification values (M = malignant; B = benign)
conf_mat = metrics.confusion_matrix(test_sln, predictions, labels=vals)
print(pd.DataFrame(conf_mat, index = "True " + vals, columns = "Predicted " + vals))

# Feature Importances

In [None]:
# It looks like our random forest model achieved pretty good accuracy.
# Now lets check how important each of the features was in the ensemble of models we built.

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

#creates a list of numbers the right size to use as the index
#and sorts the list so that the most important feature are first
index = range(len(predictors))
importances = forest.feature_importances_
sorted_indices = np.argsort(importances)

plt.figure(figsize=(8,10)) #making the table a bit bigger so the text is readable
plt.title('Breast Cancer Feature Importances')
plt.barh(range(len(sorted_indices)),importances[sorted_indices],height=0.8) #horizontal bar chart
plt.ylabel('Feature')
plt.yticks(index,predictors) #put the feature names at the y tick marks
plt.xlabel("Random Forest Feature Importance")
plt.show()

# Tuning the Forest

*   How can we tell how many trees to use?
*   What about how many features to include in our trees?

We can tune our random forest to find the best values of model
parameters:



In [None]:
#This function just loops through a series of n_estimator (number of trees) values, builds a different model
#for each, and then plots their respective accuracies. By making it a function, it's easier
#to try out different ranges of numbers
import matplotlib.pyplot as plt

def tune_number_of_trees(n_estimator_values):
    rf_accuracies = []

    # loop through all of the possible number of trees
    for n in n_estimator_values:

        curr_rf = RandomForestClassifier(n_estimators=n, random_state=0)     # create classifier object
        curr_rf.fit(train_data,train_sln)                                    # fit model to training data
        curr_predictions = curr_rf.predict(test_data)                        # use model to make predictions
        curr_accuracy = metrics.accuracy_score(test_sln,curr_predictions)    # compare predictions to test solutions to determine accuracy
        rf_accuracies.append(curr_accuracy)                                  # add accuracy to list

    # now let's plot the accuracies
    plt.suptitle('Random Forest accuracy vs. number of trees',fontsize=18)
    plt.xlabel('# trees')
    plt.ylabel('accuracy')
    plt.plot(n_estimator_values,rf_accuracies,'ro-')
    plt.axis([0,n_estimator_values[-1]+1,.9,1.01])

    plt.show()

tune_number_of_trees(range(1,31))

It looks like whether we are using small numbers of trees or large ones, the accuracy stays about the same. It appears at least sometimes that Random Forest doesn't take a lot of tuning of the number of trees.

# Tuning Number of Features

In [None]:
#This function just loops through a series of max_features (assuming the number of tree is set at 10), builds a different model
#for each, and then plots their respective accuracies. By making it a function, it's easier
#to try out different ranges of numbers
def tune_max_features(max_features_values):
    rf_accuracies = []

    # loop through the number of max features
    for m in max_features_values:

        curr_rf = RandomForestClassifier(n_estimators=10,max_features=m, random_state=0) # create classifier object
        curr_rf.fit(train_data,train_sln)                                                # fit model to training data
        curr_predictions = curr_rf.predict(test_data)                                    # use model to make predictions
        curr_accuracy = metrics.accuracy_score(test_sln,curr_predictions)                # compare predictions to test solutions to determine accuracy
        rf_accuracies.append(curr_accuracy)                                              # add accuracy to list

    # now let's plot the accuracies
    plt.suptitle('Random Forest accuracy vs. max features',fontsize=18)
    plt.xlabel('max features')
    plt.ylabel('accuracy')
    plt.plot(max_features_values,rf_accuracies,'ro-')
    plt.axis([0,max_features_values[-1]+1,.9,1.01])

    plt.show()

tune_max_features(range(1,11))

## Feature Selection Code

Documentation: [`sklearn.feature_selection.SelectKBest()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)

In [None]:
import pandas
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import accuracy_score

In [None]:
# load the data
iris_df = pd.read_csv("/content/drive/MyDrive/CS167/datasets/irisData.csv")
predictors = ['sepal length', 'sepal width', 'petal length', 'petal width']
target = "species"


In [None]:
# split the data
train_data, test_data, train_sln, test_sln = \
    train_test_split(iris_df[predictors], iris_df[target], test_size = 0.2, random_state=41)

## First, let's establish a baseline -- How well does KNN do with all of the predictors?

In [None]:
clf = KNeighborsClassifier()        # create classifier object
clf.fit(train_data,train_sln)       # fit the training data
predictions = clf.predict(test_data) # use the model to make predictions
print('Accuracy:',accuracy_score(test_sln,predictions)) # what is the accuracy?

## Now, let's just select the best 2 predictors

In [None]:
# fit your selector just like you do when training with a classifier/regressor
# only do this after splitting into train and test sets - don't let the test
# set spoil your predictions
selector = SelectKBest(k=2)
selector.fit(train_data,train_sln)

# bigger number means the feature is more important
print('Here are the scores of each feature:')
print(selector.scores_)
print(predictors)

In [None]:
#transforming the predictor columns of the training set
train_transformed = selector.transform(train_data)

print("Here's what the training predictors look like after the transformation. \
Notice that it's just the last two columns from the original data.")
train_transformed[0:6]

In [None]:
#take a look at the training data
train_data[0:6]

In [None]:
#Now we transform the predictor columns in the test set as well.
#Notice that we're using the selector that we trained using the training set.
#Do not re-fit it to the test data.
test_transformed = selector.transform(test_data)

#Now we can use our transformed data with a classifier just like always:
clf = KNeighborsClassifier()
clf.fit(train_transformed,train_sln)
predictions = clf.predict(test_transformed)
print('Accuracy:',accuracy_score(test_sln,predictions))

# 💬 Group Exercise:

Let's give it a shot:
- below, I went ahead and loaded (and cleaned) the penguin dataset 🐧
- Using `species` as the target variable, Use `SelectKBest` to determine the best 3 attributes
- Build a default Random Forest using only the 3 best attributes. How does the performance compare to a default random forest that uses all of the predictor variables?
- Keep running the code, incrementing the number of predictors. What is the minimum number of predictors needed to get 100% accuracy?


In [None]:

## the following code will load and clean the penguin dataset and will result in 8 predictors
penguin_df = pd.read_csv("/content/drive/MyDrive/CS167/datasets/penguins.csv")
penguin_df.head()
penguin_df.dropna(inplace=True) # drop null values
penguin_df["gender"] = penguin_df["gender"].map({"MALE": 0, "FEMALE": 1})
penguin_df = pd.get_dummies(penguin_df, columns=["island"]) # one-hot encode the data
penguin_df.head()

In [None]:
target = "species"
predictors = penguin_df.columns.drop(target)

train_data, test_data, train_sln, test_sln = \
    train_test_split(penguin_df[predictors], penguin_df[target], test_size = 0.2, random_state=41)

## Establish the base case, using all of the predictors in a Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=0)
rf.fit(train_data,train_sln)
predictions = rf.predict(test_data)
print('Accuracy:',accuracy_score(test_sln,predictions))

In [None]:
## Your code here;
## Use SelectKBest and start with 3 predictors;
## Keep running your code, incremented the number of predictors. What is the minimum number of predictors needed to get 100% accuracy?