# AP Score Prediction Model
> Author: Tanisha Patil,  Score Prediction Model using Random Forest Regression Model
- toc: true
- categories: []
- type: pbl
- week: 18

## Introduction to Random Forest 

Random Forest is just another Machine Learning model. Let's break it down: 

Imagine you have a big decision to make, like choosing the best video game. Since the 'best' game can be arbitrary, you ask your friends for advice, but (as expected) everyone has different opinions. Some say game A is the best, while others vouch for game B. It's difficult to determine the true best game based on just a few opinions. To help with this decision, imagine if you could ask not just a few friends but hundreds. Each friend has played different games and has their own preferences. By collecting their opinions, you can get a better idea of which game is truly the best.

Now, let's take this idea and apply it to something different, like predicting your performance on the CS AP test. Instead of asking friends about video games, we can use a similar concept called a random forest to make predictions based on different factors.mIn the case of an AP test score predictor, a random forest is like having a group of friends, called trees, that come together to make predictions. Each tree represents an individual friend with their own opinions/weights. Just as each friend may have their own opinions about video games, each tree considers different factors that could influence your AP test score. For example, one tree might consider variables like attendance, work habits, behavior, and your overall grade, while another tree may focus on variables like clarity of understanding, collaboration skills, leadership qualities, and integrity. Each tree independently makes its own prediction based on the specific factors it considers important.

However, rather than relying on the prediction of just one tree, the random forest model combines the predictions of all the trees and takes a majority vote. Just like collecting opinions from many friends helps you make a better decision on the best video game, the random forest combines the insights of multiple trees to provide a more accurate prediction of your AP test score.

By considering various factors and combining the knowledge of many trees, the random forest can help predict how well you are likely to perform on your AP test. This can provide valuable insights for Mr. Mort, Mr. Yeung, and students to identify areas of improvement that contribute to success on the test.

In [15]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the data into pandas DataFrame
data = pd.read_csv('files/ap_predict_data.csv')

# Convert the letter grades to numerical scores, since it is easier to deal with numerical data (instead of categorica like A, B, C, etc.)
grade_mapping = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'F': 1}
data['Grade'] = data['Grade'].map(grade_mapping) # python syntax map, applies one transformation to an entire array

# Separate the target variable
features = data.drop(['AP Score'], axis=1)
target = data['AP Score']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train
model.fit(X_train, y_train)

# Test
y_pred = model.predict(X_test)

# Round the predictions to the nearest integer (1 to 5) since we want an AP Score of 1, 2, 3, 4, or 5
y_pred = y_pred.round().astype(int)

# MSE Evaluation
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)

Mean Squared Error: 0.15384615384615385


This Mean Squared Error is quite small. Does this mean this model is SUPER accurate? No. This is because the range of possible values for the target variable is 1-5. The Mean Squared Error indicates that the model is making reasonably accurate predictions. 

- Main Takeaway: MSE is relative to the range of values of the target variable. 

In [20]:
import pandas as pd

def predict_score(input_data):
    # Convert the letter grades to numerical scores using the mapping method
    input_data['Grade'] = input_data['Grade'].map(grade_mapping) # .map is used when you want to apply a single transformation to a whole list
    predicted_score = model.predict(input_data)
    predicted_score = round(predicted_score[0]) #round to nearest integer
    return predicted_score

while True:
    print("Enter the following information:")
    # This could probably be written more concisely 
    attendance = int(input("Attendance (0 or 1): "))
    work_habits = int(input("Work Habits (0 or 1): "))
    behavior = int(input("Behavior (0 or 1): "))
    timely = int(input("Timely (0 or 1): "))
    clarity = int(input("Clarity (0 or 1): "))
    qualified = int(input("Qualified (0 or 1): "))
    tech_sense = int(input("Tech Sense (0 or 1): "))
    tech_rated = int(input("Tech Rated (0 or 1): "))
    collab = int(input("Collab (0 or 1): "))
    leadership = int(input("Leadership (0 or 1): "))
    integrity = int(input("Integrity (0 or 1): "))
    #Add up total score, since total is a required factor in the model
    total = attendance + work_habits + behavior + timely + clarity + qualified + tech_sense + tech_rated + collab + leadership + integrity
    grade = input("Grade (A, B, C, D, F): ") # This will be quantified in the function predict_score 

    #Dictionary of user input
    input_data = {
        'Attendance': [attendance],
        'Work Habits': [work_habits],
        'Behavior': [behavior],
        'Timely': [timely],
        'Clarity': [clarity],
        'Qualified': [qualified],
        'Tech Sense': [tech_sense],
        'Tech Rated': [tech_rated],
        'Collab': [collab],
        'Leadership': [leadership],
        'Integrity': [integrity],
        'Total': [total],
        'Grade': [grade]
    }

    input_df = pd.DataFrame(input_data)

    # Call predict_score function to apply input data to model
    predicted_score = predict_score(input_df)

    print('Total:', total)
    print('Grade:', grade)
    print("Predicted AP Score:", predicted_score)

    continue_testing = input("Do you want to continue testing? (y/n): ")
    if continue_testing.lower() != 'y': # .lower avoids confusion
        break

Enter the following information:


Feature names unseen at fit time:
- Leadership
- Tech Rated
Feature names seen at fit time, yet now missing:
-  Leadership 
-  Tech Rated



Total: 4
Grade: B
Predicted Score: 2


Try these (or your own combination) to test the model: 

1,1,1,0,0,1,1,1,1,0,0,A
- AP Score: 3

1,0,1,1,0,1,1,0,1,0,0,A
- AP Score: 3



# Hacks and Further Exploration 

...