<a href="https://colab.research.google.com/github/talamo13/Student-Success-Dashboard/blob/main/Grade_Prediction_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Load Dataset**

Use the panda library to load in the csv file from our public [GitHub repository](https://github.com/talamo13/Student-Success-Dashboard)

We also dropped the following columns to help clean up the data during the model training phase:

***'STAT 90 Text Grade <Text>' , 'GRADE' , 'SEMESTER' , 'Student ID'***

In [2]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/talamo13/Student-Success-Dashboard/main/Stat108-F20-S21")
df = df.drop(columns=['STAT 90 Text Grade <Text>','GRADE','SEMESTER','Student ID'])
# Fill in any missing values with 0's
df = df.fillna(0)

#**Data Preparation**

## Separating Into x And y Variables

In order to create a multivariate prediction model we must define which columns will be used to predict the students 'TOTAL' (overall grade).

Every other column in the 'df' dataframe will be used to predict the 'TOTAL'

This will be done by using splitting 'df' into two separate dataframe of their own, 'x' and 'y'.

**y = data we are making predictions on**

**x = data we will use to make predicitons**

In [None]:
x = df.drop(columns=['TOTAL'])
y = df['TOTAL']

## **Data Split**

***Using scikit-learning library***

Split data into training set and testing set

Training Set = 80% of data

Testing Set = 20% of data

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=100)

We will now have 4 new variables that will define how we train and test our model

**x_train = 80% of the data in x**

**x_test = 20% of the data in y**

**y_train = 80% of the data in y**

**y_test = 20% of the data in y**

# **Model Building**

## Linear Regression

### **Train The Model**

***Using scikit-learning library***

Build a linear regression ML model using the train variables that we have created

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(x_train, y_train)

### **Application**

By using our model to make a prediciton on the original dataset that it was trained on (x_train), we can verify that the model is working as intended.

y_train ***should equal*** y_lr_train_pred

In [None]:
# Make a prediction on the original dataset that the ML model was trained on
y_lr_train_pred = lr.predict(x_train)
# Make a prediciton on the test data
y_lr_test_pred = lr.predict(x_test)

#y_lr_train_pred
#y_train

### **Model Performance**

***Using scikit-learn library we will evaluate how accurate the Linear Regression ML model is***

**Mean Squared Error:** The average squared difference between the actual value and the predicted value

**R2 Score:** The measure that represents the variance in the target variable. Measure from 0-1

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Training Set
lr_train_mse = mean_squared_error(y_train, y_lr_train_pred)
lr_train_r2 = r2_score(y_train, y_lr_train_pred)

# Testing Set
lr_test_mse = mean_squared_error(y_test, y_lr_test_pred)
lr_test_r2 = r2_score(y_test, y_lr_test_pred)

Create a dataframe that will hold the performance metrics of our ML models

In [None]:
lr_performance_data = [['Linear Regression', lr_train_mse, lr_train_r2, lr_test_mse, lr_test_r2]]
models_performance = pd.DataFrame(lr_performance_data, columns=['Model', 'Training MSE', 'Training R2 Score', 'Testing MSE', 'Testing R2 Score'])
models_performance

Unnamed: 0,Model,Training MSE,Training R2 Score,Testing MSE,Testing R2 Score
0,Linear Regression,2.930088e-26,1.0,4.200527e-26,1.0


## **Random Forest**

###**Train The Model**

***Using scikit-learn library***

Train a Random Forest Regressive ML model using the train variables we have already created

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(max_depth=2, random_state=100)
rf.fit(x_train, y_train)

###**Application**

Similarly to before, we are going to make a prediciton on the original training set and on the testing set

In [None]:
 # Make a prediction on the original dataset that the ML model was trained on
y_rf_train_pred = rf.predict(x_train)
# Make a prediciton on the test data
y_rf_test_pred = rf.predict(x_test)

###**Model Performance**

Examine the performance of Random Forest using MSE and R2 Scores

Add this data to the models_performance dataframe

In [None]:
# Training Set
rf_train_mse = mean_squared_error(y_train, y_rf_train_pred)
rf_train_r2 = r2_score(y_train, y_rf_train_pred)

# Testing Set
rf_test_mse = mean_squared_error(y_test, y_rf_test_pred)
rf_test_r2 = r2_score(y_test, y_rf_test_pred)

# Adding data to dataframe
rf_performance_data = ['Random Forest', rf_train_mse, rf_train_r2, rf_test_mse, rf_test_r2]
#models_performance.loc[len(models_performance)] = rf_performance_data

##**Overall Performance**

In [None]:
models_performance

Unnamed: 0,Model,Training MSE,Training R2 Score,Testing MSE,Testing R2 Score
0,Linear Regression,2.930088e-26,1.0,4.200527e-26,1.0
1,Random Forest,1701.666,0.959992,1735.275,0.924623


#1st Exam Score

In order to predict any given student's Midterm 1 Grade

According to the STAT108 syllabus these are all the assignments that are due before Midterm1 (10 classes):
- 10 Attendance Grades
- 5 Activities
- ~6 Homeworks (hard to tell from the syllabus alone)
- 1 Quiz
- 1 Quiz Maintenance



In [8]:
midterm1_x = df.loc[:,['HW 1 Points Grade <Numeric MaxPoints:8 Category:Homework>',
       'HW 2 Points Grade <Numeric MaxPoints:8 Category:Homework>',
       'HW 3 Points Grade <Numeric MaxPoints:8 Category:Homework>',
       'HW 4 Points Grade <Numeric MaxPoints:8 Category:Homework>',
       'HW 5 Points Grade <Numeric MaxPoints:8 Category:Homework>',
       'HW 6 Points Grade <Numeric MaxPoints:8 Category:Homework>',
       'Week 1 Aug 24 to28 Points Grade <Numeric MaxPoints:15 Category:Activity>',
       'Week 2 Aug 31 to Sep 4 Points Grade <Numeric MaxPoints:15 Category:Activity>',
       'Week 3 Sep 7 to 11 Points Grade <Numeric MaxPoints:15 Category:Activity>',
       'Week 4 Sep 14 to 18 Points Grade <Numeric MaxPoints:15 Category:Activity>',
       'Week 5 Sep 21 to 25 Points Grade <Numeric MaxPoints:15 Category:Activity>',
       'Week 1 Day 1 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Week 1 Day 2 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Week 2 Day 1 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Week 2 Day 2 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Week 3 Day 1 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Week 3 Day 2 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Week 4 Day 1 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Week 4 Day 2 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Week 5 Day 1 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Week 5 Day 2 Lecture Points Grade <Numeric MaxPoints:4 Category:Lecture Attendance-Participation>',
       'Quiz 1 Points Grade <Numeric MaxPoints:13 Category:Maintenance & Improvement>',
       'Quiz 1 Points Grade <Numeric MaxPoints:33 Category:Quizzes>']]

midterm1_y = df['Midterm 1 Points Grade <Numeric MaxPoints:100 Category:Midterms>']

midterm1_y

0      70.28
1      86.15
2       0.00
3      96.00
4      72.50
       ...  
243    86.50
244    93.50
245    98.00
246    98.50
247    92.66
Name: Midterm 1 Points Grade <Numeric MaxPoints:100 Category:Midterms>, Length: 248, dtype: float64

In [9]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(midterm1_x, midterm1_y, test_size=0.2, random_state=100)

In [10]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(max_depth=2, random_state=100)
rf.fit(x_train, y_train)

In [11]:
# Make a prediction on the original dataset that the ML model was trained on
y_rf_train_pred = rf.predict(x_train)
# Make a prediciton on the test data
y_rf_test_pred = rf.predict(x_test)

In [13]:
from sklearn.metrics import mean_squared_error, r2_score

# Training Set
rf_train_mse = mean_squared_error(y_train, y_rf_train_pred)
rf_train_r2 = r2_score(y_train, y_rf_train_pred)

# Testing Set
rf_test_mse = mean_squared_error(y_test, y_rf_test_pred)
rf_test_r2 = r2_score(y_test, y_rf_test_pred)

# Adding data to dataframe
rf_performance_data = ['Random Forest', rf_train_mse, rf_train_r2, rf_test_mse, rf_test_r2]
#models_performance.loc[len(models_performance)] = rf_performance_data

In [14]:
rf_performance_data = [['Midterm Random Forest', rf_train_mse, rf_train_r2, rf_test_mse, rf_test_r2]]
models_performance = pd.DataFrame(rf_performance_data, columns=['Model', 'Training MSE', 'Training R2 Score', 'Testing MSE', 'Testing R2 Score'])
models_performance

Unnamed: 0,Model,Training MSE,Training R2 Score,Testing MSE,Testing R2 Score
0,Midterm Random Forest,130.580769,0.73001,111.866216,0.457321
