# Part 1: Data Extraction Notebook


# Setup and Imports

In order to get the correct file path, you'll need to add this folder to your drive at the file path indicated below

In [1]:
# Put all imports here
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [2]:
fname="../data/uiuc-gpa-dataset.csv"
df=pd.read_csv(fname)
df.head()

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,B-,C+,C,C-,D+,D,D-,F,W,Primary Instructor
0,2023,Spring,2023-sp,AAS,100,Intro Asian American Studies,DIS,0,11,5,...,0,0,0,0,0,0,0,1,0,"Shin, Jeongsu"
1,2023,Spring,2023-sp,AAS,100,Intro Asian American Studies,DIS,0,17,2,...,1,0,0,0,0,0,0,0,1,"Shin, Jeongsu"
2,2023,Spring,2023-sp,AAS,100,Intro Asian American Studies,DIS,0,13,2,...,2,0,0,1,0,0,0,1,0,"Lee, Sabrina Y"
3,2023,Spring,2023-sp,AAS,200,U.S. Race and Empire,LCD,6,15,5,...,0,0,0,0,0,1,0,1,0,"Sawada, Emilia"
4,2023,Spring,2023-sp,AAS,215,US Citizenship Comparatively,LCD,16,12,2,...,1,0,0,0,0,0,0,0,0,"Kwon, Soo Ah"


# Create Debug and Working Datasets

In [3]:
debug_df = df.sample(n=100)  # Smaller sample for debugging
working_df = df

In [4]:
# Print the first few rows of debug_df
print(debug_df.head())

# Print the first few rows of working_df
print(working_df.head())

       Year    Term YearTerm Subject  Number                  Course Title  \
52199  2013  Spring  2013-sp    NRES     421  Quantitative Methods in NRES   
27488  2018  Summer  2018-su    SOCW     502    Brief Mot Interventions SU   
62609  2011  Summer  2011-su     LAS     101              Freshman Seminar   
67747  2010  Spring  2010-sp    HIST     120      East Asian Civilizations   
64924  2010    Fall  2010-fa     FIN     434        Employee Benefit Plans   

      Sched Type  A+   A  A-  ...  B-  C+  C  C-  D+  D  D-  F  W  \
52199        LCD   5  17  13  ...   1   5  0   2   1  0   0  0  0   
27488        ONL  18   1   0  ...   0   0  0   1   0  0   0  0  0   
62609        DIS   4  15   3  ...   0   0  0   0   0  0   1  1  0   
67747        DIS   1   9   3  ...   3   0  1   0   0  0   0  1  0   
64924        LCD   0   1   4  ...   2   1  0   0   0  0   0  0  0   

       Primary Instructor  
52199  Yannarell, Anthony  
27488   Campbell, Corey C  
62609     Hoffman, Ruth A  
6774

Pickle the datasets

# Getting training data

In this section, we are preparing the data to be used to train a Logisitic Regression model. We first convert our final letter grades into a single score for the class. Next we convert the class names into one-hot-encoded features (more details below).

Lastly, we split this data up into 3 sections: Training (70%), Validation (15%), Testing (15%).

In [5]:
grade_columns = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D', 'D-', 'F', 'W']
grade_percent = np.linspace(100,0, len(grade_columns))
print(grade_percent)

temp = df[grade_columns].values
Z = []
for i in range(len(temp)):
    score = np.sum(temp[i]*grade_percent)/np.sum(temp[i])
    idx = (np.abs(grade_percent - score)).argmin()
    Z.append(idx)
Z = np.array(Z)
print(Z[0])

[100.          92.30769231  84.61538462  76.92307692  69.23076923
  61.53846154  53.84615385  46.15384615  38.46153846  30.76923077
  23.07692308  15.38461538   7.69230769   0.        ]
2


In [6]:
parameter_columns = ['Subject', 'Number']
temp = df[parameter_columns].values
X = []
for i in range(len(temp)):
    X.append(str(temp[i,0]+str(temp[i,1])))
X = np.array(X)
X[0]

'AAS100'

In [7]:
enc = OneHotEncoder(sparse_output=False)
X_onehot = enc.fit_transform(X.reshape(-1,1))

In [8]:
X_train, X_temp, Z_train, Z_temp = train_test_split(X_onehot, Z, test_size=0.3, random_state=42)
X_val, X_test, Z_val, Z_test = train_test_split(X_temp, Z_temp, test_size=0.5, random_state=42)
print("Training set size:", X_train.shape)
print("Training set size:", Z_train.shape)
print("Validation set size:", X_val.shape)
print("Testing set size:", X_test.shape)

Training set size: (48348, 4576)
Training set size: (48348,)
Validation set size: (10360, 4576)
Testing set size: (10361, 4576)


# Baseline Logistic Regression Model

For our baseline model, we are using only the classes as our features. Every class has been one-hot-encoded to represent a feature. Our output is the student's grade (A+, A, A-, etc.). In order to turn these into features, we have given each grade a numerical value, calculated the average score received by the students in that class that semester, and assigned a letter grade based on that (represented as a number 0-13).

Thus each datapoint consists of a one-hot-encoded class with the corresponding output being average grade. We train our regression model on this data.

In [None]:
logreg = LogisticRegression(random_state=16, max_iter=1000)
logreg.fit(X_train, Z_train)

In [None]:
z_pred = logreg.predict(X_val)
print("Mean squared error:", mean_squared_error(Z_val, z_pred))
print("Mean absolute error:", mean_absolute_error(Z_val, z_pred))

Thus we receive a mean absolute error of 0.676 and mean squared error of 1.00. This is a good starting point for our model. Next we will experiment with additional features, and eventually more complex models.