# Final Project - Modeling
##### Name: Shane Staret  
##### Class: CSCI 349 - Intro to Data Mining   
##### Semester: 2021SP   
##### Instructor: Brian King

### **Accomplishments, Challenges, & What to Expect Moving Forward**

Well, I was able to find a model that is a pretty good fit! It was not the model I was expecting, but I'm happy that I was able to find a model that had pretty accurate and precise predictions. The most challenging part of the modeling process (by far) was determining the different hyperparameters that must be used within the Keras NN. I tried different activation functions, different epochs, different NN structures (e.g. with hidden layers, without hidden layers, multiple layers with different activation functions, etc)...and unfortunately I was never able to get a Keras NN model that worked as well as I would have hoped.  
  
Moving forward, everything from this notebook along with the Data Prep/EDA notebook needs to be tied together to form a cohesive narrative about the problem I have addressed and how I have solved it. A deeper dive into the interpretation of the results and the ramification of them will also be explored within the final report.

### **Data Preprocessing**

In [1]:
# ALL DATA PREPROCESSING

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_predict
import sklearn.metrics as metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import shuffle
import tensorflow as tf
import keras
from keras.models import Input, Model, Sequential
from keras.layers.core import Dense, Activation
from keras.wrappers.scikit_learn import KerasClassifier

# reading in the dataset for the math course
df_mat = pd.read_csv('data/student-mat.csv', delimiter=';')

# reading in the dataset for the Portuguese course
df_por = pd.read_csv('data/student-por.csv', delimiter=';')

# reassigning dtype for each attribute
df_mat['school'] = pd.Categorical(df_mat['school'])
df_por['school'] = pd.Categorical(df_por['school'])

df_mat['sex'] = pd.Categorical(df_mat['sex'])
df_por['sex'] = pd.Categorical(df_por['sex'])

df_mat['age'] = pd.to_numeric(df_mat['age'], downcast='unsigned')
df_por['age'] = pd.to_numeric(df_por['age'], downcast='unsigned')

df_mat['address'] = pd.Categorical(df_mat['address'])
df_por['address'] = pd.Categorical(df_por['address'])

df_mat['famsize'] = pd.Categorical(df_mat['famsize'])
df_por['famsize'] = pd.Categorical(df_por['famsize'])

df_mat['Pstatus'] = pd.Categorical(df_mat['Pstatus'])
df_por['Pstatus'] = pd.Categorical(df_por['Pstatus'])

df_mat['Medu'] = pd.Categorical(df_mat['Medu']).as_ordered()
df_por['Medu'] = pd.Categorical(df_por['Medu']).as_ordered()

df_mat['Fedu'] = pd.Categorical(df_mat['Fedu']).as_ordered()
df_por['Fedu'] = pd.Categorical(df_por['Fedu']).as_ordered()

df_mat['Mjob'] = pd.Categorical(df_mat['Mjob'])
df_por['Mjob'] = pd.Categorical(df_por['Mjob'])

df_mat['Fjob'] = pd.Categorical(df_mat['Fjob'])
df_por['Fjob'] = pd.Categorical(df_por['Fjob'])

df_mat['reason'] = pd.Categorical(df_mat['reason'])
df_por['reason'] = pd.Categorical(df_por['reason'])

df_mat['guardian'] = pd.Categorical(df_mat['guardian'])
df_por['guardian'] = pd.Categorical(df_por['guardian'])

df_mat['traveltime'] = pd.Categorical(df_mat['traveltime']).as_ordered()
df_por['traveltime'] = pd.Categorical(df_por['traveltime']).as_ordered()

df_mat['studytime'] = pd.Categorical(df_mat['studytime']).as_ordered()
df_por['studytime'] = pd.Categorical(df_por['studytime']).as_ordered()

df_mat['failures'] = pd.Categorical(df_mat['failures']).as_ordered()
df_por['failures'] = pd.Categorical(df_por['failures']).as_ordered()

df_mat['schoolsup'] = pd.Categorical(df_mat['schoolsup'])
df_por['schoolsup'] = pd.Categorical(df_por['schoolsup'])

df_mat['famsup'] = pd.Categorical(df_mat['famsup'])
df_por['famsup'] = pd.Categorical(df_por['famsup'])

df_mat['paid'] = pd.Categorical(df_mat['paid'])
df_por['paid'] = pd.Categorical(df_por['paid'])

df_mat['activities'] = pd.Categorical(df_mat['activities'])
df_por['activities'] = pd.Categorical(df_por['activities'])

df_mat['nursery'] = pd.Categorical(df_mat['nursery'])
df_por['nursery'] = pd.Categorical(df_por['nursery'])

df_mat['higher'] = pd.Categorical(df_mat['higher'])
df_por['higher'] = pd.Categorical(df_por['higher'])

df_mat['internet'] = pd.Categorical(df_mat['internet'])
df_por['internet'] = pd.Categorical(df_por['internet'])

df_mat['romantic'] = pd.Categorical(df_mat['romantic'])
df_por['romantic'] = pd.Categorical(df_por['romantic'])

df_mat['famrel'] = pd.to_numeric(df_mat['famrel'], downcast='unsigned')
df_por['famrel'] = pd.to_numeric(df_por['famrel'], downcast='unsigned')

df_mat['freetime'] = pd.to_numeric(df_mat['freetime'], downcast='unsigned')
df_por['freetime'] = pd.to_numeric(df_por['freetime'], downcast='unsigned')

df_mat['goout'] = pd.to_numeric(df_mat['goout'], downcast='unsigned')
df_por['goout'] = pd.to_numeric(df_por['goout'], downcast='unsigned')

df_mat['Dalc'] = pd.to_numeric(df_mat['Dalc'], downcast='unsigned')
df_por['Dalc'] = pd.to_numeric(df_por['Dalc'], downcast='unsigned')

df_mat['Walc'] = pd.to_numeric(df_mat['Walc'], downcast='unsigned')
df_por['Walc'] = pd.to_numeric(df_por['Walc'], downcast='unsigned')

df_mat['health'] = pd.to_numeric(df_mat['health'], downcast='unsigned')
df_por['health'] = pd.to_numeric(df_por['health'], downcast='unsigned')

df_mat['absences'] = pd.to_numeric(df_mat['absences'], downcast='unsigned')
df_por['absences'] = pd.to_numeric(df_por['absences'], downcast='unsigned')

df_mat['G1'] = pd.to_numeric(df_mat['G1'], downcast='unsigned')
df_por['G1'] = pd.to_numeric(df_por['G1'], downcast='unsigned')

df_mat['G2'] = pd.to_numeric(df_mat['G2'], downcast='unsigned')
df_por['G2'] = pd.to_numeric(df_por['G2'], downcast='unsigned')

df_mat['G3'] = pd.to_numeric(df_mat['G3'], downcast='unsigned')
df_por['G3'] = pd.to_numeric(df_por['G3'], downcast='unsigned')

# adding column to each df to designate the course that the students are in
df_mat['course'] = 'mat'
df_mat['course'] = pd.Categorical(df_mat['course'])
df_por['course'] = 'por'
df_por['course'] = pd.Categorical(df_por['course'])

# combine data frames
df_com = pd.concat([df_mat, df_por], ignore_index=True)
df_com['course'] = pd.Categorical(df_com['course'])

# convert all categorical attributes in combined dataframe to numeric
cat_columns = df_com.select_dtypes(['category']).columns
df_com[cat_columns] = df_com[cat_columns].apply(lambda x: x.cat.codes)

# standardizing all data (except for grade variables)
min_max_scaler = preprocessing.MinMaxScaler()
df_com.loc[:, (df_com.columns != 'G1') & (df_com.columns != 'G2') & (df_com.columns != 'G3')] = min_max_scaler.fit_transform(df_com.loc[:, (df_com.columns != 'G1') & (df_com.columns != 'G2') & (df_com.columns != 'G3')])

# downcasting all data
df_com = df_com.apply(pd.to_numeric, downcast='float')

# dropping first and second period grades in an additional dataset so we can use models on both the dataset that contains these grades and the one that doesn't
df_com_alt = df_com.drop(columns=['G1', 'G2'])

display(df_com.info())
display(df_com_alt.info())

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1044 entries, 0 to 1043
Data columns (total 34 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   school      1044 non-null   float32
 1   sex         1044 non-null   float32
 2   age         1044 non-null   float32
 3   address     1044 non-null   float32
 4   famsize     1044 non-null   float32
 5   Pstatus     1044 non-null   float32
 6   Medu        1044 non-null   float32
 7   Fedu        1044 non-null   float32
 8   Mjob        1044 non-null   float32
 9   Fjob        1044 non-null   float32
 10  reason      1044 non-null   float32
 11  guardian    1044 non-null   float32
 12  traveltime  1044 non-null   float32
 13  studytime   1044 non-null   float32
 14  failures    1044 non-null   float32
 15  schoolsup   1044 non-null   float32
 16  famsup      1044 non-null   float32
 17  paid        1044 non-null   float32
 18  activities  1044 non-null   float32
 19  nursery     1044 non-null  

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1044 entries, 0 to 1043
Data columns (total 32 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   school      1044 non-null   float32
 1   sex         1044 non-null   float32
 2   age         1044 non-null   float32
 3   address     1044 non-null   float32
 4   famsize     1044 non-null   float32
 5   Pstatus     1044 non-null   float32
 6   Medu        1044 non-null   float32
 7   Fedu        1044 non-null   float32
 8   Mjob        1044 non-null   float32
 9   Fjob        1044 non-null   float32
 10  reason      1044 non-null   float32
 11  guardian    1044 non-null   float32
 12  traveltime  1044 non-null   float32
 13  studytime   1044 non-null   float32
 14  failures    1044 non-null   float32
 15  schoolsup   1044 non-null   float32
 16  famsup      1044 non-null   float32
 17  paid        1044 non-null   float32
 18  activities  1044 non-null   float32
 19  nursery     1044 non-null  

None

### **Modeling**

In [2]:
# separating target variable and all other variables to prepare for modeling
target_alt = df_com_alt['G3']
target_alt_ohe = pd.get_dummies(target_alt)

variables_alt = df_com_alt.drop('G3', axis=1)

# splitting training and testing data
X_train, X_test, y_train, y_test = train_test_split(variables_alt, target_alt, test_size=0.2, random_state=0)

# MULTIPLE LINEAR REGRESSION MODEL (1st & 2nd period grades NOT INCLUDED)
# help: https://stackabuse.com/linear-regression-in-python-with-scikit-learn/ and https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/ and https://scikit-learn.org/stable/modules/model_evaluation.html
linear_regression = LinearRegression()
linear_regression_clf = linear_regression.fit(X_train, y_train)
linear_regression_pred = linear_regression_clf.predict(X_test)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, linear_regression_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, linear_regression_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, linear_regression_pred)))
print('R2:', metrics.r2_score(y_test, linear_regression_pred))

Mean Absolute Error: 2.1597762
Mean Squared Error: 7.7431164
Root Mean Squared Error: 2.7826455
R2: 0.17523701152448756


There are a few interesting results from using the multiple linear regression model. First, the RMSE is just under 3, while scores range from 0 to 20. This RMSE is about 14% of the total score range, indicating that the model is not particularly accurate but that it can be made to make relatively precise predictions. An R2 score of 0.175, however, indicates that this model is not a good performer as an R2 closer to 1.0 is expected for a well-performing model. Because of this, it is quite clear that the multiple linear regression model is likely not the best model to use.

In [3]:
# printing the coefficients for each variable and how they affect the target
df_co = pd.DataFrame(linear_regression_clf.coef_, variables_alt.columns, columns=['Coefficient'])
df_co

Unnamed: 0,Coefficient
school,-0.908398
sex,-0.023556
age,0.260008
address,0.481881
famsize,0.512326
Pstatus,-0.18955
Medu,1.068629
Fedu,0.179861
Mjob,-0.180635
Fjob,0.07107


While the multiple linear regression model is likely not the best model to use (at least when the first and second period graees are not included), the coefficients given for each variable regarding how they affect the target value is very interesting, as it shows the individual effect each input variable had on the prediction of the target value. Based on this model, it appears that sex, activities, traveltime, and Fjob are all very irrelevant input variables (since their coefficients are very close to 0). Interestingly, it appears that Medu, studytime, higher, famrel, the course taken, and the number of absences all contribute positively to the prediction of the final score. The most surprising variable there is absences, as I would assume that a higher number of abscences would not lead to a prediction of a higher final grade. Finally, it appears that failures and schoolsup have significant negative contributions to the prediction of final score. Increased number of failures in particular appears to be highly correlated with a low final grade.

In [5]:
# separating target variable and all other variables to prepare for modeling
target = df_com['G3']
target_ohe = pd.get_dummies(target)

variables = df_com.drop('G3', axis=1)

# splitting training and testing data
X_train, X_test, y_train, y_test = train_test_split(variables, target, test_size=0.2, random_state=0)

# MULTIPLE LINEAR REGRESSION MODEL (1st & 2nd period grades INCLUDED)
# help: https://stackabuse.com/linear-regression-in-python-with-scikit-learn/ and https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/ and https://scikit-learn.org/stable/modules/model_evaluation.html
linear_regression = LinearRegression()
linear_regression_clf = linear_regression.fit(X_train, y_train)
linear_regression_pred = linear_regression_clf.predict(X_test)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, linear_regression_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, linear_regression_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, linear_regression_pred)))
print('R2:', metrics.r2_score(y_test, linear_regression_pred))

Mean Absolute Error: 0.8423875
Mean Squared Error: 1.5384248
Root Mean Squared Error: 1.2403326
R2: 0.8361336894879354


There is a very clear improvement to using the multiple linear regression model when the first and second period grades are included. All metrics went down (except R2 of course) with the RMSE only being 1.24 for this model. The R2 also increased significantly from 0.175 to 0.836, indicating that this model is much better at predicting final grade scores when the first and second period grades are included.

In [6]:
# printing the coefficients for each variable and how they affect the target
df_co = pd.DataFrame(linear_regression_clf.coef_, variables.columns, columns=['Coefficient'])
df_co

Unnamed: 0,Coefficient
school,0.048222
sex,-0.032164
age,-0.365574
address,0.116958
famsize,-0.023849
Pstatus,-0.162575
Medu,0.002303
Fedu,-0.03717
Mjob,0.06657
Fjob,-0.558514


When the first and second period grades are looked at when using the multiple linear regression model, the variables that appear to have influence over the final grade prediction change. A high number of past failures can still be seen as contributing to a lower predicted final score. Weekday drinking also appears to negatively impact final grade prediction. Second period grades, absences, a good family relationship, and travel time also appear to contribute positively to final scores. This is interesting, as I would not have predicted that high absences or high travel time would positively impact the prediction of the final scores. The course being taken also appears to influence the grade achieved quite a bit. For the most part, no other variables are relevant.

In [8]:
# KERAS NN MODEL WITH KFOLD CROSS VALIDATION (1st & 2nd period grades NOT INCLUDED)
# help: https://www.pyimagesearch.com/2019/01/21/regression-with-keras/

# number of inputs and outputs
NUM_INPUTS = 31
NUM_OUTPUTS = 19

true_list = list(target_alt)

# function to properly create Keras model
def create_keras_model(optimizer='adam', num_hidden=10, activation='relu'):
    inputs = Input(shape=(NUM_INPUTS,))
    layer = Dense(num_hidden, activation=activation)(inputs)
    outputs = Dense(NUM_OUTPUTS, activation='softmax')(layer)
    model = Model(inputs=inputs, outputs=outputs, name="keras model")
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])
    return model

# generating predictions
keras_clf = KerasClassifier(build_fn=create_keras_model, verbose=0, epochs=150, batch_size=4)
keras_pred = cross_val_predict(keras_clf, variables_alt, target_alt_ohe, cv=10)

df_results = pd.DataFrame(list(zip(true_list, keras_pred)), columns =['dt_true', 'dt_def'])
display(df_results)
print(classification_report(df_results['dt_true'], df_results['dt_def'], digits=3))
print('Mean Absolute Error:', metrics.mean_absolute_error(true_list, keras_pred))
print('Mean Squared Error:', metrics.mean_squared_error(true_list, keras_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(true_list, keras_pred)))
print('R2:', metrics.r2_score(true_list, keras_pred))

Unnamed: 0,dt_true,dt_def
0,6,11
1,6,9
2,10,6
3,15,11
4,10,11
...,...,...
1039,10,10
1040,16,12
1041,9,10
1042,10,9


              precision    recall  f1-score   support

           0      0.152     0.189     0.168        53
           1      0.000     0.000     0.000         1
           2      0.000     0.000     0.000         0
           3      0.000     0.000     0.000         0
           4      0.000     0.000     0.000         1
           5      0.000     0.000     0.000         8
           6      0.026     0.111     0.042        18
           7      0.056     0.053     0.054        19
           8      0.088     0.239     0.129        67
           9      0.067     0.222     0.103        63
          10      0.133     0.078     0.099       153
          11      0.129     0.119     0.124       151
          12      0.078     0.078     0.078       103
          13      0.100     0.080     0.089       113
          14      0.059     0.011     0.019        90
          15      0.211     0.098     0.133        82
          16      0.000     0.000     0.000        52
          17      0.000    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The output of the Keras NN without the first or second period grades is extremely interesting. Not only does it appear to perform worse than the multiple linear regression model, the R2 value is -0.440...indicating that this model is extremely poor at predicting the final grade based on the 31 input variables. I was expecting the model to not perform very well, however, this is worse than I anticipated. **The one research paper that looked at this dataset concluded that it doesn't appear to be possible to accurately predict final grades without the first and second period grades being known though.** Also, this is mentioned in the "Conclusion" section of the research paper. So, since the dataset this model was used on did not include those additional grades, maybe this result is not as unexpected as it first appears.  
  
Of course another explanation is that this model was set up inappropriately, however I cannot see where I may have gone wrong. The target variable was adjusted to not be numeric and instead be categorical (that's why we are using "target_alt_ohe") and the creation of the Keras model follows the structure of previous Keras models that I have created for this class and within other projects. All of the input variables were also standardized. Maybe there is something I'm missing (like one of the activation functions throwing something way off) or perhaps this model performing extremely poorly is not necessarily all that surprising.

In [9]:
# KERAS NN MODEL WITH KFOLD CROSS VALIDATION (1st & 2nd period grades INCLUDED)
# help: https://www.pyimagesearch.com/2019/01/21/regression-with-keras/

# number of inputs and outputs
NUM_INPUTS = 33
NUM_OUTPUTS = 19

true_list = list(target)

# function to properly create Keras model
def create_keras_model(optimizer='adam', num_hidden=10, activation='relu'):
    inputs = Input(shape=(NUM_INPUTS,))
    layer = Dense(num_hidden, activation=activation)(inputs)
    outputs = Dense(NUM_OUTPUTS, activation='softmax')(layer)
    model = Model(inputs=inputs, outputs=outputs, name="keras model")
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])
    return model

# generating predictions
keras_clf = KerasClassifier(build_fn=create_keras_model, verbose=0, epochs=150, batch_size=4)
keras_pred = cross_val_predict(keras_clf, variables, target_ohe, cv=10)

df_results = pd.DataFrame(list(zip(true_list, keras_pred)), columns =['dt_true', 'dt_def'])
display(df_results)
print(classification_report(df_results['dt_true'], df_results['dt_def'], digits=3))
print('Mean Absolute Error:', metrics.mean_absolute_error(true_list, keras_pred))
print('Mean Squared Error:', metrics.mean_squared_error(true_list, keras_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(true_list, keras_pred)))
print('R2:', metrics.r2_score(true_list, keras_pred))

Unnamed: 0,dt_true,dt_def
0,6,6
1,6,4
2,10,6
3,15,12
4,10,8
...,...,...
1039,10,8
1040,16,14
1041,9,11
1042,10,9


              precision    recall  f1-score   support

           0      0.474     0.509     0.491        53
           1      0.000     0.000     0.000         1
           3      0.000     0.000     0.000         0
           4      0.000     0.000     0.000         1
           5      0.000     0.000     0.000         8
           6      0.110     0.556     0.183        18
           7      0.000     0.000     0.000        19
           8      0.075     0.224     0.113        67
           9      0.067     0.190     0.099        63
          10      0.075     0.026     0.039       153
          11      0.113     0.106     0.110       151
          12      0.089     0.068     0.077       103
          13      0.062     0.071     0.066       113
          14      0.125     0.022     0.038        90
          15      0.300     0.110     0.161        82
          16      0.094     0.058     0.071        52
          17      0.000     0.000     0.000        35
          18      0.000    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Surprisingly, the Keras NN working on the dataset including the first and second period scores did not perform too well, however an RMSE of 2.82 (~15% of the spread of scores from 0 to 20) is not too bad and an R2 of 0.467 is a great improvement compared to negative R2 score achieved when working on the dataset that excludes the first and second period scores. I am a bit surprised to see that the Keras NN is not achieving as good results as the multiple linear regression model. Perhaps the NN needs more fine-tuning or other activation functions could be used.