# Predicting Student GPA using Survey Responses

This notebook is an experiment to see how successful we can be at predicting GPA using survey responses alone. That is to say, can well can we tell the grade point of a student based on how satisfied they rate themselves with their education in the survey questionaire? Are more satisfied students strongly likely to be good students?

I'm curious to see how bad (or surprisingly good) the result would be. The better the result, the more differenciable the survey-takers, and hence, (probably) the better the survey is. Personally I am not confident that this will be true-ish as I think most users [basically all answer the same way](https://xkcd.com/1098/), regardless of how poor things *really* are for them.

In [None]:
import pandas as pd
students = pd.read_csv('../input/STUDENT-SURVEY.csv', encoding='latin-1')
students.head(3)

To start with, here are our GPAs.

In [None]:
import seaborn as sns
sns.kdeplot(students['S.S.C (GPA)'])
sns.kdeplot(students['H.S.C (GPA)'])

They're fairly, but surprisingly not totally, correlated. H.S.C. looks easier to model, so let's stick to that one.

In [None]:
students.loc[:, ['S.S.C (GPA)', 'H.S.C (GPA)']].corr()

In [None]:
target_var = 'H.S.C (GPA)'

Next we do a ton of feature selection. In particular, we throw out the fields that give GPA-ish information. We want to stick to the survey questions: things like how satisfied are you with X, how good is Y, etcetera.

In [None]:
base = (pd.get_dummies(students.Faculty)
     .rename(columns={'Arts': 'English Degree',
                      'Law': 'Law Degree'})
     .drop('Business', axis='columns')).join(
 pd.get_dummies(students['Business Program']).add_suffix(' Business Degree')
)
base.head(3)

In [None]:
students_under_consideration = students.loc[students['Masters Academic Year in EU'].isnull()]
base = base.iloc[students_under_consideration.index.values]

In [None]:
base = base.assign(
    Year=students_under_consideration.iloc[:, 8].map(lambda v: v.split(" ")[0][:1] if pd.notnull(v) else v).astype(float)
)

In [None]:
students['Classes are mostly'].value_counts()

Interestingly enough, irregular students have a lower GPA on average. But it's not that significant an effect, due to the small sample size.

In [None]:
students.groupby('Regular/Irregular')['H.S.C (GPA)'].mean(), students.groupby('Regular/Irregular')['H.S.C (GPA)'].std()

In [None]:
base = base.assign(
    Coaching=students['Did you ever attend a Coaching center?'].map(lambda v: v == "Yes"),
    Regularity=students['Regular/Irregular'].astype(bool),
    Quality_Has_Improved=students['Do you feel that the quality of education improved at EU over the last year?'].map(lambda v: v == "Yes"),
    Image_Has_Improved=students['Do you feel that the image of the University improved over the last year?'].map(lambda v: v == "Yes")
)

In [None]:
survey_results = base.join(students_under_consideration.iloc[:, 30:80])

In [None]:
survey_results = survey_results.dropna()

In [None]:
survey_results.shape

We'll use ridge regression, because why not?

In [None]:
from sklearn.linear_model import Ridge
import numpy as np

clf = Ridge(alpha=1.0)
clf.fit(survey_results, students_under_consideration.loc[survey_results.index.values][target_var])

In [None]:
Y = clf.predict(survey_results)

And we get...

In [None]:
# sns.kdeplot(students['S.S.C (GPA)'])
sns.kdeplot(students['H.S.C (GPA)'].rename('GPA'))
sns.kdeplot(pd.Series(Y).rename('GPA (Predicted)'))

...it's not very good!

You can tell from this plot that the classifier mostly failed right away because of how densely clustered the result is around 4. This indicates that the model did not that much better at capturing the shape of the data than just settling on the average of the distribution at large.

Obviously we want a classifier that does better than that, but this one mostly doesn't!

Another view:

In [None]:
sns.jointplot(x=students['H.S.C (GPA)'].rename('GPA'), 
              y=pd.Series(Y).rename('GPA (Predicted)'))

## To-Do

Fit statistics.

## Conclusion

Student GPA is not correlated with the level of satisfication that they indicate on survey data.