# Student Exam Results

For years, students have been using their predicted grades in an effort to monitor their progress throughout their chosen subjects. More recently, as a result of the Covid-19 pandemic, predicted grades have been used by exam boards, particularly within the UK, as a student's final grade. As a result, the accurate prediction of these grades has grown significantly more important. 

The dataset used within this project was found on kaggle.com and a big thank you to Kaggle user Jakki for providing this dataset for public use. 

Throughout this kernel, we shall undertake the following tasks.

0. Package and Data Imports. In this section we shall import the basic required packages as well as the dataset. Note: The machine learning algorithms will be imported as and when they are required.
1. Exploratory Data Analysis and Visualisation. In this section we shall attempt to identify the important factors used in the predictions of student results.
2. Feature Engineering. In this section we shall try to extract as much information as possible from the dataset via the creation of new columns.
3. Data Preprocessing. In this section we will prepare the data for use within a wide range of machine learning algorithms, which shall include the identification of outliers and adjusting the format of certain data columns.
4. Model Creation. In this section we will create a wide range of machine learning algorithms for use in predicting exam results.
5. Model Analysis. In this section we shall analyse the models created in the previous section and attempt to determine which model predicted grades most accurately.

## 0. Package and Data Imports

Let us begin by importing the necessary Python packages for our exploratory data analysis and data visualisation. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let us now import the dataset and check it's head, info and describe methods.

In [None]:
df = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

We can see that our dataset contains the results of 3 exams in maths, reading and writing taken by 1000 students. Our dataset also contains information on the gender, race, parental education, lunch type and whether or not a preparation course was taken for each student. We shall attempt to determine which of these factors is most influential in predicting the exam results for each student. Let us begin our analysis.

## 1. Exploratory Data Analysis and Visualisation

In this section we shall analyse the dataset and attempt to determine which variables are the most important in predicting students' exam results, as well as the relationships between our independant variables. Let us begin by analysing the relationship between the students' results in each of the three exams.

### 1.1 Math, Reading and Writing Analysis

Let us begin by producing histograms for each of the tests to determine whether the scores are normally distributed.

In [None]:
sns.distplot(df['math score'])
plt.title('Distribution of Math scores')

In [None]:
sns.distplot(df['reading score'])
plt.title('Distribution of Reading scores')

In [None]:
sns.distplot(df['writing score'])
plt.title('Distribution of Writing Scores')

The plots above show that the results for each of the exams are approximately normally distributed, with no clear and obvious signs of skewness.

Let us now investigate in more detail the breakdown of student scores for each of the 3 exams by calculating the percentage of students that scored marks, x, within the intervals:

- x < 50
- 51 < x < 60
- 61 < x < 70
- 71 < x < 80
- 81 < x < 90
- 91 < x < 100

In [None]:
exam_list = ['math score', 'reading score', 'writing score']
for exam in exam_list:
    print(exam + ':')
    print('Percentage of students scoring between 0 & 50: {}%'.format(100 * len(df[df[exam] <= 50]) / len(df)))
    print('Percentage of students scoring between 51 & 60: {}%'.format(100 * len(df[(df[exam] >= 51) & (df[exam] <= 60)]) / len(df)))
    print('Percentage of students scoring between 61 & 70: {}%'.format(100 * len(df[(df[exam] >= 61) & (df[exam] <= 70)]) / len(df)))
    print('Percentage of students scoring between 71 & 80: {}%'.format(100 * len(df[(df[exam] >= 71) & (df[exam] <= 80)]) / len(df)))
    print('Percentage of students scoring between 81 & 90: {}%'.format(100 * len(df[(df[exam] >= 81) & (df[exam] <= 90)]) / len(df)))
    print('Percentage of students scoring between 91 & 100: {}%'.format(100 * len(df[(df[exam] >= 91)]) / len(df)))
    print('-' * 40)
  

We can see that for all 3 tests, approximately 50% of students scored between 61 & 80 marks. The math test had the lowest proportion of students scoring above 81 marks, with roughly 17% of students managing this. Approximately 20% and 23% of students achieved this threshold in the writing and reading tests, repectively. Furthermore, the math test had the highest percentage of students scoring below 50 marks, with 15% of students failing to reach this threshold. 10% and 13%, approximately, of students failed to score more than 50 marks in the reading and writing tests respectively. 

Let us investigate the relationship between each of the test scores, by producing a pairplot and a heatmap of the correlation between the variables.

In [None]:
sns.pairplot(df[['math score','writing score','reading score']])

In [None]:
sns.heatmap(df[['math score','reading score','writing score']].corr(), annot=True)

We can see that there is a clear, obvious and strong positive linear relationship between each of the three test results. This is to be expected, since students who are academically intelligent are likely to perform well in a wide range of subjects, whilst students who struggle with focus or motivation are just as likely to perform poorly across all of their subjects. The strength of the relationship between the reading and writing tests is the most prominent, likely due to the fact that these subjects are extremely highly correlated. 

As a result of the strong linear relationships between the three test results, we are able to produce an average result scored for each student across the three tests. This will enable us to create a simpler model in which we are only required to predict one value rather than 3.

### 1.2 Independant v Dependant Variables

In this section, we shall analyse the effect our independant variables have on the test scores achieved by the students.

#### 1.2.1 Gender

We shall now begin to analyse how the other data features within our dataset affect the marks achieved by the students, starting with gender.

Since we are considering three different test scores, gender may have an effect. Typically, female students tend to enjoy reading and writing more than their male counterparts, while more males than females enjoy the subject of mathematics. Let us investigate this by producing box plots.

In [None]:
df[df['gender'] == 'male'].describe()

In [None]:
df[df['gender'] == 'female'].describe()

In [None]:
sns.boxplot(x='gender',y='math score',data=df)

In [None]:
sns.boxplot(x='gender',y='reading score',data=df)

In [None]:
sns.boxplot(x='gender',y='writing score',data=df)

The three plots shown above, one for each test result, prove our hypothesis as to how gender will affect the test scores. Males, on average, scored higher than females in the maths test with a slightly narrower standard deviation, 14.5 for males in comparison to 15.5 for females. Furthermore, all males scored at least 27 marks whereas there was at least 1 female who failed to score any points in this test. 

In the reading and writing tests, as predicted, females scored higher marks than males on average, by 6 and 9 marks respectively. In all three tests, we notice considerable dispersion in the range of marks achieved. Also, the box plots produced seem to highlight points which are classed as outliers. We shall investigate potential outliers in the data preprocessing section. 

#### 1.2.2 Test Preparation

During the build up to taking the exams, students were able to complete a test preparation course. We expect to see that students who completed the course scored higher marks on average than those students who chosen not to complete the course. Let us see whether the completion or non completion of this course had an effect on the results of the tests.

In [None]:
df[df['test preparation course'] == 'completed'].describe()

In [None]:
df[df['test preparation course'] == 'none'].describe()

In [None]:
plt.figure(figsize=(12,8))
plt.subplot(1,3,1)
sns.boxplot(x='test preparation course', y='math score', data=df)

plt.subplot(1,3,2)
sns.boxplot(x='test preparation course', y='reading score', data=df)

plt.subplot(1,3,3)
sns.boxplot(x='test preparation course', y='writing score', data=df)

plt.suptitle('How does the Test Preparation Course effect Test Scores?')

We can see clearly that, as anticipated, students that completed the test preparation course scored higher marks on average than those that did not. Average marks increased by 5, 7 and 10 for the maths, reading and writing tests, repsectively. We can also observe that the dispersion of marks scored was narrower in cases where the test preparation course was completed for all three tests. However, the interquartile range of the marks acheived does not change significantly as a result of the test preparation course. It appears that the preparation course solely increased the mean and median marks for each test, rather than the width of the range of marks that were scored. Despite this, it is clear that the completion of the course has a significant impact on the test results achieved by the students.

#### 1.2.3 Ethnicity

Let us now investigate the effect that a student's ethnicity has on the results they achieved in the three tests. We predict that race should not be an influencing factor, since all students should be treated and taught equally regardless of the ethnic backgrounds and origins.

In [None]:
order = ['group A', 'group B', 'group C', 'group D', 'group E']

plt.figure(figsize=(15,8))
plt.subplot(1,3,1)
sns.boxplot(x='race/ethnicity', y='math score', data=df,order=order)

plt.subplot(1,3,2)
sns.boxplot(x='race/ethnicity', y='reading score', data=df, order=order)

plt.subplot(1,3,3)
sns.boxplot(x='race/ethnicity', y='writing score', data=df,order=order)

plt.suptitle('How does race/ethnicity effect Test Scores?')

Surprisingly, it appears that the race/ethnicity of a student affects the results they obtain. For the maths test in particular, we can see a steady increase in the average score as we work our way through the groups. This is also the case in both the reading and writing tests, however the increase is not as significant. Let us investigate how many students belong to each ethnic group.

In [None]:
sns.countplot(x='race/ethnicity',data=df, order=order)

We can see that group C is the most common ethnic group within this dataset and we also notice that students within this group achieve scores relatively close to the average scores achieved across the entire dataset. Group E students achieve the highest marks on average. The minority group within this dataset, group A, achieve the lowest scores on average across all 3 tests, which may hint at possible discrimation or neglection of students within this ethnic group. 

Since ethnicity and race does seem to have an impact on test results, we shall consider using this variable with our models that we shall use to predict students' results. The current string format of this variable is unusable for machine learning algorithms. As a result, in the data preprocessing section, we shall alter the format of the variable so that we can use it within our models.

#### 1.2.4 Parental Education

Let us now begin to investigate the effects that a parents' level of education has on their children's test scores. As a prediction, we anticipate that the higher the level of parental education, the higher results achieved by the students, since it is common to believe that intelligence is inherited and passed on from generation to generation.

We shall first consider the differing levels of parental education we have within this dataset.

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x='parental level of education',data=df)

Firstly, it is important to note that we contain two values that are extremely similar, namely "high school" and "some high school". Let us begin by merging these two groups into the same group.

In [None]:
df['parental level of education'] = df['parental level of education'].apply(lambda x: 'high school' if 'high school' in x else x)

Let us repeat the plot from above to check that our merge worked correctly.

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x='parental level of education',data=df)

We can clearly see that our merge has worked sucessfully as we have only one "high school" bar showing. Now our dataset contains 5 different levels of education, ranging from college through to a master's degree. Let us see how this range of parental edcucation affects the scores achieved by the students within the three tests.

In [None]:
education_order = ["high school", "some college", "associate's degree", "bachelor's degree", "master's degree"]
plt.figure(figsize=(10,6))
sns.boxplot(x='parental level of education', y='math score', data=df, order=education_order)

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='parental level of education', y='reading score', data=df, order=education_order)

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='parental level of education', y='writing score', data=df,order=education_order)

For all three tests we can clearly see that as the level of parental education increases the average score achieved by the students increases, with the increase more prominent in the writing test. This more significant increase may be due to the fact that master's degrees commonly contain lots of high level report writing and as a result a child may be exposed more naturally to advanced writing techniques and a more expansive vocabulary range. 

As with the race/ethnicity consideration above, since we have concluded that parental education has a notable impact on the test scores achieved, we will need to adjust the format of the variables so that they can be used within our machine learning algorithms. This will be done within the data preprocessing section. 

#### 1.2.5 Lunch

From the data contained within our dataset, we can see that some students are provided with a free or reduced lunch. Let us investigate whether or not this has an influence on the test scores achieved. 

In [None]:
df[df['lunch'] == 'standard'].describe()

In [None]:
df[df['lunch'] == 'free/reduced'].describe()

In [None]:
plt.figure(figsize=(15,8))
plt.subplot(1,3,1)
sns.boxplot(x='lunch', y='math score', data=df)

plt.subplot(1,3,2)
sns.boxplot(x='lunch', y='reading score', data=df)

plt.subplot(1,3,3)
sns.boxplot(x='lunch', y='writing score', data=df)

plt.suptitle('How does the lunch package each student recieves effect Test Scores?')

We can clearly see that the lunch package received by the student has a significant impact on the test results they were able to obtain, with those students who receive a free/reduced lunch scoring approximately 9 marks lower on average than those who recieve standard lunches. Once again, this variable been shown to be an important factor on determining the results obtained by the students and, as a result, the variable will need formatting and adjusting for use within our machine learning algorithms. This will be done within the data preprocessing section. 

Throughout this section, we have seen that all independant variables have a significant effect on students' test scores. As a result, all columns and information within the dataset shall be used within our machine learning algorithms. 

## 1.3 How do our independant variables relate to each other?

In this section, we shall look at the relationship between our independant variables. Although not strictly necessary for completing our goal of predicting students exam results, this section will enable us to throughly understand our dataset and may provide useful information about the reasons behind why our variables impact the test scores in the way they do.

### 1.3.1 Race/Ethnicity vs Parental Level of Education

It is extremely commonplace that people belonging to minority ethnic groups are unable to access higher levels of education, either through lack of funding or lack of opportunity. As a result, it is interesting to see whether this is the case within our dataset.

In [None]:
race_v_pared = df.groupby(['parental level of education','race/ethnicity']).size().reset_index(name="Count").pivot(index='parental level of education',columns='race/ethnicity',values='Count')
race_v_pared.index = pd.CategoricalIndex(race_v_pared.index, categories=["high school", "some college", "associate's degree", "bachelor's degree", "master's degree"])
race_v_pared.sort_index(level=0, inplace=True)
sns.heatmap(race_v_pared,annot=True,fmt='d')

In [None]:
for group in ['group A', 'group B', 'group C', 'group D', 'group E']:
    print(group)
    for edu in education_order:
        print("Percentage with {} education: {}%".format(edu, 100 * race_v_pared[group][edu]/race_v_pared[group].sum()))
    print('-' * 25)

We can see that in the minority group, group A, nearly 50% of students' parents' only went to high school and never went on to college or university. This percentage reduces drastically as we progress from group A to group E. Furthermore, the percentage of parents achieving a master's degree in group A is only approximately 3.5%, while this percentage increases to nearly 9% in group D. 

This highlights the potential lack of opportunities for people within minority ethnic groups.

### 1.3.2 Race/Ethnicity v Lunch

Let us determine whether a students' race has an effect on the likeliness that they receive free or reduced price lunch.

In [None]:
race_v_lunch = df.groupby(['race/ethnicity','lunch']).size().reset_index(name="Count").pivot(index='lunch',columns='race/ethnicity',values='Count')
race_v_lunch.index = pd.CategoricalIndex(race_v_lunch.index, categories=['free/reduced','standard'])
race_v_lunch.sort_index(level=0, inplace=True)
sns.heatmap(race_v_lunch,annot=True, fmt='d')

In [None]:
for group in ['group A', 'group B', 'group C', 'group D', 'group E']:
    print(group)
    for lunch in ['free/reduced','standard']:
        print("Percentage with {} lunch: {}%".format(lunch,100 * race_v_lunch[group][lunch] / race_v_lunch[group].sum()))
    print('-' * 40)

Once again, we notice that within the minority group a higher percentage of students recieve discounted lunches. This further highlights the potential lack of funds available for minority groups. 

### 1.3.3 Race/Ethnicity v Test Preparation Course

Let us investigate the relationship between a students' race and whether or not they completed the test preparation course.

In [None]:
race_v_prep = df.groupby(['race/ethnicity','test preparation course']).size().reset_index(name="Count").pivot(index='test preparation course',columns='race/ethnicity',values='Count')
race_v_prep.index = pd.CategoricalIndex(race_v_prep.index, categories=['none','completed'])
race_v_prep.sort_index(level=0, inplace=True)
sns.heatmap(race_v_prep,annot=True,fmt='d')

In [None]:
for group in ['group A', 'group B', 'group C', 'group D', 'group E']:
    print("Percentage of Students in ethnic {} who completed the test preparation course: {}%".format(group, 100 * race_v_prep[group]['completed'] / race_v_prep[group].sum()))

We can immediately notice that the race/ethnicity of a student seems to have little effect on whether they completed the test preparation course. However, those students in ethnic group E were slightly more likely than students in the remaining groups to complete the course. 

### 1.3.4 Parental Level of Education vs Test Preparation Course

Let us investigate whether the education level achieved by a student's parent affected the likelihood that they completed the test preparation course. 

In [None]:
parvprep = df.groupby(['parental level of education', 'test preparation course']).size().reset_index(name='Count').pivot(index='parental level of education',columns='test preparation course',values='Count')
parvprep.index = pd.CategoricalIndex(parvprep.index, categories=["high school", "some college", "associate's degree", "bachelor's degree", "master's degree"])
parvprep.sort_index(level=0, inplace=True)
sns.heatmap(parvprep,annot=True,fmt='d')

In [None]:
for edu in ["high school", "some college", "associate's degree", "bachelor's degree", "master's degree"]:
    print("Percentage of students who parents achieved {} level education that completed the test preparation course: {}%".format(edu, (100 * parvprep['completed'][edu] / (parvprep['completed'][edu] + parvprep['none'][edu]) )))

The percentages shown above demonstrate no clear relationship between these two variables.

## 2. Feature Engineering

In the analysis section above, we found that all independant variables in the dataset already had significant impact on the test scores achieved by students. As a result, feature engineering will not be necessary as it will prove difficult to extract extra useful information from the data.

## 3. Data Preprocessing

In this section, we shall process our data so that it is ready for use within our machine learning algorithms. We shall identify any potential outlying data points and determine whether they should be removed from or kept in the dataset. We shall also create dummy variables for our categorical variables. 

### 3.1 Outlier Detection

During this section, we shall attempt to determine if any of the entries within our dataset seem to be outliers. 

#### 3.1.1 Visual Estimations

In this section, we shall attempt to find outliers by thinking of scenarios that seem unlikely. We shall first begin by finding all students who scored 100 in each of the three tests.

In [None]:
full_marks = df[(df['math score'] == 100) & (df['reading score'] == 100) & (df['writing score'] == 100)]
full_marks

From the above, we can see that there were 3 students who achieved full marks in all 3 tests. The two female students raise slight suspicion, however. This is due to the fact that neither of them completed the test preparation course. As a result, these two students possibly cheated or are both extremely intelligent. Since we are unable to determine which of these assumptions is true, we shall leave both data points in the dataset. 

Let us now see if there were any students who failed to score any marks across all three tests.

In [None]:
zero_marks = df[(df['math score'] == 0) & (df['reading score'] == 0) & (df['writing score'] == 0)]
zero_marks

No student scored zero marks in all 3 tests. Does the same thing hold true for scoring less than 20 marks?

In [None]:
lessthan40 = df[(df['math score'] < 20) & (df['reading score'] < 20) & (df['writing score'] < 20)]
lessthan40

We can see that one student failed to score more than 20 marks across all three tests. However, based on the analysis of our independent variables above, this student seems to follow our findings. Since the student is female, on average she shall perform better in reading and writing than she does in maths, which is true in this case. Also, her parents only achieved high school level of education, which means that she should perform in a manner less than expected. The same point holds true due to the fact that she recieves free/reduced lunch and did not complete the test preparation course. For these reasons, we shall keep this record in our dataset.

Let us now attempt to find students who were in the top 25% of students in one test, whilst simultaneously being in the bottom 25% for another. In order to do this, we will use the values obtained from the ".describe()" method applied to the dataframe and consider the exams in pairs.

In [None]:
df.describe().T

In [None]:
m_and_r = df[((df['math score'] > 77) & (df['reading score'] < 59)) | ((df['reading score'] > 79) & (df['math score'] < 57))]
m_and_r

In [None]:
m_and_w = df[((df['math score'] > 77) & (df['writing score'] < 58)) | ((df['writing score'] > 79) & (df['math score'] < 57))]
m_and_w

In [None]:
r_and_w = df[((df['reading score'] > 79) & (df['writing score'] < 58)) | ((df['writing score'] > 79) & (df['reading score'] < 59))]
r_and_w

The empty dataframes above show that there were no students who did above average in one test whilst simulatenously doing below average in another, which confirms the strong linear relationship we have between the test scores. 

#### 3.1.2 Interquartile Range Method

Let us use the interquartile range method to find scores for each test that fall outside of the region given by [LQ - 1.5 x IQR, UQ + 1.5 x IQR], where IQR is the interquartile range of test scores, LQ is the lower quartile and UQ is the upper quartile.

We can create a function to find the outliers using this method.

In [None]:
def outside_range(df, column):
    global lower,upper
    q25, q75 = np.quantile(df[column], 0.25), np.quantile(df[column], 0.75)
    
    # calculate the IQR
    iqr = q75 - q25
    
    # calculate the outlier cutoff
    cut_off = iqr * 1.5
    
    # calculate the lower and upper bound value of the range
    lower, upper = q25 - cut_off, q75 + cut_off
    print('The IQR for {} is {}'.format(column,iqr))
    print('The lower bound value is', lower)
    print('The upper bound value is', upper)
    
    
    # Calculate the number of records below and above lower and above bound value respectively
    df1 = df.index[(df[column] > upper) | (df[column] < lower)]
    
    print("The number of outliers for {} is {}".format(column, len(df1)))
    
    # show the two data frames where the values are outside the range
    return df.iloc[df1]

In [None]:
outside_range(df,'math score')

In [None]:
outside_range(df,'reading score')

In [None]:
outside_range(df, 'writing score')

This method has found what it believes to be a total of 19 outliers over the three tests. However, since we are considering test scores, low results may be as a result of poor preparation. This seems to be true since in 18 of the 19 cases we have found, the test preparation course was not completed. Let us look depper into the case found above in which the test preparation course was completed.

In [None]:
df.iloc[842]

We notice that this student's parents only achieved a high school level of education and that the student recieves free/reduced lunch. From our analysis, this seems to be the cause of this students' poor results, since both of these factors in combination will outway the benefits that the completion of the preparation course will bring. We shall therefore leave this record within our dataset. 

In conclusion, we have decided that there are no outlying cases within this dataset and will therefore use all entries in the production and evaluation of our models. 

### 3.2 Dummy Variables

Since all of our independent variables are in 'object' format, they will not be processed by any machine learning algorithms in their current state. In order to make use of them within our models, we must convert these categorical features into numerical features through the use of dummy variables and label encoding. 

#### 3.2.1 Label Encoding

Label encoding is used when the categorical feature is ordinal. As a result of our analysis above, we have found that the four variables "race/ethnicity", "parental level of education", "lunch" and "test preparation course" have a significant order and we shall therefore apply label encoding to each of them. We must first generate an instance of a label encoder.

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

We shall now use the label encoder to transform our data within the "parental level of education" and "lunch" columns.

In [None]:
df['parental level of education'] = label_encoder.fit_transform(df['parental level of education'])
df['lunch'] = label_encoder.fit_transform(df['lunch'])
df['test preparation course'] = label_encoder.fit_transform(df['test preparation course'])

We shall apply our own ranking system the "race/ethnicity" column.

In [None]:
df['race/ethnicity'] = df['race/ethnicity'].replace('group A', 1)
df['race/ethnicity'] = df['race/ethnicity'].replace('group B', 2)
df['race/ethnicity'] = df['race/ethnicity'].replace('group C', 3)
df['race/ethnicity'] = df['race/ethnicity'].replace('group D', 4)
df['race/ethnicity'] = df['race/ethnicity'].replace('group E', 5)

#### 3.2.2 Dummy Variables

Let us now use the pandas "get_dummies" function to convert the gender column. We must set the "drop_first" option to be true in order to reduce multicolinearity.

In [None]:
gender = pd.get_dummies(df['gender'],drop_first=True)
df = pd.concat([df,gender],axis=1)
df.head()

We can see that we have created a new column called "male" which contains a value of 0 if the student is female and 1 if the student is male. As a result, we can now drop the "gender" column, since all of its information is contained within the new "male" column.

In [None]:
df = df.drop('gender',axis=1)

In [None]:
df.head()

All of our variables are now in a numerical format and are ready for use in our machine learning models. 

### 3.3 Generation of new dependent variable

In our model creation section, we shall create models that predict the average score obtained by the student, as well as creating models that predict the scores obtained in each individual exam. As a result, we must create a new column which contains the average score obtained for use in the model fitting and testing processes.

In [None]:
df['average score'] = (df['math score'] + df['reading score'] + df['writing score']) / 3

In [None]:
df.head()

## 4 Model Creation

In this section we will create models to predict the scores achieved by each student in all 3 exams, as well as creating models to predict the average score achieved by each student.

### 4.1 Predicting Average Scores

We shall create two models, one linear regression model and one deep learning model to predict average scores.

#### 4.1.1 Linear Regression Model

In order to use our data in the deep learning models we will create, it is recommended that we should scale our data. First, let us perform a train/test split and then we shall scale our inputs.

In [None]:
X = df[['race/ethnicity','parental level of education','lunch','test preparation course','male']]
y = df['average score']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test_ave = train_test_split(X, y, test_size=0.3, random_state=101)

We shall now use the MinMax scaler to scale our input values.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


We are now able to build our linear regression model and use it to create predictions for our training set.

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,y_train)
lr_ave_pred = lr.predict(X_test)

We shall analyse the predictions obtained by the linear regression model in the analysis section below.

#### 4.1.2 Deep Learning Model using Keras

We shall now build and train a deep learning model using the Keras library. First import the necessary packages

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

We will create a multi layered deep network which includes the use of dropout layers. Throughout the network we shall use the rectified linear unit activation function and set a dropout probability of 0.2. When compliing the model, we shall use the 'ADAM' optimiser and the loss function we will use will be the mean squared error.

In [None]:
ave_deep_model = Sequential()

# Input Layer
ave_deep_model.add(Dense(5,activation='relu'))
ave_deep_model.add(Dropout(0.25))

# Hidden Layer 1
ave_deep_model.add(Dense(10,activation='relu'))
ave_deep_model.add(Dropout(0.25))

# Hidden Layer 2
ave_deep_model.add(Dense(20,activation='relu'))
ave_deep_model.add(Dropout(0.25))

# Hidden Layer 3
ave_deep_model.add(Dense(10,activation='relu'))
ave_deep_model.add(Dropout(0.25))

# Output Layer
ave_deep_model.add(Dense(1,activation='relu'))

# Compile Model
ave_deep_model.compile(optimizer='adam',loss='mse')

We can create early stopping criteria in an attempt to prevent over-fitting.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=250)

We can now fit the model to our training data.

In [None]:
ave_deep_model.fit(x=X_train, 
          y=y_train.values, 
          epochs=1000,
          validation_data=(X_test, y_test_ave), verbose=1,
          batch_size=64,
          callbacks=[early_stop]
          )

Let us now create the predictions using our deep learning model which we shall analyse in the "Model Analysis" section.

In [None]:
ave_deep_model_pred = ave_deep_model.predict(X_test)

### 4.2 Predicting Individual Scores

In this section we shall create models to predict the scores that the students will obtain in each of the three test individually.

#### 4.2.1 Linear Regression Model

Let us recreate our X and y variables and rescale using the MinMax scaler as before.


In [None]:
X = df[['race/ethnicity','parental level of education','lunch','test preparation course','male']]
y = df[['math score','reading score','writing score']]

In [None]:
X_train, X_test, y_train, y_test_indiv = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We can now create a linear regression model to predict the results of the three tests individually.

In [None]:
lr3 = LinearRegression()
lr3.fit(X_train,y_train)
lr3_pred = lr3.predict(X_test)

#### 4.2.2 Deep Learning Model

In order to compare the results directly, we shall use the same model construction as before when predicting average scores. We shall simply change the output layer to predict 3 values.

In [None]:
multi_deep_model = Sequential()

# Input Layer
multi_deep_model.add(Dense(5,activation='relu'))
multi_deep_model.add(Dropout(0.25))

# Hidden Layer 1
multi_deep_model.add(Dense(10,activation='relu'))
multi_deep_model.add(Dropout(0.25))

# Hidden Layer 2
multi_deep_model.add(Dense(20,activation='relu'))
multi_deep_model.add(Dropout(0.25))

# Hidden Layer 3
multi_deep_model.add(Dense(10,activation='relu'))
multi_deep_model.add(Dropout(0.25))

# Output Layer
multi_deep_model.add(Dense(3,activation='relu'))

# Compile Model
multi_deep_model.compile(optimizer='adam',loss='mse')

In [None]:
multi_deep_model.fit(x=X_train, 
          y=y_train.values, 
          epochs=1000,
          validation_data=(X_test, y_test_indiv), verbose=1,
          batch_size=64,
          callbacks=[early_stop]
          )

In [None]:
multi_deep_preds = multi_deep_model.predict(X_test)

Our models have now all been trained and used to predict either average scores or each test score individually. 

## 5 Model Analysis

In this section, we shall analyse the two models created for each type of regression problem. 

### 5.1 Average Scores.

Let us import the mean squared and mean absolute error functions from Scikit-learn.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

Let us now investigate these values for each of the two models.

In [None]:
print("Linear Regression MAE: {}".format(mean_absolute_error(y_test_ave,lr_ave_pred)))
print("Deep Learning Model MAE: {}".format(mean_absolute_error(y_test_ave,ave_deep_model_pred)))
print("-" * 40)
print("Linear Regression MSE: {}".format(mean_squared_error(y_test_ave,lr_ave_pred)))
print("Deep Learning Model MSE: {}".format(mean_squared_error(y_test_ave,ave_deep_model_pred)))

We can see that both models implemented to predict the average score obtained by the student have a mean absolute error of approximately 10 marks. In some exams, the boundaries between grades can be as little as 7 marks. As a result, our models may predict entirely incorrect grades for some student, which could cause them problems when applying for university. 

### 5.2 Individual Scores

In [None]:
print("Multi Target Linear Regression MAE: {}".format(mean_absolute_error(y_test_indiv,lr3_pred)))
print("Multi Target Deep Network MAE: {}".format(mean_absolute_error(y_test_indiv,multi_deep_preds)))
print("-" * 50)
print("Multi Target Linear Regression MSE: {}".format(mean_squared_error(y_test_indiv,lr3_pred)))
print("Multi Target Deep Network MAE: {}".format(mean_squared_error(y_test_indiv,multi_deep_preds)))

Once again, it appears we have a similar degree of accuracy when it comes to predicting the individual test scores rather than the average score. 

This now leads us to whether the original problem was the correct one to investigate. Are we able to predict grades more accurately if we consider the problem as a classification problem rather than a regression problem?

## 6 Changing the Type of Problem

Let us make the assumption that grades are awarded based on the average score, x, for the three tests according to the following scale:

- 0 < x <= 40: FAIL,
- 40 < x <= 50: F,
- 50 < x <= 60: E,
- 60 < x <= 70: D,
- 70 < x <= 80: C,
- 80 < x <= 90: B,
- 90 < x <= 100: A.

If we assign categories accoring the grades in the following way,

- FAIL = 0,
- F = 1,
- E = 2,
- D = 3,
- C = 4,
- B = 5,
- A = 6,

we can convert our regression problem into a classification problem. We shall create a new column called 'grade' to store these new values.


In [None]:
def average_to_grade(x):
    
    if 0 <= x <= 40:
        return 0
    elif 40 < x <= 50:
        return 1
    elif 50 < x <= 60:
        return 2
    elif 60 < x <= 70:
        return 3
    elif 70 < x <= 80:
        return 4
    elif 80 < x <= 90:
        return 5
    else:
        return 6

In [None]:
df['grade'] = df['average score'].apply(average_to_grade)

Let us now investigate how many students have achieved each different grade.

In [None]:
df['grade'].value_counts()

We can see that we have classes which are extremely unbalanced. As a result, we will require the use of SMOTE to produce more samples of each of the under represented classes. Let us first drop the now unnecessary target columns from our dataset.

In [None]:
df = df.drop(['math score','reading score','writing score','average score'],axis=1)

In [None]:
df.head()

Let us now begin using the SMOTE algorithm to create balanced classes for use in our machine learning algorithms.

In [None]:
from imblearn.over_sampling import SMOTENC

In [None]:
data = df.values
X = data[:, :-1]
y = data[:, -1]
X_columns = df.columns[:-1]
y_columns = df.columns[-1]

oversample = SMOTENC([0,1,2,3])
X, y = oversample.fit_sample(X, y)
X_sampled = pd.DataFrame(X, columns=X_columns)
y_sampled = pd.DataFrame(y, columns=[y_columns])

df = pd.concat([X_sampled,y_sampled],axis=1)

Let us check to ensure that we have perfectly balanced "grade" classes.

In [None]:
df['grade'].value_counts()

We can see that we now have 260 instances in each of the 7 different grade classes, for a total of 1820 data points. We shall now split the data into a training and testing set and then begin to implement and analyse different machine learning algorithms for the classification of students' grades.

In [None]:
X = df.drop('grade',axis=1)
y = df['grade']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=101)

### 6.1 Logistic Regression

In this section, we shall implement and analyse a logistic regression model for the problem of predicting students grades.

In [None]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(multi_class='multinomial',max_iter=2000)

We are now ready to train our model using the training datasets.

In [None]:
log_model.fit(X_train,y_train)

Let us create predictions and use a confusion matrix and classification report to analyse the performance.

In [None]:
log_model_preds = log_model.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,log_model_preds))
print("\n")
print(classification_report(y_test,log_model_preds))

Our logistic regression model achieved approximately 31% accuracy. 

### 6.2 K-Nearest Neighbors

Let us first try to find the optimal number of neighbors by training KNN models with a range of different neighbor values and recording the error in a list.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

Let us investigate how the error rate changes with the numbers of neighbors used.

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

The plot above shows that the error rate is lowest when the number of neighbors used is 17. We shall retrain a KNN model using this value and then analyse the performance on our training set.

In [None]:
knn = KNeighborsClassifier(n_neighbors=17)
knn.fit(X_train,y_train)
knn_preds = knn.predict(X_test)
print(confusion_matrix(y_test,knn_preds))
print("\n")
print(classification_report(y_test,knn_preds))

We can see that this KNN model achieves an accuracy of approximately 40%, which is a significant improvement on the Logistic Regression model trained above.

### 6.3 Random Forest

In this section we shall implement and evaluate a random forest classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)

Create predictions and produce a confusion matrix and classification report.

In [None]:
rfc_preds = rfc.predict(X_test)
print(confusion_matrix(y_test,rfc_preds))
print("\n")
print(classification_report(y_test,rfc_preds))

Our random forest classifier achieved an accuracy of approximately 40%, similar to that achieved by the KNN model.

## 7 Conclusions

In this project we have undertaken exploratory data analysis and used our findings to create a range of models to predict student grades. Unfortunately, our models only managed to attain a 40% accuracy. This may have been as a result of attempting to predict grades in a too specific way. Rather than predict the students grades, we may have been more accurate in determining whether a student is expected to pass or fail their exams. Furthermore, there was little feature engineering in this project which may also be an underlying reason for the poor performance of our models. 

As an introductory project, I hope that I have been able to demonstrate understanding of the underlying data science techniques and ideas.