
# Introduction of the Investigation
### Purposes:
#### In this project, I will be investigating the best machine learning model for students exam scores predicting.
#### The data I will be using in this investigation is compromised of three different test scores: "math", "reading", "writing". It has five columns of different features which are: "gender", "race/ethnicity", "parental level of learning", "lunch", "test preparation course".
#### I hope to get the most accurate predictions I can get from this project.
#### I chose this set of data for prediction because it had various features in it, most values in the data are non-numerical data, however, we can easily use them for prediction after a bit modification. 

### Hypothesis:
#### I predict that the best model for this data will be Linear Regressor.
#### I predict that features that have the strongest effects on predictions will be "test preparation"; "lunch"; "parental level of education" as they are generally affect students' performance more in real life.
#### I am looking forward to minimise the Mean Absolute Errors as much as possible.
#### According to the size of the data, I will aim to keep the Mean Absolute Errors below 15.

# Preparation
#### Getting all the libraries that are going to be used ready.
#### Set up the training data directory

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualisation purposes
from sklearn import linear_model
from sklearn.tree import DecisionTreeClassifier ,plot_tree, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

print('Set up complete')

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Get the data in
#### Gives a general view of the data

In [None]:
trainfile='../input/students-performance-in-exams/StudentsPerformance.csv'
data = pd.read_csv(trainfile)
data.describe(include='all')
#print(data)

# Gather and Explore the data
#### define the features that the model will be trained with
#### drop all the missing values in the data to get a more accurate outcome
#### define student's exam score
#### visualize all the scores students get

In [None]:
selected_columns=['gender','lunch','test preparation course']
X=data[selected_columns]
X=X.dropna(axis=0)
X.describe()
math_y=data['math score']
read_y=data['reading score']
write_y=data['writing score']
sns.pairplot(data[['math score', 'reading score', 'writing score']], height = 5)
sns.relplot(x='math score', y='reading score', hue='lunch', data=data[['math score', 'reading score', 'lunch']])
sns.relplot(x='math score', y='reading score', hue='test preparation course', data=data[['math score', 'reading score', 'test preparation course']])
sns.relplot(x='math score', y='reading score', hue='parental level of education', data=data[['math score', 'reading score', 'parental level of education']])

# Model selection
#### According to the graph above, we can see that those data generally fits a line, and the features I chose to use to make predictions all have a big weighting on the outcome. Therefore, I can predict that a linear regression model may have a better performance on predicting the scores.

In [None]:
sns.relplot(x='math score', y='reading score', hue='race/ethnicity', data=data[['math score', 'reading score', 'race/ethnicity']])
sns.relplot(x='math score', y='reading score', hue='gender', data=data[['math score', 'reading score', 'gender']])

### plot another two graphs for features of Gender and race/ethnicity to have a look at their effects on data prediction. According to those two graphs, race/ethnicity won't have to much effects on prediction as the scores seem to be distributed averagly, gender seems have a big impact on the scores, so I will add gender into the list of features that I will be using to train the model.

In [None]:
selected_columns.append('gender')
X=data[selected_columns]
X=X.dropna(axis=0)

# Prepare the data
#### Change those non-numerical data to numerical data
#### Split three different sets of data into training and testing data as we are going to predict student's math, reading and writing scores.

In [None]:
one_hot_X=pd.get_dummies(X)
one_hot_X.head()
math_train_X,math_val_X,math_train_y,math_val_y=train_test_split(one_hot_X, math_y, random_state=1)
read_train_X,read_val_X,read_train_y,read_val_y=train_test_split(one_hot_X, read_y, random_state=1)
write_train_X,write_val_X,write_train_y,write_val_y=train_test_split(one_hot_X, write_y, random_state=1)
write_train_X.head(10)

# Fit the model
### Linear Regression
#### According to the hypothesis made above, I am going to use a Linear Regressor to make predictions.
#### I will train three different Linear Tree Regressor models to train three different sets of data to predict math, reading and writing scores respectively.

In [None]:
linear_math=linear_model.LinearRegression()
linear_math.fit(math_train_X,math_train_y)
linear_read=linear_model.LinearRegression()
linear_read.fit(read_train_X,read_train_y)
linear_write=linear_model.LinearRegression()
linear_write.fit(write_train_X,write_train_y)

### Decision Tree Regressor
#### In order to make sure I can get the most accurate predictions, I will train three other models using Decision Tree Rregressor

In [None]:
math_score_predictor=DecisionTreeRegressor(max_depth=10,random_state=1)
math_score_predictor.fit(math_train_X,math_train_y)
read_score_predictor=DecisionTreeRegressor(max_depth=10,random_state=1)
read_score_predictor.fit(read_train_X,read_train_y)
write_score_predictor=DecisionTreeRegressor(max_depth=10,random_state=1)
write_score_predictor.fit(write_train_X,write_train_y)

# Making predictions
#### Make predictions with the two models which have been trained with the testing data.

In [None]:
linear_math_pred=linear_math.predict(one_hot_X)
X['linear_math score']=math_y
X['lm_Predicted']=linear_math_pred
linear_read_pred=linear_read.predict(one_hot_X)
X['linear_reading score']=read_y
X['lr_Predicted']=linear_read_pred
linear_write_pred=linear_write.predict(one_hot_X)
X['linear_writing score']=write_y
X['lw_Predicted']=linear_write_pred

In [None]:
math_pred=math_score_predictor.predict(one_hot_X)
X['math score']=math_y
X['m_Predicted']=math_pred
read_pred=read_score_predictor.predict(one_hot_X)
X['reading score']=read_y
X['r_Predicted']=read_pred
write_pred=write_score_predictor.predict(one_hot_X)
X['writing score']=write_y
X['w_Predicted']=write_pred
print(X)

# Evaluate the model
#### Check the mean absolute errors of the predictions in order to evalute the accuracy of our model.
#### One big improvements regards to Mean Absolute Errors is when I changed the model from Decision Tree Classifier to Decision Tree Regressor, the MAE droped from 20 to 10, I then changed the features that I am using to train the model, while the MAE was still around 10.
#### The reason for this big improvement can be attributed to the features of models I am using. As Decision Tree Classifier is a model that was designed to make predictions on boolean values, while the Decision Tree Regressor was kind of designed to predict numerical data. Therefore, regressor can make better predictions on numerical data.

In [None]:
lmc=linear_math.predict(math_val_X)
lrc=linear_read.predict(read_val_X)
lwc=linear_write.predict(write_val_X)
lm_mae=mean_absolute_error(lmc,math_val_y)
lr_mae=mean_absolute_error(lrc,read_val_y)
lw_mae=mean_absolute_error(lwc,write_val_y)
print(f'Mean absolute error of linear regressor on math score prediction: {lm_mae}')
print(f'Mean absolute error of linear regressor on reading score prediction: {lr_mae}')
print(f'Mean absolute error of linear regressor on writing score prediction: {lw_mae}')

In [None]:
mc=math_score_predictor.predict(math_val_X)
m_mae=mean_absolute_error(mc,math_val_y)
rc=read_score_predictor.predict(read_val_X)
r_mae=mean_absolute_error(rc,read_val_y)
wc=write_score_predictor.predict(write_val_X)
w_mae=mean_absolute_error(wc,write_val_y)
print(f'Mean absolute error of decision tree regressor on math score prediction: {m_mae}')
print(f'Mean absolute error of decision tree regressor on reading score prediction: {r_mae}')
print(f'Mean absolute error of decision tree regressor on writing score prediction: {w_mae}')

# Conclusion
### The purposes of the investigation:
#### In this investigation, I successfully implemented the Decision Tree Regerssor and Linear Regressor to make predictions based on students pre-exam features. 
#### The quality of the predictions is relatively not good, but is already the best I can get based on what I have learnt. There might be other methods which can help reduce the Mean Absolute Errors of my predicitions. By the way, I have also implemented a random forest regressor and got a similar mean absolute error as the Decision Tree Regressor.
#### However, I do have some ideas that may help with lowering the Mean Absolute Error. I can draw more graphs and make more analysis on what features weight more than other features, and change the weighting of those features when training the model. Or, I can use some reinforement learning techinques, get some more data of student's exam performance to get a better outcome.