# Introduction
In this notebook, we will try to predict the gender of a student based on their socio economic status, family background, as well as their scores in math, writing and reading. Below are the steps that we will be performing:

1. Exploratory Data Analytics
    * Checking for missing values
    * Visualization of variables & correlations
2. Data Engineering
    * Converting qualitative variables to dummy variables
    * Standardizing quantitative variables
3. Data Modelling with:
    * Logistic Regression
    * K Nearest Neighbours
    * Random Forest
    * SVM
4. Additional Findings
5. Comparison and Conclusion



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV 

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
students = pd.read_csv("/kaggle/input/students-performance-in-exams/StudentsPerformance.csv")

# Exploratory Data Analytics

We first take a look at the data

In [None]:
students.head()

The dataframe columns are renamed for easier accessibility

In [None]:
students.columns = "gender","race","parental_edu","lunch","test_prep","math","reading","writing"

We also check if there are any missing data in the dataset

In [None]:
students.isna().sum()
# No missing data in this dataset

We then plot bar plots and histograms to visualize the distribution of the data for each variable

In [None]:
f, axs = plt.subplots(3,3,figsize=(15,15))
students['gender'].value_counts().plot(kind='bar', ax=axs[0,0])
axs[0,0].title.set_text('Gender')
students['race'].value_counts().plot(kind='bar', ax=axs[0,1])
axs[0,1].title.set_text('Race')
students['parental_edu'].value_counts().plot(kind='bar', ax=axs[0,2])
axs[0,2].title.set_text('Parental Education')
students['lunch'].value_counts().plot(kind='bar', ax=axs[1,0])
axs[1,0].title.set_text('Lunch')
students['test_prep'].value_counts().plot(kind='bar', ax=axs[1,1])
axs[1,1].title.set_text('Test Prep')
axs[1,2].hist(students['math'])
axs[1,2].title.set_text('Math')
axs[2,0].hist(students['reading'])
axs[2,0].title.set_text('Readiing')
axs[2,1].hist(students['writing'])
axs[2,1].title.set_text('Writing')

f.delaxes(axs[2][2])
f.tight_layout()
plt.show()

From this, we observe the following regarding the data:
* Qualitative variables are distributed rather evenly between the classes, with no sparse classes. 
* Quantitative variables 'Math', 'Reading', and 'Writing' have a relatively normal distribution. Besides that, they have also taken on acceptable values within the range of 0 to 100
* However, it is worth noting that in the variable "Parental Education", the variable has the values "some college" and "some high school" which may be a repetition of other values in the variable. We will keep them for now, but depending on the model outcome, we may merge some values to see if we get better results

Besides that, we also study the relationship between the quantitative variables

In [None]:
sns.pairplot(students.iloc[:,:])
students.corr()

From this, it can be observed that the 3 score variables are quite highly correlated, with highest correlation between reading & writing. As this may affect the model output, we may remove certain variables during modelling or combine them during the data modelling stage

# Data & Feature Engineering
As the columns "gender", "race", "parental_edu", "lunch" and "test_prep" are qualitative variables, we will create dummy variables for them, removing 1 dummy variable for each variable to prevent dummy trap (multi-collinearity problems)

We then concatenate the dummy variables with the original dataset, and remove the original variable, we will store this as a new dataframe students_d

In [None]:
dum_gender = pd.get_dummies(students.gender, prefix='gender', prefix_sep='_')
dum_gender.drop('gender_female', inplace=True, axis=1)

dum_race = pd.get_dummies(students.race, prefix='race', prefix_sep='_')
dum_race.columns = "race_A", "race_B", "race_C", "race_D", "race_E"
dum_race.drop('race_E', inplace=True, axis=1)

dum_parental_edu = pd.get_dummies(students.parental_edu, prefix='parental_edu', prefix_sep='_')
dum_parental_edu.columns = "parental_edu_associate", "parental_edu_bachelor", "parental_edu_hs", "parental_edu_masters", "parental_edu_somecollege", "parental_edu_somehs"
dum_parental_edu.drop('parental_edu_somehs', inplace=True, axis=1)

dum_lunch = pd.get_dummies(students.lunch, prefix='lunch', prefix_sep='_')
dum_lunch.drop('lunch_free/reduced', inplace=True, axis=1)

dum_test_prep = pd.get_dummies(students.test_prep, prefix='test_prep', prefix_sep='_')
dum_test_prep.drop('test_prep_none', inplace=True, axis=1)

students_d = pd.concat([students, dum_gender, dum_race, dum_parental_edu, dum_lunch, dum_test_prep], axis=1)
students_d.drop(['gender', 'race', 'parental_edu', 'lunch', 'test_prep'], inplace=True, axis=1)


Since we will be using distance based algorithms to model this problem (i.e. K nearest neighbours), we will perform feature scaling to scale the quantitative variables to be between the ranges 0 and 1

In [None]:
def norm_func(i):
    x = (i-i.min())	/ (i.max()-i.min())
    return (x)

students_d = norm_func(students_d.iloc[:,:]) 

# Gender Prediction

For gender prediction, we will try and compare several models, namely:
1. Logistic Regression
2. K-Nearest Neighbours
3. Random Forest
4. SVM

# Logistic Regression Model

For model validation, we will use validation set approach. For this, we first perform a train-test split on the data in the ratio of 70:30

In [None]:
train_data,test_data = train_test_split(students_d, test_size = 0.3) # 30% test data

We then perform logistic regression using all the initial variables from the dataset

In [None]:
logit_model = sm.logit('gender_male ~ math+reading+writing+race_A+race_B+race_C+race_D+parental_edu_associate+parental_edu_bachelor+parental_edu_hs+parental_edu_masters+parental_edu_somecollege+lunch_standard+test_prep_completed', data = train_data).fit()

logit_model.summary()

From this, we can see that some of the P-Values are rather high, meaning they are not statistically significant. Besides that, the R-squared value of the model is also not too great, below 0.85.

Either way, we will use the model for validation on the train & test set to have an idea of the model performance

In [None]:
predict_train = logit_model.predict(pd.DataFrame(train_data))
predict_test = logit_model.predict(pd.DataFrame(test_data))

cnf_test_matrix = confusion_matrix(test_data['gender_male'], predict_test > 0.5 )
print("test set confusion matrix: \n", cnf_test_matrix)

print("test set accuracy: ", accuracy_score(test_data.gender_male, predict_test > 0.5), "\n")

# Error on train data
cnf_train_matrix = confusion_matrix(train_data['gender_male'], predict_train > 0.5 )
print("train set confusion matrix: \n", cnf_train_matrix)

print("train set accuracy: ", accuracy_score(train_data.gender_male, predict_train > 0.5))

From the output, it can be seen that the model performs well on the test data, with accuracy of ~0.9. 

Additionally, based on the confusion matrix, we also observe that the predictions of the model is also quite balanced, with approximately equal predictions of both classes

We will try to further improve the logistic regression model. We do this by first performing VIF on the data to determine any multi-collinearity within the data

In [None]:
vif = pd.DataFrame()

X = students_d.drop('gender_male', axis=1)
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

vif

As expected from the pairplot earlier, reading and writing are highly correlated. Thus, we will remove reading and calculate the VIF scores again

In [None]:
vif = pd.DataFrame()

X = students_d.drop(['gender_male','reading'], axis=1)
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

vif

Even after removing the variable 'reading', the VIF scores of math and writing are still high. Thus, rather than removing writing and relying only on math scores, it may be a better approach to average the 3 scores for model building.

We will attempt to build a model with this approach

In [None]:
logit_model = sm.logit('gender_male ~ I(math+reading+writing/3)+race_A+race_B+race_C+race_D+parental_edu_associate+parental_edu_bachelor+parental_edu_hs+parental_edu_masters+parental_edu_somecollege+lunch_standard+test_prep_completed', data = train_data).fit()

logit_model.summary()

In [None]:
predict_train = logit_model.predict(pd.DataFrame(train_data))
predict_test = logit_model.predict(pd.DataFrame(test_data))

cnf_test_matrix = confusion_matrix(test_data['gender_male'], predict_test > 0.5 )
print("test set confusion matrix: \n", cnf_test_matrix)

print("test set accuracy: ", accuracy_score(test_data.gender_male, predict_test > 0.5), "\n")

# Error on train data
cnf_train_matrix = confusion_matrix(train_data['gender_male'], predict_train > 0.5 )
print("train set confusion matrix: \n", cnf_train_matrix)

print("train set accuracy: ", accuracy_score(train_data.gender_male, predict_train > 0.5))

Based on the extremely low R-squared score and test & train accuracies, it is evident that we may have removed some key predictors of gender and oversimplified the relationship between the 3 score variables.

Thus, we hypothesize that there may be interaction terms between the score variables

In [None]:
logit_model = sm.logit('gender_male ~ I(reading*writing)+math+reading+writing+race_A+race_B+race_C+race_D+parental_edu_associate+parental_edu_bachelor+parental_edu_hs+parental_edu_masters+parental_edu_somecollege+lunch_standard+test_prep_completed', data = train_data).fit()

logit_model.summary()

In [None]:
predict_train = logit_model.predict(pd.DataFrame(train_data))
predict_test = logit_model.predict(pd.DataFrame(test_data))

cnf_test_matrix = confusion_matrix(test_data['gender_male'], predict_test > 0.5 )
print("test set confusion matrix: \n", cnf_test_matrix)

print("test set accuracy: ", accuracy_score(test_data.gender_male, predict_test > 0.5), "\n")

# Error on train data
cnf_train_matrix = confusion_matrix(train_data['gender_male'], predict_train > 0.5 )
print("train set confusion matrix: \n", cnf_train_matrix)

print("train set accuracy: ", accuracy_score(train_data.gender_male, predict_train > 0.5))

Based on the model summary, we can see that the newly added interaction term has a low P-value, meaning that it is statistically significant. Besides that, the R-squared of the model has also increased slightly from before.

Based on the confusion matrix & accuracies, it is clear that this model is more accurate than the previous ones

Note that the variable 'reading' is retained even though it has a high P-value. This is due to the hierarchical principle, stating that if we include interaction terms in the model, we should also include main effects even if their P-value is not significant

We will try to further improve the model by removing some variables

In [None]:
logit_model = sm.logit('gender_male ~ I(reading*writing)+math+reading+writing+race_A+race_B+race_C+race_D+lunch_standard+test_prep_completed', data = train_data).fit()

logit_model.summary()

In [None]:
predict_train = logit_model.predict(pd.DataFrame(train_data))
predict_test = logit_model.predict(pd.DataFrame(test_data))

cnf_test_matrix = confusion_matrix(test_data['gender_male'], predict_test > 0.5 )
print("test set confusion matrix: \n", cnf_test_matrix)

print("test set accuracy: ", accuracy_score(test_data.gender_male, predict_test > 0.5), "\n")

# Error on train data
cnf_train_matrix = confusion_matrix(train_data['gender_male'], predict_train > 0.5 )
print("train set confusion matrix: \n", cnf_train_matrix)

print("train set accuracy: ", accuracy_score(train_data.gender_male, predict_train > 0.5))

After trying out removal of various variables, it is found that removing the variable 'parental education' results in a slight increase in accuracy, with a decrease in complexity. Thus, we will remove this variable.

Several transformations were also tried, with no further increase in accuracy. Thus, this will be the final logistic regression model.

We will store the final accuracy values for final tabulation and comparison

In [None]:
log_test_acc = accuracy_score(test_data.gender_male, predict_test > 0.5)
log_train_acc = accuracy_score(train_data.gender_male, predict_train > 0.5)

# K Nearest Neighbours

We will utilize the same train & test split data to model for K nearest neighbours. Since all the variables have already been scaled to be between 0 & 1, we can begin modelling immediately

Note that hyperparameter tuning for the value of k has already been done, and only the best value of k is displayed

Splitting traing & test dataset into predictor and target

In [None]:
train_X = train_data.drop(['gender_male'], axis=1)
train_Y = train_data.loc[:,'gender_male']
test_X = test_data.drop(['gender_male'], axis=1)
test_Y = test_data.loc[:,'gender_male']

In [None]:
knn = KNeighborsClassifier(n_neighbors=25)
knn.fit(train_X,train_Y)

print("For test data: \n")
pred = knn.predict(test_X)
print(pd.crosstab(test_Y, pred, rownames=['Actual'],colnames= ['Predictions']))
print("Test accuracy:", accuracy_score(test_Y, pred))

print("\nFor training data: \n")
pred_train = knn.predict(train_X)
print(pd.crosstab(train_Y, pred_train, rownames=['Actual'],colnames= ['Predictions']))
print("Training accuracy:", accuracy_score(train_Y, pred_train))

From this first run, the accuracies of the KNN model is quite low. We will try to remove some variables to try to improve the accuracy of the model. We start by removing parental education, as it was deemed statistically insignificant in the logistic regression model as well

In [None]:
knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(train_X.drop(['parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'], axis=1), train_Y)

print("For test data: \n")
pred = knn.predict(test_X.drop(['parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'], axis=1))
print(pd.crosstab(test_Y, pred, rownames=['Actual'],colnames= ['Predictions']))
print("Test accuracy:", accuracy_score(test_Y, pred))

print("\nFor training data: \n")
pred_train = knn.predict(train_X.drop(['parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'], axis=1))
print(pd.crosstab(train_Y, pred_train, rownames=['Actual'],colnames= ['Predictions']))
print("Training accuracy:", accuracy_score(train_Y, pred_train))


The accuracy has increased quite significantly. In fact, we find that using only the scores for the variables 'math', 'reading', and 'writing', we are able to obtain the best prediction accuracies, as shown below:

In [None]:
knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(train_X.drop(['race_A', 'race_B', 'race_C', 'race_D','test_prep_completed','lunch_standard','parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'], axis=1), train_Y)

print("For test data: \n")
pred = knn.predict(test_X.drop(['race_A', 'race_B', 'race_C', 'race_D', 'test_prep_completed','lunch_standard','parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'], axis=1))
print(pd.crosstab(test_Y, pred, rownames=['Actual'],colnames= ['Predictions']))
print("Test accuracy:", accuracy_score(test_Y, pred))

print("\nFor training data: \n")
pred_train = knn.predict(train_X.drop(['race_A', 'race_B', 'race_C', 'race_D','test_prep_completed','lunch_standard','parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'], axis=1))
print(pd.crosstab(train_Y, pred_train, rownames=['Actual'],colnames= ['Predictions']))
print("Training accuracy:", accuracy_score(train_Y, pred_train))

We will store the final accuracy values for final tabulation and comparison

In [None]:
knn_test_acc = accuracy_score(test_Y, pred)
knn_train_acc = accuracy_score(train_Y,pred_train)

# Random Forest
We will utilize the same train & test split data to model for random forest

Note that hyperparameter tuning for the number of trees in forest has already been done, and only the best value is displayed

In [None]:
rf = RandomForestClassifier(n_jobs=2, n_estimators=35, criterion="entropy")

rf.fit(train_X, train_Y)

print("For test data: \n")
pred = rf.predict(test_X)
print(pd.crosstab(test_Y, pred, rownames = ['Actual'], colnames = ['Predictions']))
print("Test accuracy:", accuracy_score(test_Y,pred))

print("\nFor training data: \n")
pred_train = rf.predict(train_X)
print(pd.crosstab(train_Y, pred_train, rownames = ['Actual'], colnames = ['Predictions']))
print("Training accuracy:", accuracy_score(train_Y,pred_train))


Next, we will try removing variables to improve model accuracy. We will start by removing the variables 'parental education'

In [None]:
rf = RandomForestClassifier(n_jobs=2, n_estimators=200, criterion="entropy")

rf.fit(train_X.drop(['parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'], axis=1), train_Y)

print("For test data: \n")
pred = rf.predict(test_X.drop(['parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'], axis=1))
print(pd.crosstab(test_Y, pred, rownames = ['Actual'], colnames = ['Predictions']))
print("Test accuracy:", accuracy_score(test_Y,pred))

print("\nFor training data: \n")
pred_train = rf.predict(train_X.drop(['parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'], axis=1))
print(pd.crosstab(train_Y, pred_train, rownames = ['Actual'], colnames = ['Predictions']))
print("Training accuracy:", accuracy_score(train_Y,pred_train))


We can see that the accuracy does not differ much from removal of the variable. I have also tried performing pruning on the decision trees to further improve the performance, with not much difference in accuracy.

We will store the final accuracy values for final tabulation and comparison

In [None]:
rf_test_acc = accuracy_score(test_Y, pred)
rf_train_acc = accuracy_score(train_Y,pred_train)

# SVM

We will utilize the same train & test split data to model for SVM

We will model the data for the following kernels to select the best one: 
* Linear
* Polynomial
* Sigmoid
* Gaussian (rbf)

In [None]:
# kernel = linear
model_linear = SVC(kernel="linear")
model_linear.fit(train_X, train_Y)
pred_test_linear = model_linear.predict(test_X)
pred_train_linear = model_linear.predict(train_X)

# kernel = poly
model_poly = SVC(kernel="poly")
model_poly.fit(train_X, train_Y)
pred_test_poly = model_poly.predict(test_X)
pred_train_poly = model_poly.predict(train_X)

# kernel = sigmoid
model_sigmoid = SVC(kernel="sigmoid")
model_sigmoid.fit(train_X, train_Y)
pred_test_sigmoid = model_sigmoid.predict(test_X)
pred_train_sigmoid = model_sigmoid.predict(train_X)

# kernel = rbf
model_rbf = SVC(kernel="rbf")
model_rbf.fit(train_X, train_Y)
pred_test_rbf = model_rbf.predict(test_X)
pred_train_rbf = model_rbf.predict(train_X)

data = {"kernel":pd.Series(["linear","polynomial","sigmoid","rbf"]),"Test Accuracy":pd.Series([accuracy_score(test_Y, pred_test_linear),accuracy_score(test_Y, pred_test_poly),accuracy_score(test_Y, pred_test_sigmoid),accuracy_score(test_Y, pred_test_rbf)])}
table_acc=pd.DataFrame(data)
table_acc


From this, it can be seen that the linear kernel gives the best test accuracy.

Next, we will try removing variables to further improve model accuracy. We will start by removing the variables 'parental education'

In [None]:
train_X = train_X.drop(['parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'],axis=1)
test_X = test_X.drop(['parental_edu_associate', 'parental_edu_bachelor', 'parental_edu_hs', 'parental_edu_masters', 'parental_edu_somecollege'],axis=1)

# kernel = linear
model_linear = SVC(kernel="linear")
model_linear.fit(train_X, train_Y)
pred_test_linear_dropped = model_linear.predict(test_X)
pred_train_linear_dropped = model_linear.predict(train_X)

# kernel = poly
model_poly = SVC(kernel="poly")
model_poly.fit(train_X, train_Y)
pred_test_poly_dropped = model_poly.predict(test_X)
pred_train_poly_dropped = model_poly.predict(train_X)

# kernel = sigmoid
model_sigmoid = SVC(kernel="sigmoid")
model_sigmoid.fit(train_X, train_Y)
pred_test_sigmoid_dropped = model_sigmoid.predict(test_X)
pred_train_sigmoid_dropped = model_sigmoid.predict(train_X)

# kernel = rbf
model_rbf = SVC(kernel="rbf")
model_rbf.fit(train_X, train_Y)
pred_test_rbf_dropped = model_rbf.predict(test_X)
pred_train_rbf_dropped = model_rbf.predict(train_X)

data = {"kernel":pd.Series(["linear","polynomial","sigmoid","rbf"]),"Test Accuracy":pd.Series([accuracy_score(test_Y, pred_test_linear_dropped),accuracy_score(test_Y, pred_test_poly_dropped),accuracy_score(test_Y, pred_test_sigmoid_dropped),accuracy_score(test_Y, pred_test_rbf_dropped)]),"Train Accuracy":pd.Series([accuracy_score(train_Y, pred_train_linear_dropped),accuracy_score(train_Y, pred_train_poly_dropped),accuracy_score(train_Y, pred_train_sigmoid_dropped),accuracy_score(train_Y, pred_train_rbf_dropped)])}
table_acc=pd.DataFrame(data)
table_acc


It can be observed that removing variables does not result in an increase in accuracies. Removal of other variables have also been tested, similarly with no increase in accuracy

We will try tuning the hyperparameters for the SVM. We will use a randomized search to find the best estimator

Note that a randomized search is used to have a good estimate of the hyperparameters, while reducing computation time

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform

svm_clf=SVC()
param_distributions = {"kernel":('linear','poly','rbf'), "gamma": reciprocal(0.001, 0.1), "C": uniform(1, 10)}
rnd_search_cv = RandomizedSearchCV(svm_clf, param_distributions, n_iter=10, verbose=2, cv=3)
rnd_search_cv.fit(train_X, train_Y)

In [None]:
print("Best estimator: ",rnd_search_cv.best_estimator_)

predicted = rnd_search_cv.predict(test_X)
rnd_test_acc = accuracy_score(test_Y,predicted)
print("test accuracy: ", rnd_test_acc)

predicted_train = rnd_search_cv.predict(train_X)
rnd_train_acc = accuracy_score(train_Y,predicted_train)
print("train accuracy: ", rnd_train_acc)

From this, we see that there is not much difference from the initial SVM predictions.

We will store the highest final test accuracy value for final tabulation and comparison

In [None]:
if (table_acc.iloc[:,1].idxmax() > rnd_test_acc):
    svm_best = table_acc.iloc[:,1].idxmax()
    svm_test_acc = table_acc.iloc[svm_best,1]
    svm_train_acc = table_acc.iloc[svm_best,2]
else: 
    svm_test_acc = rnd_test_acc
    svm_train_acc = rnd_train_acc

# Additional Findings

Based on the models built, it is observed that student scores in math, writing and reading are a significant predictor of their gender. Thus, we are also interested to verify this by directly comparing their scores.

We will compare male and female students average scores in math, reading, writing and overall score.

In [None]:
students["average_score"] = students.loc[:,['math','reading','writing']].mean(axis=1).round(1)
students

students.loc[:,['gender','math','writing','reading','average_score']].groupby(['gender']).mean().transpose().plot.bar()
plt.title('Comparison of Student Scores Between Genders')
plt.xlabel('Subject')
plt.ylabel('Score')
plt.legend(loc='lower right')
plt.show()


It can be observed from the plot that on average male students score higher in math, whereas female students score higher in writing, reading, and have higher overall scores

# Comparison and Conclusion

We will tabulate the test & train accuracies for the 4 algorithms

In [None]:
data = {"Model":pd.Series(["Logistic Regression","K Nearest Neighbour","Random Forest","SVM"]),"Test Accuracy":pd.Series([log_test_acc,knn_test_acc,rf_test_acc,svm_test_acc]),"Train Accuracy":pd.Series([log_train_acc,knn_train_acc,rf_train_acc,svm_train_acc])}
table_final=pd.DataFrame(data)
table_final


From this, it can be seen that logistic regression and SVM have the highest accuracies. 

It is proposed that logistic regression be used as the final model, due to advantages in terms of:
* Interpretability as logistic regression provides a formula, with relative weightage of variables. The probability of each class is also provided
* Simplicity as the logistic regression model built uses less number of variables