# <h1 style="font-family:Garamond; color: purple; font-size:30px"><center> Can we predict a student's gender based on their maths, reading and writing scores? </center></h1>

Throughout my entire school experience, there was always a stereotype that girls are better at subjects that are more creative with an emphasis on reading and writing, whereas boys are better at the more logical subjects. Personally, I never fit this stereotype, but is it generally true?

In this notebook I will be using students' test scores in maths, reading and writing to determine if their genders can be correctly identified without knowing any other characteristics.

# 1. Imports

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import plot_confusion_matrix

# 2. Loading data

In [None]:
students = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')
students

For this task, we only need the students' genders and their math, reading and writing scores.

In [None]:
scores = students[['gender', 'math score', 'reading score', 'writing score']]
scores

In [None]:
scores.info()

# 3. Checking gender frequencies

In classification problems it is important to ensure that the target data isn't significantly unbalanced. Therefore for this dataset we need to check the male and female frequencies.

In [None]:
sns.set_style("darkgrid")
sns.countplot(scores['gender'], palette="Set1")

Only slightly unbalanced - won't affect the results.

Now let's visualise the distributions of scores for each gender.

# 4. Visualising gender scores

In [None]:
sns.displot(scores, x='math score', hue='gender', palette="Set1")

Males seem to get slightly higher maths scores than females, but there isn't a massive difference.

In [None]:
sns.displot(scores, x='reading score', hue='gender', palette="Set1")

The difference here is more obvious, with female reading scores being much higher than males overall.

In [None]:
sns.displot(scores, x='writing score', hue='gender', palette="Set1")

This distribution is similar to reading, with the female writing scores being much higher than males overall.

Judging by these plots, we should be able to create a model capable of predicting most of the genders correctly. However, I wouldn't expect 100% accuracy because there will always be anomalies - for instance, I am a female who has always been better at maths than reading and writing!

In [None]:
x = scores[['math score','reading score', 'writing score']]
y = scores['gender']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

# 5. Logistic Regression

In [None]:
linreg = LogisticRegression()

linreg.fit(x_train, y_train)
print('Logistic Regression Accuracy:', linreg.score(x_test, y_test)*100, '%')

In [None]:
sns.set_style("white")
disp = plot_confusion_matrix(linreg, x_test, y_test, cmap=plt.cm.PuBu)
disp.ax_.set_title('Confusion Matrix for Logistic Regression')

# 6. K-Nearest Neighbours

In [None]:
knn = KNeighborsClassifier()

knn.fit(x_train, y_train)
print('KNN Accuracy:', knn.score(x_test, y_test)*100, '%')

In [None]:
sns.set_style("white")
disp2 = plot_confusion_matrix(knn, x_test, y_test, cmap=plt.cm.PuBu)
disp2.ax_.set_title('Confusion Matrix for KNN')

# 7. Gaussian Naive Bayes

In [None]:
nb = GaussianNB()

nb.fit(x_train, y_train)
print('Gaussian Naive Bayes Accuracy:', nb.score(x_test, y_test)*100, '%')

In [None]:
sns.set_style("white")
disp3 = plot_confusion_matrix(nb, x_test, y_test, cmap=plt.cm.PuBu)
disp3.ax_.set_title('Confusion Matrix for Gaussian Naive Bayes')

# 8. Decision Tree Classifier

In [None]:
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
print('Decision Tree Accuracy:', dtc.score(x_test, y_test)*100, '%')

In [None]:
sns.set_style("white")
disp4 = plot_confusion_matrix(dtc, x_test, y_test, cmap=plt.cm.PuBu)
disp4.ax_.set_title('Confusion Matrix for Decision Tree Classifier')

# 9. SVC (Support Vector Classification)

In [None]:
svc = SVC()
svc.fit(x_train, y_train)
print('SVC Accuracy:', svc.score(x_test, y_test)*100, '%')

In [None]:
sns.set_style("white")
disp5 = plot_confusion_matrix(svc, x_test, y_test, cmap=plt.cm.PuBu)
disp5.ax_.set_title('Confusion Matrix for SVC')

# 10. Conclusion

The answer is YES, we can predict the gender of a student based on their test scores. With an accuracy of 89.3% using SVC, the model actually performed better than I was expecting. Generally, it seems that males are better at maths whereas females are better at reading and writing.