First we import the libraries.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn import tree,svm

Then we load the data and take a look inside.

In [None]:
df = pd.read_csv('../input/StudentsPerformance.csv')
df.head()

Let get some basic statistics of the data.

In [None]:
df.describe()

Now let's extend the dataset. At my University you pass an exam when reached a score of at least 50. So for each of the three tests we will add to the dataset whether the students have passed or not.

In [None]:
df['math passed'] = df['math score'] > 50
df['reading passed'] = df['reading score'] > 50
df['writing passed'] = df['writing score'] > 50
df['all passed'] = df['math passed'] & df['reading passed'] & df['writing passed']

Time for some visualizations.

In [None]:
def plotPassedByColumn(column, df):
    fig = plt.figure(figsize=(10,4))
    plt.subplot(221)
    sns.countplot(x=column, hue='math passed', data=df)
    plt.subplot(222)
    sns.countplot(x=column, hue='reading passed', data=df)
    plt.subplot(223)
    sns.countplot(x=column, hue='writing passed', data=df)
    plt.subplot(224)
    sns.countplot(x=column, hue='all passed', data=df)
    
def barplotPercentage(column, df):
    fig = plt.figure(figsize=(10,4))
    plt.subplot(221)
    sns.barplot(x=column, y='math passed', data=df)
    plt.subplot(222)
    sns.barplot(x=column, y='reading passed', data=df)
    plt.subplot(223)
    sns.barplot(x=column, y='writing passed', data=df)
    plt.subplot(224)
    sns.barplot(x=column, y='all passed', data=df)
    

In [None]:
plotPassedByColumn('gender', df)

We can see that female students performed better than male students in writing an reading while male students performed better in math. No real trend can be seen by regarding the final results. 

In [None]:
plotPassedByColumn('race/ethnicity', df)

It seems difficult to interpret these bars so let's express this in numbers.

In [None]:
result_types = ['math passed', 'reading passed', 'writing passed', 'all passed']
groups = ['group A', 'group B', 'group C', 'group D', 'group E']
result_type_performance = []
for group in groups:
    group_performance = [group]
    for result_type in result_types:
        values = df[(df[result_type]) & (df['race/ethnicity'] == group)].count() / df[df['race/ethnicity'] == group].count()
        group_performance.append(int(values[0].round(2) * 100))
    result_type_performance.append(group_performance)
#sns.barplot(x=groups, hue=groups, data=np.array(result_type_performance))
res_df = pd.DataFrame(result_type_performance)
res_df.columns = ['group', 'math passed', 'reading passed', 'writing passed', 'all passed']
barplotPercentage('group', res_df)

It seems like group A has the most difficulties in passing the exams while group E performs best.

In [None]:
plotPassedByColumn('lunch', df)

Looks like the people with free or reduced lunch have more trouble in passing the exams compared to those who pay the standard price.

In [None]:
plotPassedByColumn('test preparation course', df)

Here we see that completing the test preparation course has an observable effect on whether the exams are passed or not. 

Now let's set up a simple knn classifier to see in how far the results are predictable from the given data.

In [None]:
X = df[['gender', 'race/ethnicity', 'lunch', 'test preparation course']]
y = df['all passed']
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(X)
X = pd.get_dummies(X)
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=.15,random_state=0)
scores = []
iterations = 100
for i in range(1,iterations + 1):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))
x = np.linspace(1, iterations, iterations)
results = pd.DataFrame({'n_neighbors':x, 'scores': scores})
argmax = np.argmax(scores)
(argmax, scores[argmax])

So using 18 Neighbours we achieved the best results by getting an Accuracy of 79.33%. Let's try some other methods as well.

In [None]:
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(X_train, y_train)
decision_tree.score(X_test, y_test)

In [None]:
lin_svm = svm.SVC(kernel='poly', gamma='scale')
lin_svm.fit(X_train, y_train)
lin_svm.score(X_test, y_test)

Looks like were not getting any better. So let's recap we we are right now. We can use information about the gender, race/ethnicity, lunch fees and attendence in a test preparation course for a mediocre prediction if student's pass all their exams or not. 
    
This kind of prediction could lead to positive outcomes such as giving learning support to student's who have a higher risk of failing. But also negative findings could be drawn such as female student's are not good at math or people of race X won't pass anyways. 

As data scientists it is our responsibility to work with data carefully because even such a small dataset of students performance can lead to discrimination. So even if the data exists we should ask ourselves whether using it might discriminate people and if it would, what methods can be used to prevent it?

