# Student's Performance in Exams
What type of impact does someone's background have on their performance in Exams?

Grabbed data from Kaggle
[https://www.kaggle.com/spscientist/students-performance-in-exams?select=StudentsPerformance.csv](https://www.kaggle.com/spscientist/students-performance-in-exams?select=StudentsPerformance.csv)

**Note**: This is fictitious data.

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import matplotlib

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

%matplotlib inline

### Open file

In [None]:
records = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv', sep=',')
records.head()

### Get basic stats

In [None]:
print(records['math score'].describe())
print(records.gender.count())

This is interesting.  More than half of the people go less than 50% on their math score!

### How many unique records in each column do we have

In [None]:
records.select_dtypes('object').nunique()

### Are we missing any data?

In [None]:
records.isnull().any()

### Draw histograms of scores
Let's see the spread of scores for Math, Reading, and Writing

In [None]:
scores = records.copy()
del scores['gender']
del scores['race/ethnicity']
del scores['parental level of education']
del scores['lunch']
del scores['test preparation course']



def queueHistograms(df):
    for col in df.columns:
        drawHistogram(df[col], col)


def drawHistogram(records, col):
    fig, ax = plt.subplots()
    fig = plt.gcf()
    fig.set_size_inches(10,7)
    plt.hist(records, 10, density=False, color=(0.2, 0.4, 0.6, 0.6))
    plt.xlabel('Scores')
    plt.ylabel('# of students')
    plt.title('Histogram of ' + col)
    plt.grid(True)
    plt.show()
    
queueHistograms(scores)

### Graph spread of the different columns
How even is the spread across the different categories?

In [None]:
def graphSpread(name, cols, vals):
    y_pos = np.arange(len(cols))
    fig, ax = plt.subplots()
    fig = plt.gcf()
    fig.set_size_inches(10,7)

    rects1 = ax.bar(cols, vals, align='center', color=(0.2, 0.4, 0.6, 0.6))
    plt.xticks(y_pos, cols)
    ax.set_title(name + ' spread', fontsize=22)
    plt.ylabel('Count')
    autolabel(rects1, ax)
    plt.show()
    return

def getData(df, col):
    cols = df[col].unique()
    counts = df[col].value_counts()
    return cols,counts


def runGraphs(df):
    for col in df.columns:
        c, v = getData(df, col)
        graphSpread(col, c, v)

def autolabel(rects, ax):
    """
    Attach a text label above each bar displaying its height
    """
    for rect in rects:
        height = rect.get_height()
        
        ax.text(rect.get_x() + rect.get_width()/2., 1.01*height,
                '%d' % float(height),
        ha='center', va='bottom')


chartDf = records.copy()
del chartDf['math score']
del chartDf['reading score']
del chartDf['writing score']
runGraphs(chartDf)


### Get total students that scored higher than 80%

In [None]:
remove_scores = records[records['math score'] >= 80].copy()
del remove_scores['math score']
del remove_scores['reading score']
del remove_scores['writing score']
total = remove_scores.gender.count()
print(total)

In [None]:
remove_scores.head(10)

### Create reusable functions
Format the data as needed, as well as creating a pie chart.

In [None]:
def loopData(ds, subject, grade):
    """
    Loop through all of the columns, format the data, and create a graph
    """
    for col in ds.columns:
        df = createData(ds, col, ds.gender.count())
        createPie(df, col, subject, grade)

def createData(ds, header, total):
    """
    Format the data how we want it for the chart
    """
    df = ds.groupby(header, as_index=False).size().reset_index(name='count')
    df['count'] = df['count'].astype(float)
    df['count'] = (df['count'] / total * 100)
    df['count'] = df['count'].round(decimals=0)
    return df

def createPie(data, header, subject, grade):
    """
    Create pie chart.  I am not using this anymore, but it may be helpful for others.
    """
    labels = data[header].unique()
    sizes = data['count'].unique()
    plt.figure(figsize=(15,10))
    fig1, ax1 = plt.subplots()
    ax1.pie(sizes, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90, radius=1900)
    ax1.axis('equal')
    ax1.set_title(grade + '%+ ' + subject + ' grade by ' + header, bbox={'facecolor':'0.8', 'pad':3}, fontsize=22)
    fig = plt.gcf()
    fig.set_size_inches(10,10)
    plt.show()

In [None]:
def createDfs(df, col):
    """
    Create data frames so that we can create the graphs
    """
    ds = df[df['math score'] >= 80].copy()
    df1 = createData(ds, col, ds.gender.count())

    ds2 = df[df['reading score'] >= 80].copy()
    df2 = createData(ds2, col, ds2.gender.count())

    ds3 = df[df['writing score'] >= 80].copy()
    df3 = createData(ds3, col, ds3.gender.count())

    ds4 = df[(df['writing score'] >= 80) & (df['reading score'] >= 80) & (df['math score'] >= 80)].copy()
    df4 = createData(ds4, col, ds4.gender.count())
    return df1, df2, df3, df4



def runGraphs(df, col):
    """
    Create the graphs to show how the spread was for the top of the classes for each subject.
    """
    
    df1, df2, df3, df4 = createDfs(df, col)
    category_names = list(df1[col])

    results = {
        'Math Score 80%+': list(df1['count']),
        'Writing Score 80%+' : list(df2['count']),
        'Reading Score 80%+' : list(df3['count']),
        'All Scores 80%+' : list(df4['count']),
    }
    
    survey(results, category_names)
    fig = plt.gcf()
    plt.suptitle(col + ' impact on Exam Scores',x=.5)
    fig.set_size_inches(15,7.5) 

    plt.show()



def survey(results, category_names):
    """
    Parameters
    ----------
    results : dict
        A mapping from question labels to a list of answers per category.
        It is assumed all lists contain the same number of entries and that
        it matches the length of *category_names*.
    category_names : list of str
        The category labels.
    """
    labels = list(results.keys())
    data = np.array(list(results.values()))
    data_cum = data.cumsum(axis=1)
    category_colors = plt.get_cmap('twilight')(
        np.linspace(0.15, 0.85, data.shape[1]))

    fig, ax = plt.subplots(figsize=(9.2, 5))
    ax.invert_yaxis()
    ax.xaxis.set_visible(False)
    ax.set_xlim(0, np.sum(data, axis=1).max())

    for i, (colname, color) in enumerate(zip(category_names, category_colors)):
        widths = data[:, i]
        starts = data_cum[:, i] - widths
        ax.barh(labels, widths, left=starts, height=0.5,
                label=colname, color=color)
        xcenters = starts + widths / 2

        r, g, b, _ = color
        text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
        for y, (x, c) in enumerate(zip(xcenters, widths)):
            ax.text(x, y, str(int(c)), ha='center', va='center',
                    color=text_color)
    ax.legend(ncol=len(category_names), bbox_to_anchor=(0, 1),
              loc='lower left', fontsize='small')

    return fig, ax

### Get graphs for impact for Math, Writing, Reading, and All
The student must get at least an 80% in math, writing, reading, or in all of them.

In [None]:
# Create all of the graphs
for col in remove_scores.columns:
    runGraphs(records, col)

### Initial impressions
Initially, this looks like there is a huge impact for Group A and free/reduced students on getting at least an 80% on their tests.  However, this does not show the entire picture.  The groups that had the lowest numbers could also potentially have the least amount of data.

Let create a new type of graph

In [None]:
# Get a count of everything
records['math high mark'] = records['math score'] >= 80
records['reading high mark'] = records['reading score'] >= 80
records['writing high mark'] = records['writing score'] >= 80
records['all high mark'] = (records['math score'] >= 80) &(records['reading score'] >= 80) &(records['writing score'] >= 80)

print(records['math high mark'].value_counts())
print(records['reading high mark'].value_counts())
print(records['writing high mark'].value_counts())
print(records['all high mark'].value_counts())

In [None]:
df = records.groupby('gender', as_index=False).size().reset_index(name='count')

In [None]:
print(df)

In [None]:
def createDBData(df, label):
    """
    Create a new data frame for the bars to say how many got high marks, 
    and how many did not.
    """
    newDF = pd.DataFrame(data={label: [], 'high mark': [], 'not high mark': []})
    for col in df[label].unique():
        tds = records[df[label] == col]
        total = len(tds)
        tdf = tds.groupby('math high mark', as_index=False).size().reset_index(name='count')
        high = tdf.iloc[1]['count']
        low = tdf.iloc[0]['count']

        newDF = newDF.append({label: col, 'high mark': high, 'not high mark': low}, ignore_index=True)
    return newDF

In [None]:
def autolabel2(rects, ax):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 1),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom', color='black')


def drawBars(df, col):
    labels = df[col].unique()
    high_mark = df['high mark']
    low_mark = df['not high mark']

    x = np.arange(len(labels))  # the label locations
    width = 0.35  # the width of the bars

    fig, ax = plt.subplots()
    rects1 = ax.bar(x - width/2, high_mark, width, label='Score >= 80', align='center', color=(0.2, 0.4, 0.6, 0.6))
    rects2 = ax.bar(x + width/2, low_mark, width, label='Score < 80', align='center', color = (0.6, 0.4, 0.2, 0.6))

    # Add some text for labels, title and custom x-axis tick labels, etc.
    ax.set_ylabel('Count')
    ax.set_title('Math Count by score and ' + col)
    ax.set_xticks(x)
    ax.set_xticklabels(labels)
    ax.legend()
    autolabel2(rects1, ax)
    autolabel2(rects2, ax)

    fig.tight_layout()

    fig.set_size_inches(15,10) # or (4,4) or (5,5) or whatever

    plt.show()

In [None]:
for col in remove_scores.columns:
    newDF = createDBData(records, col)
    drawBars(newDF, col)

# Results
There is a big impact on lunch with regards to grades.  36% of the students who had standard lunch got an 80%+ on their math. 6.6% of free/reduced lunch students got an 80%+ on their math!  What a big difference!

I also thought test preparation would have a bigger impact on the scores.  19.5% of students with no prep course got 80%+ on Math.  For those that did complete the course, 32.6% of them got an 80%+.  Definitely better, but not by as much as I would have thought.

Gender played a small role on performance.  16% of Female's got high marks, while 23% of Male's did.

In [None]:
def convertToGrade(percent):
    """
    Convert the % grade to a letter grade.
    """
    if(percent >= 90):
        return 'A'
    if(percent >=80):
        return 'B'
    if(percent >=70):
        return 'C'
    if(percent >=60):
        return 'D'
    return 'F'

In [None]:
records['math grade'] = records.apply(lambda x: convertToGrade(x['math score']), axis = 1)
records['reading grade'] = records.apply(lambda x: convertToGrade(x['reading score']), axis = 1)
records['writing grade'] = records.apply(lambda x: convertToGrade(x['writing score']), axis = 1)


print(records['math grade'].value_counts())
print(records['reading grade'].value_counts())
print(records['writing grade'].value_counts())


In [None]:
histDF = records.copy()
histDF.head()

In [None]:
# Delete everything but the grades for charting
del histDF['gender']
del histDF['race/ethnicity']
del histDF['parental level of education']
del histDF['test preparation course']
del histDF['math score']
del histDF['reading score']
del histDF['writing score']
del histDF['math high mark']
del histDF['reading high mark']
del histDF['writing high mark']
del histDF['all high mark']
del histDF['lunch']

histDF.head()

In [None]:
histDFSorted = histDF.sort_values(by=['math grade'], ascending=False)
histDFSorted.head()
queueHistograms(histDF)

In [None]:
histDFSorted['math grade'].value_counts(sort=False)

In [None]:
records.isnull().any()

In [None]:
modelDF = records.copy()

In [None]:
def passOrFail(percent):
    """
    Create a pass/fail column
    """
    if(percent >= 60):
        return 1
    return 0

In [None]:
modelDF['average score'] = (modelDF['math score'] + modelDF['reading score'] + modelDF['writing score'])  / 3
modelDF['average score'] = modelDF['average score'].round(decimals=0)
modelDF['average grade'] = modelDF.apply(lambda x : convertToGrade(x['average score']), axis=1)
records['average grade'] = modelDF['average grade']

modelDF['passed'] = modelDF.apply(lambda x: passOrFail(x['average score']), axis = 1)
modelDF.head()

# Machine Learning task
We are going to use the most basic machine learning task and do a DecisionTreeClassifier.  This is a classification task.  I am worried that there isn't enough data.

In [None]:
def convertColumn(df, col):
    """
    We can't use Strings on some of these columns, so let's do a simple conversion to numeric
    """
    i = 0
    newDF = df.copy()
    for val in df[col].unique():
        newDF[col] = newDF[col].replace(val, i)
        i = i + 1
    return newDF[col]


modelDF['gender'] = convertColumn(modelDF,'gender')
modelDF['race/ethnicity'] = convertColumn(modelDF,'race/ethnicity')
modelDF['parental level of education'] = convertColumn(modelDF,'parental level of education')
modelDF['lunch'] = convertColumn(modelDF,'lunch')
modelDF['test preparation course'] = convertColumn(modelDF,'test preparation course')



modelDF.head()



### What we are trying to guess
We are going to try and guess what the average grade will be.  This seems very difficult.

In [None]:
y = modelDF[['average grade']].copy()
y.head()

### Features
We are going to go with their social and economic value.  We will not be using scores from their other classes, as I feel that is way too easy.  You can uncomment trying some other combinations though.

In [None]:
#features = ['gender', 'race/ethnicity', 'lunch', 'test preparation course', 'math score','reading score', 'writing score', 'math high mark', 'reading high mark', 'writing high mark', 'all high mark', 'math grade', 'reading grade', 'writing grade', 'average score']
#features = ['gender', 'race/ethnicity', 'lunch', 'test preparation course', 'math score','reading score', 'writing score', 'math high mark', 'reading high mark', 'writing high mark', 'all high mark']
#features = ['gender', 'race/ethnicity', 'lunch', 'test preparation course', 'math score','reading score', 'writing score']
#features = ['gender', 'race/ethnicity', 'lunch', 'test preparation course', 'math score','reading score', 'writing score', 'math high mark', 'reading high mark', 'writing high mark', 'all high mark', 'average score']


features = ['gender', 'race/ethnicity', 'lunch', 'test preparation course']


X = modelDF[features].copy()

In [None]:
X.columns

In [None]:
y.columns

### Training and test data split
We are going to split our data since we do not have a separate set of data to test things on.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)

### Train the model

In [None]:
math_classifier = DecisionTreeClassifier(max_leaf_nodes=35, random_state=337)
math_classifier.fit(X_train, y_train)

### Make our predictions
How well did the model guess on the average grade someone got just based on social and economic factors?

In [None]:
predictions = math_classifier.predict(X_test)

In [None]:
accuracy_score(y_true = y_test, y_pred = predictions)

### The model could not figure out what grade a student.  
Maybe it can figure out a simple pass/fail based on these factors?

In [None]:
y2 = modelDF['passed']

X_train, X_test, y_train, y_test = train_test_split(X, y2, test_size=0.25, random_state=12)

math_classifier = DecisionTreeClassifier(max_leaf_nodes=3, random_state=337)
math_classifier.fit(X_train, y_train)

predictions = math_classifier.predict(X_test)
accuracy_score(y_true = y_test, y_pred = predictions)

### Better
This is better than guessing their grade, but not great.  Before I stated that I didn't want to know the grades, as I felt it was cheating.  But what if we knew the reading and writing grades, and wanted to know what their Math grade would be?

In [None]:
features = ['gender', 'race/ethnicity', 'lunch', 'test preparation course', 'reading score', 'writing score']


y3 = modelDF['math grade']
X = modelDF[features].copy()


X_train, X_test, y_train, y_test = train_test_split(X, y3, test_size=0.25, random_state=12)

math_classifier = DecisionTreeClassifier(max_leaf_nodes=40, random_state=337)
math_classifier.fit(X_train, y_train)

predictions = math_classifier.predict(X_test)
accuracy_score(y_true = y_test, y_pred = predictions)

### Better
This is a lot more accurate than previously, but much less accurate than the passing grade.  What about just guessing if they will pass math?

In [None]:
y4 = modelDF.apply(lambda x: passOrFail(x['math score']), axis = 1)

features = ['gender', 'race/ethnicity', 'lunch', 'test preparation course', 'reading score', 'writing score']
X = modelDF[features].copy()


X_train, X_test, y_train, y_test = train_test_split(X, y4, test_size=0.25, random_state=12)

math_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=337)
math_classifier.fit(X_train, y_train)

predictions = math_classifier.predict(X_test)
accuracy_score(y_true = y_test, y_pred = predictions)


### Another approach
Let's try and do One-hot encoding.

In [None]:
secondDF = records.copy()
secondDF.head()

In [None]:
secondDF = pd.get_dummies(secondDF)

In [None]:
secondDF.head()

In [None]:
secondDF.columns

In [None]:
using = secondDF.copy()

In [None]:
features = ['gender_female', 'gender_male', 'race/ethnicity_group A',
       'race/ethnicity_group B', 'race/ethnicity_group C',
       'race/ethnicity_group D', 'race/ethnicity_group E',
       "parental level of education_associate's degree",
       "parental level of education_bachelor's degree",
       'parental level of education_high school',
       "parental level of education_master's degree",
       'parental level of education_some college',
       'parental level of education_some high school', 'lunch_free/reduced',
       'lunch_standard', 'test preparation course_completed',
       'test preparation course_none']

X = using[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
math_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=337)
math_classifier.fit(X_train, y_train)

predictions = math_classifier.predict(X_test)
accuracy_score(y_true = y_test, y_pred = predictions)

### Less accurate
It's interesting that this made it worse.  Maybe guessing a pass/fail will work better with this method?

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y2, test_size=0.25, random_state=12)
math_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=337)
math_classifier.fit(X_train, y_train)

predictions = math_classifier.predict(X_test)
accuracy_score(y_true = y_test, y_pred = predictions)

### Also less accurate
But at least we know it didn't work as well.  What if we want to know their average grade, but we already had all the other data, this should be very high, right?

### One-Hot Encoding

In [None]:
features = ['gender_female', 'gender_male', 'race/ethnicity_group A',
       'race/ethnicity_group B', 'race/ethnicity_group C',
       'race/ethnicity_group D', 'race/ethnicity_group E',
       "parental level of education_associate's degree",
       "parental level of education_bachelor's degree",
       'parental level of education_high school',
       "parental level of education_master's degree",
       'parental level of education_some college',
       'parental level of education_some high school', 'lunch_free/reduced',
       'lunch_standard', 'test preparation course_completed',
       'test preparation course_none', 'reading grade_A',
       'reading grade_B', 'reading grade_C', 'reading grade_D',
       'reading grade_F', 'writing grade_A', 'writing grade_B',
       'writing grade_C', 'writing grade_D', 'writing grade_F','math grade_A', 'math grade_B',
       'math grade_C', 'math grade_D', 'math grade_F' ]

X = using[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
math_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=337)
math_classifier.fit(X_train, y_train)

predictions = math_classifier.predict(X_test)
accuracy_score(y_true = y_test, y_pred = predictions)

### Standard

In [None]:
features = ['gender', 'race/ethnicity', 'lunch', 'test preparation course', 'math score','reading score', 'writing score', 'math high mark', 'reading high mark', 'writing high mark', 'all high mark']

X = modelDF[features].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
math_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=337)
math_classifier.fit(X_train, y_train)

predictions = math_classifier.predict(X_test)
accuracy_score(y_true = y_test, y_pred = predictions)


### Lastly
Let's guess if they passed overall if we already knew the grades.

In [None]:
features = ['gender_female', 'gender_male', 'race/ethnicity_group A',
       'race/ethnicity_group B', 'race/ethnicity_group C',
       'race/ethnicity_group D', 'race/ethnicity_group E',
       "parental level of education_associate's degree",
       "parental level of education_bachelor's degree",
       'parental level of education_high school',
       "parental level of education_master's degree",
       'parental level of education_some college',
       'parental level of education_some high school', 'lunch_free/reduced',
       'lunch_standard', 'test preparation course_completed',
       'test preparation course_none', 'reading grade_A',
       'reading grade_B', 'reading grade_C', 'reading grade_D',
       'reading grade_F', 'writing grade_A', 'writing grade_B',
       'writing grade_C', 'writing grade_D', 'writing grade_F','math grade_A', 'math grade_B',
       'math grade_C', 'math grade_D', 'math grade_F' ]

X = using[features]

X_train, X_test, y_train, y_test = train_test_split(X, y2, test_size=0.25, random_state=12)
math_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=337)
math_classifier.fit(X_train, y_train)

predictions = math_classifier.predict(X_test)
accuracy_score(y_true = y_test, y_pred = predictions)

In [None]:
features = ['gender', 'race/ethnicity', 'lunch', 'test preparation course', 'math score','reading score', 'writing score', 'math high mark', 'reading high mark', 'writing high mark', 'all high mark']

X = modelDF[features].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y2, test_size=0.25, random_state=12)
math_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=337)
math_classifier.fit(X_train, y_train)

predictions = math_classifier.predict(X_test)
accuracy_score(y_true = y_test, y_pred = predictions)


# Final Notes
We were semi reasonably able to determine if a student would pass or fail based on the social and economic factors, but not remotely reasonable to determine their grade.  If we knew two of their grades, we could reasonably guess what their overall grade would be.  This could potentially be useful for teachers to give extra attention to certain students.

Someone's economic background may have a big impact on the grades.  What can we do as a community to help those in need?

# Thank you
Thank you to the following:

- [Kaggle](https://kaggle.com) for such a great platform.

- [Jakki](https://www.kaggle.com/spscientist/students-performance-in-exams?select=StudentsPerformance.csv) for the data he created.

- [Jack Vial](http://jackvial.com) for looking through my notebook and helping me look at it from other angles.

- [Dan Beck](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding) on his tutorial on One-Hot Encoding.

- [UC San Diego](https://www.edx.org/micromasters/uc-san-diegox-data-science) for their Micro Masters program on Data Science