In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Importing Libraries 

In [None]:
%matplotlib inline
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings
warnings.filterwarnings("ignore")
sns.set_style('whitegrid')
np.random.seed(42)

# Introduction

The data that we are going to use is all about what factors affecting student's scores in math, reading, and writing.

You can dowload the data in these sites: https://www.kaggle.com/spscientist/students-performance-in-exams or http://roycekimmons.com/tools/generated_data/exams

Features: 
- "gender": M or F 
- "race/ethnicity": Race of the student
- "parental level of education: Student's parent level of education
- "lunch": What kind of lunch did student take
- "test prepartion course": Did the student took a test preparation course
- "math score": Scores in Math 
- "reading score": Scores in Reading 
- "writing score": Scores in Writing

In this notebook we will explore what factors will affect student's scores in various subject.

## Loading the Data

In [None]:
df = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")

In [None]:
df.head()

In [None]:
df.info()

No missing values therefore, we can proceed smoothly in our analysis.

Let's check the unique values in parental level of education and in race/ethnicity.

In [None]:
df["parental level of education"].unique()

In [None]:
df["race/ethnicity"].unique()

There are six unique values in parental level of education, you can group these into same category but for this analysis we let the features stay the same, while in race/ethnicity column there are five unique values.

In [None]:
df[["math score", "reading score", "writing score"]].describe()

# Exploratory Data Analysis

In [None]:
passing_score = 75
df["MathScorePass"] = np.where(df["math score"] < passing_score,
                              "Fail",
                              "Pass")
df["ReadingScorePass"] = np.where(df["reading score"] < passing_score,
                                 "Fail",
                                 "Pass")
df["WritingScorePass"] = np.where(df["writing score"] < passing_score,
                                "Fail", 
                                "Pass")


I set a threshold of 75 for passing grade I based this value to our educational system. You can set this into your own value of passing grade. 

In [None]:
df.head()

In [None]:
def get_count_plot(x,
                  hue,
                  data,
                  palette,
                  title): 
    fig, ax = plt.subplots(figsize=(10, 5), dpi=100)
    ax.tick_params(labelsize=16)
    ax.set_title(title)
    sns.countplot(x=x, hue=hue, data=data, palette=palette, ax=ax);    
    
def get_scatter_plot(x, 
                     y,
                    hue,
                    data,
                    palette,
                    title): 
    fig, ax = plt.subplots(figsize=(10, 5), dpi=100)
    ax.tick_params(labelsize=16)
    ax.set_title(title)
    sns.stripplot(x=x, y=y, hue=hue, data=data, palette=palette, ax=ax, alpha=0.75)
    
def get_point_plot(x,
                  y,
                  hue,
                  data,
                  palette,
                  title,
                  markers=["o", "x"],
                  linestyle=["-", "--"],
                  ):
    fig, ax = plt.subplots(figsize=(10, 5), dpi=100)
    ax.tick_params(labelsize=16)
    ax.set_title(title)
    sns.pointplot(x=x, y=y, hue=hue, data=data, palette=palette, markers=markers, linestyle=linestyle, ax=ax)

We created three functions for plotting, for us to visualize and understand the data more.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5), dpi=100)
ax.tick_params(labelsize=16)
ax.set_title("Number of Gender Based on Parental Level of Education")
sns.countplot(x="parental level of education",
             hue="gender",
             data=df,
             palette="pastel",
             ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45);

The number of samples in bachelor's degree and master's degree are low it can affect the analysis later if we relate the parental level of education to the exam scores.

The number of samples here may be affected by their status, only handful can afford a bachelor's degree and master's degree. 

There are more females than male in most aspect except in high school.


In [None]:
df[df['test preparation course'] == 'completed']

There were 358 who took the test preparation course

## Grade Based on Gender

In [None]:
get_count_plot(x="MathScorePass",
              hue="gender",
              data=df,
              palette="flare",
              title="Math Grade Based on Gender")
    

In [None]:
get_count_plot(x="ReadingScorePass",
              hue="gender",
              data=df,
              palette="ch:s=.25,rot=-.25",
              title="Reading Grade Based on Gender")

In [None]:
get_count_plot(x="WritingScorePass",
              hue="gender",
              data=df,
              palette="hls",
              title="Writing Grade Based on Gender")

In Math subject, most who failed was female, we can conclude that in this data male was superior in Math subject than female.

But in Reading and Writing scores, female beat the male. We can conclude that in this data set male was not good at reading and writing. 

## Grade based on Race/Ethnicity

In [None]:
get_count_plot(x="MathScorePass",
              hue=df["race/ethnicity"].sort_values(ascending=True),
              data=df,
              palette="hls",
              title="Math Grade Based on Race")

In [None]:
get_count_plot(x="ReadingScorePass",
              hue=df["race/ethnicity"].sort_values(ascending=True),
              data=df,
              palette="ch:s=.25,rot=-.25",
              title="Reading Grade Based on Race")

In [None]:
get_count_plot(x="WritingScorePass",
              hue=df["race/ethnicity"].sort_values(ascending=True),
              data=df,
              palette="pastel",
              title="Writing Grade Based on Race")

In all of the subjects, Group C was the one who failed the most but also who passed the most in Reading and Writing subject. I infer that there are more samples of female in Group C since there are less student in Group C who passed in Math subject.

I think students whose parents level of education are Bachelor's degree and Master's degree fall into Group A and Group E since there are low samples in that area. 

## Grade Based on Parental Level of Education

In [None]:
get_count_plot(x="MathScorePass",
              hue=df["parental level of education"].sort_values(ascending=True),
              data=df,
              palette="hls",
              title="Math Grade Based on Parental Level of Education")

In [None]:
get_count_plot(x="ReadingScorePass",
              hue=df["parental level of education"].sort_values(ascending=True),
              data=df,
              palette="ch:s=.25,rot=-.25",
              title="Reading Grade Based on Parental Level of Education")

In [None]:
get_count_plot(x="WritingScorePass",
              hue=df["parental level of education"].sort_values(ascending=True),
              data=df,
              palette="pastel",
              title="Writing Score Based on Parental Level of Education")

Students whose parents level of education are Associate's Degree and Some College have more number of students who pass in all of the subject. We can't still conclude that students in that area are more better than the students whose parents have bachelor's degree or master's degree since we have a low number of sample in that area. 

## Grade Based on Lunch 

In [None]:
get_scatter_plot(x='MathScorePass', 
                 y='math score', 
                 hue='lunch',
                data=df,
                palette='flare',
                title='Math Grade Based on Lunch')

In [None]:
get_point_plot(x='MathScorePass',
              y='math score',
              hue='lunch',
              data=df,
              palette='flare',
              title='Math Grade Based on Lunch')

In [None]:
get_scatter_plot(x='ReadingScorePass', 
                 y='reading score', 
                 hue='lunch',
                data=df,
                palette='ch:s=.25,rot=-.25',
                title='Reading Grade Based on Lunch')

In [None]:
get_point_plot(x='ReadingScorePass',
              y='reading score',
              hue='lunch',
              data=df,
              palette='ch:s=.25,rot=-.25',
              title='Reading Grade Based on Lunch')

In [None]:
get_scatter_plot(x='WritingScorePass', 
                 y='writing score', 
                 hue='lunch',
                data=df,
                palette='hls',
                title='Writing Grade Based on Lunch')

In [None]:
get_point_plot(x='WritingScorePass',
              y='writing score',
              hue='lunch',
              data=df,
              palette='hls',
              title='Writing Grade Based on Lunch')

Students who got standard lunch have more chance of passing in all of the subject.

## Grade Based on Test Preparation

In [None]:
get_count_plot(x='MathScorePass', 
               hue='test preparation course',
               data=df,
               palette='flare',
               title='Math Grade Based on Test Preparation')

In [None]:
get_point_plot(x='MathScorePass',
              y='math score',
              hue='test preparation course',
              data=df,
              palette='flare',
              title='Math Grade Based on Test Preparation')

In [None]:
get_count_plot(x='ReadingScorePass', 
               hue='test preparation course',
               data=df,
               palette='ch:s=.25,rot=-.25',
               title='Reading Grade Based on Test Preparation')

In [None]:
get_point_plot(x='ReadingScorePass',
              y='reading score',
              hue='test preparation course',
              data=df,
              palette='ch:s=.25,rot=-.25',
              title='Reading Grade Based on Test Preparation')

In [None]:
get_count_plot(x='WritingScorePass', 
               hue='test preparation course',
               data=df,
               palette='hls',
               title='Writing Grade Based on Test Preparation')

In [None]:
get_point_plot(x='WritingScorePass',
              y='writing score',
              hue='test preparation course',
              data=df,
              palette='hls',
              title='Writing Grade Based on Test Preparation')

Test preparation course is only effective in Writing and Reading subject, but in Math there is no signficant value since most who pass are students who don't take the test preparation course. 

# Summary of Analysis

- Male is better than Math, and female is better than Reading and Writing 
- Group C in race/ethnicity column has more students fail in all area but also has more students pass in Reading and Writing
- The kind of lunch that the student take has an affect on performance in exams
- Test preparation course is only effective in Writing and Reading subject

Still, there are a lot of factors affecting the performance of student in an exam: Sleep quality, Hours of studying, Kind of Studying, and a lot more. 

We still need more information to better analyze the peformance of a student in an exam. 