# Introduction

This data set includes scores from three exams and a variety of personal, social, and economic factors that have interaction effects upon them. My aim is to examine the effects of these situations on students' grades and to answer some of the questions below.


* Are girls or boys more successful ? Which one is better in which area ?
* Which is the most successful race in the exam ? 
* How effective is level of family education affect to student performance ?
* How effective is the test preparation course?
* Is lunch important for Student Performance ?
* Which major factors contribute to test outcomes?
* What would be the best way to improve student scores on each test?

<font color = 'blue'>
Content: 

1. [Load and Check Data](#1)
1. [Variable Description](#2)
1. [Basic Data Analysis](#3)
    * [Value amounts in the features](#4)
    * [Grouping by some features](#5)
    * [Pivot Tables](#6)
    * [Filtering](#7)
1. [Visualization](#8)
    * [Correlation Between Math Scores -- Reading Scores -- Writing Scores](#9)
    * [Numerical Variables](#10)
    * [Categorical Variables](#11)
    * [Gender -- All Scores](#12)
    * [Gender -- Race -- All Scores](#13)
    * [Parental Level of Education -- All Scores](#14)
    * [Test Preparation Course -- All Scores](#15)
    * [Lunch -- All Scores](#16)
    * [All Factors -- All Scores in 3D Scatter Plot](#17)
1. [Conclusion](#18)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid") 
import seaborn as sns 
#import plotly.graph_objs as go
import plotly.express as px
from plotly.offline import plot

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = "1"></a><br>
# Load and Check Data

This data set:
- It is 2-dimensional
- Includes 1000 rows, 8 columns and 8000 observations

In [None]:
#Load data
spDataRaw = pd.read_csv("/kaggle/input/students-performance-in-exams/StudentsPerformance.csv")
spData = spDataRaw.copy()

In [None]:
spData.ndim

In [None]:
spData.shape

In [None]:
spData.size

In [None]:
spData.columns

In [None]:
spData.columns = spData.columns.str.replace(" ","_")
spData.rename(columns={'race/ethnicity':'race'}, inplace=True)
spData.columns

In [None]:
#Observations in the first 5 rows of the data set
spData.head()

In [None]:
#Observations in the last 5 lines of the data set
spData.tail()

In [None]:
#5 random rows of observation inside the dataset
spData.sample(5)

If we examine the some basic statistical details of the numerical values in the dataset, we can say:
* In math; the lowest grade is 0, the highest grade is 100 and the average is 66.089
* In reading; the lowest grade is 17, the highest grade is 100 and the average is 69.169
* In writing; the lowest grade is 10, the highest grade is 100 and the average is 68.054

In [None]:
spData.describe().T

* There is no missing value in this dataset.

In [None]:
spData.isnull().sum().sum()

<a id = "2"></a><br>
# Variable Description
* gender: Students' gender (male / female)
* race/ethnicity: Ethnic groups of students (group A / group B / group C / group D / group E)
* parental level of education: Educational status of students' families (master's degree / bachelor's degree / some high school / high school / associate's degree / some college)
* lunch: Students' lunch (standard & free/reduced)
* test preparation course: Students' test preparation (completed / none)
* math score: Mathematics scores taken by students in the exam
* reading score: Reading scores taken by students in the exam
* writing score: Writing scores taken by students in the exam

* int64(3): math score, reading score, writing score
* object(5): gender, race/ethnicity, parental level of education, lunch, test preparation course

In [None]:
spData.info()

<a id = "3"></a><br>
# Basic Data Analysis
1. Value amounts in the features
2. Grouping by some features 
3. Pivot Tables
4. Filtering

<a id = "4"></a><br>
## Value amounts in the features

In [None]:
spData.gender.value_counts()

In [None]:
spData.race.value_counts()

In [None]:
spData.parental_level_of_education.value_counts()

In [None]:
spData.lunch.value_counts()

In [None]:
spData.test_preparation_course.value_counts()

In [None]:
spData.math_score.value_counts()

In [None]:
spData.reading_score.value_counts()

In [None]:
spData.writing_score.value_counts()

<a id = "5"></a><br>
## Grouping by some features

In [None]:
#Average all scores by gender
spData.groupby("gender").apply(np.mean)

In [None]:
#Average all scores by gender and race
spData.groupby(["gender","race"]).apply(np.mean)

In [None]:
#Average all scores by parental level of education
spData.groupby(["parental_level_of_education"]).apply(np.mean).sort_values(by ="math_score", ascending = False)

In [None]:
#Average all scores by gender and lunch
spData.groupby(["gender","lunch"]).mean()

In [None]:
#Average all scores by gender and test preparation course
spData.groupby(["gender","test_preparation_course"]).mean()

According to the above results:
* Completing test preparation course increases success.
* Standardizing lunch greatly improves success.
* We can observe that children of educated families are more successful

<a id = "6"></a><br>
## Pivot Tables

In [None]:
#Average mathematics, reading and writing scores by sex
pd.pivot_table(spData, index = "gender")

In [None]:
#Average math scores by gender and race
spData.pivot_table("math_score", index = "gender", columns = "race")

In [None]:
#Average reading scores by gender and race
spData.pivot_table("reading_score", index = "gender", columns = "race")

In [None]:
#Average writing scores by gender and race
spData.pivot_table("writing_score", index = "gender", columns = "race")

According to the above results:
* In mathematics, boys are more successful than girls, while in others girls are more successful than boys.
* The success order of ethnic groups in all exams and genders is as follows: group E > group D > group C > group B > group A

<a id = "7"></a><br>
## Filtering

In [None]:
#Students whose math score is higher than the average of math score are filtered below.
spData[spData.math_score > spData.math_score.mean()]

In [None]:
#Students whose reading score is higher than the average of reading score are filtered below.
spData[spData.reading_score > spData.reading_score.mean()]

In [None]:
#Students whose writing score is higher than the average of writing score are filtered below.
spData[spData.writing_score > spData.writing_score.mean()]

According to the above results:
* While 49.3% of students achieved an above average score in mathematics, 50.7% remained below this average.
* While 51.3% of students achieved an above average score in reading, 48.7% remained below this average.
* While 51.2% of students achieved an above average score in writing, 48.8% remained below this average.

<a id = "8"></a><br>
# Data Visualization

<a id = "9"></a><br>
## Correlation Between Math Scores -- Reading Scores -- Writing Scores

* By looking at the table below, we can say that there is a high positive correlation between all variables.

In [None]:
#Correlation Map
f,ax = plt.subplots(figsize = (9,5))
sns.heatmap(spData.corr(), annot = True, linewidth = .5, fmt = '.1f', ax = ax)
plt.show()

<a id = "10"></a><br>
## Numerical Variables
Thanks to the histogram, we can examine the distribution of observations within our numerical variables.

In [None]:
#Histogram
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(spData[variable])
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()
    
numericVars = ["math_score", "reading_score", "writing_score"]

for i in numericVars:
    plot_hist(i)

<a id = "11"></a><br>
## Categorical Variables
* Thanks to the bar plot, we can visualize the amount of observations in categorical variables in the data set.

In [None]:
#Barplot
def bar_plot(variable):
    """
        input: variable ex: "gender"
        output: bar plot & value count
    """
    #get feature
    var = spData[variable]
    #count number of categorical variable(value/sample)
    varValue = var.value_counts()
    #visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show
    print("{}:\n{}".format(variable,varValue))

categoricalVars = ["gender","race", "parental_level_of_education",
             "lunch", "test_preparation_course"]

for j in categoricalVars:
    bar_plot(j)

<a id = "12"></a><br>
## Gender -- All Scores

In [None]:
spRG = spData.groupby("gender", as_index = False)["reading_score"].mean().sort_values(by = "reading_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "gender", 
            y = "reading_score",  
            data = spRG)
plt.xlabel('Gender')
plt.ylabel('Average Reading Score')
plt.title('Average Reading Scores by Gender')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="gender", 
                 y="reading_score", 
                 data=spData)
plt.xlabel('Gender')
plt.ylabel('Reading Score')
plt.title('Reading Scores by Gender')
plt.show()

In [None]:
spWG = spData.groupby("gender", as_index = False)["writing_score"].mean().sort_values(by = "writing_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "gender", 
            y = "writing_score",  
            data = spWG)
plt.xlabel('Gender')
plt.ylabel('Average Writing Score')
plt.title('Average Writing Scores by Gender')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="gender", 
                 y="writing_score", 
                 data=spData)
plt.xlabel('Gender')
plt.ylabel('Writing Score')
plt.title('Writing Scores by Gender')
plt.show()

In [None]:
spMathG = spData.groupby("gender", as_index = False)["math_score"].mean().sort_values(by = "math_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "gender", 
            y = "math_score",  
            data = spMathG)
plt.xlabel('Gender')
plt.ylabel('Average Math Score')
plt.title('Average Math Scores by Gender')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="gender", 
                 y="math_score", 
                 data=spData)
plt.xlabel('Gender')
plt.ylabel('Math Score')
plt.title('Math Scores by Gender')
plt.show()

In [None]:
genderPivotData = pd.pivot_table(spData, index = "gender")
genderPivotData.plot(kind = 'bar')
plt.xlabel('Gender')
plt.ylabel('Average Grades')
plt.title('Average Exam Grades by Gender')
plt.show()

<a id = "13"></a><br>
## Gender -- Race -- All Scores

In [None]:
 avgScoresGRR = spData.groupby(["gender","race"], as_index = False)["reading_score"].mean().sort_values(by = "reading_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "reading_score", 
            y = "gender", 
            hue="race", 
            data = avgScoresGRR)
plt.xlabel('Average Reading Score')
plt.ylabel('Gender')
plt.title('Average Reading Scores by Gender and Race')
plt.show()

In [None]:
ax = sns.boxplot(x="reading_score", 
                 y="gender", 
                 hue="race",
                 data=spData)
plt.xlabel('Reading Score')
plt.ylabel('Gender')
plt.title('Reading Scores by Gender and Race')
plt.show()

In [None]:
avgScoresGRW = spData.groupby(["gender","race"], as_index = False)["writing_score"].mean().sort_values(by = "writing_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "writing_score", 
            y = "gender", 
            hue="race", 
            data = avgScoresGRW)
plt.xlabel('Average Writing Score')
plt.ylabel('Gender')
plt.title('Average Writing Scores by Gender and Race')
plt.show()

In [None]:
ax = sns.boxplot(x="writing_score", 
                 y="gender", 
                 hue="race",
                 data=spData)
plt.xlabel('Writing Score')
plt.ylabel('Gender')
plt.title('Writing Scores by Gender and Race')
plt.show()

In [None]:
avgScoresGRM = spData.groupby(["gender","race"], as_index = False)["math_score"].mean().sort_values(by = "math_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "math_score", 
            y = "gender", 
            hue="race", 
            data = avgScoresGRM)
plt.xlabel('Average Math Score')
plt.ylabel('Gender')
plt.title('Average Math Scores by Gender and Race')
plt.show()

In [None]:
ax = sns.boxplot(x="math_score", 
                 y="gender", 
                 hue="race",
                 data=spData)
plt.xlabel('Math Score')
plt.ylabel('Gender')
plt.title('Math Scores by Gender and Race')
plt.show()

In [None]:
racePivotData = pd.pivot_table(spData, index = "race")
racePivotData.plot.barh()
plt.xlabel('Average Exam Score')
plt.ylabel('Race')
plt.title('Average Exam Scores by Race')
plt.show()

<a id = "14"></a><br>
## Parental Level of Education -- All Scores

In [None]:
parentMath = spData.groupby("parental_level_of_education", as_index = False)["math_score"].mean().sort_values(by = "math_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "math_score", 
            y = "parental_level_of_education",  
            data = parentMath)
plt.xlabel('Average Math Score')
plt.ylabel('Parental Level of Education')
plt.title('Average Math Scores by Parental Level of Education')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="math_score", 
                 y="parental_level_of_education", 
                 data=spData)
plt.xlabel('Math Score')
plt.ylabel('Parental Level of Education')
plt.title('Math Scores by Parental Level of Education')
plt.show()

In [None]:
parentWRT = spData.groupby("parental_level_of_education", as_index = False)["writing_score"].mean().sort_values(by = "writing_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "writing_score", 
            y = "parental_level_of_education",  
            data = parentWRT)
plt.xlabel('Average Writing Score')
plt.ylabel('Parental Level of Education')
plt.title('Average Writing Scores by Parental Level of Education')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="writing_score", 
                 y="parental_level_of_education", 
                 data=spData)
plt.xlabel('Writing Score')
plt.ylabel('Parental Level of Education')
plt.title('Writing Scores by Parental Level of Education')
plt.show()

In [None]:
parentRD = spData.groupby("parental_level_of_education", as_index = False)["reading_score"].mean().sort_values(by = "reading_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "reading_score", 
            y = "parental_level_of_education",  
            data = parentRD)
plt.xlabel('Reading Score')
plt.ylabel('Parental Level of Education')
plt.title('Average Reading Scores by Parental Level of Education')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="reading_score", 
                 y="parental_level_of_education", 
                 data=spData)
plt.xlabel('Writing Score')
plt.ylabel('Reading Score of Education')
plt.title('Reading Scores by Parental Level of Education')
plt.show()

In [None]:
parentalPivotData = pd.pivot_table(spData, index = "parental_level_of_education")
parentalPivotData.plot.barh()
plt.xlabel('Exam Score')
plt.ylabel('Parental Level of Education')
plt.title('Average Exam Scores by Parental Level of Education')
plt.show()

<a id = "15"></a><br>
## Test Preparation Course -- All Scores

In [None]:
tpcRD = spData.groupby("test_preparation_course", as_index = False)["reading_score"].mean().sort_values(by = "reading_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "test_preparation_course", 
            y = "reading_score",  
            data = tpcRD)
plt.xlabel('Test Preparation Course')
plt.ylabel('Average Reading Score')
plt.title('Average Reading Scores by Test Preparation Course')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="test_preparation_course", 
                 y="reading_score", 
                 data=spData)
plt.xlabel('Test Preparation Course')
plt.ylabel('Reading Score')
plt.title('Test Preparation Courses by Gender')
plt.show()

In [None]:
tpcWRT = spData.groupby("test_preparation_course", as_index = False)["writing_score"].mean().sort_values(by = "writing_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "test_preparation_course", 
            y = "writing_score",  
            data = tpcWRT)
plt.xlabel('Test Preparation Course')
plt.ylabel('Average Writing Score')
plt.title('Average Writing Scores by Test Preparation Course')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="test_preparation_course", 
                 y="writing_score", 
                 data=spData)
plt.xlabel('Test Preparation Course')
plt.ylabel('Writing Score')
plt.title('Writing Scores by Test Preparation Course')
plt.show()

In [None]:
tpcMath = spData.groupby("test_preparation_course", as_index = False)["math_score"].mean().sort_values(by = "math_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "test_preparation_course", 
            y = "math_score",  
            data = tpcMath)
plt.xlabel('Test Preparation Course')
plt.ylabel('Average Math Score')
plt.title('Average Math Scores by Test Preparation Course')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="test_preparation_course", 
                 y="math_score", 
                 data=spData)
plt.xlabel('Test Preparation Course')
plt.ylabel('Math Score')
plt.title('Math Scores by Test Preparation Course')
plt.show()

In [None]:
testPivotData = pd.pivot_table(spData, index = "test_preparation_course")
testPivotData.plot.bar()
plt.xlabel('Test Preparation Course')
plt.ylabel('Exam Score')
plt.title('Average Exam Scores by Test Preparation Course')
plt.show()

<a id = "16"></a><br>
## Lunch -- All Scores

In [None]:
lunchRD = spData.groupby("lunch", as_index = False)["reading_score"].mean().sort_values(by = "reading_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "lunch", 
            y = "reading_score",  
            data = lunchRD)
plt.xlabel('Lunch')
plt.ylabel('Average Reading Score')
plt.title('Average Reading Scores by Lunch')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="lunch", 
                 y="reading_score", 
                 data=spData)
plt.xlabel('Lunch')
plt.ylabel('Reading Score')
plt.title('Reading Scores by Lunch')
plt.show()

In [None]:
lunchWRT = spData.groupby("lunch", as_index = False)["writing_score"].mean().sort_values(by = "writing_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "lunch", 
            y = "writing_score",  
            data = lunchWRT)
plt.xlabel('Lunch')
plt.ylabel('Average Writing Score')
plt.title('Average Writing Scores by Lunch')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="lunch", 
                 y="writing_score", 
                 data=spData)
plt.xlabel('Test Preparation Course')
plt.ylabel('Lunch')
plt.title('Writing Scores by Lunch')
plt.show()

In [None]:
lunchMath = spData.groupby("lunch", as_index = False)["math_score"].mean().sort_values(by = "math_score", ascending = False)
plt.figure(figsize=(8,5))
sns.barplot(x = "lunch", 
            y = "math_score",  
            data = lunchMath)
plt.xlabel('Lunch')
plt.ylabel('Average Math Score')
plt.title('Average Math Scores by Lunch')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
ax = sns.boxplot(x="lunch", 
                 y="math_score", 
                 data=spData)
plt.xlabel('Lunch')
plt.ylabel('Math Score')
plt.title('Math Scores by Lunch')
plt.show()

In [None]:
lunchPivotData = pd.pivot_table(spData, index = "lunch")
lunchPivotData.plot.bar()
plt.xlabel('Lunch')
plt.ylabel('Exam Score')
plt.title('Average Exam Scores by Lunch')
plt.show()

<a id = "17"></a><br>
## All Factors -- All Scores in 3D Scatter Plot

In [None]:
fig = px.scatter_3d(spData, 
                    x='reading_score', 
                    y='writing_score', 
                    z='math_score',
                    color='gender')
fig.show()

In [None]:
fig = px.scatter_3d(spData, 
                    x='reading_score', 
                    y='writing_score', 
                    z='math_score',
                    color='race')              
fig.show()

In [None]:
fig = px.scatter_3d(spData, 
                    x='reading_score', 
                    y='writing_score', 
                    z='math_score',
                    color='parental_level_of_education')
fig.show()

In [None]:
fig = px.scatter_3d(spData, 
                    x='reading_score', 
                    y='writing_score', 
                    z='math_score',
                    color='lunch')
fig.show()
    

In [None]:
fig = px.scatter_3d(spData, 
                    x='reading_score', 
                    y='writing_score', 
                    z='math_score',
                    color='test_preparation_course')
fig.show()

<a id = "18"></a><br>
# Conclusion

Consequently, standardizing lunch and completing course preparation have a big impact on test results compared to others. In addition, we observed that the children of having master or bachelor degree families were more successful in exams than the children of other families. Students with ethnicity in group E succeeded in all circumstances. Thus, the students in the conditions mentioned above could belong to group E. Male students were successful in mathematics and female students were successful in other branches. However, we don't have data to prove why it was. In short, If we want to make the best combination, we can choose students who their family has a master's degree, ethnicity belonging to group E, standardizing lunch, and completing course preparation.