# Introduction

### **Content:**
<font color = 'black'>
    

1. [Loading Libraries and Data](#1)    
    * [Information About Features](#3)   
1. [Checking Missing Values](#4)
1. [Data Preparation](#5)  
    * [Grading System](#6)
    * [Data](#7)   
1. [Data Visualization](#8)  
    * [Gender Distribution](#9)
    * [Gender vs Passing Grade](#10)
    * [Gender vs Exam Grade](#11)
    * [Gender vs Exam Grade With Test Preparation Course](#12)
    * [Lunch vs Exam Grade With Test Preparation Course](#13)
    * [Passing Score vs Parental Level Of Education](#14)
    * [Race/Ethnicity Distribution](#15)
    * [Race/Ethnicity vs Passing Score Distribution](#16)

<a id='1'></a><br>
# Loading Libraries and Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

sns.set(style='whitegrid')

import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df

In [None]:
df.describe().T

<a id='3'></a><br>
## Information About Features

Column Name                  | Description
-----------------------------|--------------------------
gender                       | Male/Female 
race/ethnicity               | group A, group B, group C... 
parental level of education  | parental education details from high school to master's degree 
lunch                        | selected type of lunch
test preparation course      | test preparation course was completed by the student or not
math score                   | specifies score in math 
reading score                | specifies score in reading 
writing score                | specifies score in writing

---

<a id='4'></a><br>
# Checking Missing Values

In [None]:
df.isnull().sum()

> As seen above, there are no missing ( NaN ) values in this data.

---

<a id='5'></a><br>
# Data Preparation

> First of all, let's collect the exam scores of the students in a new column under the name of passing score.

In [None]:
df['passing score'] = ((df['math score'] + df['reading score'] + df['writing score']) / 3).round(4)

In [None]:
df.head(3)

> Now let's create a grading system corresponding to these scores.

<a id='6'></a><br>
### Grading System
        
    >= 95	S	Excellent
    >= 80	A	Very Good
    >= 70	B	Good
    >= 60	C	Average
    >= 50	D	Sufficient
    >= 40	E	Passable
     < 40	F	Fail

In [None]:
grades = ['S', 'A', 'B', 'C', 'D', 'E', 'F']
def Grade(grade):    
    if (grade >= 95):return 'S'
    if (grade >= 80):return 'A'
    if (grade >= 70):return 'B'
    if (grade >= 60):return 'C'
    if (grade >= 50):return 'D'
    if (grade >= 40):return 'E'
    else: return 'F'

df['math grade']    = df.apply(lambda x: Grade( x['math score'] ), axis=1)
df['reading grade'] = df.apply(lambda x: Grade( x['reading score'] ), axis=1)
df['writing grade'] = df.apply(lambda x: Grade( x['writing score'] ), axis=1)
df['passing grade'] = df.apply(lambda x : Grade( x['passing score'] ), axis=1)

<a id='7'></a><br>
### Let's look at the result.

In [None]:
df.head(10)

In [None]:
df["passing grade"].value_counts()

---

<a id='8'></a><br>
# Data Visualization

<a id='9'></a><br>
### Gender Distribution

In [None]:
values = df["gender"].value_counts()
def make_autopct(values):
    def my_autopct(pct):
        total = sum(values)
        val = int(round(pct*total/100.0))
        return '{p:.2f}%\n{v:d}'.format(p=pct, v=val)
    return my_autopct

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 7))
ax = plt.pie(values, labels=['Female', 'Male'], colors=['#F3CEE8','#B7D1F8'], autopct=make_autopct(values), startangle=90, explode=[0.05,0.05])
plt.title('Gender')
plt.axis('equal')
plt.show()

> * When we look at the gender distribution, the number of females is slightly higher than the number of males.

<a id='10'></a><br>
### Gender vs Passing Grade

In [None]:
sizes = df['passing grade'].value_counts().sort_index() / df['passing grade'].value_counts().sum() * 100
values = df['passing grade'].value_counts().sort_index().values
grades = ['A', 'B', 'C', 'D', 'E', 'F', 'S']

def make_autopct(values):
    def my_autopct(pct):
        total = sum(values)
        val = int(round(pct*total/100.0))
        return '{p:.2f}%\n{v:d}'.format(p=pct, v=val)
    return my_autopct

inner_values = [99, 77, 150, 111, 130, 126, 97, 85, 50, 23, 16, 14, 19, 3]
outer_colors = ['#C7CEEA', '#B5EAD7', '#E2F0CB', '#FFDAC1', '#FFB7B2', '#FF9AA2', '#9884C7']
inner_colors = ['#F3CEE8','#B7D1F8'] * 7
outer_explode = np.ones(7)*0.01
inner_explode = np.ones(14)*0.01

# Plot
fig, ax = plt.subplots(figsize=(5, 5))
ax.pie(sizes, labels=grades, colors=outer_colors,
        startangle=97.5, radius=1.7, autopct=make_autopct(values), pctdistance=0.88, explode=outer_explode,
        wedgeprops=dict(width=0.5, edgecolor='w'))
ax.pie(inner_values, startangle=97.5, radius=1.2, colors=inner_colors, explode=inner_explode, wedgeprops=dict(width=0.5, edgecolor='w'))

ax.set(aspect="equal")
fig.legend(bbox_to_anchor=(1.25, 0.7), labels=['S', 'A', 'B', 'C', 'D', 'E', 'F'], loc="upper center", ncol=1, title="Grades")
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
sns.countplot(y="passing grade", hue="gender", data=df, order=["S","A","B","C","D","E","F"], palette=['#F3CEE8','#B7D1F8'])
ax.legend(loc='upper right',frameon=True)
plt.show()

> * In the S class, which we classify as the most successful, women have an overwhelming preponderance.
> * In classes A and B, females are in the majority.
> * In class E, the number of males is more than the number of females.

<a id='11'></a><br>
### Gender vs Exam Grade

In [None]:
colors = ['#9884C7', '#C7CEEA', '#B5EAD7', '#E2F0CB', '#FFDAC1', '#FFB7B2', '#FF9AA2']
exam_grade = ['math grade', 'writing grade', 'reading grade']
grades = ['S', 'A', 'B', 'C', 'D', 'E', 'F']

def plot_func(df, i):
    sns.countplot(ax=axes[i], x=df['gender'], hue=df[exam_grade[i]], hue_order=grades, palette=colors)
    axes[i].set_xlabel("{} w gender".format(exam_grade[i]))
    axes[i].legend([],[], frameon=False)
    axes[i].set_ylabel(" ")
    axes[i].set_ylim(0, 160)
# Plot
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 8), sharey=True)
for i in range(3):
    plot_func(df, i)
axes[0].set_ylabel("Count")
# Add legend
fig.legend(bbox_to_anchor=(0.5, 1.0), labels=grades, loc="upper center", ncol=7, title="Grades")
plt.xticks(rotation=0)
plt.show()

> * We can see that females are more successful in writing and reading, and males in mathematics.

<a id='12'></a><br>
### Gender vs Exam Grade With Test Preparation Course

In [None]:
colors = ['#9884C7', '#C7CEEA', '#B5EAD7', '#E2F0CB', '#FFDAC1', '#FFB7B2', '#FF9AA2']
exam_grade = ['math grade', 'writing grade', 'reading grade']
grades = ['S', 'A', 'B', 'C', 'D', 'E', 'F']

# Filtering Data
df_course_completed = df[df['test preparation course'] == 'completed']
df_course_none      = df[df['test preparation course'] == 'none'] 

def plot_func(df_comp, df_none, i):
    sns.countplot(ax=axes[i,0], x=df_comp['gender'], hue=df_comp[exam_grade[i]], hue_order=grades, palette=colors)
    sns.countplot(ax=axes[i,1], x=df_none['gender'], hue=df_none[exam_grade[i]], hue_order=grades, palette=colors)
    axes[i,0].set_xlabel("{} with course".format(exam_grade[i]))
    axes[i,1].set_xlabel("{} with no course".format(exam_grade[i]))
# Plot
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 15), sharey=False)
for i in range(3):
    plot_func(df_course_completed, df_course_none, i)
# Remove legends
for ax in axes:
    for i in range(2):
        ax[i].legend([],[], frameon=False)
        ax[i].set_ylabel(" ")
        ax[i].set_ylim(0, 100)  
# Add legend
fig.legend(bbox_to_anchor=(0.5, 0.95), labels=grades, loc="upper center", ncol=7, title="Grades")
plt.show()

>* When we examine the tables, we see that many people have not completed the preparation course.
>* Let's remove the gender discrimination to better see the results.

In [None]:
colors = ['#9884C7', '#C7CEEA', '#B5EAD7', '#E2F0CB', '#FFDAC1', '#FFB7B2', '#FF9AA2']
exam_type = ['math score', 'writing score', 'reading score']
exam_grade = ['math grade', 'writing grade', 'reading grade']
course_ = ['none', 'completed']

def plot_func(df, i, course_):
    sns.countplot(ax=axes[i,0], x=df['test preparation course'], hue=df[exam_grade[i]], palette=colors, order=course_)
    sns.distplot(df[df['test preparation course'] == course_[0]][exam_type[i]].values, color='#5E96AE', label='Course not Completed', ax=axes[i,1])
    sns.distplot(df[df['test preparation course'] == course_[1]][exam_type[i]].values, color='#E08963', label='Course not Completed', ax=axes[i,1]) 

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 18), sharey=False, sharex=False)
for i in range(3):
    plot_func(df, i, course_)
    axes[i,0].set_xlabel("{} and course".format(exam_grade[i]))
    axes[i,1].set_xlabel("{}".format(exam_type[i]))
# Remove legends and set x,y limits
for ax in axes:
    for i in range(2):
        ax[i].legend([],[], frameon=False)
        ax[0].set_ylim(0, 180)  
        ax[1].set_ylim(0, 0.035)
        ax[1].set_xlim(0, 120)

# Add legends
axes[0,0].legend(bbox_to_anchor=(0.5, 1.25), labels=grades, loc="upper center", ncol=7, title="Grades")
axes[0,1].legend(bbox_to_anchor=(0.5, 1.25), labels=['none', 'completed'] , loc="upper center", ncol=2, title="Test Preparation Course")
plt.show()

> * As can be seen, the **success rate** of those who complete the preparation course is higher.
> * If we want to see this feature more clearly;

In [None]:
fig = plt.figure(figsize=(6,8))
sns.distplot(df[df['test preparation course'] == 'none']['passing score'].values, color='#5E96AE', label='Course not Completed')
sns.distplot(df[df['test preparation course'] == 'completed']['passing score'].values, color='#E08963', label='Course Completed')
fig.legend(bbox_to_anchor=(0.5, 1), labels=['Course not Completed','Course Completed'] , loc="upper center", ncol=2, title="Test Preparation Course")
plt.xlabel("passing score")
plt.ylim(0, 0.035)
plt.xlim(0, 120)
plt.show()

> * Here is the effect of the preparatory course on the success of the students.

<a id='13'></a><br>
### Lunch vs Exam Grade With Test Preparation Course

> * We will see the relationship between **lunch** and the average of the scores students get in the selected exam type.

In [None]:
colors = ['#5E96AE', '#B2EBE0']
exam_type = ['math score', 'reading score', 'writing score', 'passing score']

def bar_plot(data, i):
    sns.barplot(data=df, x='test preparation course', y=exam_type[i], hue='lunch', palette=colors, ax=axes[i])
    axes[i].set_title(exam_type[i])
    
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(15,6))
for i in range(4):
    bar_plot(df, i)
    axes[i].legend([],[], frameon=False)
    axes[i].set_ylabel(" ")
    axes[i].set_ylim(0, 90)
    
axes[1].legend(bbox_to_anchor=(1.7, 1.2), loc="center right", ncol=2, title="Lunch")
plt.show()

> * As you can see, the type of lunch selected has a serious effect on the exam score.

<a id='14'></a><br>
### Passing Score vs Parental Level Of Education

In [None]:
fig, axes = plt.subplots(figsize=(12,6))
sns.countplot(data=df, x="parental level of education", hue="gender", palette=['#F3CEE8','#B7D1F8'])
axes.legend(loc='upper right',frameon=True)
axes.set_ylim(0, 140)
plt.show()

> * We can say that the gender distribution is almost equal except for the master's degree.

In [None]:
educ_type = ["high school", "some high school", "some college", "associate's degree", "bachelor's degree", "master's degree"]
colors = ['#38908F', '#A02C2D', '#5E96AE', '#BC85A3', '#CA7E8D', '#9E6B55']

def dist_plot(data, i):
    sns.distplot(data, color=colors[i], ax=axes[i])

fig, axes = plt.subplots(nrows=1, ncols=6, figsize=(21,8), sharey=True)
for i in range(6):
    dist_plot(df[df['parental level of education'] == educ_type[i]]['passing score'].values, i)
    axes[i].set_xlabel(educ_type[i])
    axes[i].set_xticks([0,25,50,75,100,125])
    data_x, data_y = axes[i].lines[0].get_data()
    max_value = data_x[np.argmax(data_y)]
    axes[i].axvline(x = max_value, ls='--', alpha=0.75, color='#1B2631')
    axes[i].annotate(text=round(max_value, 2), xy=(max_value+5, 0.00025), xytext=(max_value+30, 0.0025), arrowprops=dict(facecolor='#1B2631', width=1.7, headwidth=6, headlength=7))

plt.ylim(0, 0.035)
plt.xlim(0, 120)
plt.show()

 > **As you can see;**
     
   > * Students whose parental education level is "master's degree" have higher overall pass grade.
   > * Students whose parental education level is "high school" have lower pass grade

<a id='15'></a><br>
### Race/Ethnicity Distribution

In [None]:
sizes = df['race/ethnicity'].value_counts().sort_index() / df['race/ethnicity'].value_counts().sum() * 100
colors = ['#9884C7', '#B5EAD7', '#E2F0CB', '#FFDAC1', '#FFB7B2', '#FF9AA2']
ethnicity = ['group A', 'group B', 'group C', 'group D', 'group E']
values = [89, 190, 319, 262, 140]
def make_autopct(values):
    def my_autopct(pct):
        total = sum(values)
        val = int(round(pct*total/100.0))
        return '{p:.2f}%\n{v:d}'.format(p=pct, v=val)
    return my_autopct

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 7))
ax = plt.pie(sizes, labels=ethnicity, colors=colors, autopct= make_autopct(values), startangle=90)
plt.title('Race/Ethnicity Distribution', fontsize=15, fontweight='bold', y=1.1)
fig.legend(bbox_to_anchor=(1.1, 0.6), labels=ethnicity, loc="upper center", ncol=1, title="Race/Ethnicity")
plt.axis('equal')
plt.show()

> * There is a great disproportion in the distribution of ethnicity.

<a id='16'></a><br>
### Race/Ethnicity vs Passing Score Distribution

In [None]:
sns.catplot(x="race/ethnicity", y="passing score", data=df, order=ethnicity, kind='violin', height=6, aspect=1.25, color="#B2EBE0", inner=None)
sns.boxenplot(x=df["race/ethnicity"], y=df["passing score"], color="#FFBFA3", width=0.1, order=ethnicity)
plt.show()

>  **We can see;**

   > * The average of group E is higher than the other groups.
   > * The average of group A is lower than the other groups.

---

### Thanks for taking the time to review my work!
### Please upvote if you liked the kernel!