# Analyzing StudentPerformance
**Our goal in this project is to try to analyze students' performance and identify the factors that could affect their performance. We have some questions that we will try to answer them through analysis:**
- Does gender affect the average score?
- Does the education level of the parents affect the educational level of the children?
- Does race have a role in the educational level of students?
- Do students who pass the test preparation course perform better?
- Is there a superiority for one of gender in some subjects over the other?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')
data.head()

In [None]:
data.columns = data.columns.str.replace('race/ethnicity','race').str.replace('parental level of education','peducation').str.replace('test preparation course','tpc')

In [None]:
data.info()

In [None]:
data['percentage'] = round((data['math score']+data['reading score']+data['writing score'])/3,2)

def p_f (l) :
    if l >=50 :
        return 'passed'
    else :
        return 'failed'
    
data['pass/failed'] = data['percentage'].apply(p_f)

# Data Summary

In [None]:
plt.figure(figsize=(18,9))
gender = data.groupby('gender').size()
pass_failed = data.pivot_table(index='gender',columns='pass/failed',values='percentage',aggfunc=np.size).stack()
cmap = plt.get_cmap("tab20")

ax1 = plt.pie(gender,radius=1,colors= cmap([2,0]),
        labels = gender.index,wedgeprops=dict(width=0.3, edgecolor='w'),
        autopct='%1.1f%%',pctdistance=0.83 )

plt.annotate(str(gender.sum()), xy=(0, 0), xytext=(-.16, -.04),fontsize=25)

ax2 = plt.pie(pass_failed,radius=1-.3,colors=cmap([7,5,7,5]),
        labels = pass_failed.index.get_level_values(1),
        wedgeprops=dict(width=0.3, edgecolor='w'),labeldistance=0.7,
        autopct='%1.1f%%',pctdistance=0.4)
plt.show()

**In th chart above We made a quick summary of the data, reviewing the ratios of males and females and the success and failure rates of each of them ,Where 1000 is the total number os students(male/female).**

In [None]:
sns.displot(data['percentage'],kde=True,bins=range(0,101,5))

**Here's a histogram to show which intervals the students' grades are most concentrated in**

In [None]:
obj_cols = data.loc[:,data.dtypes==object].columns
dummies = pd.get_dummies(data[obj_cols])
data = pd.concat([data.drop(obj_cols,axis=1),dummies],axis=1)
data.head()

In [None]:
plt.figure(figsize=(10,7))
percentage_corr = data.corr().abs()['percentage']
most_relevent_cols = percentage_corr[percentage_corr>.1].index
corr = data[most_relevent_cols].corr().abs()

for i in corr.columns :
    corr.loc[i,i] = 0
    
sns.heatmap(corr)

**We tried to find a correlation between the columns of data, but as you can see from the above graph there is no correlation or there is a weak correlation**

# Data Visualzation

In [None]:
plt.figure(figsize=(15,7))
ax1=sns.histplot(data.loc[data['gender_female']==1,'percentage'],color='orange',label='female',bins=range(0,101,5))
ax2=sns.histplot(data.loc[data['gender_male']==1,'percentage'],label='male',bins=range(0,101,5))
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.legend()

**As we can see from the above graph, the average score for females is higher than the average scores for males**

In [None]:
plt.figure(figsize=(15,7))
parent_cols = data.columns[data.columns.str.contains('peducation')]
colors = ['green','blue','red','orange','yellow','black']
for i,c in zip(parent_cols,colors) :
    ax=sns.kdeplot(data.loc[data[i]==1,'percentage'],color=c,label=i)
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    
ax.legend()

**In this graph, we tried to find a relationship between the students ’grades and the parents’ educational level, and as it is shown that the higher the parents ’educational level, the higher the students’ grades.**

In [None]:
plt.figure(figsize=(15,7))
parent_cols = data.columns[data.columns.str.contains('race')]
colors = ['green','blue','red','orange','black']
for i,c in zip(parent_cols,colors) :
    ax=sns.kdeplot(data.loc[data[i]==1,'percentage'],color=c,label=i)
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
ax.legend()

**The students were divided according to their race into groups from A to E. The nature of this division is unknown and it is not known what each group of groups refers to, but it turns out that there are some groups that perform better than others and score higher. This does not, of course, indicate that there is a race that is better than The other or smarter than the other, but it indicates the nature of the different circumstances and environment that each child goes through in terms of the standard of living and the level of education of the parents**

In [None]:
plt.figure(figsize=(15,7))
parent_cols = data.columns[data.columns.str.contains('tpc')]
colors = ['green','red']
for i,c in zip(parent_cols,colors) :
    ax=sns.kdeplot(data.loc[data[i]==1,'percentage'],color=c,label=i)
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
ax.legend()

**This chart expresses the relationship between the test preparation course and the grades that students get, and it shows that students who passed the test preparation course got higher grades than those who did not pass it.**

In [None]:
fig = plt.figure(figsize=(17,7))
score_cols = data.loc[:,data.dtypes==int].columns
colors = ['green','red']
for i,c in zip(score_cols,range(1,4)) :
    fig.add_subplot(1,3,c)
    ax=sns.boxplot(data=data ,x = 'gender_female' ,y = i)
    ax.set_xlabel('gender')
    ax.set_xticklabels(['male','female'])
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)

**Finally, after the previous graphs showed us the superiority of females in terms of success rates and average grade values, we wanted to know whether females excel in absolute terms, or do males excel in some school subjects, and it became clear that males outperform females in the math score, while females excel In reading score and writing score**

# Conclusion
**There is a weak correlation between the columns of data and each other with some indications that characteristics such as parental level of education, race and test preparation course may affect the performance of students and the grades they receive also noticed the presence of some superiority of the females to males in success rates and average scores with male superiority in some subjects such as math score**