# Do socio-cultural factors influence students' performance?
<img src="https://s35691.pcdn.co/wp-content/uploads/2020/08/embracing-culturally-responsive-teaching.jpg" alt="UnityinDiversity" style="width: 800px; height: 400px;" class="center"><br>
Social, economical, cultural and psychological factors such as gender, race, caste, family background, environment etc. have been observed to affect the overall academic performance of students worldwide. A lot of them either fail at an early stage, or combat loads of hurdles to attain their goals.
Let's take an instance to understand the scenario.

In [None]:
import pandas as pd
import numpy as np
import collections
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Loading the dataset
> There are a total of 1000 students whose overall performance has been recorded in *Math*, *Reading*, and *Writing*. In addition to that, there are some other attributes documented with regards to their *gender*, *race*, *parental level of education*, *lunch*, and *test preparation course* attended.
In this notebook, we will dive deep into understanding how such factors affect the performance of the students.

In [None]:
studentsPerformTable = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')
display(studentsPerformTable.shape)
display(studentsPerformTable.head())

## Adding total and average score of every student as separate columns
> We shall use this information to assess the effect of the factors on the score of students

In [None]:
studentsPerformTable['total score'] = studentsPerformTable[['math score','reading score','writing score']].sum(axis=1)
studentsPerformTable['avg score'] = studentsPerformTable[['math score','reading score','writing score']].mean(axis=1)
studentsPerformTable.head(3)

## Looking at the percentage distribution of gender across the students
> Female students outnumber the male students with percentage of *51.8* and *48.2* respectively

In [None]:
plt.figure(figsize=(8,5))
sns.set(style="darkgrid", font_scale=1)
plt.tight_layout()
total = len(studentsPerformTable['gender'])
ax = sns.countplot(y="gender", data=studentsPerformTable,palette="Set2")
ax.set(ylabel='Gender', xlabel='Number of Students')
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))

sns.despine()
plt.title('Percentage distribution of genders',fontweight="bold",fontsize = 15)    
plt.show()

## In which exam do we observe a fairly good performance?
> On an average the students have performed well in the *Reading* exam as compared to the other two exams

In [None]:
sns.set(style="darkgrid", font_scale=1)
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, sharey=True)
fig.set_size_inches(10, 4)

plt.tight_layout()

sns.histplot(studentsPerformTable["math score"],ax=ax1)
ax1.set_xlabel('Math Score')

sns.histplot(studentsPerformTable["reading score"],ax=ax2)
ax2.set_xlabel('Reading Score')

sns.histplot(studentsPerformTable["writing score"],ax=ax3)
ax3.set_xlabel('Writing Score')

fig.subplots_adjust(wspace = 0.5)
plt.suptitle('Distribution of marks scored by students in different subjects',fontweight="bold",fontsize = 15,y=1.05)    
plt.show()

## Reading and Writing, go hand-in-hand: do we see any correlation in their scores?
> We do observe a strong positive correlation between the two scores

In [None]:
sns.set_style('darkgrid')
ax = sns.jointplot(data=studentsPerformTable,x="writing score", y="reading score")  
plt.show()

## Who is in the lead - Girls or Boys?
> We observe that the girls outperform in the boys in Reading and Writing, while are a little behind in Math

In [None]:
sns.set(style="darkgrid", font_scale=1)
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, sharey=True)
fig.set_size_inches(10, 6)

plt.tight_layout()

sns.violinplot("gender", "math score", data=studentsPerformTable,
               palette='Set2', ax=ax1)
ax1.set_xlabel('Gender')
ax1.set_ylabel('Math Score')
ax1.set_xticklabels(labels = studentsPerformTable['gender'].unique(),rotation=90)

sns.violinplot("gender", "reading score", data=studentsPerformTable,
               palette='Set2',ax=ax2)
ax2.set_xlabel('Gender')
ax2.set_ylabel('Reading Score')
ax2.set_xticklabels(labels = studentsPerformTable['gender'].unique(),rotation=90)

sns.violinplot("gender", "writing score", data=studentsPerformTable,
               palette='Set2',ax=ax3)
ax3.set_xlabel('Gender')
ax3.set_ylabel('Writing Score')
ax3.set_xticklabels(labels = studentsPerformTable['gender'].unique(),rotation=90)

fig.subplots_adjust(wspace = 0.5)
plt.suptitle('Distribution of marks secured by girls and boys in different subjects',fontweight="bold",fontsize = 15,y=1.05)
plt.show()

## Having a quick glance at the different races or ethinicities given in the data
> There are a total of five races in which we see a majority of students belong to *group C*, while *group A* are the least with a total percentage distribution of *31.9* and *8.9* respectively.

In [None]:
plt.figure(figsize=(10,8))
sns.set_style('darkgrid')
sns.despine(offset=10, trim=True)
plt.tight_layout()
total = studentsPerformTable['race/ethnicity'].value_counts().sum()
ax = sns.countplot(x="race/ethnicity", hue='gender',data=studentsPerformTable,palette="Set2",
                  order = studentsPerformTable['race/ethnicity'].value_counts().index)
ax.set(xlabel='Race/Ethnicity', ylabel='Number of Students')
for p in ax.patches:
    percentage = f'{100 * p.get_height() / total:.1f}%\n'
    x = p.get_x() + p.get_width() / 2
    y = p.get_height()
    ax.annotate(percentage, (x, y), ha='center', va='center')

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize=10,title = 'Gender')
plt.title('Percentage distribution of ethenicities across genders',fontweight="bold",fontsize = 15)    
plt.show()

## How does parents' education level varies in the different ethenicities?

In [None]:
sns.set_palette("Set3")
raceToEduLeveldf = studentsPerformTable.groupby(['race/ethnicity','parental level of education']).size().unstack().apply(lambda r: r/r.sum()*100, axis=1)
ax = raceToEduLeveldf.plot(kind='barh', stacked=True, figsize=(10, 6))
ax.set_ylabel('Race/Ethnicity',fontsize=15)
ax.set_xlabel('Percentage distribution in terms of education level',fontsize=15)
plt.legend(title='Parental Level of Education', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title("Percentage distribution of parents' education level across ethenicities",fontweight="bold",fontsize = 15,y=1.05)
plt.show()

## Does parents' education level have a significant influence on their children's performance?
> There seems to be no major effect in terms of parents' level of education on the average score across the subjects evaluated

In [None]:
sns.set_style('darkgrid')
ax = sns.catplot(y='parental level of education',x='avg score', data=studentsPerformTable,
                kind='box', orient='h', palette='Set2')
ax.set(xlabel='Average Score', ylabel='Parental Level of Education')
plt.title('Percentage distribution of ethenicities across genders',fontweight="bold",fontsize = 15,y=1.05)    
plt.show()

## Let's look at if lunch affects the average score of the students
> ### As it's rightly said: *You gotta nourish to flourish*.
We do see that students to have access to *standard* lunch have a remarkably better average as compared to the ones who come under the *free/reduced* category

In [None]:
sns.set(style="darkgrid", font_scale=1)
ax1 = sns.catplot(x="lunch", y="avg score", hue="gender", kind="point", data=studentsPerformTable.sort_values('lunch'),legend=False)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize=10,title = 'Gender')
ax1.set(xlabel='Lunch', ylabel='Average Score')
plt.title('Impact of lunch on the \naverage score of students',fontweight="bold",fontsize = 15)    
plt.show()

## Do we see the Test Preparation Course helping the students score well?
> ### *Practice makes perfect*
Students who completed the course scored a better aggregate than the ones who didn't

In [None]:
sns.set(style="darkgrid", font_scale=1)
ax1 = sns.catplot(x="test preparation course", y="total score", hue="gender", kind="bar", data=studentsPerformTable.sort_values('test preparation course',ascending=False),legend=False)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize=10,title = 'Gender')
ax1.set(xlabel='Test Preparation Course Status', ylabel='Total Score')
plt.title('Impact of Test Preparation Course \non the Total Score of students',fontweight="bold",fontsize = 15)    
plt.show()

## Do we observe a collective impact on the total score based on the access to lunch and status of test preparation course?
> ### *Food for thought?*
We can see that the total score is higher for the ones who have completed the Test Preparation Course, and within the aspect of lunch - we observe the ones who had access to standard lunch scored higher than the ones who didn't.

In [None]:
ax = sns.catplot(x="test preparation course", y="total score", hue = "lunch", kind="box", data=studentsPerformTable.sort_values('test preparation course',ascending=False),legend=False)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',fontsize=10,title = 'Lunch')
ax.set(xlabel='Test Preparation Course Status', ylabel='Total Score')
plt.title('Impact of Test Preparation Course and \nLunch on the Total Score of students',fontweight="bold",fontsize = 15,y=1.05)    
plt.show()

## This work is in progress. Feel free to Upvote and give Feedback.