# Abstract

Education is the foundation for the equality of opportunity. The quality of education one receives in Brazil is deep correlated with their family income, the geografic region they live, and so forth. In this notebook, we intend to explore how income affect the student performace on the most importante exam high school students take in Brazil. ENEM is the front door to joining the university, much like the SAT in the USA, and is taken by almost all students who intend to go to college. In this notebook we answer the following questions:

* How is the distribuition of Income across school types, though?
* How is the correlation between grades and income on each school type?
* How much can student performance be explained by their school quality?
* Is income a factor inside a school?
* How is student income related to their distribuition among schools?



In [None]:
!pip install seaborn -U --quiet

In [None]:
import math

import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
from tqdm.notebook import tqdm

plt.style.use('seaborn')

# Loading



In [None]:
id_columns = [
    'NU_INSCRICAO', #student unique id
]
status_columns=[
    'TP_ST_CONCLUSAO', #high school situation (already over, finishing, ongoing, not on high school) 
]
grade_columns_raw = [
    'NU_NOTA_CN', #grade natural sciencies
    'NU_NOTA_CH', #grades human sciencies
    'NU_NOTA_LC', #grades languages and codes
    'NU_NOTA_MT', #grades math
    'NU_NOTA_REDACAO', #grades essay
]
social_economic_columns = [
    'Q001', #how far did the mother go with her studies
    'Q002', #how far did the father go with his studies
    'Q005', #number of people living in the house
    'Q006', #family income
]
school_columns_raw=[
    'CO_ESCOLA', #School code
    'NO_MUNICIPIO_ESC',#School city name
    'SG_UF_ESC', #School State code
    'TP_DEPENDENCIA_ADM_ESC', #school dependency(federal gov., state gov., city gov., private)
]

df = pd.read_csv('/kaggle/input/enem-2019/DADOS/MICRODADOS_ENEM_2019.csv', 
                 delimiter=';',
                 encoding='ISO-8859-1', 
                 usecols=id_columns + status_columns + grade_columns_raw + social_economic_columns + school_columns_raw)

df = df.rename(columns={
    'NU_NOTA_CN': 'Natural Sciences',
    'NU_NOTA_CH': 'Humanities',
    'NU_NOTA_LC': 'Languages', 
    'NU_NOTA_MT': 'Mathematics',
    'NU_NOTA_REDACAO': 'Essay', 
    'TP_DEPENDENCIA_ADM_ESC': 'School type'
})

df = df.replace({
    'School type':{
        1.0: 'Federal',
        2.0: 'State',
        3.0: 'City',
        4.0: 'Private'
    },
    'TP_ST_CONCLUSAO':{
        1.0: 'Graduated',
        2.0: '3rd year',
        3.0: '2nd year or before',
        4.0: 'Not graduated nor ongoing'
    }
})

grade_columns = [
    'Natural Sciences',
    'Humanities',
    'Languages', 
    'Mathematics',
    'Essay', 
]

school_columns=[
    'CO_ESCOLA', #School code
    'NO_MUNICIPIO_ESC',#School city name
    'SG_UF_ESC', #School State code
    'School type', #school dependency(federal gov., state gov., city gov., private)
]

In [None]:
df.head()

In [None]:
df.info(null_counts=True)

From the above, we see that all students have answered social-economic questions. Some students have not shown to test, therefore they don't have grades. Some students don't have a school associated with them, we hypotetize that those student have graduated from highschool before the year of 2019. 

We are interested on the students graduating from high school that took all the tests. So we can drop all nan values on the grading 
columns and select by students graduating. School will play an important role on our further analysis, so we remove students that do not have a school associated with them.

In [None]:
#drop non assigned grades and non assigned schools
df = df.dropna(subset=grade_columns + school_columns)
#drop students that are not finishing this year
df = df[df['TP_ST_CONCLUSAO']=='3rd year']
#drop grades 0 in CN, CH, LC, MT, because it should not be possible to receive these grades if you took the test (probably it means that the student missed the test)
df = df[df['Natural Sciences']>0]
df = df[df['Humanities']>0]
df = df[df['Languages']>0]
df = df[df['Mathematics']>0]
print('High school graduating students that took all the tests:', len(df))

In [None]:
print('Nan Values in columns:\n')
print(df.isna().sum())

# Processing

From our knowledge of Brazil's education system, we know that school type (federal, state, city or private) is a major factor in student performance. Lets see how this information is reflected in the data.

In [None]:
df['School type'].value_counts()/len(df)

In [None]:
id_vars = ['School type', 'NU_INSCRICAO']
lin_df = pd.melt(df[grade_columns + id_vars], 
                 id_vars=id_vars, 
                 value_vars=grade_columns,
                 var_name='Test',
                 value_name='Grade')

g = sns.displot(data=lin_df, 
            x='Grade', 
            hue='School type', 
            row='Test',
            kind='kde', 
            facet_kws={
                'sharex': True, 
                'sharey': False, 
                'xlim': (0, 1000)
            },
            common_norm=False, 
            height=3, 
            aspect=5)

g.set_titles('KDE for student grades on `{row_name}` per school type')

print('Number of students in each type of school:\n')
print(df['School type'].value_counts())

The plots above show a clear difference in performance from State and City schools to Federal and Private ones. Despite being public, Federal schools have a performance close and similar to Private schools, except in essays, where there is a growing difference in performance after the score of about 800.

In regard to the number of students in each school type, we see from the printed values that the vast majority of students come from State schools. The following plot will display the histogram of student grades. We already visualise the inequalities in proportion of performance between students given their school type, does the same inequality holds when observing total numbers?

In [None]:
g = sns.displot(data=lin_df, 
            x='Grade', 
            hue='School type', 
            row='Test',
            facet_kws={
                'sharex': True, 
                'sharey': False, 
                'xlim': (0, 1000)
            },
            binwidth=10,
            height=3, 
            aspect=5)

g.set_titles('Histogram for student grades on `{row_name}` per school type')

We can verify on the plots above the inequalities also in absolute values. Despite being only 16.36% of the total, Private school students are the predominant class after thresholds that are just above the mode of their grade histograms.

### How is the distribuition of Income across school types, though?

We have two types of information: family income and number of people per family. Income per person is a better measure of finantial health than just family income. A family of two with 3k family income can make a lot more than a family of 6. 

Family income is categorical, so we take the average value between thresholds to each bin as its value. For the upper income bin (more than 19,960BRL) we take the average between that value and its double.  

In [None]:
incomeTh =  [
    0, #padding
    0,
    998,
    1497,
    1996,
    2495,
    2994,
    3992,
    4990,
    5988,
    6986,
    7984,
    8982,
    9980,
    11976,
    14970,
    19960,
    19960*2
]
incomeClasses = {}
for idx, c in enumerate(list('ABCDEFGHIJKLMNOPQ'), start=1):
    incomeClasses[c]=(incomeTh[idx-1] + incomeTh[idx])/2

df2 = df.copy() # do not change main df after data preparation    
df2 = df2.replace({'Q006':incomeClasses}) 
df2['Income per person'] = df2['Q006']/df2['Q005']

# remove some outliers on income per person
quantiles = df2[['Income per person', 'School type']].groupby('School type')['Income per person'].quantile(0.99)
# df2 = df2[df2['Income per person'] < 1000 * math.ceil(df2['Income per person'].quantile(0.99)/1000)]
for school_type, quantile in quantiles.iteritems():
    df2 = df2[(df2['School type'] != school_type) | 
              (df2['Income per person'] <= quantile)]

g = sns.displot(data=df2, 
                x='Income per person', 
                hue='School type',
                row='School type', 
                common_norm=False, 
                binwidth=100,
                facet_kws={
                    'sharey': False,
                },
                height=3, 
                aspect=5)
g.set_titles('')

plt.subplots_adjust(top=0.95)
plt.figlegend(title='School type')
_ = plt.suptitle('Students per income per person', fontweight='bold', fontsize='xx-large')

We can see that the distribuition of students among public schools are very similar and skewed towards low income. Students in private schools are more distibuited across income regions. It worth notice, though, that there is a fair amount of low income student in private schools. 

Federal Schools have the biggest tail among public schools. That might reflect the acknowlegment by the parents that federal schools are as good (or better) than private schools. 

We saw that there is a reasonable proportion of students with low income on private schools. 
### How is the correlation between grades and income on each school type?

In [None]:
df2.head()

In [None]:
fig, axs = plt.subplots(6, 4, sharex='col', sharey='row', figsize=(15, 12), gridspec_kw={'right': 0.85, 'wspace':0.05, 'hspace': 0.1})
       
df2['Binned income'] = df2['Income per person'].apply(lambda x: 100*int(x/100)).astype(str)
        
default_palette = sns.color_palette()
school_types = df2['School type'].unique()
for col, school_type in enumerate(school_types):
    data = df2[df2['School type']==school_type]
    for row, test_type in enumerate(grade_columns):
        ax = axs[row][col]
        sns.histplot(
            data = data,
            x='Income per person', 
            y=test_type,
            binwidth=(100,10),
            ax=ax,
            legend=False,
            color=default_palette[col])
        
        mean = data[['Binned income', test_type]].groupby('Binned income').mean()
        mean = mean.reset_index()
        mean['Binned income'] = mean['Binned income'].astype(int)
        sns.lineplot(x=mean['Binned income'], 
                     y=mean[test_type], 
                     ax=ax, 
                     color='#111111')
        
        if row==0 or col==0:
            ax.xaxis.set_major_locator(ticker.MultipleLocator(2000))
            ax.yaxis.set_major_locator(ticker.MultipleLocator(200))
    
        if row == 0:
            ax.set_title(school_type)
        
        ax.spines['right'].set_visible(False)
        ax.spines['top'].set_visible(False)
        
        
    ax = axs[len(grade_columns)][col]
    sns.histplot(
            data = data,
            x='Income per person',
            binwidth=100,
            stat='density',
            ax=ax,
            legend=False,
            color=default_palette[col])
   
    if col == 0:
        ax.set(ylabel='count')
        

patches = [mpatches.Patch(color=default_palette[col], 
                          label=school_type) 
           for col, school_type in enumerate(school_types)]   
patches.append(plt.plot([],[], ms=10, label='Average', color='#111111')[0]) #https://stackoverflow.com/a/44113141/7502752

fig.legend(handles=patches, loc='center right', title='School type', frameon=False)
_ = fig.suptitle('Grade distribuition per income per person\n(Plots do not carry the same scale, but ticks are equally spaced in absolute values)', 
                 fontweight='bold', fontsize='xx-large')

From the above charts we can see that there is correlation between student performance and income. This correlation is specially accentuated in an income range up to 1000BRL. The improvements on performance decrease as the income grows. This can be explained taking in consideration the student needs. Students need school material, a good place to study, books and maybe some tutoring. This needs are filled in the low-income range. The expenses for studing in the high-income end (like better computers and tablets) promote only marginal gains to a student performance compared to the basic ones. Also, parent education, for which income is a good proxy, may be playing an important role in their children education. 

We have seen that income influence on the type of school student attend. It also influence on their performance given the type of school. Therefore, we can ask:

How much can student performance be explained by their school quality?
Is income a factor inside each school?

### How much can student performance be explained by their school quality?

We are going to use the average grade for students in a school as measure for its quality. Schools are required to have at least 10 students, in order to factor out those schools that have too little students

In [None]:
df3 = df2.copy()

school_avg = (df3[school_columns + grade_columns + id_columns]
              .groupby(school_columns)
              .agg(func={
                **{col: 'mean' for col in grade_columns}, 
                **{col: 'count' for col in id_columns}
              })
              .rename(columns={id_columns[0]:'Count'}))
school_avg = school_avg[school_avg['Count'] >= 10]

df3 = df3.join(school_avg, how='inner', on=school_avg.index.names, rsuffix=' (school average)')
print('Removing school with less than 10 students dropped {:.2f}% of students'.format(100*(1-len(df3)/len(df2))))

grade_columns_mean = [col+' (school average)' for col in grade_columns]
grade_columns_mean_normalized = [col+' (normalized by school average)' for col in grade_columns]

df3[grade_columns_mean_normalized] = df3[grade_columns] - df3[grade_columns_mean].values

f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(df3[grade_columns + grade_columns_mean].corr().iloc[:,5:], 
            annot=True, fmt=".2f", linewidths=.5, ax=ax)
_ = plt.title('Correlation between student grades and their school averages', fontweight='bold')

From the plot above we can see that there is a high correlation between school average grades. This means that good schools probably teach all subjects well, as expected.

In [None]:
fig, axs = plt.subplots(2, 5, sharex='col', figsize=(15, 6), gridspec_kw={'right': 0.85, 'wspace':0.25, 'hspace': 0.15})

grade_columns_mean_binned = [col+'_mean_binned' for col in grade_columns]

df3[grade_columns_mean_binned] = df3[grade_columns_mean].applymap(lambda x: 10*int(x/10)).astype(str)
        
default_palette = sns.color_palette()
for col, student_var in enumerate(grade_columns):
    ###### Plot 1 ######
    school_var = grade_columns_mean[col]
    school_var_binned = grade_columns_mean_binned[col]
        
    ax = axs[0][col]
    sns.histplot(
        data = df3,
        x=school_var, 
        y=student_var,
        binwidth=(10,10),
        ax=ax,
        legend=False,)

    stats = df3[[school_var_binned, student_var]].groupby(school_var_binned)[student_var].agg(['mean', 'std'])
    stats = stats.reset_index()
    stats[school_var_binned] = stats[school_var_binned].astype(int)
    sns.lineplot(x=stats[school_var_binned], 
                 y=stats['mean'], 
                 ax=ax, 
                 label='Average',
                 legend=False)
    
    sns.lineplot(x=stats[school_var_binned], 
                 y=3*stats['std'], 
                 ax=ax, 
                 label='std * 3',
                 legend=False)
    
    ax.xaxis.set_major_locator(ticker.MultipleLocator(200))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(200))
    ax.set(title=student_var, ylim=(0, None), ylabel='Student grade' if col==0 else None)
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
   
    ###### Plot 2 ######   
    ax = axs[1][col]
    sns.histplot(
        data = df3,
        x=school_var,
        binwidth=10,
        stat='density',
        ax=ax,
        legend=False)
       
    ax.set(
        xlabel='School average grade' if col==2 else None,
        ylabel='Student density' if col==0 else None, 
        yticks=[]
    )
# patches = [mpatches.Patch(color=default_palette[col], 
#                           label=school_type) 
#            for col, school_type in enumerate(school_types)]   
# patches.append(plt.plot([],[], ms=10, label='Average', color='#111111')[0]) #https://stackoverflow.com/a/44113141/7502752

# fig.legend(handles=patches, loc='center right', title='School type', frameon=False)
handles, labels = axs[0][0].get_legend_handles_labels()
fig.legend(handles, labels, loc='center right')
plt.subplots_adjust(top=0.8)
_ = fig.suptitle('Student Grades vs school grades\n(Plots do not carry the same scale, but ticks are equally spaced in absolute values)', 
                 fontweight='bold', fontsize='xx-large')

So student grade is correlated with school grade, but not much. This correlation is there by construction, since we evaluate schools based on their student average, so there is not much to conclude here. We can observe that Mathematics and Essay are the most 'democratic' disciplines, since we find students in the middle-end of schools obtaining almost full grades. 

We find that as we move towards the high end, the major aspect in play is reducing the standard deviation of students while keeping their performance high. On Humanities and Natural Sciences, the best grades were not even achieved by top ranking schools, but their worst student performed much better that the worst ones from middle-to-high ranking schools. 

### Is income a factor inside a school?
In order to answer that question we are going to study how the difference between student grades and school average grades behave among different incomes classes. For that matter, we are now going to turn income values into discrete income ranges, by binning them. Since the income distribuition is highly skewed toward low incomes, we bin by splitting in bins with equal number of points.

The first information we analyse is correlation.

In [None]:
print('Correlation between income per person and student deviation from school average:')
df3[['Income per person'] + grade_columns_mean_normalized].corr().iloc[0:1, 1:]

That numbers tell us that this information is not correlated. Lets investigate further.

In [None]:
df4 = df3.copy()
classes = list('ABCDEFGHIJ')
df4['Income class'], income_classes = pd.qcut(
    df4.loc[df4['Income per person'] > 0, 'Income per person'],
    q=9,
    labels=classes[1:],
    retbins=True,
    duplicates='drop'
)
df4 = df4.astype({'Income class': str})

df4.loc[df4['Income per person'] == 0, 'Income class'] = classes[0]
assert df4['Income class'].isna().sum() == 0

ax = sns.countplot(data=df4, x='Income class', order=classes)
ax.set_title('Number of students in each income class', fontweight='bold')
print('Income class binning result:\n')
print('\n'.join(
    ['Income Class\tRange']+
    ['{}\t\tno income'.format(classes[0])]+
    ['{}\t\t{:.2f} - {:.2f}'.format(key, income_classes[idx], income_classes[idx+1]) for idx, key in enumerate(classes[1:])]))

In [None]:
grades_by_income = df4[grade_columns_mean_normalized + ['School type','Income class']].groupby(['School type','Income class']).agg(['mean', 'std', 'count'])
mean_income = df4[['Income per person', 'Income class']].groupby('Income class').mean().rename({'income_per_person':'mean_income'})

In [None]:
fig, axs = plt.subplots(5, 1, sharex='col', figsize=(15, 15), gridspec_kw={'right': 0.85, 'wspace':0.25, 'hspace': 0.15})
for idx_discipline, discipline in enumerate(grade_columns_mean_normalized):
    for idx_school_type, school_type in enumerate(['State', 'Private']):
        grades = (grades_by_income
                  .loc[school_type, discipline]
                  .join(mean_income))
        ax = axs[idx_discipline]
        ax.plot(grades['Income per person'], 
                grades['mean'], 
                color=default_palette[idx_school_type],
                label=school_type)
        ax.fill_between(
            grades['Income per person'], 
            grades['mean']-grades['std'], 
            grades['mean']+grades['std'],
            color=default_palette[idx_school_type],
            alpha=.3)
    
        ax.set(title=discipline, xticks=grades['Income per person'])
        
axs[-1].set(xticklabels=classes)
plt.xlabel('Income class')
handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels, loc='center right', title='School type')
fig.show()

The graphs above show a trend in student performance towards better performance at better incomes. There is a large standard deviation in those values, though, and the trend never leaves the standard deviation range. 

Finally, lets ask one more question:
### How is student income related to their distribuition among schools?
To answer that question, we are going to visualize the distribuition of students per income class. The schools are sorted by low to high grades. To simplify, we are taking the average grade among tests to be the school grade

In [None]:
df5 = df4.copy()
df5['school_grade'] = df5[grade_columns_mean].mean(axis=1)
renameMap = {
    **{key:'A-C' for key in list('ABC')},
    **{key:'D-F' for key in list('DEF')},
    **{key:'G-I' for key in list('GH')},
    **{key:'I-J' for key in list('IJ')},
}

df5 = df5.astype({'school_grade':np.int}).replace({
    'Income class':renameMap
})

student_per_income_class = df5.pivot_table(
    index='school_grade',
    columns='Income class', 
    values='NU_INSCRICAO',
    aggfunc='count')
student_per_income_class = student_per_income_class.div(student_per_income_class.sum(axis=1), axis=0)

student_per_income_class.plot.area()
plt.ylim(top=1, bottom=0)
plt.xlim(left=student_per_income_class.index.min(), right=student_per_income_class.index.max())
plt.xlabel('School Grade')
plt.ylabel('Student Income distribuition')
plt.legend(title="Income Class", fancybox=True, framealpha = .7, frameon=True)
plt.gca().yaxis.set_major_formatter(ticker.PercentFormatter(1.0))
_ = plt.title('Student income distribuition per school sorted by school grade', fontweight='bold')

# Conclusion
On this notebook we analised how income play a role in education in Brazil. We concluded that students are highly affected by quality of the school they attend. We show that students on the low income range tends to attend to the worst schools. We also show that the biggest impact on family income improvements happens on the 0-1000 BRL income per person range. The effects on income given the school appears to be limited, but present. It is more present in the very low income range, this time up to about 600BRL per person.  

Federal schools are public schools but, differently from city and state schools, have a performance very close to private schools. Student income distribuition in federal schools are slightly less skewed to low income than state and city schools. We suspect that federal schools can be a role model to other public schools, although further investigation is necessary to see if this is actually doable. 

Our analysis shows that students in middle range schools can hope to achieve golden grades on some disciplines. We also notice that high-end schools are able to not only improve its average but also reduce its standard deviation, having a more consistent grading among students. This conclusion needs further investigation. We need to analyse if these schools are able to reproduce this behaviour among the years, or if this fenomena is just derived from the math (after all grades are limited on the top).

The analysis done here does not include school budget information. We leave for future work to consider the role of school budget on its student performance.