In [1]:
import pandas as pd
from statsmodels.formula.api import ols

For International Women's Day, I was inspired by tweets that zeroed in on women - more so than usual anyway. I am a day late because leading up to that, I was focused on analyzing this particular dataset for finding regional variances. My interest was mostly at the nexus of place and education in Tanzania and how using such data can create a more targeted organizational approach to tackling the toughest educational inequalities.

One such inequality is across the sexes. No patience for punchlines here: educational outcomes for Tanzanian girls is unequal compared to their counterparts. We have some indication across the board of what levers the government (local and national) can begin to tinker with to close these gaps. In this quick linear regression, I use the 2013 Primary School Leaving Examination (PSLE) outcomes to try to capture the magnitude of the inequality. My past research has mostly found this magnitude expressed in terms of absenteeism, pass-fail differences, etc. To be clear, those values are likely more impactful and more clear-cut. Quite frankly, this analysis is closely tied to them as I'll be looking at the Calculated Average PSLE score and gender's effect on it. If you can close the pass-fail gaps or address various causes of absenteeism, my assumption is we would see the effect of gender narrow. Let's get to it.

First, we read in our data using pandas and clean out some N/A values. For this quick analysis, I look at two categorical variables: Sex and Regions. I already dummy coded Sex (Female = 1, Male = 0) when cleaning up the scraped data, but pandas can also dummy code all 25 regions. Statsmodels' ols function does this step for you as well when fitting your model so I opt for this instead. (Many thanks to scipy-lectures for their tutorial on setting this up using ols)

In [2]:
#Read in CSV
psle2013 = pd.read_csv("~/Documents/GitHub/ImportingNECTA/CompleteDatasets/necta_psle_2013.csv")

#Drop NAs, get Dummies if desired, call .head to check dataframe if desired
psle2013_noNA = psle2013.dropna(axis=0, how='any')
psle2013_noNA2 = pd.get_dummies(psle2013_noNA, columns=['Region'])

#Assign variables for building the model
CalcAverage = psle2013_noNA.CalcAverage
sex = psle2013_noNA.SEX
regions = psle2013_noNA.Region

#For just the DAR-ES-SALAAM Dummy Variable
dar = psle2013_noNA2.Region_DAR

#Build the model, print the model summary
model = ols("CalcAverage ~ sex + regions", psle2013_noNA).fit()
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:            CalcAverage   R-squared:                       0.075
Model:                            OLS   Adj. R-squared:                  0.075
Method:                 Least Squares   F-statistic:                     2748.
Date:                Sun, 11 Mar 2018   Prob (F-statistic):               0.00
Time:                        22:19:11   Log-Likelihood:            -8.9027e+05
No. Observations:              844297   AIC:                         1.781e+06
Df Residuals:                  844271   BIC:                         1.781e+06
Df Model:                          25                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          2.7540      0.004    728.

In [3]:
print("Girls' Mean Calculated Average Score: "+str(psle2013_noNA[psle2013_noNA.SEX == 1]['CalcAverage'].mean()))
print("Girls' Median Calculated Average Score: "+ str(psle2013_noNA[psle2013_noNA.SEX == 1]['CalcAverage'].median()))
print("Boys' Mean Calculated Average Score: "+ str(psle2013_noNA[psle2013_noNA.SEX == 0]['CalcAverage'].mean()))
print("Boys' Median Calculated Average Score: "+ str(psle2013_noNA[psle2013_noNA.SEX == 0]['CalcAverage'].median()))


Girls' Mean Calculated Average Score: 2.467210504122669
Girls' Median Calculated Average Score: 2.4
Boys' Mean Calculated Average Score: 2.6070221581969726
Boys' Median Calculated Average Score: 2.6


The results show the regression taking into account regions. However, all else equal, girls on average scored -0.13 less on their calculated average score. Consequently, assuming a score of 3 (2.5+) is passing, the magnitude could very well be the difference between passing and failing the exam. I feel that this needs some reiteration: because of the system as currently implemented, being born a girl could have been the difference between passing or failing the PSLE in 2013. Since then, PSLE is no longer a barrier to attending secondary education. However, the fight is not won as other discriminatory practices have taken its place (e.g. banning once pregnant girls from continuing their education). Unfortunately, this discussion is still very high level. I'm going to continue to piece together data that might give more actionable items to close these gaps. In the meantime, check the work of good people like Dropwall (eagleanalytics.co.tz) who are working on a system that predicts potential dropouts. Literature shows that the issues they are finding are of acute importance to girls.

##Repeated Analysis for Other Years

What if we repeated this analysis for the other years? How would the results differ or be similar?

In [4]:
#Read in CSV
psle2014 = pd.read_csv("~/Documents/GitHub/ImportingNECTA/CompleteDatasets/necta_psle_2014.csv")
psle2015 = pd.read_csv("~/Documents/GitHub/ImportingNECTA/CompleteDatasets/necta_psle_2015.csv")

def womens_day_ols(data):
    
    #Drop NAs, get Dummies if desired, call .head to check dataframe if desired
    psle_noNA = data.dropna(axis=0, how='any')
    psle_noNA2 = pd.get_dummies(psle_noNA, columns=['Region'])

    #Assign variables for building the model
    CalcAverage = psle_noNA.CalcAverage
    sex = psle_noNA.SEX
    regions = psle_noNA.Region

    #For just the DAR-ES-SALAAM Dummy Variable
    dar = psle_noNA2.Region_DAR

    #Build the model, print the model summary
    model = ols("CalcAverage ~ sex + regions", psle_noNA).fit()
    return model.summary()

womens_day_ols(psle2014)

0,1,2,3
Dep. Variable:,CalcAverage,R-squared:,0.08
Model:,OLS,Adj. R-squared:,0.08
Method:,Least Squares,F-statistic:,2743.0
Date:,"Sun, 11 Mar 2018",Prob (F-statistic):,0.0
Time:,22:19:22,Log-Likelihood:,-843310.0
No. Observations:,791674,AIC:,1687000.0
Df Residuals:,791648,BIC:,1687000.0
Df Model:,25,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.8920,0.004,745.180,0.000,2.884,2.900
regions[T.DAR],0.2120,0.005,44.562,0.000,0.203,0.221
regions[T.DOD],-0.3813,0.005,-72.385,0.000,-0.392,-0.371
regions[T.GEI],-0.1030,0.006,-18.183,0.000,-0.114,-0.092
regions[T.IRI],-0.0842,0.006,-13.759,0.000,-0.096,-0.072
regions[T.KAG],-0.0359,0.005,-6.828,0.000,-0.046,-0.026
regions[T.KAT],-0.1720,0.009,-18.939,0.000,-0.190,-0.154
regions[T.KIG],-0.4240,0.006,-75.844,0.000,-0.435,-0.413
regions[T.KIL],-0.0555,0.005,-10.601,0.000,-0.066,-0.045

0,1,2,3
Omnibus:,19684.167,Durbin-Watson:,1.047
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21195.091
Skew:,0.401,Prob(JB):,0.0
Kurtosis:,2.984,Cond. No.,29.1


In [5]:
womens_day_ols(psle2015)

0,1,2,3
Dep. Variable:,CalcAverage,R-squared:,0.075
Model:,OLS,Adj. R-squared:,0.075
Method:,Least Squares,F-statistic:,2748.0
Date:,"Sun, 11 Mar 2018",Prob (F-statistic):,0.0
Time:,22:19:31,Log-Likelihood:,-890270.0
No. Observations:,844297,AIC:,1781000.0
Df Residuals:,844271,BIC:,1781000.0
Df Model:,25,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.7540,0.004,728.468,0.000,2.747,2.761
regions[T.DAR],0.2880,0.005,62.197,0.000,0.279,0.297
regions[T.DOD],-0.3458,0.005,-67.920,0.000,-0.356,-0.336
regions[T.GEI],-0.1413,0.005,-25.767,0.000,-0.152,-0.131
regions[T.IRI],0.0417,0.006,7.085,0.000,0.030,0.053
regions[T.KAG],-0.0118,0.005,-2.362,0.018,-0.022,-0.002
regions[T.KAT],-0.2414,0.009,-28.252,0.000,-0.258,-0.225
regions[T.KIG],-0.2975,0.005,-55.388,0.000,-0.308,-0.287
regions[T.KIL],0.0402,0.005,7.879,0.000,0.030,0.050

0,1,2,3
Omnibus:,34235.368,Durbin-Watson:,1.126
Prob(Omnibus):,0.0,Jarque-Bera (JB):,38535.362
Skew:,0.516,Prob(JB):,0.0
Kurtosis:,3.174,Cond. No.,29.5


The gender gap did not close between 2013 - 2015.

In [6]:
#Removing NaNs
psle2014_noNA = psle2014.dropna(axis=0, how='any')
psle2015_noNA = psle2015.dropna(axis=0, how='any')

def compare_gender_regions(dataset):
    """
    Finding the difference between girls and boys' calculated average score given the year's PSLE scores.
    Returns a dictionary with the regions as keys, and the difference in means as the value.
    """
    regions = psle2015_noNA['Region'].unique()
    regions_diff = {}
    for r in regions:
        subdataset = dataset[dataset['Region']==r]
        diff = subdataset[subdataset['SEX']==1]['CalcAverage'].mean() - subdataset[subdataset['SEX']==0]['CalcAverage'].mean()
        regions_diff[r]=diff
    return regions_diff

compare_gender_regions(psle2013_noNA)

{'ARU': -0.023141147489854408,
 'DAR': -0.09362856240003659,
 'DOD': -0.06634530408093031,
 'GEI': -0.3451214977636692,
 'IRI': -0.06522874402395828,
 'KAG': -0.0876283415509942,
 'KAT': -0.31380848080974566,
 'KIG': -0.3064879885989309,
 'KIL': 0.0981385920588056,
 'LIN': -0.18273635431305912,
 'MAN': -0.04800686226493589,
 'MAR': -0.3020155064176664,
 'MBE': -0.07536204126413892,
 'MOR': -0.06866359881960493,
 'MTW': -0.13172323601629898,
 'MWA': -0.3417855354631709,
 'NJO': -0.01599636056831688,
 'PWA': -0.10229727665507848,
 'RUK': -0.22958881314975477,
 'RUV': -0.05063593018311252,
 'SHI': -0.2487913446562553,
 'SIM': -0.3388181979412077,
 'SIN': -0.06692392836480598,
 'TAB': -0.15588035339163886,
 'TAN': -0.026987524373916383}

In [7]:
compare_gender_regions(psle2014_noNA)

{'ARU': -0.04304838875601735,
 'DAR': -0.10728923916743893,
 'DOD': -0.06877898497006907,
 'GEI': -0.33780788869552625,
 'IRI': -0.10406749319160635,
 'KAG': -0.11943932717320349,
 'KAT': -0.3127103731389007,
 'KIG': -0.30657126192412276,
 'KIL': 0.08959842160658038,
 'LIN': -0.14658345312464416,
 'MAN': -0.02600321728709032,
 'MAR': -0.2823405091414042,
 'MBE': -0.07713467764184223,
 'MOR': -0.057185080899470275,
 'MTW': -0.1002963885393946,
 'MWA': -0.2871856001804014,
 'NJO': -0.006775464237100515,
 'PWA': -0.08346493584668524,
 'RUK': -0.2433839255256487,
 'RUV': -0.06754111862853973,
 'SHI': -0.23781325536459752,
 'SIM': -0.34318408504683573,
 'SIN': -0.07621054512827596,
 'TAB': -0.1267053585955633,
 'TAN': 0.012060385122616868}

In [8]:
compare_gender_regions(psle2015_noNA)

{'ARU': -0.023141147489854408,
 'DAR': -0.09362856240003659,
 'DOD': -0.06634530408093031,
 'GEI': -0.34512149776366785,
 'IRI': -0.06522874402395784,
 'KAG': -0.08762834155099242,
 'KAT': -0.31380848080974566,
 'KIG': -0.30648798859893,
 'KIL': 0.0981385920588056,
 'LIN': -0.18273635431306268,
 'MAN': -0.04800686226493589,
 'MAR': -0.3020155064176664,
 'MBE': -0.07536204126414026,
 'MOR': -0.06866359881960493,
 'MTW': -0.13172323601629632,
 'MWA': -0.3417855354631709,
 'NJO': -0.015996360568318657,
 'PWA': -0.10229727665507848,
 'RUK': -0.22958881314975477,
 'RUV': -0.05063593018311785,
 'SHI': -0.24879134465625707,
 'SIM': -0.3388181979412077,
 'SIN': -0.06692392836480598,
 'TAB': -0.15588035339163886,
 'TAN': -0.026987524373916383}