# "Wulu Data Analysis"
> "Blog post to summarise our findings through our study using Wulu chatbot for Stanford Longevity Design challenge 2021."

- toc: false
- branch: master
- badges: false
- comments: true
- hide: false
- author: Bhavya Jha



In [1]:
#hide
import pandas as pd
import numpy as np
import altair as alt
from altair.expr import datum
import json
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
!pip install altair --upgrade



## Wulu Data Analysis

The following post contains the data analysis of the attitude changes the we percieved in our project for the Stanford Longevity Design Challenge. For a more detailed understanding on how the project planned, executed and what we learned from it, read our [report](https://www.hks.harvard.edu/sites/default/files/centers/cid/files/publications/CID_Wiener_Inequality%20Award%20Research/Pallavi%20Khare%20(1-A).pdf). 
This project won 3rd place out of more than 220 entries from 40 countries at the Stanford Design Challenge 2021.

In [2]:
#hide
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
!ls '/content/drive/MyDrive/WUL/ProjectH'

Mounted at /content/drive/
basic.csv  trend.csv  TRTAnalysis.csv


In [3]:
#hide
df = pd.read_csv('/content/drive/MyDrive/WUL/ProjectH/basic.csv')
df_trend = pd.read_csv('/content/drive/MyDrive/WUL/ProjectH/trend.csv')
df.drop(['Unnamed: 0'],axis=1, inplace=True)
df_trend.drop(['Unnamed: 0'],axis=1, inplace=True)


In [4]:
#hide
df.head()

Unnamed: 0,SNO,Age,Gender,Father edu,Mother edu,baseline 7,baseline 8,baseline 9,baseline 10,baseline 11,baseline 12,baseline 13,baseline 14,baseline 15,baseline 16,baseline 17,baseline 18,baseline 19,baseline 20,end 7,end 8,end 9,end 10,end 11,end 12,end 13,end 14,end 15,end 16,end 17,end 18,end 19,end 20,baseline score,endline score
0,1,17,F,1,1,1,0,0,1,0,0,0,0,1,0,1,0,1,1,1,1,0,0,0,0,0,1,1,1,1,0,1,1,6,8
1,2,16,F,1,3,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,1,1,5,4
2,3,18,M,2,2,1,0,0,0,0,0,0,0,1,0,0,0,1,1,1,0,1,1,0,0,0,0,1,0,1,0,1,1,4,7
3,4,18,M,3,2,1,1,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,5,1
4,5,17,F,1,1,1,0,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,11,13


In [5]:
#hide
df1 = df[['Age','Gender','baseline score','endline score']]
df11 = pd.melt(df1, id_vars=['Gender','Age'], value_vars=[ 'baseline score', 'endline score'])
df11.head()

Unnamed: 0,Gender,Age,variable,value
0,F,17,baseline score,6
1,F,16,baseline score,5
2,M,18,baseline score,4
3,M,18,baseline score,5
4,F,17,baseline score,11


In [6]:
#hide

# % positive change in attitude
#the change in attitude for girls was ___% and for boys was ___% with girls having a higher change by ____%.

#count_less_than_hs = (df11_mother[(df11_mother['Mother edu']=='Less than High School') & (df11_mother['variable']=='baseline score')]['value']).count()
#print(count_less_than_hs)
#x = (sum_less_than_hs_end - sum_less_than_hs_base)/sum_less_than_hs
#print('x = ', x)
#sum_more_than_hs_base = sum(df11_mother[(df11_mother['Mother edu']!='Less than High School') & (df11_mother['Mother edu']!='Not mentioned') & (df11_mother['variable']=='baseline score')]['value'])
#print(sum_more_than_hs_base)
mean_base = (df11[(df11['variable']=='baseline score')]['value']).mean()
print('mean of baseline score = ', mean_base)
mean_end = (df11[(df11['variable']=='endline score')]['value']).mean()
print('mean of endline score = ', mean_end)
print('% positive change in attitude = ', 100*(mean_end - mean_base)/mean_base )
sum_base = (df11[(df11['variable']=='baseline score')]['value']).sum()
sum_end = (df11[(df11['variable']=='endline score')]['value']).sum()
count_base = (df11[(df11['variable']=='baseline score')]['value']).count()
count_end = (df11[(df11['variable']=='endline score')]['value']).count()
print('check: sum base = ', sum_base, ' | sum end = ', sum_end, ' count = ', count_base, count_end)
print('% change = sum end - sum base / sum base = ', 100*(sum_end - sum_base)/sum_base)

mean_base_female = (df11[(df11['Gender']=='F') & (df11['variable']=='baseline score')]['value']).mean()
print('mean of baseline score female = ', mean_base_female)
mean_end_female = (df11[(df11['Gender']=='F') & (df11['variable']=='endline score')]['value']).mean()
print('mean of endline score female= ', mean_end_female)
mean_base_male = (df11[(df11['Gender']=='M') & (df11['variable']=='baseline score')]['value']).mean()
print('mean of baseline score male = ', mean_base_male)
mean_end_male = (df11[(df11['Gender']=='M') & (df11['variable']=='endline score')]['value']).mean()
print('mean of endline score male= ', mean_end_male)
print('% positive change in attitude females = ', 100*(mean_end_female - mean_base_female)/mean_base_female )
print('% positive change in attitude males = ', 100*(mean_end_male - mean_base_male)/mean_base_male )


mean of baseline score =  5.614130434782608
mean of endline score =  6.320652173913044
% positive change in attitude =  12.584704743465648
check: sum base =  2066  | sum end =  2326  count =  368 368
% change = sum end - sum base / sum base =  12.584704743465634
mean of baseline score female =  6.241025641025641
mean of endline score female=  7.164102564102564
mean of baseline score male =  4.907514450867052
mean of endline score male=  5.369942196531792
% positive change in attitude females =  14.790468364831558
% positive change in attitude males =  9.422850412249701


## Analysis
### Comparison between baseline and endline questionnaire scores using only the best option as the answer

The following chart establishes a relationship between age and score, with gender taken as a differentiating parameter. 
With an increase in age, there appears to be a more susceptibility to a positive change in attitude. It is also evident that this change is reflected more in the girls. 

In [7]:
#hide_input
input_dropdown = alt.binding_radio(options=['baseline score','endline score'])
selection = alt.selection_single(fields=['variable'], bind=input_dropdown, name='Score')
color = alt.condition(selection,
                    alt.Color('Gender:N', legend=None),
                    alt.value('lightgray'))
domain = [12,22]
range = [-5,25]
alt.Chart(df11).mark_circle().encode(
    x=alt.X('Age:Q',scale=alt.Scale(domain=domain)),
    y=alt.Y('value:Q',scale=alt.Scale(domain=range)),
    size = 'count()',
    color='Gender:N',
    tooltip=['Age','Gender','variable','value','count()']
).add_selection(
    selection
).transform_filter(
    selection
).interactive().properties(height=400, width=400, title='Combined charts for baseline and endline scores')

In [8]:
#hide
domain = [14,20]
range = [0,15]
best_choice_base = alt.Chart(df1).mark_circle().encode(
    x=alt.X('Age:Q',scale=alt.Scale(domain=domain)),
    y=alt.Y('baseline score:Q',scale=alt.Scale(domain=range)),
    size = 'count()',
    color='Gender',
    tooltip=['Age','Gender','baseline score','count()']
).interactive().properties(height=400, width=400, title='Chart for best choice baseline scores')

In [9]:
#hide_input
domain = [14,20]
range = [0,15]
best_choice_end = alt.Chart(df1).mark_circle().encode(
    x=alt.X('Age:Q',scale=alt.Scale(domain=domain)),
    y=alt.Y('endline score:Q',scale=alt.Scale(domain=range)),
    size = 'count()',
    color='Gender',
    tooltip=['Age','Gender','endline score','count()']
).interactive().properties(height=400, width=400, title='Chart for best choice endline scores')

best_choice_base | best_choice_end

In [10]:
#hide_input
domain = [12,22]
range = [-5,25]
alt.Chart(df_trend).mark_circle().encode(
    x=alt.X('Age:Q',scale=alt.Scale(domain=domain)),
    y=alt.Y('baseline score:Q',scale=alt.Scale(domain=range)),
    size = 'count()',
    color='Gender',
    tooltip=['Age','Gender','baseline score','count()']
).interactive().properties(height=400, width=400, title='Chart for graded baseline scores')

In [11]:
#hide_input
domain = [12,22]
range = [-5,25]
alt.Chart(df_trend).mark_circle().encode(
    x=alt.X('Age:Q',scale=alt.Scale(domain=domain)),
    y=alt.Y('endline score:Q',scale=alt.Scale(domain=range)),
    size = 'count()',
    color='Gender',
    tooltip=['Age','Gender','endline score','count()']
).interactive().properties(height=400, width=400, title='Chart for graded endline scores')

###Trends with respect to parents education levels
####Based on
*   Gender
*   Age
*   Education
*   Count of children 



In [12]:
#hide
df_trend.head()

Unnamed: 0,SNO,Age,Gender,Father edu,Mother edu,baseline 7,baseline 8,baseline 9,baseline 10,baseline 11,baseline 12,baseline 13,baseline 14,baseline 15,baseline 16,baseline 17,baseline 18,baseline 19,baseline 20,end 7,end 8,end 9,end 10,end 11,end 12,end 13,end 14,end 15,end 16,end 17,end 18,end 19,end 20,baseline score,endline score
0,1,17,F,1,1,1,1,0,2,-1,-1,-1,1,2,-1,1,0,1,1,1,2,1,1,-1,1,1,2,2,2,1,0,1,1,6,15
1,2,16,F,1,3,1,0,1,1,0,-1,1,1,2,2,1,0,0,1,1,0,1,1,0,1,-1,0,2,1,0,0,1,1,10,8
2,3,18,M,2,2,1,-1,1,1,-1,-1,0,1,2,1,0,0,1,1,1,-1,2,2,0,0,1,-1,2,1,1,0,1,1,6,10
3,4,18,M,3,2,1,2,2,1,0,-1,-1,0,2,-1,0,0,0,1,1,0,1,1,-1,0,-1,1,0,0,0,0,0,0,6,2
4,5,17,F,1,1,1,0,2,2,1,2,2,2,2,2,1,0,1,1,1,2,2,2,2,2,2,2,2,2,1,0,1,1,19,22


In [13]:
#hide
dft = df_trend[['Age','Gender','Father edu', 'Mother edu' ,'baseline score','endline score']]
df11_mother = pd.melt(dft, id_vars=['Mother edu','Gender','Age'], value_vars=[ 'baseline score','endline score'])
df11_father = pd.melt(dft, id_vars=['Father edu','Gender','Age'], value_vars=[  'baseline score','endline score'])

edu_level = { 1: 'Less than High School', 2:'High School', 3:'Graduate / Diploma', 4:'Post-graduate', 0: 'Not mentioned' }
df11_mother['Mother edu'] = pd.Categorical(df11_mother['Mother edu'].map(edu_level).fillna('No match'))
df11_father['Father edu'] = pd.Categorical(df11_father['Father edu'].map(edu_level).fillna('No match'))

df11_mother.head()
df11_father.head()

Unnamed: 0,Father edu,Gender,Age,variable,value
0,Less than High School,F,17,baseline score,6
1,Less than High School,F,16,baseline score,10
2,High School,M,18,baseline score,6
3,Graduate / Diploma,M,18,baseline score,6
4,Less than High School,F,17,baseline score,19


In [14]:
#hide_input
#layered histogram - mother's education levels are the colours - x axis is age - y axis is score
alt.Chart(df11_mother).mark_area(
    opacity=0.3,
    interpolate='step'
).encode(
    alt.X('Age:O', bin=True, title='Age of children'),
    alt.Y('count()', stack=None, title='Number of children'),
    alt.Color('Mother edu:N')
).properties(height=400, width=600, title='Mother\'s education level VS age of child')

In [15]:
#hide_input
alt.Chart(df11_mother).mark_area().encode(
    x='Age:O',
    y='count()',
    color='Mother edu:N',
    row=alt.Row('Mother edu:N')
).properties(height=100, width=400, title='Number of children and their age, with mother\'s in a particular edution level')

In [16]:
#hide_input
selector = alt.selection_single(empty='all', fields=['Mother edu'])

color_scale = alt.Scale(scheme='dark2')

base = alt.Chart(df11_mother).properties(
    width=400,
    height=400
).add_selection(selector)

# points = base.mark_point(filled=True, size=200).encode(
#    x=alt.X('Gender:Q',
#            scale=alt.Scale(domain=['M','F'])),
#    y=alt.Y('count():Q',
#            scale=alt.Scale(domain=[0,250])),
#    color=alt.condition(selector,
#                        'Gender:N',
#                        alt.value('lightgray'),
#                        scale=color_scale),
#)

hists = base.mark_area(opacity=0.5, thickness=100, size=30).mark_area(
    opacity=0.3,
    interpolate='step'
).encode(
    x=alt.X('Age',
            bin=True, # step keeps bin size the same
            scale=alt.Scale(domain=[10,25])),
    y=alt.Y('count()',
            stack=None,
            scale=alt.Scale(domain=[0,200])),
    color=alt.Color('Mother edu:N',
                    scale=color_scale)
).transform_filter(
    selector
)


hists

In [17]:
#hide_input
selector = alt.selection_single(empty='all', fields=['Mother edu'])

color_scale = alt.Scale(scheme='dark2')

base = alt.Chart(df11_mother).properties(
    width=400,
    height=400
).add_selection(selector)

points = base.mark_point(filled=True, size=200).encode(
    x=alt.X('value:Q',
            scale=alt.Scale(domain=[-10,30]), title='Questionnaire Scores'),
    y=alt.Y('count():Q',
            scale=alt.Scale(domain=[0,40]),title='Number of children'),
    color=alt.condition(selector,
                        'Gender:N',
                        alt.value('lightgray'),
                        scale=color_scale),
)

hists = base.mark_area(opacity=0.5, thickness=100, size=30).mark_area(
    opacity=0.3,
    interpolate='step'
).encode(
    x=alt.X('Age',
            bin=True, # step keeps bin size the same
            scale=alt.Scale(domain=[10,25])),
    y=alt.Y('count()',
            stack=None,
            scale=alt.Scale(domain=[0,200])),
    color=alt.Color('Mother edu:N',
                    scale=color_scale)
).transform_filter(
    selector
)


points | hists

In [18]:
#hide_input
selector = alt.selection_single(empty='all', fields=['Mother edu'])

color_scale = alt.Scale(scheme='dark2')

base = alt.Chart(df11_mother).properties(
    width=400,
    height=400
).add_selection(selector)

points = base.mark_bar().encode(
    x=alt.X('value:Q',
            scale=alt.Scale(domain=[-5,25]), title='Questionnaire Scores'),
    y=alt.Y('count():Q',
            scale=alt.Scale(domain=[0,60]),title='Number of children'),
    color=alt.condition(selector,
                        'Mother edu:N',
                        alt.value('lightgray'),
                        scale=color_scale),
)

hists = base.mark_area(opacity=0.5, thickness=100, size=30).mark_area(
    opacity=0.3,
    interpolate='step'
).encode(
    x=alt.X('Age:Q',
            bin=True, # step keeps bin size the same
            scale=alt.Scale(domain=[10,25])),
    y=alt.Y('count()',
            stack=None,
            scale=alt.Scale(domain=[0,200])),
    color=alt.Color('Mother edu:N',
                    scale=color_scale)
).transform_filter(
    selector
)


points | hists

In [19]:
#hide_input
selector = alt.selection_single(empty='all', fields=['Mother edu'])

color_scale = alt.Scale(scheme='dark2')

base = alt.Chart(df11_mother).properties(
    width=400,
    height=400
).add_selection(selector)

points = base.mark_bar().encode(
    x=alt.X('value:Q',
            scale=alt.Scale(domain=[-5,25]), title='Questionnaire Scores'),
    y=alt.Y('count():Q',
            scale=alt.Scale(domain=[0,60]),title='Number of children'),
    color=alt.condition(selector,
                        'Mother edu:N',
                        alt.value('lightgray'),
                        scale=color_scale),
)

hists = base.mark_bar().encode(
    x=alt.X('value',
            scale=alt.Scale(domain=[-5,25]),title='Questionnaire Scores'),
    y=alt.Y('count()',
            scale=alt.Scale(domain=[0,60]),title='Number of children'),
    color=alt.Color('Mother edu:N',
                    scale=color_scale)
).transform_filter(
    selector
)


hists

In [20]:
#hide_input

selector = alt.selection_single(empty='all', fields=['Mother edu'])

color_scale = alt.Scale(scheme='dark2')

base = alt.Chart(df11_mother).properties(
    width=400,
    height=400
).add_selection(selector)

brush = alt.selection_single(empty='all', fields=['Mother edu'])

bars = base.mark_bar().encode(
    x=alt.X('value:Q',
            scale=alt.Scale(domain=[-5,25]), title='Graded Questionnaire Scores'),
    y=alt.Y('count():Q',
            scale=alt.Scale(domain=[0,60]),title='Number of children'),
).transform_filter(
    selector
)

hists = base.mark_bar(size=30).encode(
    x=alt.X('Age',
            scale=alt.Scale(domain=[13,20]),title='Age'),
    y=alt.Y('count()',
            scale=alt.Scale(domain=[0,300]),title='Number of children'),
)


bars | hists

In [21]:
#hide
dft['Mother edu'] = pd.Categorical(dft['Mother edu'].map(edu_level).fillna('No match'))
dft['Father edu'] = pd.Categorical(dft['Father edu'].map(edu_level).fillna('No match'))


In [22]:
#hide

#mother's education and score on one graph and see the impact

basesums_mother = alt.Chart(dft).mark_bar().encode(
    x=alt.X('Mother edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate','Not mentioned']),title='Mother\'s education level'),
    y=alt.Y('mean(baseline score)',
            scale=alt.Scale(domain=[0,15]),title='Mean of baseline score')
)
endsums_mother = alt.Chart(dft).mark_bar().encode(
    x=alt.X('Mother edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate','Not mentioned']),title='Mother\'s education level'),
    y=alt.Y('mean(endline score)',
            scale=alt.Scale(domain=[0,15]),title='Mean of endline score')
)
counts_mother = alt.Chart(dft).mark_bar().encode(
    x=alt.X('Mother edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate','Not mentioned']),title='Mother\'s education level'),
    y=alt.Y('count()',
            scale=alt.Scale(domain=[0,220]),title='Number of children')
)

#father's education and score on one graph and see the impact
#edu_level = { 1: 'Less than High School', 2:'High School', 3:'Graduate / Diploma', 4:'Post-graduate', 0: 'Not mentioned' }

basesums_father = alt.Chart(dft).mark_bar().encode(
    x=alt.X('Father edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate','Not mentioned']),title='Father\'s education level'),
    y=alt.Y('mean(baseline score)',
            scale=alt.Scale(domain=[0,15]),title='Mean of baseline score')
)
endsums_father = alt.Chart(dft).mark_bar().encode(
    x=alt.X('Father edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate','Not mentioned']),title='Father\'s education level'),
    y=alt.Y('mean(endline score)',
            scale=alt.Scale(domain=[0,15]),title='Mean of endline score')
)
counts_father = alt.Chart(dft).mark_bar().encode(
    x=alt.X('Father edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate','Not mentioned']),title='Father\'s education level'),
    y=alt.Y('count()',
            scale=alt.Scale(domain=[0,220]),title='Number of children')    
)
(counts_father | counts_mother )&( basesums_father | endsums_father | basesums_mother | endsums_mother )
#dft['Father edu'].head()

In [23]:
#hide

dft.head()
print('\nMother\'s edu levels\n')
for i in ['Less than High School','High School', 'Graduate / Diploma', 'Post-graduate']:
  print('Mean baseline score for ', i, ' = ', (dft[(dft['Mother edu']==i)]['baseline score']).mean())
  print('Mean endline score for ', i, ' = ', (dft[(dft['Mother edu']==i)]['endline score']).mean())
print('\nFather\'s edu levels\n')
for i in ['Less than High School','High School', 'Graduate / Diploma', 'Post-graduate']:
  print('Mean baseline score for ', i, ' = ', (dft[(dft['Father edu']==i)]['baseline score']).mean())
  print('Mean endline score for ', i, ' = ', (dft[(dft['Father edu']==i)]['endline score']).mean())


Mother's edu levels

Mean baseline score for  Less than High School  =  11.088785046728972
Mean endline score for  Less than High School  =  12.5
Mean baseline score for  High School  =  10.44186046511628
Mean endline score for  High School  =  11.94186046511628
Mean baseline score for  Graduate / Diploma  =  12.241379310344827
Mean endline score for  Graduate / Diploma  =  13.827586206896552
Mean baseline score for  Post-graduate  =  12.166666666666666
Mean endline score for  Post-graduate  =  13.333333333333334

Father's edu levels

Mean baseline score for  Less than High School  =  11.023121387283236
Mean endline score for  Less than High School  =  12.508670520231213
Mean baseline score for  High School  =  10.517241379310345
Mean endline score for  High School  =  12.267241379310345
Mean baseline score for  Graduate / Diploma  =  11.666666666666666
Mean endline score for  Graduate / Diploma  =  12.529411764705882
Mean baseline score for  Post-graduate  =  12.153846153846153
Mean 

In [24]:
#hide
#Chart to compare the baseline and endline scores under each education level of parent.
compare_mother = alt.Chart(df11_mother).mark_bar().encode(
    x='variable',
    y='mean(value)',
    color='variable',
    column=alt.Column('Mother edu', sort=['Less than High School','High School', 'Graduate / Diploma', 'Post-graduate', 'Not mentioned'], title='Mother\'s education level')
).properties(width=80)
compare_father = alt.Chart(df11_father).mark_bar().encode(
    x='variable',
    y='mean(value)',
    color='variable',
    column=alt.Column('Father edu', sort=['Less than High School','High School', 'Graduate / Diploma', 'Post-graduate', 'Not mentioned'], title='Father\'s education level')
).properties(width=80)
compare_mother & compare_father

In [25]:
#hide

#Adolescents whose mothers have had education as high school and above showed a x% change while 
#adolescents whose mothers did not have high school education showed y% change. 
#The difference was z% more change for high school and above educated mothers.
sum_less_than_hs_base = sum(df11_mother[(df11_mother['Mother edu']=='Less than High School') & (df11_mother['variable']=='baseline score')]['value'])
print(sum_less_than_hs_base)
sum_less_than_hs_end = sum(df11_mother[(df11_mother['Mother edu']=='Less than High School') & (df11_mother['variable']=='endline score')]['value'])
print(sum_less_than_hs_end)
count_less_than_hs = (df11_mother[(df11_mother['Mother edu']=='Less than High School') & (df11_mother['variable']=='baseline score')]['value']).count()
print(count_less_than_hs)
x = (sum_less_than_hs_end - sum_less_than_hs_base)/sum_less_than_hs_base
print('x = ', x)
sum_more_than_hs_base = sum(df11_mother[(df11_mother['Mother edu']!='Less than High School') & (df11_mother['Mother edu']!='Not mentioned') & (df11_mother['variable']=='baseline score')]['value'])
print(sum_more_than_hs_base)
sum_more_than_hs_end = sum(df11_mother[(df11_mother['Mother edu']!='Less than High School') & (df11_mother['Mother edu']!='Not mentioned') & (df11_mother['variable']=='endline score')]['value'])
print(sum_more_than_hs_end)
count_more_than_hs = (df11_mother[(df11_mother['Mother edu']!='Less than High School') & (df11_mother['Mother edu']!='Not mentioned') & (df11_mother['variable']=='baseline score')]['value']).count()
print(count_more_than_hs)
y = (sum_more_than_hs_end - sum_more_than_hs_base)/sum_more_than_hs_base
print('y = ', y)
z = y - x
print('z = ', z)


2373
2675
214
x =  0.12726506531816267
1326
1508
121
y =  0.13725490196078433
z =  0.009989836642621652


In [26]:
#hide
df.head()

Unnamed: 0,SNO,Age,Gender,Father edu,Mother edu,baseline 7,baseline 8,baseline 9,baseline 10,baseline 11,baseline 12,baseline 13,baseline 14,baseline 15,baseline 16,baseline 17,baseline 18,baseline 19,baseline 20,end 7,end 8,end 9,end 10,end 11,end 12,end 13,end 14,end 15,end 16,end 17,end 18,end 19,end 20,baseline score,endline score
0,1,17,F,1,1,1,0,0,1,0,0,0,0,1,0,1,0,1,1,1,1,0,0,0,0,0,1,1,1,1,0,1,1,6,8
1,2,16,F,1,3,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,1,1,5,4
2,3,18,M,2,2,1,0,0,0,0,0,0,0,1,0,0,0,1,1,1,0,1,1,0,0,0,0,1,0,1,0,1,1,4,7
3,4,18,M,3,2,1,1,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,5,1
4,5,17,F,1,1,1,0,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,11,13


In [27]:
#hide
df_best = df[['Age','Gender','Father edu', 'Mother edu' ,'baseline score','endline score']]
df_best_mother = pd.melt(df_best, id_vars=['Mother edu','Gender','Age'], value_vars=[ 'baseline score','endline score'])
df_best_father = pd.melt(df_best, id_vars=['Father edu','Gender','Age'], value_vars=[  'baseline score','endline score'])

edu_level = { 1: 'Less than High School', 2:'High School', 3:'Graduate / Diploma', 4:'Post-graduate', 0: 'Not mentioned' }
df_best['Mother edu'] = pd.Categorical(df_best['Mother edu'].map(edu_level).fillna('No match'))
df_best['Father edu'] = pd.Categorical(df_best['Father edu'].map(edu_level).fillna('No match'))

df_best_mother['Mother edu'] = pd.Categorical(df_best_mother['Mother edu'].map(edu_level).fillna('No match'))
df_best_father['Father edu'] = pd.Categorical(df_best_father['Father edu'].map(edu_level).fillna('No match'))

df_best.head()
#df_best_mother.head()
#df_best_father.head()

Unnamed: 0,Age,Gender,Father edu,Mother edu,baseline score,endline score
0,17,F,Less than High School,Less than High School,6,8
1,16,F,Less than High School,Graduate / Diploma,5,4
2,18,M,High School,High School,4,7
3,18,M,Graduate / Diploma,High School,5,1
4,17,F,Less than High School,Less than High School,11,13


In [28]:
#hide
dft.head()
print('\nMother\'s edu levels\n')
for i in ['Less than High School','High School', 'Graduate / Diploma', 'Post-graduate']:
  print('Mean baseline score for ', i, ' = ', (df_best[(df_best['Mother edu']==i)]['baseline score']).mean())
  print('Mean endline score for ', i, ' = ', (df_best[(df_best['Mother edu']==i)]['endline score']).mean())
print('\nFather\'s edu levels\n')
for i in ['Less than High School','High School', 'Graduate / Diploma', 'Post-graduate']:
  print('Mean baseline score for ', i, ' = ', (df_best[(df_best['Father edu']==i)]['baseline score']).mean())
  print('Mean endline score for ', i, ' = ', (df_best[(df_best['Father edu']==i)]['endline score']).mean())


Mother's edu levels

Mean baseline score for  Less than High School  =  5.747663551401869
Mean endline score for  Less than High School  =  6.383177570093458
Mean baseline score for  High School  =  5.186046511627907
Mean endline score for  High School  =  6.011627906976744
Mean baseline score for  Graduate / Diploma  =  6.137931034482759
Mean endline score for  Graduate / Diploma  =  7.068965517241379
Mean baseline score for  Post-graduate  =  7.166666666666667
Mean endline score for  Post-graduate  =  7.666666666666667

Father's edu levels

Mean baseline score for  Less than High School  =  5.635838150289017
Mean endline score for  Less than High School  =  6.3468208092485545
Mean baseline score for  High School  =  5.362068965517241
Mean endline score for  High School  =  6.275862068965517
Mean baseline score for  Graduate / Diploma  =  5.901960784313726
Mean endline score for  Graduate / Diploma  =  6.333333333333333
Mean baseline score for  Post-graduate  =  6.769230769230769
Mea

In [29]:
#hide

#and compared less high school with high school + graduate + post graduate?
#can you also compare less high school + high school to graduate + post grad?
#best scores + leave out the not mentioned

### Effects of parent's education on mean scores of the questionnaires.

* Adolescents whose mothers have had  education  above high school showed a 85.71% change in their  total  scores  between  baseline  and  endline  while  adolescents  whose  mothers  did  not  have beyond  high  school  education  showed  69%  change.  


* Adolescents whose fathers have had education above high school  showed  a  32.81%  change  in  their  total  scores  between  baseline  and  endline  while adolescents whose fathers did not have beyond high school education showed 79.24% change.

The strong correlation that completing secondary education shows with delaying age of marriage and  pregnancy  in  girls  (United  Nations)  suggests  that  education  also  leads  to  developing  more gender equitable attitudes.  Hence, it could be explained that educated mothers have more gender 
equitable  attitudes  and  Intergenerational  Transmission  of  gender  attitudes  (Moen  et  al,  1997) results in their daughters having more equitable attitudes as well.

In [30]:
#hide_input
#mother's education and score on one graph and see the impact

basesums_mother = alt.Chart(df_best).mark_bar().encode(
    x=alt.X('Mother edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate']),title='Mother\'s education level'),
    y=alt.Y('mean(baseline score)',
            scale=alt.Scale(domain=[0,8]),title='Mean of baseline score')
)
endsums_mother = alt.Chart(df_best).mark_bar().encode(
    x=alt.X('Mother edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate']),title='Mother\'s education level'),
    y=alt.Y('mean(endline score)',
            scale=alt.Scale(domain=[0,8]),title='Mean of endline score')
)
counts_mother = alt.Chart(df_best).mark_bar().encode(
    x=alt.X('Mother edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate']),title='Mother\'s education level'),
    y=alt.Y('count()',
            scale=alt.Scale(domain=[0,220]),title='Number of children')
)

#father's education and score on one graph and see the impact
#edu_level = { 1: 'Less than High School', 2:'High School', 3:'Graduate / Diploma', 4:'Post-graduate', 0: 'Not mentioned' }

basesums_father = alt.Chart(df_best).mark_bar().encode(
    x=alt.X('Father edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate']),title='Father\'s education level'),
    y=alt.Y('mean(baseline score)',
            scale=alt.Scale(domain=[0,8]),title='Mean of baseline score')
)
endsums_father = alt.Chart(df_best).mark_bar().encode(
    x=alt.X('Father edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate']),title='Father\'s education level'),
    y=alt.Y('mean(endline score)',
            scale=alt.Scale(domain=[0,8]),title='Mean of endline score')
)
counts_father = alt.Chart(df_best).mark_bar().encode(
    x=alt.X('Father edu',
            scale=alt.Scale(domain=['Less than High School','High School','Graduate / Diploma','Post-graduate']),title='Father\'s education level'),
    y=alt.Y('count()',
            scale=alt.Scale(domain=[0,220]),title='Number of children')    
)

(counts_father | counts_mother )&( basesums_father | endsums_father | basesums_mother | endsums_mother )
#df_best['Father edu'].head()

In [31]:
#hide_input
compare_mother = alt.Chart(df_best_mother).mark_bar().encode(
    x='variable',
    y='mean(value)',
    color='variable',
    column=alt.Column('Mother edu', sort=['Less than High School','High School', 'Graduate / Diploma', 'Post-graduate', 'Not mentioned'], title='Mother\'s education level')
).properties(width=80)

compare_father = alt.Chart(df_best_father).mark_bar().encode(
    x='variable',
    y='mean(value)',
    color='variable',
    column=alt.Column('Father edu', sort=['Less than High School','High School', 'Graduate / Diploma', 'Post-graduate', 'Not mentioned'], title='Father\'s education level')
).properties(width=80)


text = bars.mark_text(
    align='left',
    baseline='middle',
    dx=3  # Nudges text to right so it doesn't appear on top of the bar
).encode(
    text='mean(value)'
)
compare_mother & compare_father
#(compare_mother + text).facet(column=alt.Column('Mother edu', sort=['Less than High School','High School', 'Graduate / Diploma', 'Post-graduate', 'Not mentioned'], title='Mother\'s education level'))
#(compare_mother + text).properties(height=900) 

In [32]:
#hide 

#Adolescents whose mothers have had education as above high school showed a x% change while 
#adolescents whose mothers did not have above high school education showed y% change. 
#The difference was z% more change above high school educated mothers.
sum_less_than_hs_base = sum(df_best_mother[(df_best_mother['Mother edu']=='Less than High School') & (df_best_mother['variable']=='baseline score')]['value'])
sum_less_than_hs_end = sum(df_best_mother[(df_best_mother['Mother edu']=='Less than High School') & (df_best_mother['variable']=='endline score')]['value'])
count_less_than_hs = (df_best_mother[(df_best_mother['Mother edu']=='Less than High School') & (df_best_mother['variable']=='baseline score')]['value']).count()
#x = (sum_less_than_hs_end - sum_less_than_hs_base)/sum_less_than_hs_base
print('Category = sum of base score, sum of end score, count\nLess than high school = ', sum_less_than_hs_base, sum_less_than_hs_end, count_less_than_hs)

sum_hs_base = sum(df_best_mother[(df_best_mother['Mother edu']=='High School') & (df_best_mother['variable']=='baseline score')]['value'])
sum_hs_end = sum(df_best_mother[(df_best_mother['Mother edu']=='High School') & (df_best_mother['variable']=='endline score')]['value'])
count_hs = (df_best_mother[(df_best_mother['Mother edu']=='High School') & (df_best_mother['variable']=='baseline score')]['value']).count()
print('high school = ', sum_hs_base, sum_hs_end, count_hs)

sum_grad_base = sum(df_best_mother[(df_best_mother['Mother edu']=='Graduate / Diploma') & (df_best_mother['variable']=='baseline score')]['value'])
sum_grad_end = sum(df_best_mother[(df_best_mother['Mother edu']=='Graduate / Diploma') & (df_best_mother['variable']=='endline score')]['value'])
count_grad = (df_best_mother[(df_best_mother['Mother edu']=='Graduate / Diploma') & (df_best_mother['variable']=='baseline score')]['value']).count()
print('grad = ', sum_grad_base, sum_grad_end, count_grad)

sum_pgrad_base = sum(df_best_mother[(df_best_mother['Mother edu']=='Post-graduate') & (df_best_mother['variable']=='baseline score')]['value'])
sum_pgrad_end = sum(df_best_mother[(df_best_mother['Mother edu']=='Post-graduate') & (df_best_mother['variable']=='endline score')]['value'])
count_pgrad = (df_best_mother[(df_best_mother['Mother edu']=='Post-graduate') & (df_best_mother['variable']=='baseline score')]['value']).count()
print('pgrad = ', sum_pgrad_base, sum_pgrad_end, count_pgrad)

sum_more_than_hs_base = sum(df_best_mother[(df_best_mother['Mother edu']!='Less than High School') & (df_best_mother['Mother edu']!='Not mentioned') & (df_best_mother['variable']=='baseline score')]['value'])
sum_more_than_hs_end = sum(df_best_mother[(df_best_mother['Mother edu']!='Less than High School') & (df_best_mother['Mother edu']!='Not mentioned') & (df_best_mother['variable']=='endline score')]['value'])
count_more_than_hs = (df_best_mother[(df_best_mother['Mother edu']!='Less than High School') & (df_best_mother['Mother edu']!='Not mentioned') & (df_best_mother['variable']=='baseline score')]['value']).count()
#y = (sum_more_than_hs_end - sum_more_than_hs_base)/sum_more_than_hs_base
print('More than high school = ', sum_more_than_hs_base, sum_more_than_hs_end, count_more_than_hs)
#z = y - x
#print('z = ', z)
print('\nSum of baseline scores for less than HS, HS = ', sum_less_than_hs_base+sum_hs_base)
print('Sum of endline scores for less than HS, HS = ', sum_less_than_hs_end + sum_hs_end)
print('Count for less than HS, HS = ', count_less_than_hs + count_hs)
print('% change for <=HS = ', ( (sum_less_than_hs_end + sum_hs_end)-(sum_less_than_hs_base+sum_hs_base) )/(count_less_than_hs + count_hs) )

print('Sum of baseline scores for more than HS = ', sum_grad_base + sum_pgrad_base)
print('Sum of endline scores for more than HS = ', sum_grad_end + sum_pgrad_end)
print('Count for more than HS = ', count_grad + count_pgrad )
print('% change for >HS = ', ((sum_grad_end + sum_pgrad_end)-(sum_grad_base + sum_pgrad_base))/(count_grad + count_pgrad) )


Category = sum of base score, sum of end score, count
Less than high school =  1230 1366 214
high school =  446 517 86
grad =  178 205 29
pgrad =  43 46 6
More than high school =  667 768 121

Sum of baseline scores for less than HS, HS =  1676
Sum of endline scores for less than HS, HS =  1883
Count for less than HS, HS =  300
% change for <=HS =  0.69
Sum of baseline scores for more than HS =  221
Sum of endline scores for more than HS =  251
Count for more than HS =  35
% change for >HS =  0.8571428571428571


In [33]:
#hide

#Adolescents whose fathers have had education as above high school showed a x% change while 
#adolescents whose fathers did not have above high school education showed y% change. 
#The difference was z% more change above high school educated fathers.
sum_less_than_hs_base = sum(df_best_father[(df_best_father['Father edu']=='Less than High School') & (df_best_father['variable']=='baseline score')]['value'])
sum_less_than_hs_end = sum(df_best_father[(df_best_father['Father edu']=='Less than High School') & (df_best_father['variable']=='endline score')]['value'])
count_less_than_hs = (df_best_father[(df_best_father['Father edu']=='Less than High School') & (df_best_father['variable']=='baseline score')]['value']).count()
#x = (sum_less_than_hs_end - sum_less_than_hs_base)/sum_less_than_hs_base
print('Category = sum of base score, sum of end score, count\nLess than high school = ', sum_less_than_hs_base, sum_less_than_hs_end, count_less_than_hs)

sum_hs_base = sum(df_best_father[(df_best_father['Father edu']=='High School') & (df_best_father['variable']=='baseline score')]['value'])
sum_hs_end = sum(df_best_father[(df_best_father['Father edu']=='High School') & (df_best_father['variable']=='endline score')]['value'])
count_hs = (df_best_father[(df_best_father['Father edu']=='High School') & (df_best_father['variable']=='baseline score')]['value']).count()
print('high school = ', sum_hs_base, sum_hs_end, count_hs)

sum_grad_base = sum(df_best_father[(df_best_father['Father edu']=='Graduate / Diploma') & (df_best_father['variable']=='baseline score')]['value'])
sum_grad_end = sum(df_best_father[(df_best_father['Father edu']=='Graduate / Diploma') & (df_best_father['variable']=='endline score')]['value'])
count_grad = (df_best_father[(df_best_father['Father edu']=='Graduate / Diploma') & (df_best_father['variable']=='baseline score')]['value']).count()
print('grad = ', sum_grad_base, sum_grad_end, count_grad)

sum_pgrad_base = sum(df_best_father[(df_best_father['Father edu']=='Post-graduate') & (df_best_father['variable']=='baseline score')]['value'])
sum_pgrad_end = sum(df_best_father[(df_best_father['Father edu']=='Post-graduate') & (df_best_father['variable']=='endline score')]['value'])
count_pgrad = (df_best_father[(df_best_father['Father edu']=='Post-graduate') & (df_best_father['variable']=='baseline score')]['value']).count()
print('pgrad = ', sum_pgrad_base, sum_pgrad_end, count_pgrad)

sum_more_than_hs_base = sum(df_best_father[(df_best_father['Father edu']!='Less than High School') & (df_best_father['Father edu']!='Not mentioned') & (df_best_father['variable']=='baseline score')]['value'])
sum_more_than_hs_end = sum(df_best_father[(df_best_father['Father edu']!='Less than High School') & (df_best_father['Father edu']!='Not mentioned') & (df_best_father['variable']=='endline score')]['value'])
count_more_than_hs = (df_best_father[(df_best_father['Father edu']!='Less than High School') & (df_best_father['Father edu']!='Not mentioned') & (df_best_father['variable']=='baseline score')]['value']).count()
#y = (sum_more_than_hs_end - sum_more_than_hs_base)/sum_more_than_hs_base
print('More than high school = ', sum_more_than_hs_base, sum_more_than_hs_end, count_more_than_hs)
#z = y - x
#print('z = ', z)
print('\nSum of baseline scores for less than HS, HS = ', sum_less_than_hs_base+sum_hs_base)
print('Sum of endline scores for less than HS, HS = ', sum_less_than_hs_end + sum_hs_end)
print('Count for less than HS, HS = ', count_less_than_hs + count_hs)
print('% change for <=HS = ', ( (sum_less_than_hs_end + sum_hs_end)-(sum_less_than_hs_base+sum_hs_base) )/(count_less_than_hs + count_hs) )

print('Sum of baseline scores for more than HS = ', sum_grad_base + sum_pgrad_base)
print('Sum of endline scores for more than HS = ', sum_grad_end + sum_pgrad_end)
print('Count for more than HS = ', count_grad + count_pgrad )
print('% change for >HS = ', ((sum_grad_end + sum_pgrad_end)-(sum_grad_base + sum_pgrad_base))/(count_grad + count_pgrad) )


Category = sum of base score, sum of end score, count
Less than high school =  975 1098 173
high school =  622 728 116
grad =  301 323 51
pgrad =  88 87 13
More than high school =  1011 1138 180

Sum of baseline scores for less than HS, HS =  1597
Sum of endline scores for less than HS, HS =  1826
Count for less than HS, HS =  289
% change for <=HS =  0.7923875432525952
Sum of baseline scores for more than HS =  389
Sum of endline scores for more than HS =  410
Count for more than HS =  64
% change for >HS =  0.328125


###Conclusion

####Demographic
In general more than half the participants across both groups had parents with education levels less 
than high school. Attitude change observed in adolescents varied with the education level of their 
mothers but not their fathers.  A greater change was observed in adolescents whose mothers have 
had higher education (high school and above). The change in attitude was higher in girls than boys 
for all questions, especially on questions regarding male involvement at household level.

####Gender stereotypes on education 
There was a positive shift in attitude regarding the importance of girl education and the stereotypes 
attached  with  educated  women. This change was observed irrespective of 
the gender of the participant.

####Gender roles at home 
While there was a significant positive change in the traditional view of a man being head of family, 
the attitudes regarding boys doing household chores or women moving beyond caregiving roles 
did not see much positive change.

####Bullying and body shaming 
There was a positive change regarding change in attitude to deal with bullying and body shaming. 
While  the  incremental  change  was  highest  in  bullying,  most  adolescents  across  treatment  and 
control had a positive attitude towards dealing with body shaming as well.  
Gender  stereotypes  regarding  appearance  did  not  show  much  positive  change