## import modules

In [1]:
import pandas as pd
from scipy import stats
import numpy as np

## census data processing

In [2]:
# for population data
cols=['State', 'Level', 'Name', 'TRU', 'No_HH', 'TOT_P', 'TOT_M', 'TOT_F']
census=pd.read_excel('datasets/census.xlsx',engine='openpyxl',usecols=cols)
censusIndia=census.iloc[0,:]
census=census.loc[(census.Level=='STATE') & (census.TRU=='Total')]
census=census.append(censusIndia,ignore_index=True)
census.sort_values(by=['State'],axis=0,inplace=True)
census.reset_index(drop=True,inplace=True)

## read census language dataset[C-18]

In [3]:
c18=pd.read_excel('datasets/C-18.xlsx',engine='openpyxl',skiprows=6,header=None)

## calculate p-values for three parts

**`what is being done in code? Overall description is:`**
- first we store state-names corresponding to state code so that we can see which states have significant difference in ratios
- now we run through each state using its code as identifier to get relevent informations and then store them into lists
    - to get male and female total pop I have used census data of 2011
    - to get particular {part} language population I have used `C-18` file
    - to get exactly two and only one language population I have used same concept as used in Q1[described in getRatio func comments]
    - male percent = 100*(male population of particular {part:3+,excatly-2,only-1} from state )/(total male pop of that stae) 
    - similarly for female part
- note: every item of list is a dict conataining relevent info
- I just simple convert it into a pandas df and save it into a csv file

In [4]:
STATE_NAMES=[]
for state in c18.iloc[:,2].values:
    if not (state in STATE_NAMES):
        STATE_NAMES.append(state)

In [5]:
# useful_data=[]
tri_list=[]
bi_list=[]
uni_list=[]
for i,state in enumerate(STATE_NAMES):
    
    # here i is the state code
    male_pop=census[(census['State']==i) & (census['TRU']=='Total')]['TOT_M'].values[0]
    female_pop=census[(census['State']==i) & (census['TRU']=='Total')]['TOT_F'].values[0]
    
    # tri
    tri_male=c18[(c18.iloc[:,0]==i) & (c18.iloc[:,4]=='Total') & (c18.iloc[:,3]=='Total')].iloc[0,9]
    tri_female=c18[(c18.iloc[:,0]==i) & (c18.iloc[:,4]=='Total') & (c18.iloc[:,3]=='Total')].iloc[0,10]

    #bi
    bi_male=c18[(c18.iloc[:,0]==i) & (c18.iloc[:,4]=='Total') & (c18.iloc[:,3]=='Total')].iloc[0,6] - tri_male
    bi_female=c18[(c18.iloc[:,0]==i) & (c18.iloc[:,4]=='Total') & (c18.iloc[:,3]=='Total')].iloc[0,7] - tri_female

    #uni
    uni_male=male_pop-bi_male-tri_male
    uni_female=female_pop-bi_female-tri_female
    
    p_value=stats.ttest_1samp([tri_male/tri_female,bi_male/bi_female,uni_male/uni_female],[male_pop/female_pop]).pvalue[0]

    # or we could perform simple ttest_ind with variance being unequal like:
    # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
    # p_value1=stats.ttest_ind([male_pop/female_pop]*3,[tri_male/tri_female,bi_male/bi_female,uni_male/uni_female],equal_var=False).pvalue
    # item={
    #     'state-code':i,
    #     'state-name':state,
    #     'p-value-1samp':p_value,
    #     'male-to-female':male_pop/female_pop,
    #     'tri-male-to-female-ratio':tri_male/tri_female,
    #     'bi-male-to-female-ratio':bi_male/bi_female,
    #     'uni-male-to-female-ratio':uni_male/uni_female
    # }
    tri_item={
        'state-code':i,
        'male-percentage':round(100*tri_male/male_pop,2),
        'female-percentage':round(100*tri_female/female_pop,2),
        'p-value':p_value
    }

    bi_item={
        'state-code':i,
        'male-percentage':round(100*bi_male/male_pop,2),
        'female-percentage':round(100*bi_female/female_pop,2),
        'p-value':p_value
    }

    uni_item={
        'state-code':i,
        'male-percentage':round(100*uni_male/male_pop,2),
        'female-percentage':round(100*uni_female/female_pop,2),
        'p-value':p_value
    }
    tri_list.append(tri_item)
    bi_list.append(bi_item)
    uni_list.append(uni_item)
    # useful_data.append(item)

In [6]:
tri_df=pd.DataFrame(tri_list)
bi_df=pd.DataFrame(bi_list)
uni_df=pd.DataFrame(uni_list)

In [7]:
tri_df.to_csv('outputs/gender-india-c.csv',index=False)
bi_df.to_csv('outputs/gender-india-b.csv',index=False)
uni_df.to_csv('outputs/gender-india-a.csv',index=False)

## how am I obtaining p-value?

- Here `the null hypothesis` states that the means of the two populations are the same(meaning -- the ratios not quite different between males and females)
- `The alternate hypothesis` states that the means of the two populations are not the same(here it means -- the ratios significantly different between males and females)
- So a test is needed to decided this!
- I am doing a simple t-test in particular I am using welch's t-test that is used when have unequal variances for samples, rather than student's t-test that is used when variances of samples is equal.
- basic thing is that we have continous features(here `ratios`) and their variances are uneuqal.
    - Here's the reason why:
        - Let X = vector containing trilingual ratio, bilingual ratio, monolingual ratio
        - Y = Vector containing ratio of male:female population or urban:rural population
        Y will contain 1 value repeated thrice.
        - Since all the values of Y are same, Var[Y]=0
        - But the 3 values in X will be different and hence Var[X] won't be zero
        - Thus, Var[X] != Var[Y]
- To perform Welch's test we can use `scipy.stats` module; in particular we can use either of the following options[I have used first -- both gives same answer]
    - use `ttest_1samp` function with `popmean` being background ratio of male to female
    - or use `ttest_ind` func with `equal_var` being set to `False` with `X` and `Y` vectors

In [8]:
tri_df

Unnamed: 0,state-code,male-percentage,female-percentage,p-value
0,0,8.11,6.04,0.340433
1,1,18.96,14.19,0.537034
2,2,5.76,4.34,0.30633
3,3,29.95,26.3,0.568812
4,4,30.75,30.21,0.733277
5,5,2.13,1.58,0.270833
6,6,5.05,4.13,0.293845
7,7,8.25,7.91,0.587541
8,8,1.81,1.1,0.26823
9,9,1.48,1.09,0.234998


## observations
- For all three case:
    - for no state or ut the ratio is significantly(at 0.05 level) different since for all state/ut p-value is greater than 0.05 level. Even for 0.1 level we have no state/ut for which is ratio is signifantly different
    - so for all state/ut we accept the null hypothesis that is, the ratios are not different between males and females