# Costa Rican Household Poverty

## Define the Problem
How can we more accurately classify the poverty levels of Costa Rican households using observeable attributes, ie education level, monthly rent, building materials, or assets, in order to predict their level of need? In order to assess the accuracy of any predictive model built, an F1 score will be used to evaluate its predictiveness.

## Identify Client
The client is the Inter-American Development Bank, who wants to assess income qualification for families in need within Costa Rica.

## Describe Dataset and How it was Cleaned/Wrangled

Dataset has 143 columns. Each record describes an individual living in Costa Rica, with attributes mostly relating to their household descriptions, education level, and location.

We filled in the missing values, combined binary columns into respective categorical columns, recoded values for easy readability, performed exploratory analysis and inferential statistics. 

We want to decide which variables to use and which to remove in order to build a classification model.

In [None]:
#import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
train = pd.read_csv('../input/train.csv')

In [None]:
train.head()

In [None]:
train.columns.to_frame()

## Data Cleaning

Here we create a function to clean the data. We can apply this function to clean both the training set and the test set. 

We filled in missing values for `rez_esc`, `v18q1`, `v2a1`, `meaneduc`, and `SQBmeaned`.

Then we transformed the binary columns into categorical columns related to: 
* housing situation
* education levels
* regions
* relations
* marital
* rubbish location 
* energy source
* toilets
* floor materials
* wall materials
* roof materials
* floor quality
* wall quality
* roof quality
* water provision location
* electricity source

Then recoded the values in each column for easy readability

In [None]:
def data_clean(data):
    #fill in missing values
    data['rez_esc']=data['rez_esc'].fillna(0)
    data['v18q1'] = data['v18q1'].fillna(0)
    v2a1 = data['v2a1'].sort_values()
    med = v2a1.median()
    data.loc[(data['tipovivi1']==1), 'v2a1'] = 0
    data.loc[(data['tipovivi4']==1), 'v2a1'] = med
    data.loc[(data['tipovivi5']==1), 'v2a1'] = med
    meaneduc_nan=data[data['meaneduc'].isnull()][['Id','idhogar','escolari']]
    me=meaneduc_nan.groupby('idhogar')['escolari'].mean().reset_index()
    me
    for row in meaneduc_nan.iterrows():
        idx=row[0]
        idhogar=row[1]['idhogar']
        m=me[me['idhogar']==idhogar]['escolari'].tolist()[0]
        data.at[idx, 'meaneduc']=m
        data.at[idx, 'SQBmeaned']=m*m
        
    #binary columns
    housesitu = ['tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5']
    educlevels = ['instlevel1', 'instlevel2', 'instlevel3', 'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7',
             'instlevel8', 'instlevel9']
    regions = ['lugar1', 'lugar2', 'lugar3', 'lugar4', 'lugar5', 'lugar6']
    relations = ['parentesco1', 'parentesco2', 'parentesco3', 'parentesco4', 'parentesco5', 'parentesco6',
            'parentesco7', 'parentesco8', 'parentesco9', 'parentesco10', 'parentesco11', 'parentesco12']
    marital = ['estadocivil1', 'estadocivil2', 'estadocivil3', 'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7']
    rubbish = ['elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 'elimbasu5', 'elimbasu6']
    energy = ['energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4']
    toilets = ['sanitario1', 'sanitario2', 'sanitario3', 'sanitario5', 'sanitario6']
    floormat = ['pisomoscer', 'pisocemento', 'pisoother', 'pisonatur', 'pisonotiene', 'pisomadera']
    wallmat = ['paredblolad', 'paredzocalo', 'paredpreb', 'pareddes', 'paredmad', 'paredzinc', 'paredfibras', 'paredother']
    roofmat = ['techozinc', 'techoentrepiso', 'techocane', 'techootro']
    floorqual = ['eviv1', 'eviv2', 'eviv3']
    wallqual = ['epared1', 'epared2', 'epared3']
    roofqual = ['etecho1', 'etecho2', 'etecho3']
    waterprov = ['abastaguadentro', 'abastaguafuera', 'abastaguano']
    electric = ['public', 'planpri', 'noelec', 'coopele']
    
    
    #make a dictionary
    binaries = {'housesitu':housesitu,
                'educlevels':educlevels,
                'regions':regions,
                'relations':relations,
                'marital':marital,
                'rubbish':rubbish,
                'energy':energy,
                'toilets':toilets,
                'floormat':floormat,
                'wallmat':wallmat,
                'roofmat':roofmat,
                'floorqual':floorqual,
                'wallqual':wallqual,
                'roofqual':roofqual,
                'waterprov':waterprov,
                'electric':electric
               }
    
    #Replacing the binaries with categorical
    for i in binaries.keys():
        data[i] = data[binaries[i]].idxmax(axis=1)
        data.drop(data[binaries[i]], axis=1, inplace=True)
    
    #recoding values
    hs = {'tipovivi1':'Own', 
      'tipovivi2':'Own/Paying Instllmnts', 
      'tipovivi3':'Rented', 
      'tipovivi4':'Precarious', 
      'tipovivi5':'Other'}
    el = {'instlevel1':'None', 
      'instlevel2':'Incomplete Primary', 
      'instlevel3':'Complete Primary', 
      'instlevel4':'Incomplete Acad. Secondary', 
      'instlevel5':'Complete Acad. Secondary', 
      'instlevel6':'Incomplete Techn. Secondary', 
      'instlevel7':'Complete Techn. Secondary',
      'instlevel8':'Undergrad.', 
      'instlevel9':'Postgrad.'}
    rgn = {'lugar1':'Central', 
       'lugar2':'Chorotega', 
       'lugar3':'Pacafafico Central', 
       'lugar4':'Brunca', 
       'lugar5':'Huetar Atlantica', 
       'lugar6':'Huetar Norte'}
    rltn = {'parentesco1':'Household Head', 
        'parentesco2':'Spouse/Partner', 
        'parentesco3':'Son/Daughter', 
        'parentesco4':'Stepson/Daughter', 
        'parentesco5':'Son/daughter in law', 
        'parentesco6':'Grandson/daughter',
        'parentesco7':'Mother/Father', 
        'parentesco8':'Mother/father in law', 
        'parentesco9':'Brother/sister', 
        'parentesco10':'Brother/sister in law', 
        'parentesco11':'Other family member', 
        'parentesco12':'Other non-family member'}
    mrtl = {'estadocivil1':'< 10 y/o', 
        'estadocivil2':'Free or coupled union', 
        'estadocivil3':'Married', 
        'estadocivil4':'Divorced', 
        'estadocivil5':'Separated', 
        'estadocivil6':'Widow/er', 
        'estadocivil7':'Single'}
    rb = {'elimbasu1':'Tanker Truck', 
      'elimbasu2':'Botan Hollow or Buried', 
      'elimbasu3':'Burning', 
      'elimbasu4':'Thrown in unoccupied space', 
      'elimbasu5':'Thrown in river, creek, or sea', 
      'elimbasu6':'Other'}
    eng = {'energcocinar1':'None', 
       'energcocinar2':'Electricity', 
       'energcocinar3':'Gas', 
       'energcocinar4':'Wood Charcoal'}
    tlt = {'sanitario1':'None', 
       'sanitario2':'Sewer or Cesspool', 
       'sanitario3':'Septic Tank', 
       'sanitario5':'Black hole or letrine', 
       'sanitario6':'Other'}
    flmt = {'pisomoscer':'Mosaic, Ceramic', 
        'pisocemento':'Cement', 
        'pisoother':'Other', 
        'pisonatur':'Natural', 
        'pisonotiene':'None', 
        'pisomadera':'Wood'}
    wlmt = {'paredblolad':'Block/Brick', 
        'paredzocalo':'Socket (wood, zinc, absbesto)', 
        'paredpreb':'Prefabricated/cement', 
        'pareddes':'Waste', 
        'paredmad':'Wood', 
        'paredzinc':'Zinc', 
        'paredfibras':'Natural Fibers', 
        'paredother':'Other'}
    rfmt = {'techozinc':'Metal foil/Zinc', 
        'techoentrepiso':'Fiber cement', 
        'techocane':'Natural fibers', 
        'techootro':'Other'}
    flql = {'eviv1':'Bad', 
        'eviv2':'Regular', 
        'eviv3':'Good'}
    wlql = {'epared1':'Bad',
        'epared2':'Regular', 
        'epared3':'Good'}
    rfqu = {'etecho1':'Bad', 
        'etecho2':'Regular', 
        'etecho3':'Good'}
    wtrpr = {'abastaguadentro':'Inside', 
         'abastaguafuera':'Outside', 
         'abastaguano':'None'}
    elct = {'public':'Public', 
        'planpri':'Private Plant', 
        'noelec':'None', 
        'coopele':'Cooperative'}
    
    #replacing
    data.replace(dict(housesitu=hs, 
                  educlevels=el,
                  regions=rgn,
                  relations=rltn,
                  marital=mrtl,
                  rubbish=rb,
                  energy=eng,
                  toilets=tlt,
                  floormat=flmt,
                  wallmat=wlmt,
                  roofmat=rfmt,
                  floorqual=flql,
                  wallqual=wlql,
                  roofqual=rfqu,
                  waterprov=wtrpr,
                  electric=elct), inplace=True)

### *Clean Training Dataset*

In [None]:
train = pd.read_csv('../input/train.csv')
data_clean(train)
train.to_csv('trainclean.csv')
train.head()

143 -> 71 columns

In [None]:
train.columns

### *Clean Test Dataset*

In [None]:
test = pd.read_csv('../input/test.csv')
data_clean(test)
test.to_csv('testclean.csv')
test.head()

In [None]:
corr = train.corr()
corr.style.background_gradient()

In [None]:
train[['r4h1','r4h2','r4h3','r4m1','r4m2','r4m3','r4t1','r4t2','r4t3']].describe()

### 1 = extreme poverty 

### 2 = moderate poverty 

### 3 = vulnerable households 

### 4 = non vulnerable households

### Poverty Levels

In [None]:
sns.countplot('Target',data=train)
plt.xlabel('Poverty Level')
plt.ylabel('Frequency')
plt.title('Household Poverty Levels')
plt.show()

In [None]:
train.floorqual.value_counts()

### Monthly Rent

In [None]:
sns.boxplot(x='Target', y='v2a1', data=train)
plt.xlabel('Poverty Level')
plt.ylabel('Monthly Rent Payment ($)')
plt.show()

We see we have two outstanding outliers in Non Vulnerable. Also a lot of records where the housing situation is 'owned', 'precarious', and 'other'. Let's get rid of these records. to get a better look at the distribution.

In [None]:
train = train[train['v2a1'] < 400000]
trainrented = train[train['housesitu']=='Rented']

In [None]:
sns.boxplot(x='Target', y=trainrented['v2a1'], data=train)
plt.xlabel('Poverty Level')
plt.ylabel('Monthly Rent Payment ($)')
plt.show()

In [None]:
#Monthly rent summary for each poverty level
for i in train['Target'].unique():
    print(i)
    print(trainrented[(trainrented['Target'] == i)]['v2a1'].describe())
    print()

In [None]:
levels = [1,2,3,4]
rentmeans = []
for x in levels:
    mean = np.mean(trainrented[trainrented['Target']==x]['v2a1'])
    rentmeans.append(mean)

plt.plot(levels, rentmeans, marker='o')
plt.xlabel('Poverty Level')
plt.title('Mean Monthly Rent by Poverty Level')
plt.xticks(levels,rotation=30)
plt.ylabel('Mean Monthly Rent ($)')

There seems to be a positive relationship between mean monthly rent and poverty level.

### *Inferential Statistics - Difference of means*
The mean for non vulnerable households is definitely significantly larger than the other means. But what about among vulnerable, moderate poverty, and extreme poverty?

We will perform a hypothesis test using a t test.

**Null Hypothesis** : There is NO significant difference between the means.
Alpha = 0.05

In [None]:
#non vulnerable
meanV = np.mean(trainrented[trainrented['Target'] == 4]['v2a1'])
print('Non Vulnerable Mean Rent: ', meanV)

#vulnerable
meanV = np.mean(trainrented[trainrented['Target'] == 3]['v2a1'])
print('Vulnerable Mean Rent: ', meanV)

#moderate
meanM = np.mean(trainrented[trainrented['Target'] == 2]['v2a1'])
print('Moderate Mean Rent: ', meanM)

#extreme
meanE = np.mean(trainrented[trainrented['Target'] == 1]['v2a1'])
print('Extreme Mean Rent: ', meanE)

#total
meanTot = np.mean(trainrented['v2a1'])
print('Mean Rent of Total: ', meanTot)

In [None]:
#non vulnerable and vulnerable
from statsmodels.stats.weightstats import ztest
tstat, p = ztest(trainrented[trainrented['Target'] == 4]['v2a1'],
                           trainrented[trainrented['Target'] == 3]['v2a1'])
print('T Stat: ', tstat)
print('P-Value: ', p)

Here we reject the null hypothesis as our p-value is significantly lower than alpha. There is a significant difference between the mean monthly rent of the non vulnerable and vulnerable level.

In [None]:
#vulnerable and moderate
tstat, p = ztest(trainrented[trainrented['Target'] == 3]['v2a1'],
                           trainrented[trainrented['Target'] == 2]['v2a1'])
print('T Stat: ', tstat)
print('P-Value: ', p)

We fail to reject the null hypothesis as our p-value here is greater than alpha. There is not a significant difference between the mean montly rent of the vulnerable level and moderate level.

In [None]:
#moderate and extreme
tstat, p = ztest(trainrented[trainrented['Target'] == 2]['v2a1'],
                           trainrented[trainrented['Target'] == 1]['v2a1'])
print('T Stat: ', tstat)
print('P-Value: ', p)

We fail to reject the null hypothesis as our p-value here is greater than alpha. There is not a significant difference between the mean montly rent of the vulnerable level and moderate level.

Non vulnerable has a lot more variance in monthy payments and also the highest mean. Monthly rent seems to increase as the poverty level gets better. Monthly rent seems to be a significant indicator of poverty level.

### Roof, floor, and wall materials

In [None]:
#proportion chart to compare normalized data among target levels for each feature.
def percent_table(x):
    return x/float(x[-1])

def prop_chart(column, title):
    df = pd.crosstab(train['Target'], train[column], margins=True).apply(percent_table, axis=1)
    df.iloc[:-1,:-1].plot(kind='bar')
    plt.legend(loc=0, fontsize='x-small')
    plt.title(title)

In [None]:
prop_chart('wallmat', 'Wall Materials')

prop_chart('floormat', 'Floor Materials')

prop_chart('roofmat', 'Roof Materials')

* Most walls are made of brick/block; Non Vulnerable significantly so.
* Most floors are made of Mosaic/Ceramic. Second most is Cement. Third most is wood.
* Over 90-97% of roofs are Metal foil/zinc

Poverty levels seem to have similar distributions for wall, floor, and roof materials. 

In [None]:
#quality
#create crosstab dataframes 
prop_chart('wallqual', 'Wall Quality')
prop_chart('floorqual', 'Floor Quality')
prop_chart('roofqual', 'Roof Quality')

* In Non Vulnerable and Vulnerable houses, a greater proportion of them have Good quality than Regular, or Bad. 
    * Vulnerable households consistently have 45-55% with good quality materials, and Non Vulnerable households consistently have over 65% with good quality materials.
    * Between 8-20% of these households have materials considered Bad quality.
* In Extreme and Moderate Poverty houses, a lesser proportion of them have Good quality than Regular, or Bad.
    * Between 20-30% of these households have materials considered Bad quality.
    * Between 35-50% of these households have materials considere Good Quality.
    
Wall, Floor, and Roof quality seems to be a strong indicator of Poverty Level.

### *Inferential Statistics - Difference in proportions*
Is there a difference between the proportions among the poverty levels for Good, Regular, and Bad quality materials (wall, floor, roof)? We will perform two sample Z tests to test significance.

**Null hypothesis**: There is not a significant difference in proportion for quality.

Alpha = 0.05

In [None]:
#ztest proportion
from statsmodels.stats.proportion import proportions_ztest
import warnings
warnings.filterwarnings("ignore")

def propztest_poverty(data, column, val): 
    
    nonvuln = data[data.Target==4]
    vuln = data[data.Target==3]
    moder = data[data.Target==2]
    extreme = data[data.Target==1]
    
    n1 = len(extreme)
    n2 = len(moder)
    n3 = len(vuln)
    n4 = len(nonvuln)
    s1 = len(extreme[data[column]==val])
    s2 = len(moder[data[column]==val])
    s3 = len(vuln[data[column]==val])
    s4 = len(nonvuln[data[column]==val])
    
    #nonvuln and vuln
    z1, pval1 = proportions_ztest([s4, s3], [n4, n3])
    print('Nonvuln proportion:', s4/n4)
    print('Vuln proportion:', s3/n3)
    print('Non Vulnerable and Vulnerable: [zscore, P-Value]', 
          ['{:.12f}'.format(b) for b in (z1, pval1)])
    if pval1 < 0.05:
        print('Significant')
    else:
        print('Not significant')
    
    #vuln and moder
    z2, pval2 = proportions_ztest([s3, s2], [n3, n2])
    print('Vuln proportion:', s3/n3)
    print('Moderate proportion:', s2/n2)
    print('Vulnerable and Moderate: [zscore, P-Value]', 
          ['{:.12f}'.format(b) for b in (z2, pval2)])
    if pval2 < 0.05:
        print('Significant')
    else:
        print('Not significant')
        
    #moder and extreme
    z3, pval3 = proportions_ztest([s2, s1], [n2, n1])
    print('Moderate proportion', s2/n2)
    print('Extreme proportion', s1/n1)
    print('Moderate and Extreme: [zscore, P-Value]', 
          ['{:.12f}'.format(b) for b in (z3, pval3)])
    if pval3 < 0.05:
        print('Significant')
    else:
        print('Not significant')

#Floors
print('Floor Quality')
for x in train['floorqual'].unique():
    
    print(x)
    propztest_poverty(train, 'floorqual', x)
    print()

6/9 instances of significance. Though some compared proportions here are not significant, there is enough evidence to deduce that there is significance among proportions for Floor Quality of each poverty level. 

In [None]:
#Wall Quality
print('Wall Quality')
for x in train['wallqual'].unique():
    print(x)
    propztest_poverty(train, 'wallqual', x)
    print()

8/9 instances of signifance. There is enough evidence to deduce that there is significant differences among the poverty levels for wall quality.

In [None]:
#Roof Quality
print('Roof Quality')
for x in train['roofqual'].unique():
    print(x)
    propztest_poverty(train, 'roofqual', x)
    print()

7/9 instances of significance. There is enough evidence to deduce significant differences among the poverty levels for roof quality.

**Wall, Floor, and Roof Quality seems to be a strong indicator of poverty level.**

### Education

`educlevels`

The different education levels among poverty levels. 

In [None]:
educdf = pd.crosstab(index=train['Target'], columns=train['educlevels'], margins=True).apply(percent_table,axis=1)
educdf

In [None]:
primary = educdf[['Complete Primary', 'Incomplete Primary']]
secondary = educdf[['Complete Acad. Secondary', 'Incomplete Acad. Secondary',
                   'Complete Techn. Secondary', 'Incomplete Techn. Secondary']]
none = educdf['None']
undergrad = educdf['Undergrad.']
postgrad = educdf['Postgrad.']

primary.iloc[:-1].plot(kind='bar')
plt.title('Primary School')
plt.show()

secondary.iloc[:-1].plot(kind='bar')
plt.title('Secondary School')
plt.legend(fontsize='x-small')
plt.show()

none.iloc[:-1].plot(kind='bar')
plt.title('No Schooling')
plt.show()

undergrad.iloc[:-1].plot(kind='bar')
plt.title('Undergrad')
plt.show()

postgrad.iloc[:-1].plot(kind='bar')
plt.title('Postgrad')
plt.show()

Findings:

* Non Vulnerable and Vulnerable have higher proportions of completing primary school, while Moderate and Extreme have higher proportions of not completing primary school.
* All poverty levels have higher proportions of not completing academic secondary school. Proportion of completing academic secondary school goes up as poverty level becomes less vulnerable. Only Non Vulnerable has higher proportion of completing technical primary school than not completing.
* Proportion of having no schooling goes down as poverty level becomes less vulnerable.
* Proportion of completing undergrad and postgrad goes up as poverty level becomes less vulnerable.
* Non Vulnerable seems to have significant increase in proportion of completing undergrad and postgrad compared to other levels.

### *Inferential Statistics - Difference in proportions*
Is there a difference in proportions among the poverty levels for the different education levels? We will perform a two sample proportions Z tests to test significance. 

**Null hypothesis:** There is no significant difference between the proportions.

Alpha = 0.05

In [None]:
print('Education Levels')
for x in train.educlevels.unique():
    print(x)
    propztest_poverty(train, 'educlevels', x)
    print()

13/27 instances of significance. There is enough evidence to deduce significance among the poverty levels for each education level.

**Education level seems to be a strong indicator of poverty level.**

### Overcrowding

`hacapo` = Overcrowding by rooms.
`hacdor` = Overcrowding by bedrooms.
* 1 = Yes
* 0 = No

In [None]:
train.hacapo.describe()

In [None]:
overcrowdf = pd.crosstab(train['Target'], train['hacapo'], margins=True).apply(percent_table, axis=1)
overcrowdf.iloc[:-1, 1].plot(kind='bar', stacked=True)
plt.xticks(rotation=30)
plt.title('Proportion of Overcrowding per Poverty Level')
plt.ylabel('Proportion')

In [None]:
overcrowdf

The proportion of overcrowding by room decreases as the poverty level becomes less vulnerable.

In [None]:
#overcrowding by room 
train.hacapo.head()

`hacapo` is probably not worth looking at, as distribution is all very close to 0.

In [None]:
#overcrowding by bedroom
train.hacdor.head()

In [None]:
overcrowdf = pd.crosstab(train['Target'], train['hacdor'], margins=True).apply(percent_table, axis=1)
overcrowdf.iloc[:-1, 1].plot(kind='bar', stacked=True)
plt.xticks(rotation=30)
plt.title('Proportion of Overcrowding by Bedroom per Poverty Level')
plt.ylabel('Proportion')

Proportion of overcrowding by bedroom decreases as poverty level becomes less vulnerable. This proportion is almost twice as much as overcrowding by room. 

### *Inferential Statistics - Difference in proportions*
Is there a difference in proportions among the poverty levels for overcrowding? We will perform a two sample proportions Z tests to test significance.

**Null hypothesis:** There is no significant difference between the proportions.

Alpha = 0.05

In [None]:
print('overcrowding by room')
propztest_poverty(train, 'hacapo', 1)
print()

print('overcrowding by bedroom')
propztest_poverty(train, 'hacdor', 1)

6/6 instances of significance. There is enough evidence to deduce significant differences among the proportion for each poverty level with both overcrowding by room and by bedroom. Is there a strong correlation between these two variables? If so, we can remove one.


In [None]:
from scipy.stats import pearsonr
corr1 = pearsonr(train.hacapo, train.hacdor)
print('hacapo x hacdor: ', corr1)

corr2 = pearsonr(train.Target, train.hacapo)
print('Target x hacapo: ', corr2)

corr3 = pearsonr(train.Target, train.hacdor)
print('Target x hacdor: ', corr3)

With an correlation coefficient of 0.6524, `hacapo` and `hacdor` have a moderately strong relationship with each other. 

### Water Provision

`waterprov`

If water provisions are inside, outside, or not present at the household.

In [None]:
#inside
waterprovdf = pd.crosstab(train['Target'], train['waterprov'], margins=True).apply(percent_table, axis=1)
waterprovdf.iloc[:-1, 0].plot(kind='bar', stacked=True)
plt.legend()
plt.ylabel('Proportion')

In [None]:
#none
waterprovdf.iloc[:-1, 1].plot(kind='bar', stacked=True)
plt.legend()
plt.ylabel('Proportion')

In [None]:
#outside
waterprovdf.iloc[:-1, 2].plot(kind='bar', stacked=True)
plt.legend()
plt.ylabel('Proportion')

In [None]:
waterprovdf

In [None]:
for x in train['waterprov'].unique():
    print(x)
    propztest_poverty(train, 'waterprov', x)
    print()

Does not seem to be a significant difference in proportion of whether water provisions are inside or outside dwellings among poverty levels. None might be an indicator they are in moderate poverty, or just no information.


### Regions

`regions`

Different regions that the households reside in. 

In [None]:
sns.countplot('regions', data=train)
plt.xticks(rotation=45)

In [None]:
train['regions'].value_counts()

Majority of households are located in the Central region. 

In [None]:
regionsdf = pd.crosstab(train['regions'], train['Target'])
regionsdf

In [None]:
prop_chart('regions', 'Regions')

### *Inferential Statistics - Difference in Proportions*

Is there a difference in proportions among the poverty levels for overcrowding? We will perform a two sample proportions Z tests to test significance.

Null hypothesis: There is no significant difference between the proportions.

Alpha = 0.05

In [None]:
for x in train['regions'].unique():
    print(x)
    propztest_poverty(train, 'regions', x)
    print()

Difference in proportions between Non vulnerable and Vulnerable households are consistently significant. However, the two other comparisons,  vulnerable/moderate, and moderate/extreme are consistently non significant. Regions may not be a significant factor in determining poverty level.

### Relations

In [None]:
print(train['relations'].value_counts())
sns.countplot('relations', data=train)
plt.xticks(rotation=60)

In [None]:
prop_chart('relations', 'Relations')

* Brother/sister in law and stepson/daughter are significantly more present in Vulnerable households than other relations.
* Grandson/daughter and stepson/daughter significantly less apparent Non Vulnerable households, but more apparent in Moderate Poverty households.

### Inferential Statistics - Difference in Proportions
Is there a significant difference in proportions among the poverty levels for the different relations? We will perform a two sample proportions Z tests to test significance.

**Null hypothesis**: There is no significant difference between the proportions.

Alpha = 0.05

In [None]:
for x in train['relations'].unique():
    print(x)
    propztest_poverty(train, 'relations', x)
    print()

### Toilets
What toilet is connected to.

In [None]:
print(train['toilets'].value_counts())
sns.countplot('toilets', data=train)
plt.xticks(rotation=45)

In [None]:
prop_chart('toilets','Toilet is Connected To')

Similar distribution among poverty levels. Majority of toilets connected to septic tanks.

### Inferential Statistics - Difference in Proportions
Is there a significant difference in proportions among the poverty levels for the what toilets are connected to? We will perform a two sample proportions Z tests to test significance.

**Null hypothesis**: There is no significant difference between the proportions.

Alpha = 0.05

In [None]:
for x in train['toilets'].unique():
    print(x)
    propztest_poverty(train, 'toilets', x)
    print()

8/15 instances of significance. There is enough evidence to deduce significance among the proportions among the poverty levels for what the toilets are connected to. This variable is a good indicator of poverty level.

### Housing Situation

The housing situation of the household.

In [None]:
print(train['housesitu'].value_counts())
sns.countplot('housesitu', data=train)
plt.xticks(rotation=45)

Most houses are owned, second most are rented. Unknown what 'other' is meant to be.

In [None]:
prop_chart('housesitu', 'Housing Situation')

### Inferential Statistics - Difference in Proportions
Is there a significant difference in proportions among the poverty levels for the what toilets are connected to? We will perform a two sample proportions Z tests to test significance.

**Null hypothesis**: There is no significant difference between the proportions.

Alpha = 0.05

In [None]:
for x in train['housesitu'].unique():
    print(x)
    propztest_poverty(train, 'housesitu', x)
    print()

6/15 instances of significances. There is enough evidence to deduce significance. 

### Energy Source
Main source of energy used for cooking.

In [None]:
print(train['energy'].value_counts())
sns.countplot('energy', data=train)
plt.xticks(rotation=45)

In [None]:
prop_chart('energy', 'Sources of Energy for Cooking')

### Inferential Statistics - Difference in Proportions
Is there a significant difference in proportions among the poverty levels for the sources of energy used for cooking? We will perform a two sample proportions Z tests to test significance.

**Null hypothesis**: There is no significant difference between the proportions.

Alpha = 0.05

In [None]:
for x in train['energy'].unique():
    print(x)
    propztest_poverty(train, 'energy', x)
    print()

8/12 instances of significance. There is enough evidence to deduce significance. `energy` is a strong indicator of poverty level.

### Appliances

`v18q` owns a tablet

`v18q1` number of tablets household owns

`computer` =1 if the household has notebook or desktop computer

`television` =1 if the household has TV

`mobilephone` =1 if mobile phone

`qmobilephone` # of mobile phones


In [None]:
appliances = train[['v18q', 'v18q1', 'computer', 'television', 'mobilephone', 'qmobilephone']]
appliances.sample(10)

A household can own two or more different appliances at the same time. 

In [None]:
appliances.describe()

In [None]:
for x in appliances.columns:
    print(appliances[x].value_counts())
    sns.countplot(x, data=train)
    plt.show()

In [None]:
for x in appliances.columns:
    prop_chart(x, x)

### Inferential Statistics - Difference in Proportions
Is there a significant difference in proportions among the poverty levels for appliances? We will perform a two sample proportions Z tests to test significance.

**Null hypothesis**: There is no significant difference between the proportions.

Alpha = 0.05

In [None]:
print('Tablets')
propztest_poverty(train, 'v18q', 1)
print()

print('Computer')
propztest_poverty(train, 'computer', 1)
print()

print('Television')
propztest_poverty(train, 'television', 1)
print()

print('Mobile Phones')
propztest_poverty(train, 'mobilephone', 1)
print()

* Whether a household owns a tablet or not seems like a significant indicator of poverty level.
* Whether a household owns a computer or not seems like a significant indicator of poverty level.
* Whether a household owns a television or not seems like a significant indicator of poverty level.
* Whether a household owns a mobile phone or not does not seem like a significant indicator of poverty. A large majority of households own a mobile phone. 


### *Differences in means* 

We will compare the differences in means for # of tablets and # of mobile phones, for those that own these two appliances. We will use a two sample z test to compare the means.

**Null Hypothesis**: There is no significant differences among the means of the poverty levels. 

Alpha = 0.05

In [None]:
# Number of Tablets
def ztestmean_poverty(data, column):
    
    nonvuln = train[train['Target']==4][column]
    vuln = train[train['Target']==3][column]
    moder = train[train['Target']==2][column]
    extreme = train[train['Target']==1][column]
    total = train[column]

    print('Nonvulnerable mean: ', np.mean(nonvuln))
    print('Vulnerable mean: ', np.mean(vuln))
    print('Moderate mean: ', np.mean(moder))
    print('Extreme mean: ', np.mean(extreme))
    print('Total Mean: ', np.mean(total))
    print()
    
    tstat, p= ztest(trainrented[trainrented['Target'] == 4][column],
                           trainrented[trainrented['Target'] == 3][column])
    print('Nonvulnerable and Vulnerable p-val: ', p)
    if p < 0.05: 
        print('Significant')
    else: 
        print('Non Significant')
        
    tstat, p= ztest(trainrented[trainrented['Target'] == 3][column],
                           trainrented[trainrented['Target'] == 2][column])
    print('Vulnerable and Moderate p-val: ', p)
    if p < 0.05: 
        print('Significant')
    else: 
        print('Non Significant')
    
    tstat, p= ztest(trainrented[trainrented['Target'] == 2][column],
                           trainrented[trainrented['Target'] == 1][column])
    print('Moderate and Extreme p-val: ', p)
    if p < 0.05: 
        print('Significant')
    else: 
        print('Non Significant')
        
print('Number of tablets')
ztestmean_poverty(train, 'v18q1')

In [None]:
# Number of Phones
print('Number of phones')
ztestmean_poverty(train, 'qmobilephone')

### Remaining Numerical Variables

rooms,  number of all rooms in the house

r4h1, Males younger than 12 years of age

r4h2, Males 12 years of age and older

r4h3, Total males in the household

r4m1, Females younger than 12 years of age

r4m2, Females 12 years of age and older

r4m3, Total females in the household

r4t1, persons younger than 12 years of age

r4t2, persons 12 years of age and older

r4t3, Total persons in the household

tamhog, size of the household

tamviv, number of persons living in the household

hhsize, household size

hogar_nin, Number of children 0 to 19 in household

hogar_adul, Number of adults in household

hogar_mayor, # of individuals 65+ in the household

hogar_total, # of total individuals in the household

dependency, Dependency rate, calculated = (number of members of the household younger than 19 or older than 
64)/(number of member of household between 19 and 64)

meaneduc, average years of education for adults (18+)

bedrooms, number of bedrooms

overcrowding, # persons per room

age, Age in years

SQBescolari, escolari squared

SQBage, age squared

SQBhogar_total, hogar_total squared

SQBedjefe, edjefe squared

SQBhogar_nin, hogar_nin squared

SQBovercrowding, overcrowding squared

SQBdependency, dependency squared

SQBmeaned, square of the mean years of education of adults (>=18) in the household

agesq, Age squared


In [None]:
numeric = {'rooms':['rooms'],
           'males': ['r4h1', 'r4h2', 'r4h3'],
          'females': ['r4m1', 'r4m2', 'r4m3'],
          'persons': ['r4t1', 'r4t2', 'r4t3'],
          'sizeohhold':['tamhog'],
          '#ofpersons':['tamviv'],
          'hholdsize':['hhsize'],
          }
#-unfinished.

## Feature Engineering
-to do

Still need to finish going through the rest of the numerical variables. Then do some more feature engineering. After that, build some models, including Random Forest Trees, KNN Clustering, and Hierarchal Clustering.  I would love any more suggestions to what I can improve on, as this is my first data analysis project ever. 