## Poverty Assessment: Pamplona Alta
### Caja del Amor Survey Results
### Sean O'Malley

#### Project

Every year, in partnership with the Solidaridad en Marcha, Catholic churches throughout the city of Lima, Peru deliver thousands of Christmas gift boxes to the poorest residents in the city. The campaign, called Caja Del Amor, has been in operation for years and has subsequently built long-term and well-sustained relationships with many community leaders in these areas. 


The list of gift recipients is built in coordination with the local community leaders, most of whom oversee around 150 families. These leaders choose the 5 to 10 families in most need of assistance in their respective community to receive the gift boxes.


These networks served as a point of strength when we began to explore the creation of an in-depth survey to better understand the poorest urban populations in Peru. However, our focus needed to also be on action-ability of insight and the concrete opportunity of enacting positive change in the lives those we surveyed. 


Therefore, we decided to focus on a region of high need, and an area where Solidaridad en Marcha had a significant footprint. This process of elimination lead us to the region of Pamplona Alta near the San Juan de Miraflores municipality of Lima.

#### Data

Understanding the many facets of those in Pamplona Alta was integral in the way we built our survey, the questions we asked, and the way we asked them. We raised some questions that we heuristically had an intuitive idea of the answer, but needed to understand the severity. Yet, others we asked in order to gain insight into the tools we may have available to us within our solution set. 

Lastly, we inquired of economic indicators, religious factors and family structure. All intended to paint a picture of the lives of those in Pamplona Alta and to possibly determine causality between the various characteristics.  

We were able to fully survey over 500 families and after extracting all personably identifiable data, we built a dataset that held great potential for a greater understanding of the lives of those in Pamplona Alta and the possible routes available to help them.

The completed dataset built from the original survey contains 21 variables and 507 observations of which to explore, visualize and perform analysis on. The complete dataset can be found on my GitHub account, here. Also note, for binary variables, 1 is yes and 0 is no.

1.	__fam_n__	– _numeric factor_ – Unique identifier for each family.
2.	__internet__ – _binary_ – Does your phone have internet?
3.	__agua__ –	_binary_ – Can the water truck get to your house?
4.	__banco__ – _binary_ – Do you have a bank account?
5.	__iglesia__ – _binary_ – Do you go to church at least once a month?
6.	__dejar_hijos__ –  _binary_ – Do you leave your children home alone (when you go to work)?
7.	__cuantas_personas__ – _numeric_ – How many people live in your house?
8.	__tiempo_casa__ – _numeric_ – How long have you lived in your house?
9.	__primer_hijo__ – _numeric_ – At what age did you have your first child?
10.	__cuantas_trabajan__ – _numeric_ – How many people in your house work?
11.	__tiempo_trabajan__ – _numeric_ – How long does it take to get to your job?	
12.	__pierden_colegio__ – _numeric_ – How many days a month do your children miss school?
13.	__ingreso__ – _numeric_ – What is your monthly household income?
14.	__bautizadas__ – _numeric_ – How many people in your family are baptized?
15.	__direccion__ – _character factor_ – Name of neighborhood.
16.	__padre__ – _binary_ – Does the father of the children live in the home?
17.	__madre__ – _binary_ – Does the mother of the children live in the home?
18.	__F__ – _numeric_ – Count of females in the home.
19.	__M__ – _numeric_ – Count of males in the home.
20.	__niños__ – _numeric_ – Count of children 18 and younger in the home.
21.	__mayores__ – _numeric_ – Count of adults 65 and older in the home.

Months of conversations, meetings, reading and collaboration with community members came into building this survey. Qualitative analysis helped us produce a dataset that has the potential to perform multiple quantitative analyses that are relevant and informed; and it is from this point that we will now follow the flow of a data science analysis. 

#### Process

Reminder: I will at times use technical language, but I encourage you to keep reading through, because I will also accompany every scientific insight with an explanation in simple language, relevant to the question at hand.

We will begin by exploring the variables, the average values and basic correlations, visualizing how characteristics behave with one another. Following our exploratory phase, we will inspect cause and effect relationships between pairs of variables, as well as predict specific variables using all available data. I will use multiple techniques to perform this analysis of causality in hopes of providing variable importance in the prediction of key factors of the poor. The result will be a set of priorities for aid workers to pursue in the betterment of certain economic or societal indicators.


The succeeding analysis will be that of understanding natural segments that exist within the poorest of the poor. Again, using multiple techniques, I will attempt to determine the groups of people that exist within those surveyed. What commonalities do certain segments have? How can we target aid campaigns to help certain groups? These are a few of the many questions a segmentation analysis will help us answer.

Our quantitative and qualitative analyses will come to fruition in the final recommendation portion of this process. We will present questions and provide actionable insight into those questions, as determined by our analysis. We will build a road map for aid, a list of how we can help, who we can help and the logistical suggestions to do so. Our intent is to tie every insight to action and offer suggestions as to the best action available given what we have learned from the analysis. So, let’s get started!

Import the necessary packages

In [1]:
import pandas as pd
import numpy as np
import os

Ingest the raw data from its github location

In [52]:
github_file_location = "https://raw.githubusercontent.com/showmalley/SeanOMalleyCodePortfolio/master/Development%20Economics/PovertySurveys/CDA_FULL_2018.csv"

df = pd.read_csv(github_file_location, encoding="ISO-8859-1").iloc[:,1:]

We need to explore the datatypes to unsure correct formatting. It looks like we only need to alter the FAM_N and DIRECCION variables to be categorical. 

In [53]:
df.FAM_N = df.FAM_N.astype('category')
df.DIRECCION = df.DIRECCION.astype('category')
print(df.dtypes)

FAM_N               category
internet             float64
agua                 float64
banco                float64
iglesia              float64
dejar_hijos          float64
cuantas_personas     float64
tiempo_casa          float64
primer_hijo          float64
cuantas_trabajan     float64
tiempo_trabajan      float64
pierden_colegio      float64
ingreso              float64
bautizadas           float64
DIRECCION           category
PADRE                float64
MADRE                float64
F                    float64
M                    float64
NINOS                float64
MAYORES              float64
dtype: object


All looks well, lets move on to explore the data itself. First with summary statistics. 

In [59]:
df.describe()

Unnamed: 0,internet,agua,banco,iglesia,dejar_hijos,cuantas_personas,tiempo_casa,primer_hijo,cuantas_trabajan,tiempo_trabajan,pierden_colegio,ingreso,bautizadas,PADRE,MADRE,F,M,NINOS,MAYORES
count,498.0,493.0,503.0,500.0,486.0,505.0,495.0,472.0,479.0,409.0,344.0,441.0,440.0,502.0,502.0,499.0,499.0,499.0,499.0
mean,0.106426,0.705882,0.035785,0.874,0.403292,5.043564,14.225859,21.063559,1.281837,1.721516,1.962791,682.380952,2.65,0.551793,0.97012,2.543086,2.036072,2.603206,0.058116
std,0.308692,0.456108,0.185939,0.332182,0.491064,1.958418,8.601098,4.544681,0.763925,4.086312,2.318429,576.150332,1.838651,0.497806,0.170427,1.24008,1.2701,1.521885,0.258644
min,0.0,0.0,0.0,0.0,0.0,0.0,0.2,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,0.0,4.0,6.0,18.0,1.0,0.75,0.0,430.0,1.0,0.0,1.0,2.0,1.0,2.0,0.0
50%,0.0,1.0,0.0,1.0,0.0,5.0,15.0,20.0,1.0,1.0,2.0,700.0,2.0,1.0,1.0,2.0,2.0,3.0,0.0
75%,0.0,1.0,0.0,1.0,1.0,6.0,18.0,23.0,1.0,2.0,3.0,850.0,4.0,1.0,1.0,3.0,3.0,3.0,0.0
max,1.0,1.0,1.0,1.0,1.0,15.0,50.0,45.0,7.0,60.0,22.0,8500.0,10.0,1.0,1.0,8.0,6.0,8.0,2.0


Looking above, we have a few checks we need to take care of. Look to min and max to identify unreasonable outliers. Once this is controlled for, looking at the mean values can give us our first brief insight into the lives of the data we have ingested

The only data changes I see that need changed is the age of the first child. We know no one at 0 can have a child, so lets control for this by saying any value under 11 is NaN. 

In [58]:
df.loc[df.primer_hijo < 12, 'primer_hijo'] = np.nan

In [49]:
df_encuesta.head()

Unnamed: 0,id,internet,agua,banco,iglesia,dejar_hijos,cuantas_personas,tiempo_casa,primer_hijo,cuantas_trabajan,tiempo_trabajan,pierden_colegio,ingreso,bautizadas
0,792,1.0,0.0,1.0,0.0,0.0,8,50.0,20.0,3.0,0.5,,2000.0,6.0
1,791,0.0,1.0,1.0,0.0,1.0,10,20.0,19.0,2.0,1.0,,2000.0,10.0
2,790,1.0,1.0,1.0,1.0,,3,15.0,,1.0,1.5,,1800.0,3.0
3,793,0.0,1.0,1.0,1.0,1.0,6,0.3,21.0,3.0,1.5,,1200.0,3.0
4,234,0.0,1.0,0.0,1.0,1.0,3,18.0,21.0,1.0,1.0,,400.0,2.0


In [50]:
df_encuesta.iloc[:,1:].describe()

Unnamed: 0,internet,agua,banco,iglesia,dejar_hijos,cuantas_personas,tiempo_casa,primer_hijo,cuantas_trabajan,tiempo_trabajan,pierden_colegio,ingreso,bautizadas
count,496.0,491.0,501.0,498.0,484.0,503.0,493.0,474.0,477.0,407.0,343.0,439.0,438.0
mean,0.106855,0.704684,0.035928,0.873494,0.404959,5.045726,14.24503,20.890295,1.283019,1.727518,1.933528,682.984055,2.648402
std,0.30924,0.45665,0.186297,0.332753,0.491392,1.96176,8.599988,4.772697,0.765309,4.095462,2.257298,577.305663,1.841459
min,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,0.0,4.0,6.0,18.0,1.0,0.75,0.0,440.0,1.0
50%,0.0,1.0,0.0,1.0,0.0,5.0,15.0,20.0,1.0,1.0,2.0,700.0,2.0
75%,0.0,1.0,0.0,1.0,1.0,6.0,18.0,23.0,1.0,2.0,3.0,850.0,4.0
max,1.0,1.0,1.0,1.0,1.0,15.0,50.0,45.0,7.0,60.0,22.0,8500.0,10.0


In [51]:
df_master = df_encuesta.merge(df_familia, left_on='id', right_on='FAM_N', how='outer')

In [58]:
df_master = df_master.iloc[:,1:]
df_master.head()

Unnamed: 0,internet,agua,banco,iglesia,dejar_hijos,cuantas_personas,tiempo_casa,primer_hijo,cuantas_trabajan,tiempo_trabajan,...,ingreso,bautizadas,FAM_N,DIRECCION,PADRE,MADRE,F,M,NINOS,MAYORES
0,1.0,0.0,1.0,0.0,0.0,8.0,50.0,20.0,3.0,0.5,...,2000.0,6.0,792,OTRO CASO,0.0,1.0,5.0,4.0,3.0,0.0
1,0.0,1.0,1.0,0.0,1.0,10.0,20.0,19.0,2.0,1.0,...,2000.0,10.0,791,OTRO CASO,0.0,1.0,2.0,4.0,5.0,0.0
2,1.0,1.0,1.0,1.0,,3.0,15.0,,1.0,1.5,...,1800.0,3.0,790,OTRO CASO,1.0,0.0,2.0,1.0,0.0,0.0
3,0.0,1.0,1.0,1.0,1.0,6.0,0.3,21.0,3.0,1.5,...,1200.0,3.0,793,OTRO CASO,1.0,1.0,3.0,3.0,2.0,0.0
4,0.0,1.0,0.0,1.0,1.0,3.0,18.0,21.0,1.0,1.0,...,400.0,2.0,234,OTRO CASO,0.0,1.0,3.0,0.0,2.0,0.0


In [63]:
print("We have " +str(len(df_master.FAM_N)) + " total families and " + str(len(df_master.FAM_N.unique()))+ " unique values.")

We have 570 total families and 565 unique values.


that means we need to deduplicate on family number

In [70]:
df_master.drop_duplicates(subset='FAM_N', keep='first', inplace=False))

565

move familia to front of rest of dataframe columns

In [71]:
df_master = df_master.set_index('FAM_N').reset_index()

In [72]:
df_master.head()

Unnamed: 0,FAM_N,internet,agua,banco,iglesia,dejar_hijos,cuantas_personas,tiempo_casa,primer_hijo,cuantas_trabajan,...,pierden_colegio,ingreso,bautizadas,DIRECCION,PADRE,MADRE,F,M,NINOS,MAYORES
0,792,1.0,0.0,1.0,0.0,0.0,8.0,50.0,20.0,3.0,...,,2000.0,6.0,OTRO CASO,0.0,1.0,5.0,4.0,3.0,0.0
1,791,0.0,1.0,1.0,0.0,1.0,10.0,20.0,19.0,2.0,...,,2000.0,10.0,OTRO CASO,0.0,1.0,2.0,4.0,5.0,0.0
2,790,1.0,1.0,1.0,1.0,,3.0,15.0,,1.0,...,,1800.0,3.0,OTRO CASO,1.0,0.0,2.0,1.0,0.0,0.0
3,793,0.0,1.0,1.0,1.0,1.0,6.0,0.3,21.0,3.0,...,,1200.0,3.0,OTRO CASO,1.0,1.0,3.0,3.0,2.0,0.0
4,234,0.0,1.0,0.0,1.0,1.0,3.0,18.0,21.0,1.0,...,,400.0,2.0,OTRO CASO,0.0,1.0,3.0,0.0,2.0,0.0


In [75]:
df_master.iloc[:,1:].describe()

Unnamed: 0,internet,agua,banco,iglesia,dejar_hijos,cuantas_personas,tiempo_casa,primer_hijo,cuantas_trabajan,tiempo_trabajan,pierden_colegio,ingreso,bautizadas,PADRE,MADRE,F,M,NINOS,MAYORES
count,498.0,493.0,503.0,500.0,486.0,505.0,495.0,476.0,479.0,409.0,344.0,441.0,440.0,567.0,567.0,564.0,564.0,564.0,564.0
mean,0.106426,0.705882,0.035785,0.874,0.403292,5.043564,14.225859,20.918067,1.281837,1.721516,1.962791,682.380952,2.65,0.548501,0.96649,2.574468,2.069149,2.675532,0.058511
std,0.308692,0.456108,0.185939,0.332182,0.491064,1.958418,8.601098,4.806532,0.763925,4.086312,2.318429,576.150332,1.838651,0.498082,0.180122,1.244435,1.311449,1.543693,0.263429
min,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,1.0,0.0,4.0,6.0,18.0,1.0,0.75,0.0,430.0,1.0,0.0,1.0,2.0,1.0,2.0,0.0
50%,0.0,1.0,0.0,1.0,0.0,5.0,15.0,20.0,1.0,1.0,2.0,700.0,2.0,1.0,1.0,2.0,2.0,3.0,0.0
75%,0.0,1.0,0.0,1.0,1.0,6.0,18.0,23.0,1.0,2.0,3.0,850.0,4.0,1.0,1.0,3.0,3.0,4.0,0.0
max,1.0,1.0,1.0,1.0,1.0,15.0,50.0,45.0,7.0,60.0,22.0,8500.0,10.0,1.0,1.0,8.0,6.0,8.0,2.0


In [79]:
df_master = df_master.iloc[:505,:]

In [80]:
df_master.to_csv('CDA_FULL_2018.csv')