In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

train = pd.read_csv("../input/train.csv")

# A first look at the target and its relation to some other fields
___
The target value indicates the groups of income levels.

- 1 = extreme poverty 
- 2 = moderate poverty 
- 3 = vulnerable households 
- 4 = non vulnerable households

So let's make a quick visual of how income is distributed among population.

In [None]:
sns.countplot(train.Target);

Most households in Costa Rica are not in a vulnerable situation. However, the second most frequent group of households is that of moderate poverty, followed by vulnerable and extreme poverty households. Now we have to evaluate the other features. But... what are those features anyway?

In [None]:
for feature in train.columns:
    print (feature)

There are lots of features (143 to be specific), so there is a lot of exploration to make. I don't really know where to start, so after looking at the data description, I decided to first look if the place where the household is have some kind of influence in its vulnerability.

In [None]:
#create a column to put every place (lugar) together, as they are dummies.
train['Region'] = np.nan
train.loc[train.lugar1 == 1, 'Region'] = 'Central'
train.loc[train.lugar2 == 1, 'Region'] = 'Chorotega'
train.loc[train.lugar3 == 1, 'Region'] = 'Central Pacific'
train.loc[train.lugar4 == 1, 'Region'] = 'Brunca'
train.loc[train.lugar5 == 1, 'Region'] = 'Atlantic Huetar'
train.loc[train.lugar6 == 1, 'Region'] = 'North Huetar'

pd.crosstab(train.Region, train.Target, normalize=0)

From the table above, which is normalized by row, we can see that onthe central region, only 5% of the households are of extreme poverty, while virtually 70% of households on this region are non-vulnerable condition. This seems to be the best region when considering household conditions. On the other hand, Central Pacific has the lowest number of non-vulnerable households and, probably not coincidentally, the highest percentage of extreme poor households, with 14% of its households in this condition. Not only that, it also has the highest percentage of vulnerable households among all regions - almost 19%. This indicates that the region in which a household is located might be an important feature to predict the Target. 

Similarly, let's see the area where those households are. I'd guess that most vulnerable/poor households are on rural areas. But as being good data scientists as we are, it is good to check what the data have to say regarding this.

In [None]:
train['Area'] = np.nan
train.loc[train.area1 == 1, 'Area'] = 'Urban'
train.loc[train.area2 == 1, 'Area'] = 'Rural'

pd.crosstab(train.Area, train.Target, normalize=0)

Seems like my guess was right, even though not that crystal clear. We can see that rural area has about 11p.p. (percentage points) less non-vulnerable households than urban areas. As a consequence, each of the other [bad] conditions is considerably more frequent in rural areas.

Now that we saw how each region and area is related to household condition, let's explore a bit more about the households themselves. First, the size of the households.

Note: 'tamhog' and 'hhsize' are the same variables

In [None]:
pd.crosstab(train.tamhog, train.Target, normalize = 1)

I couldn't find the meaning of tamhog/hhsize values, but I'd guess it is the number of people on that household (please correct me if I'm wrong).  Most households have 4 people living together, independent of its condition. 

Age could be an important predictor as well. What is its distribution?

In [None]:
sns.kdeplot(train.age, legend=False)
plt.xlabel("Age");

And how is the Age distribution for every household condition?

In [None]:
p = sns.FacetGrid(data = train, hue = 'Target', size = 4, legend_out=True)
p = p.map(sns.kdeplot, 'age')
plt.legend()
plt.title("Age distribution colored by household condition(target)")
p;

Since most households belongs to condition 4, its distribution basically shapes the "general" age distribution, as we can see by comparing the red curve and the "general" curve on the plot above. However, as we move down on the vulnerability scale, the age distributions becomes more and more skewed. Meaning that people living the worse is one's household condition, the higher the chances that there are younger citizens there.

Now we are starting to make a picture. Poor/vulnerable households are more densily found in rural areas and specially in Central Pacific region. They are also younger than their non-vulnerable counterpart. But we still don't know their structure. For instance, what's their main source of used for cooking?

In [None]:
train['CookSource'] = np.nan
train.loc[train.energcocinar1 == 1, 'CookSource'] = 'NoKicthen'
train.loc[train.energcocinar2 == 1, 'CookSource'] = 'Electricity'
train.loc[train.energcocinar3 == 1, 'CookSource'] = 'Gas'
train.loc[train.energcocinar4 == 1, 'CookSource'] = 'WoodCharcoal'

pd.crosstab(train.CookSource, train.Target, normalize = 1)

Here the difference between non-vulnerable and poor/vulnerable households become clearer. About 55% of non-vulnerable households use electricity and other 43% use gas. Only less than 3% either have no kitchen or use charcoal. Looking at the opposite position on the target values, 12%(!) of households still use charcoal to cook - a huge difference to the other conditions (9%, 6% and 2% for moderate poverty, vulnerable and non-vulnerable conditions, respectively).  A few people - 32% - have access to electricity, but they mostly use gas, just like every other household except non-vulnerable ones. 