**7 Leading Facts About Costa Rica’s Poverty Rate**

1. Costa Rica’s inequality rate has increased since 2000, a division that disproportionately affects indigenous and minority groups. Today, the country’s richest 20 percent receive an income  [19 times higher](https://www.telesurtv.net/english/news/Costa-Ricas-Poor-Left-Behind-Despite-Economic-Growth--20170407-0013.html)  than that of the poorest 20 percent.
2. While, overall, Costa Rica’s poverty rate has dropped from 22.4 percent to 21.7 percent from 2014 to 2015, the country’s extreme poverty rate rose from 5.8 percent to 7.2 percent, [the highest recorded rate](https://thecostaricanews.com/national-household-survey-to-measure-poverty-and-inequality/) in the last 60 years.
3. While 19 percent of urban households live in poverty and 5.2 percent live in extreme poverty, 30.3 percent of rural households live in poverty and 10.6 percent in extreme poverty.
4. Poor Costa Ricans have, on average, three years less schooling than their economically stable peers.
5. In Costa Rica, 43.5 percent of poor households are headed by women
6. Since an inflation [crisis in the ’80s and ’90s](https://www.cato.org/publications/economic-development-bulletin/growth-without-poverty-reduction-case-costa-rica), the Costa Rican government has managed to boost the economy through international tourism and exports. These sectors benefit qualified workers, while unskilled workers, over-represented by indigenous and minority groups, see no change or a decrease in their salaries.

7. Public assistance to poor families increased by 9.3 percent per household and 6.9 percent per person from 2014 to 2015.

Costa Rica’s poverty rate seems to be sewed up neatly on the surface, but the growth of a country doesn’t always reflect the growth of its people. The disparity of incomes and opportunities between uneducated people in rural areas versus educated people in urban areas threatens to rob Costa Rica of its good economic reputation.
– Sophie Nunnally
[Source](https://borgenproject.org/about-costa-ricas-poverty-rate/) 

**Objective:**

The Inter-American Development Bank is asking the Kaggle community for help with income qualification for some of the world's poorest families so as to making sure the right people are given enough aid.


Let us first import the necessary modules.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
color = sns.color_palette()

%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'



Checking for files in path for compitition

In [None]:
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

Now let us read the train and test file and check the number of rows and columns.

In [None]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
print("Train shape : ", train.shape)
print("Test shape : ", test.shape)

So we have 9557 rows in train set and 23859 rows in test set. We also have 143 columns in total including the target and id col.  Test set is almost 2.5 times as that of train set.

In [None]:
train.head()

In [None]:
test.describe()

In [None]:
test.describe()

**Target variable**

In [None]:
plt.figure(figsize=(20,8))
sns.countplot(train.Target)
plt.title("Value Counts of Target Variable")

Most of the families fall under catagory 4, who are non vulnerable households

**Missing values:**

Now let us check if there are missing values in the dataset.

In [None]:
missing_train = train.isnull().sum(axis=0).reset_index()
missing_train.columns = ['column_name', 'missing_count']
missing_train = missing_train[missing_train['missing_count']>0]
missing_train['d_type']='train'

missing_test = test.isnull().sum(axis=0).reset_index()
missing_test.columns = ['column_name', 'missing_count']
missing_test = missing_test[missing_test['missing_count']>0]
missing_test['d_type']='test'

ind = np.arange(missing_train.shape[0])
width = 0.9
fig, ax = plt.subplots(figsize=(12,5))
rects = ax.barh(ind, missing_train.missing_count.values, color='y')
ax.set_yticks(ind)
ax.set_yticklabels(missing_train.column_name.values, rotation='horizontal')
ax.set_xlabel("Count of missing values")
ax.set_title("Number of missing values in each column for train data set")
plt.show()

ind = np.arange(missing_test.shape[0])
width = 0.9
fig, ax = plt.subplots(figsize=(12,5))
rects = ax.barh(ind, missing_test.missing_count.values, color='y')
ax.set_yticks(ind)
ax.set_yticklabels(missing_test.column_name.values, rotation='horizontal')
ax.set_xlabel("Count of missing values")
ax.set_title("Number of missing values in each column for test data set")
plt.show()


This shows both train and test data set have same ratio of missing

**Data Type of Columns:**

Now let us also check the data type of the columns.

In [None]:
dtype_df = train.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df.groupby("Column Type").aggregate('count').reset_index()

So majority of them are numerical variables with 5 factor variables.

**Correlation Heatmap of Top  correlated features with Target**

In [None]:
from scipy.stats import spearmanr
import warnings
warnings.filterwarnings("ignore")

labels = []
values = []
for col in train.columns:
    if col not in ["Id", "Target"]:
        labels.append(col)
        values.append(spearmanr(train[col].values, train["Target"].values)[0])
corr_df = pd.DataFrame({'col_labels':labels, 'corr_values':values})
corr_df = corr_df.sort_values(by='corr_values')
 

In [None]:
cols_to_use = corr_df[(corr_df['corr_values']>0.21) | (corr_df['corr_values']<-0.21)].col_labels.tolist()

temp_df = train[cols_to_use]
corrmat = temp_df.corr(method='spearman')
f, ax = plt.subplots(figsize=(20, 20))

# Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=1., square=True, cmap="YlGnBu", annot=True)
plt.title("Important variables correlation map", fontsize=15)
plt.show()

**Finding Feature Importance with Light GBM**

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import mean_squared_error
from tqdm import tqdm

In [None]:
target = train['Target'].astype('int')
train.drop(['Id','Target'], axis=1, inplace=True)


In [None]:
obj_columns = [f_ for f_ in train.columns if train[f_].dtype == 'object']
for col in tqdm(obj_columns):
    le = LabelEncoder()
    le.fit(train[col].astype(str))
    train[col] = le.transform(train[col].astype(str))

In [None]:
lgbm = LGBMClassifier()
xgbm = XGBClassifier()
train = train.astype('float32') # For faster computation
lgbm.fit(train, target , verbose=False)
xgbm.fit(train, target ,verbose=False)

In [None]:
LGBM_FEAT_IMP = pd.DataFrame({'Features':train.columns, "IMP": lgbm.feature_importances_}).sort_values(by='IMP', ascending=False)

XGBM_FEAT_IMP = pd.DataFrame({'Features':train.columns, "IMP": xgbm.feature_importances_}
                            ).sort_values(
                              by='IMP', ascending=False)

**Top 10 features as seen by LightGBM model **

In [None]:
LGBM_FEAT_IMP.head(10).transpose()

**Top 10 features as seen by XGBoost model **

In [None]:
XGBM_FEAT_IMP.head(10).transpose()

**Into charts**

In [None]:
data = [go.Bar(
            x= LGBM_FEAT_IMP.head(50).Features,
            y= LGBM_FEAT_IMP.head(50).IMP, 
            marker=dict(color='green',))
       ]
layout = go.Layout(title = "LGBM Top 50 Feature Importances")
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
data = [go.Bar(
            x= XGBM_FEAT_IMP.head(50).Features,
            y= XGBM_FEAT_IMP.head(50).IMP, 
            marker=dict(color='blue',))
       ]
layout = go.Layout(title = "XGBM Top 50 Feature Importances")
fig = go.Figure(data=data, layout=layout)
iplot(fig)

**So the top 10 variables and their description from the data dictionary are:**
1.  overcrowding-  # persons per room
2. meaneduc - average years of education for adults (18+)
3. qmobilephone -  # of mobile phones
4. SQBedjefe -  edjefe squared
5. rooms -   number of all rooms in the house
6. SQBdependenc -  dependency squared
7. dependency -  Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
8. edjefa -  years of education of female head of household, based on the interaction of escolari (years of education), head of household and gender, yes=1 and no=0
9. r4h2 -  Males 12 years of age and older
10. r4t2 -  persons 12 years of age and older

**Number of Rooms in house**

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="rooms", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Number of rooms', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

This look a normal distrubition of number of rooms in house with means as 5 rooms. Most of house have 4-6 rooms

**Number of mobile phones used in a family**

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="qmobilephone", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('No of mobile phones', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

Looks number of mobile phones are normal distrubiion with little left skewed, With 1-3 mobile phone per family

** Number persons per room**


In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="overcrowding", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Number persons per room', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

We could see that moslty 1 to 2 persons are there per room. There are few cases where 4 - 6 persons per room.

**Average years of education for adults (18+)**

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="meaneduc", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('average years of education for adults (18+)', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

Average years of education for adults(18+) is more concentrated with in range of  5 and 12 years.

**Years of education of female head of household.**

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="edjefa", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Years of education of female head of household', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

Surprising!! I see 20 years of education of female head of household.

**Numbers of Persons 12 years of age and older**                                                                                                                                                          
What is with 12 years ??                                                                                                                                                                                                                         
- It is considered as **Minimum age of consent
(sex with persons under this age is always illegal)** .  In most of the familes there are 1 - 2 males of 12 years of age and older.

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="r4t2", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Persons 12 years of age and older', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

Look most of family have 2-5 person who are 12 years of age and older

** Males 12 years of age and older**  

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="r4h2", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Males 12 years of age and older', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

When it come to  Males of 12 years of age and older,                                                                                 
**Observations**                                                                                                                                 
1.  Most of family have 1-3  males ≥ 12years age
2. There are more then 500 families with no males 12 years of age and older

**Square of years of education of male head of household** 

**What is Head of Household??**                                                                                  
Head of Household is a filing status for single or unmarried taxpayers who keep up a home for a Qualifying Person. The Head of Household filing status has some important tax advantages over the Single filing status. ... Also, Heads of Household must have a higher income than Single filers before they owe income tax.





In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="SQBedjefe", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Square of years of education of male head of household', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

This shows near to 50% of people are un-educated. With some comparity large bars at 36 and 121.

**Dependency**                                                                                                                                                                                                                
How is Dependency rate calculated?                                                                                                                                                                   
    Dependency rate = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)


In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="dependency", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Dependency', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

This show a multimodal distrubition with  distinct peaks (local maxima) at 7 , 30.

**Squared of Dependency rate.** 

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="SQBdependency", data=train)
plt.ylabel('Count', fontsize=12)
plt.xlabel('dependency squared', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

For every dependency value as **True** Squared of Dependency rate is considered as **1** and for **False** as **0**. So there is large bars at 0 ,1.

Many things to come..Stay tuned!