# ** Kiva - Data analysis and Poverty estimation **
***

**Mhamed Jabri — 02/27/2018**

Machine Learning is **disruptive**. That's no news, everyone knows that by now. Every industry is being affected by AI, AI/ML startups are booming ... No doubt, now sounds like the perfect time to be a data scientist !  
That being said, two industries stand out for me when it comes to applying machine learning : **Healthcare and Economic Development**. Not that other applications aren't useful or interesting but those two are, in my opinion, showing how we can really use technology to make the world a better place. Some impressive projects on those fields are being conducted right now by several teams; *Stanford* has an ongoing project about predicting poverty rates with satellite imagery, how impressive is that ?!

Here in Kaggle, we already have experience with the first one (healthcare), for example, every year there's the *Data Science Bowl* challenge where competitors do their very best to achieve something unprecedented, in 2017, the competition's goal was to **improve cancer screening care and prevention**.  
I was very excited and pleased when I got the email informing me about the Kiva Crowdfunding challenge and it's nice to know that this is only the beggining, with many more other competitions to come in the Data Science for Good program.

Being myself very interested in those issues and taking courses such as *Microeconomics* and *Data Analysis for Social Scientists* (If interested, you can find both courses [here](https://micromasters.mit.edu/dedp/), excellent content proposed by MIT and Abdul Latif Jameel Poverty Action Lab), I decided to publish a notebook in this challenge and take the opportunity to use everything I've learned so far.     
**Through this notebook**, I hope that not only will you learn some Data Analysis / Machine Learning stuff, but also (and maybe mostly) learn a lot about economics (I'll do my best), learn about poverty challenges in the countries where Kiva is heavily involved, learn about how you can collect data that's useful in those problems and hopefully inspire you to apply your data science skills to build a better living place in the future !

**P.S : This will be a work in progress for at least a month. I will constantly try to improve the content, add new stuff and make use of any interesting new dataset that gets published for this competition.**

Enjoy !

# Table of contents
***

* [About Kiva and the challenge](#introduction)

* [1. Exploratory Data Analysis](#EDA)
   * [1.1. Data description](#description)
   * [1.2. Use of Kiva around the world](#users)
   * [1.3. Loans, how much and what for ?](#projects)
   * [1.4. How much time until you get funded ?](#dates)
   * [1.5. Amount of loan VS Repayment time ?](#ratio)
   * [1.6. Lenders : who are they and what drives them ?](#lenders)

* [2. Poverty estimation model](#predict)
   * [2.1. What's poverty ?](#definition)
   * [2.2. Multidimensional Poverty Index](#mpi)
   * [2.3 Proxy Means Test](#pmt)     
   * [2.4 Building the model](#model)  
        * [2.4.1 Acquiring more data : DHS](#data)   
        * [2.4.2 Assigning loans to clusters](#knn)   
        * [2.4.3 Custom Kiva metric](#customkiva)   
   * [2.5 Summary and pipeline](#summary)  
* [Conclusion](#conclusion)

#  About Kiva and the challenge
***

Kiva is a non-profit organization that allows anyone to lend money to people in need in over 80 countries. When you go to kiva.org, you can choose a theme (Refugees, Shelter, Health ...) or a country and you'll get a list of all the loans you can fund with a description of the borrower, his needs and the time he'll need for repayment. So far, Kiva has funded more than 1 billion dollars to 2 million borrowers and is considered a major actor in the fight against poverty, especially in many African countries.

In this challenge, the ultimate goal is to obtain as precise informations as possible about the poverty level of each borrower / region because that would help setting investment priorities. Kagglers are invited to use Kiva's data as well as any external public datasets to build their poverty estimation model.  
As for Kiva's data, here's what we've got : 
* **kiva_loans** : That's the dataset that contains most of the informations about the loans (id of borrower, amount of loan, time of repayment, reason for borrowing ...)
* **kiva_mpi_region_locations** : This dataset contains the MPI of many regions (subnational) in the world.
* **loan_theme_ids** : This dataset has the same unique_id as the kiva_loans (id of loan) and contains information about the theme of the loan.
* **loan_themes_by_region** : This dataset contains specific informations about geolocation of the loans.

This notebook will be divided into two parts : 
1. First I will conduct an EDA using mainly the 4 datasets provided by Kiva. 
2. After that, I'll try to use the informations I got from the EDA and external public datasets to build a model for poverty level estimation.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import math
import missingno as msno
from datetime import datetime, timedelta


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('seaborn-darkgrid')
palette = plt.get_cmap('Set1')

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os

# Any results you write to the current directory are saved as output.

# 1. Exploratory Data Analysis
<a id="EDA"></a>
*** 
In this part, the goal is to understand the data that was given to us through plots and statistics, draw multiple conclusions and see how we can use those results to build the features that will be needed for our machine learning model. 

Let's first see what this data is about.

## 1.1 Data description
<a id="description"></a>
*** 
Let's load the 4 csv files we have and start by analyzing the biggest one : kiva loans.

In [None]:
df_kiva_loans = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/kiva_loans.csv")
df_loc = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/loan_themes_by_region.csv")
df_themes = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/loan_theme_ids.csv")
df_mpi = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/kiva_mpi_region_locations.csv")

df_kiva_loans.head(5)

Before going any further, let's take a look at the missing values so that we don't encounter any bad surprises along the way.

In [None]:
msno.matrix(df_kiva_loans);

Seems that this dataset is pretty clean ! the *tags* column got a lot of missing values but that's not a big deal. The *funded_time* has little less than 10% of missing values, that's quite a few but since we have more than 600 000 rows, we can drop the missing rows if we need to and we'll still get some telling results !  
Let's get some global information about each of our columns.

In [None]:
df_kiva_loans.describe(include = 'all')

Plenty of useful informations in this summary :
* There are exactly 87 countries where people borrowed money according to this snapshot.
* There are 11298 genders in this dataset ! That's obviously impossible so we'll see later on why we have this value. 
* The funding mean over the world is 786 dollars while the funding median is 450 dollars.
* More importantly : there's only 1298 different dates on which loans were posted. If we calculate the ratio, **it means that there's more than 500 loans posted per day on Kiva** and that's just a snapshot (a sample of their entire data). This gives you a clear idea about how important this crowdsourcing platform is and what impact it has.

## 1.2. Kiva users 
<a id="users"></a>
*** 
In this part we will focus on the basic demographic properties of people who use Kiva to ask for loans : Where do they live ? what's their gender ? Their age would be a nice property but we don't have direct access to that for now, we'll get to that later.

Let's first start with their countries : as seen above, the data contains 671205 rows. In order to have the most (statistically) significant results going further, I'll only keep the countries that represent at least 0.5% of Kiva's community. 

In [None]:
countries = df_kiva_loans['country'].value_counts()[df_kiva_loans['country'].value_counts()>3400]
list_countries = list(countries.index) #this is the list of countries that will be most used.

In [None]:
plt.figure(figsize=(13,8))
sns.barplot(y=countries.index, x=countries.values, alpha=0.6)
plt.title("Number of borrowers per country", fontsize=16)
plt.xlabel("Nb of borrowers", fontsize=16)
plt.ylabel("Countries", fontsize=16)
plt.show();

Philippines is the country with most borrowers with approximately 25% of all users being philippinians. Elliott Collins, from the Kiva team, explained that this is due to the fact that a couple of Philippine field partners tend to make smaller short-term loans (popular low-risk loans + fast turnover rate). 


We also notice that several african countries are in the list such as *Kenya, Mali, Nigeria, Ghana ...* and no european union country at all !     
For me, the most surprising was actually the presence of the US in this list, as it doesn't have the same poverty rate as the other countries but it turns out it's indeed a specific case, **I'll explain that in 1.4**.

Let's now move on to the genders.

In [None]:
df_kiva_loans['borrower_genders']=[elem if elem in ['female','male'] else 'group' for elem in df_kiva_loans['borrower_genders'] ]
#to replace values such as "woman, woman, woman, man"

borrowers = df_kiva_loans['borrower_genders'].value_counts()
labels = (np.array(borrowers.index))
values = (np.array((borrowers / borrowers.sum())*100))

trace = go.Pie(labels=labels, values=values,
              hoverinfo='label+percent',
               textfont=dict(size=20),
                showlegend=True)

layout = go.Layout(
    title="Borrowers' genders"
)

data_trace = [trace]
fig = go.Figure(data=data_trace, layout=layout)
py.iplot(fig, filename="Borrowers_genders")

In many loans (16.4% as you can see), the borrower is not actually a single person but a group of people that have a project, here's an [example](https://www.kiva.org/lend/1440912). In the dataset, they're listed as 'female, female, female' or 'male, female' ... I decided to use the label *mixed group* to those borrowers on the pie chart above.

You can see that most borrowers are female, I didn't expect that and it was actually a great surprise. This means that **women are using Kiva to get funded and work on their projects in countries (most of them are third world countries) where breaking in as a woman is still extremely difficult.**

## 1.3 Activities, sectors and funding amounts
***

Now let's take a peek at what people are needing loans for and what's the amounts they're asking for. Let's start with the sectors. There were 15 unique sectors in the summary we've seen above, let's see how each of them fare.

In [None]:
plt.figure(figsize=(13,8))
sectors = df_kiva_loans['sector'].value_counts()
sns.barplot(y=sectors.index, x=sectors.values, alpha=0.6)
plt.xlabel('Number of loans', fontsize=16)
plt.ylabel("Sectors", fontsize=16)
plt.title("Number of loans per sector")
plt.show();

**The most dominant sector is Agriculture**, that's not surprising given the list of countries that heavily use Kavi. A fast research for Kenya for example shows that all the top page is about agriculture loans, here's a sample of what you would find:  *buy quality seeds and fertilizers to use in farm*, *buy seeds to start a horticulture farming business so as a single mom*, *Purchase hybrid maize seed and fertilizer* ... Food sector occupies an important part too because many people are looking to buy fish, vegetables and stocks for their businesses to keep running.  
It's important to note that *Personal Use* occupy a significant part too, this means there are people who don't use Kavi to get a hand with their work but because they are highly in need.

Let's see the more detailed version and do a countplot for **activities**,

In [None]:
plt.figure(figsize=(15,10))
activities = df_kiva_loans['activity'].value_counts().head(50)
sns.barplot(y=activities.index, x=activities.values, alpha=0.6)
plt.ylabel("Activity", fontsize=16)
plt.xlabel('Number of loans', fontsize=16)
plt.title("Number of loans per activy", fontsize=16)
plt.show();

This plot is only a confirmation of the previous one, activities related to agriculture come in the top : *Farming, Food production, pigs ...*. All in all, we notice that none of the activities belong to the world of 'sophisticated'. Everything is about basic daily needs or small businesses like buying and reselling clothes ...

How about the money those people need to pursue their goals ?

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(df_kiva_loans['loan_amount'])
plt.ylabel("density estimate", fontsize=16)
plt.xlabel('loan amount', fontsize=16)
plt.title("KDE of loan amount", fontsize=16)
plt.show();

Some outliers are clearly skewing the distribution and the plot doesn't give much information in this form : We need to **truncate the data**, how do we do that ? 

We'll use a basic yet really powerful rule : the **68–95–99.7 rule**. This rule states that for a normal distribution :
* 68.27% of the values $ \in [\mu - \sigma , \mu + \sigma]$
* 95.45% of the values $ \in [\mu - 2\sigma , \mu + 2\sigma]$
* 99.7% of the values $ \in [\mu - 3\sigma , \mu + 3\sigma]$     
where $\mu$ and $\sigma$ are the mean and standard deviation of the normal distribution.

Here it's true that the distribution isn't necessarily normal but for a shape like the one we've got, we'll see that applying the third filter will **improve our results radically**.


In [None]:
temp = df_kiva_loans['loan_amount']

plt.figure(figsize=(12,8))
sns.distplot(temp[~((temp-temp.mean()).abs()>3*temp.std())]);
plt.ylabel("density estimate", fontsize=16)
plt.xlabel('loan amount', fontsize=16)
plt.title("KDE of loan amount (outliers removed)", fontsize=16)
plt.show();

Well, that's clearly a lot better !    
* Most of the loans are between 100\$ and 600\$ with a first peak at 300\$.
* The amount is naturally decreasing but we notice that we have a clear second peak at 1000\$. This suggets that there may be a specific class of projects that are more 'sophisticated' and get funded from time to time, interesting.

How about some specification ? We have information how the loan amount in general, let's see now sector-wise : 

In [None]:
plt.figure(figsize=(15,8))
sns.boxplot(x='loan_amount', y="sector", data=df_kiva_loans);
plt.xlabel("Value of loan", fontsize=16)
plt.ylabel("Sector", fontsize=16)
plt.title("Sectors loans' amounts boxplots", fontsize=16)
plt.show();

As you can see, for any sector, we have outlier loans. For example it seems someone asked for a 100k loan for an agriculture project. There are also many 20k, 50k ... loans. But as seen earlier, the mean amount in general is around 500 dollars so we have to get rid of those outliers to obtain better boxplots.  
First, let's see the median loan amount for each sector, this would give an idea about the value to use as a treshold.

In [None]:
round(df_kiva_loans.groupby(['sector'])['loan_amount'].median(),2)

The highest median corresponds to 950 dollars for the sector *Wholesale*. Basically, using a treshold that doubles this value (so 2000 dollars) is more than safe and we wouldn't be using much information.

In [None]:
temp = df_kiva_loans[df_kiva_loans['loan_amount']<2000]
plt.figure(figsize=(15,8))
sns.boxplot(x='loan_amount', y="sector", data=temp)
plt.xlabel("Value of loan", fontsize=16)
plt.ylabel("Sector", fontsize=16)
plt.title("Sectors loans' amounts boxplots", fontsize=16)
plt.show();

This is obviously better visually ! We can also draw some important conclusions. The median and the inter-quartile for 'Personal Use' are much lower than for any other sector. Education and health are a bit higher compared to other 'standard sectors'.    
Why such information can be considered important ? Well keep in mind throughout this notebook that our final goal is to estimate poverty levels.   
Since the amount of loans aren't the same for all sectors (distribution-wise), it may mean that for example borrowers with Personnal Use are poorer and need a small amout for critical needs. This is not necessarily true but it's indeed an hypothesis we can take advantage of going further.

## 1.4. Waiting time for funds
<a id="dates"></a>
*** 

So far we got to see where Kiva is most popular, the nature of activities borrowers need the money for and how much money they usually ask for, great !    

An interesting question now would also be : **how long do they actually have to wait for funding ?** As we've seen before, some people on the plateform are asking for loans for critical needs and can't afford to wait for months to buy groceries or have a shelter. Fortunately, we've got two columns that will help us in our investigation : 
* funded_time : corresponds to the date + exact hour **when then funding was completed.**
* posed_time : corresponds to the date + exact hour **when the post appeared on the website.**

We've also seen before that we have some missing values for 'funded_time' so we'll drop those rows, get the columns in the correct date format and then calculate the difference between them.

In [None]:
loans_dates = df_kiva_loans.dropna(subset=['disbursed_time', 'funded_time'], how='any', inplace=False)

dates = ['posted_time','disbursed_time','funded_time']
loans_dates[dates] = loans_dates[dates].applymap(lambda x : x.split('+')[0])

loans_dates[dates]=loans_dates[dates].apply(pd.to_datetime)
loans_dates['time_funding']=loans_dates['funded_time']-loans_dates['posted_time']
loans_dates['time_funding'] = loans_dates['time_funding'] / timedelta(days=1) 


#this last line gives us the value for waiting time in days and float format,
# for example: 3 days 12 hours = 3.5

Now first thing first, we'll plot the this difference that we called *time_funding*. To avoid any outliers, we'll apply the same rule for normal distribution as before.

In [None]:
temp = loans_dates['time_funding']

plt.figure(figsize=(12,8))
sns.distplot(temp[~((temp-temp.mean()).abs()>3*temp.std())]);

I was really surprised when I got this plot (and happy too), you'll rarely find a histogram where the distribution fits in this smoothly !   
On top of that, getting two peaks was the icing on the cake, it makes perfect sense ! **We've seen above that there are two peaks for loans amounts, at 300\$ and 1000\$, we're basically saying that for the first kind of loan you would be waiting 7 days and for the second kind a little more than 30 days !   **
This gives us a great intuition about how those loans work going forward.

Let's be more specific and check for both loan amounts and waiting time country-wise :   
We'll build two new DataFrames using the groupby function and we'll aggregate using the median : what we'll get is the median loan amount (respectively waiting time) for each country.

In [None]:
df_ctime = round(loans_dates.groupby(['country'])['time_funding'].median(),2)
df_camount = round(df_kiva_loans.groupby(['country'])['loan_amount'].median(),2)

In [None]:
df_camount = df_camount[df_camount.index.isin(list_countries)].sort_values()
df_ctime = df_ctime[df_ctime.index.isin(list_countries)].sort_values()

f,ax=plt.subplots(1,2,figsize=(20,10))

sns.barplot(y=df_camount.index, x=df_camount.values, alpha=0.6, ax=ax[0])
ax[0].set_title("Medians of funding amounts per loan country wise ")
ax[0].set_xlabel('Amount in dollars')
ax[0].set_ylabel("Country")

sns.barplot(y=df_ctime.index, x=df_ctime.values, alpha=0.6,ax=ax[1])
ax[1].set_title("Medians of waiting days per loan to be funded country wise  ")
ax[1].set_xlabel('Number of days')
ax[1].set_ylabel("")

plt.tight_layout()
plt.show();

**Left plot**    
We notice that in most countries, funded loans don't usually exceed 1000\$. For Philippines, Kenya and El Salvador (the three most present countries as seen above), the medians of fund per loan are respectively : 275.00\$, 325.00\$ and 550.00\$ .

The funded amount for US-based loans seem to be a lot higher than for other countries. I dug deeper and looked in Kiva's website. **It appears that there's a special section called 'Kiva U.S.' which goal is to actually fund small businesses for *financially excluded and socially impactful borrowers*.  ** 
Example of such businesses : Expanding donut shop in Detroit (10k\$),  Purchasing equipment and paying for services used to properly professionally train basketball kids ... You can see more of that in [here](https://www.kiva.org/lend/kiva-u-s).    
This explains what we've been seeing earliers : the fact that the US is among the countries, the big amount of loan, the two-peaks plots ...

**Right plot**   
The results in this one aren't that intuitive. 
* Paraguay is the second fastest country when it comes to how much to wait for a loan to be funded while it was also the country with the second highest amount per loan in the plot above !  
* The US loans take the most time to get funded and that's only natural since their amount of loans are much higher than the other countries.
* Most of African countries are in the first half of the plot.

## 1.5. Amount of loan vs Repayment time
<a id="ratio"></a>
*** 

We have information about months needed for borrowers to repay their loans. Simply ploting the average / median repayment time per country can give some insights but what's even more important is **the ratio of the amount of loan to repayment time**. Indeed, let's say in country A, loans are repayed after 12 months in average and in country B after 15 months in average; if you stop here, you can just say *people in country B need more time to repay their loans compared to people in country A*. Now let's say the average amount of loans in country A is 500\$ while it's 800\$ in country B, then it means that *people in country A repay 41.66\$ per month while people in country B repay 51.33\$ per month* !   
This ratio gives you an idea about **how much people in a given country can afford to repay per month**.

In [None]:
df_repay = round(df_kiva_loans.groupby(['country'])['term_in_months'].median(),2)
df_repay = df_repay[df_repay.index.isin(list_countries)].sort_values()

df_kiva_loans['ratio_amount_duration']= df_kiva_loans['funded_amount']/df_kiva_loans['term_in_months'] 
temp = round(df_kiva_loans.groupby('country')['ratio_amount_duration'].median(),2)
temp = temp[temp.index.isin(list_countries)].sort_values()

f,ax=plt.subplots(1,2,figsize=(20,10))

sns.barplot(y=temp.index, x=temp.values, alpha=0.6, ax=ax[0])
ax[0].set_title("Ratio of amount of loan to repayment period per country", fontsize=16)
ax[0].set_xlabel("Ratio value", fontsize=16)
ax[0].set_ylabel("Country", fontsize=16)

sns.barplot(y=df_repay.index, x=df_repay.values, alpha=0.6,ax=ax[1])
ax[1].set_title("Medians of number of months per repayment, per country",fontsize=16)
ax[1].set_xlabel('Number of months', fontsize=16)
ax[1].set_ylabel("")

plt.tight_layout()
plt.show();

From these 2 plots, we notice that ** Nigeria** is the country with the smallest ratio (around 8 dollars per month) while Paraguay has a surprinsinly high one. Also, it seems that on average, loans are being repayed after 1 year except for India where they take much longer.    
In the second part of this kernel, we'll see weither the fact that a country/region repays a loan rapidly (in other words if the ratio of a country is high) is correlated with poverty or not.

## 1.6. Lenders community
<a id="lenders"></a>
*** 
We said that we would talk about Kiva users, that include lenders too ! It's true that our main focus here remains the borrowers and their critical need but it's still nice to know more about who uses Kiva in the most broad way and also get an idea about **what drives people to actually fund projects ?   **
Thanks to additional datasets, we got freefrom text data about the lenders and their reasons for funding, let's find about that.

In [None]:
lenders = pd.read_csv('../input/additional-kiva-snapshot/lenders.csv')
lenders.head()

Seems like this dataset is filled with missing values :). We'll still be able to retrieve some informations, let's start by checking which country has most lenders.

In [None]:
lender_countries = lenders.groupby(['country_code']).count()[['permanent_name']].reset_index()
lender_countries.columns = ['country_code', 'Number of Lenders']
lender_countries.sort_values(by='Number of Lenders', ascending=False,inplace=True)
lender_countries.head(7)

Two things here :    
* The US is, by far, the country with most lenders. It has approximately 9 times more lenders than any other country. If we want to plot a map or a barplot with this information, we have two choices : either we leave out the US or we use a logarithmic scale, which means we'll apply $ ln(1+x) $ for each $x$ in the column *Number of Lenders*. The logarithmic scale allows us to respond to skewness towards large values when one or more points are much larger than the bulk of the data (here, the US).
* We don't have a column with country names so we'll need to use another dataset to get those and plot a map.

Here's another additional dataset that contains poverty informations about each country. For the time being, we'll only use the column *country_name* to merge it with our previous dataset.

In [None]:
countries_data = pd.read_csv( '../input/additional-kiva-snapshot/country_stats.csv')
countries_data.head()

In [None]:
countries_data = pd.read_csv( '../input/additional-kiva-snapshot/country_stats.csv')
lender_countries = pd.merge(lender_countries, countries_data[['country_name','country_code']],
                            how='inner', on='country_code')

data = [dict(
        type='choropleth',
        locations=lender_countries['country_name'],
        locationmode='country names',
        z=np.log10(lender_countries['Number of Lenders']+1),
        colorscale='Viridis',
        reversescale=False,
        marker=dict(line=dict(color='rgb(180,180,180)', width=0.5)),
        colorbar=dict(autotick=False, tickprefix='', title='Lenders'),
    )]
layout = dict(
    title = 'Lenders per country in a logarithmic scale ',
    geo = dict(showframe=False, showcoastlines=True, projection=dict(type='Mercator'))
)
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False, filename='lenders-map')

The US have the largest community of lenders and it is followed by Canada and Australia. On the other hand, the African continent seems to have the lowest number of funders which is to be expected, since it's also the region with highest poverty rates and funding needs.

So now that we know more about lenders location, let's analyze the textual freeform column *loan_because* and construct a wordcloud to get an insight about their motives for funding proejcts on Kiva.

In [None]:
import matplotlib as mpl 
from wordcloud import WordCloud, STOPWORDS
import imageio

heart_mask = imageio.imread('../input/poverty-indicators/heart_msk.jpg') #because displaying this wordcloud as a heart seems just about right :)

mpl.rcParams['figure.figsize']=(12.0,8.0)    #(6.0,4.0)
mpl.rcParams['font.size']=10                #10 

more_stopwords = {'org', 'default', 'aspx', 'stratfordrec','nhttp','Hi','also','now','much','username'}
STOPWORDS = STOPWORDS.union(more_stopwords)

lenders_reason = lenders[~pd.isnull(lenders['loan_because'])][['loan_because']]
lenders_reason_string = " ".join(lenders_reason.loan_because.values)

wordcloud = WordCloud(
                      stopwords=STOPWORDS,
                      background_color='white',
                      width=3200, 
                      height=2000,
                      mask=heart_mask
            ).generate(lenders_reason_string)

plt.imshow(wordcloud)
plt.axis("off")
plt.savefig('./reason_wordcloud.png', dpi=900)
plt.show()

Lenders' answers are heartwarming :) Most reasons contain *help people / others* or *want to help*. We also find that it's the *right thing* (to do), it helps *less fortunate* and makes the world a *better place*.  
Kiva provides a platform for people who need help to fund their projects but it also provides a platform for people who want to make a difference by helping others and maybe changing their lives !

# 2. Welfare estimation
<a id="prediction"></a>
*** 
In this part we'll delvo into what's this competition is really about : **welfare and poverty estimation.  ** 
As a lender, you basically have two criterias when you're looking for a loan to fund : the loan description and how much the borrower does really need that loan. For the second, Kiva's trying to have as granular poverty estimates as possible through this competition.

In this part, I'll be talking about what poverty really means and how it is measures by economists. I'll also start with a country-level model as an example to what will be said.

Let's start.

    No society can surely be flourishing and happy, of which by far the greater part of the numbers are poor and miserable. - Adam Smith, 1776       


## 2.1 What's poverty ?
<a id="definition"></a>
*** 
The World Bank defines poverty in terms of **income**. The bank defines extreme poverty as living on less than US\$1.90 per day (PPP), and moderate poverty as less than \$3.10 a day.  
P.S: In this part, we'll say (PPP) a lot. It refers to Purchasing Power Parity. I have a notebook that is entirely dedicated to PPP and if interested and want to know more about how it works, you can check it [here](https://www.kaggle.com/mhajabri/salary-and-purchasing-power-parity).  
Over the past half century, significant improvements have been made and still, extreme poverty remains widespread in the developing countries. Indeed, an estimated **1.374 billion people live on less than  1.25 \$ per day** (at 2005 U.S. PPP) and around **2.6 billion (which is basically 40% of the worlds's population !!) live on less than \$ 2 per day**. Those impoverished people suffer from : undernutrition / poor health, live in environmentally degraded areas, have little literacy ...

As you can see, poverty seems to be defined exactly by the way it's actually measured, but what's wrong with that definition ? **In developing countries, many of the poor work in the informal sector and lack verifiable income records => income data isn't reliable**. Suppose you're the government and you have a specific program that benefits the poorest or you're Kiva and you want to know who's in the most critical condition, then relying on income based poverty measures in developing countries will be misleading and using unreliable information to identify eligible households can result in funds being diverted to richer households and leave fewer resources for the program’s intended beneficiaries. We need another way of measuring poverty. 

## 2.2 Multidimensional Poverty Index
<a id="mpi"></a>
*** 

Well one day the UNDP (United Nations Development Programme) came and said well *salary* is only one **dimension** that can describe poverty levels but it's far from the only indicator. Indeed, if you visit someone's house and take a look at how it is and what it has, it gives an intuition. Based on that, the UNDP came up with the **Multidimensional Poverty Index **, an index that has **3 dimensions and a total of 10 factors **assessing poverty : 
* **Health **: Child Mortality - Nutrition
* **Education** : Years of schooling - School attendance
* **Living Standards** : Cooking fuel - Toilet - Water - Electricity - Floor - Assets

How is the MPI calculated ? Health's and Education's indicators (there are 4 in total) are weighted equally at 1/6. Living standards' indicators are weighted equally at 1/18. The sum of the weights $2*1/6 + 2*1/6 + 6*1/18 = 1$. Going from here, **a person is considered poor if they are deprived in at least a third of the weighted indicators.** 
Example : Given a household with no electricity, bad sanitation, no member with more than 6 years of schooling and no access to safe drinking water, the MPI score would be : 
$$ 1/18 + 1/18 + 1/6 + 1/18 = 1/3$$
So this household is deprived in at least a third of the weighted indicators (MPI > 0.33) and is considered MPI-poor.

Kiva actually included MPI data so let's get a look at it :

In [None]:
df_mpi.head(7)

This dataset gives the MPI of different regions for each country, to have a broad view, let's use a groupby *country* and take the average MPI and plot that in a map.

In [None]:
mpi_country = df_mpi.groupby('country')['MPI'].mean().reset_index()

data = [dict(
        type='choropleth',
        locations=mpi_country['country'],
        locationmode='country names',
        z=mpi_country['MPI'],
        colorscale='Greens',
        reversescale=True,
        marker=dict(line=dict(color='rgb(180,180,180)', width=0.5)),
        colorbar=dict(autotick=False, tickprefix='', title='MPI'),
    )]

layout = dict(
    title = 'Average MPI per country',
    geo = dict(showframe=False, showcoastlines=True, projection=dict(type='Mercator'))
)

fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False, filename='mpi-map')

As you can notice, the data provides the MPI for the African continent essentially. That shouldn't surpise you, as we said before, for developed countries, income data is actually reliable and good enough to measure poverty so we don't need to send researchers on the field to run surveys and get the necessary data for the MPI. That's why you'll find more data about MPI measurements in developing / poor countries.   

Now that we know what's MPI is about, we can use an Oxford Poverty & Human Development Initiative dataset that's been uploaded here to analyze the **Poverty headcount ratio (% of population listed as poor) in the countries that are using kiva**.

In [None]:
df_mpi_oxford = pd.read_csv('../input/mpi/MPI_subnational.csv')
temp = df_mpi_oxford.groupby('Country')['Headcount Ratio Regional'].mean().reset_index()

temp = temp[temp.Country.isin(list_countries)].sort_values(by="Headcount Ratio Regional", ascending = False)

plt.figure(figsize=(15,10))
sns.barplot(y=temp.Country, x=temp['Headcount Ratio Regional'], alpha=0.6)
plt.ylabel("Country", fontsize=16)
plt.xlabel('Headcount Ratio National', fontsize=16)
plt.title("Headcount Ratio National per Country", fontsize=16)
plt.show();

First of all, the dataset provides us with regional headcount ratios; as you can see in the code above, I consider that the national headcount ratio is the mean of all the regional ratios. That's not perfectly true. A more precise formula would weight the ratios by the % of total population living in that region. For example, if a region in a given country has 40% of total habitants, then it would count 10 times more than a region that has 4% of total habitants.   
So those results are far from perfect, many countries may have a higher ratio, other would have a smaller ratio but this gives us a first glance at least !    

As you can notice, African countries come on top, with the top 3 having more than 70% of their total population listed as poor ! Kenya, which is the most present country in Kiva, has more than 40% of its population listed as poor while Philippines' is 11%.

## 2.3 Proxy Means Test
<a id="pmt"></a>
*** 

### Definition 

Eventually, the World Bank had to come up with it's own way of estimating consumption when income data are either unreliable or unavailable. The results are used for **mean-testing** as the name suggests, where a **mean-test is a determination of whether an individual or family is eligible for government assistance, based upon whether the individual or family possesses the means to do without that help.** Sounds familiar ? Yeah, we want to determine if a person is "eligible" for funding here ! So let's know more about this technique.

For the MPI, we have exactly 10 indicators and those same indicators are used whenever / wherever you're conducting your research. For PMT, there's two key differences : 
1. There is no longer an exact number of  **proxies**. You choose them based on household surveys and you can actually come up with what you think will be the most effective when it comes to estimating poverty.
2. You don't have equal weights in PMT. You use statistical methods (mainly regressions) with the dependant variable being either consumption-related or income related. Then regression will give you the $\beta$s (weights). Example : Ownership of a house could have a weight of 200, number of persons per living room a weight of (-50) ....

One advantage of PMT is that** it is not required that a single model (same proxies or same weights) is used for the entire country ** and so this gives us a better model overall.

### Performing a PMT    

For a start, I decided actually run a very general Proxy Means Test where I consider the most frequent countries in Kiva !   
#### **Step 1**     
First we have to decide what proxies to use. Mine fall under 4 categories : 
1. **Location** : 
    * % of population living in rural Area. 
2. **Housing** :
    * % of population with access to improved water sources
    * % of population with access to electricity
    * % of population with access to improved sanitation facilities 
3. **Family** : 
    * Average size of the household
    * Average number of children per family
    * Sex of the head of household : % of families with a male head of household    
    * Level of education attained
    * Employment rate : Employment to population (age >15) ratio.
    * Agriculture employment : % of total workers who have an agriculture-related job
4. **Ownership** : 
    * Ownership of a telephone : Mobile cellular per capita
 
#### **Step 2**     
Assembling the data ! Let's get back to coding :)

In [None]:
#load all needed data
df_household =pd.read_csv('../input/poverty-indicators/household_size.csv',sep=';')
df_indicators = pd.read_csv('../input/poverty-indicators/indicators.csv',
                            sep=';', encoding='latin1', decimal=',').rename(columns={'country_name': 'country'})
df_education= pd.read_csv('../input/additional-kiva-snapshot/country_stats.csv')[['country_name','mean_years_of_schooling']].rename(columns={'country_name': 'country'})
df_mobile = pd.read_csv('../input/poverty-indicators/mobile_ownership.csv',sep=';',encoding='latin1',decimal=',')
df_mobile['mobile_per_100capita']=df_mobile['mobile_per_100capita'].astype('float')

#merge data for most frequent countries
temp = pd.merge(df_indicators, df_household, how='right', on='country')
temp = pd.merge(temp, df_mobile, how='left', on='country')
indicators = pd.merge(temp, df_education, how='left', on='country').round(2)

indicators

Okay, so we notice that there is missing data about Palestine and Kyrgyzstan. I searched for that to fill those two rows instead of deleting them.

In [None]:
palestine_data = ['Palestine','PLS', 4550000, 24.52, 91, 89, 60, 73, 7.4, 0.90, 5.9, 3.1, 97.8, 8]
kyrgyzstan_data = ['Kyrgyzstan','KGZ', 6082700, 64.15, 90, 93.3, 99.8, 58.3, 29.20, 0.73, 4.2, 2.1, 123.7, 10.80]

indicators.loc[35]=palestine_data
indicators.loc[36]=kyrgyzstan_data

Now before going further with PMT, let's play a bit with this data.  

Just above, I've calculated average MPIs per country to plot the map. Let's merge the MPI information with those indicators, plot a correlation matrix and see what we'll have.

In [None]:
indicators_mpi = pd.merge(indicators,mpi_country, how='inner', on='country').drop(['country_code','population'],axis=1)

corre = indicators_mpi.corr()

mask1 = np.zeros_like(corre, dtype=np.bool)
mask1[np.triu_indices_from(mask1)] = True

f, axs = plt.subplots(figsize=(10, 8))

cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corre, mask=mask1, cmap=cmap, vmax=.3, center=0, square=False, linewidths=.5, cbar_kws={"shrink": .5});

The results here are **comforting**. According to the matrix, the MPI is significantly related to every indicator except the sex of household headship. Fair enough.    More importantly, notice that the most correlated indicators to the MPI, which also happen to have really high scores (0.8 approximately) are the following : **access to water / sanitation / electricity and years of schooling**. Now if you go back to how the MPI is defined, you'll find among the 10 indicators : Years of schooling - Toilet - Water - Electricity !!

#### **Step 3**     
Back to our Proxy Means Test now, let's do some regression now. The dependant variable here will be **Household final consumption expenditure per capita (PPP)**.

In [None]:
consumption = pd.read_csv('../input/poverty-indicators/consumption.csv',sep=";", decimal=',',encoding='latin1')

df = pd.merge(indicators, consumption[['country','consumption_capita']] , how='left', on='country').round(2).dropna()
df.rename(columns={'rural_population_%': 'rural_ratio','access_water_%':'ratio_wateraccess', 'access_electricity_%':'ratio_electricityaccess',
                  'employment_%':'ratio_employment','agriculture_employment_%':'ratio_agriculture',
                  'access_sanitation_%':'ratio_sanitation'},inplace=True)

from statsmodels.formula.api  import ols
model = ols(formula = 'consumption_capita ~ rural_ratio+ratio_wateraccess+ratio_electricityaccess+ratio_sanitation+ratio_employment+ratio_agriculture+\
            male_headship+average_household_size+avg_children_nb+mean_years_of_schooling+mobile_per_100capita',
          data = df).fit()
print(model.summary())

Back to reality where first models don't actually give you the best results right away :( The p-values for the coefficients are too high (which translates to not statistically significant) and it seems that we also have multicolinearity issues.   
Let's see what's happening here :   
* The data is very **heterogeneous**. Indeed, we're dealing with country-level data here and poverty in different countries might as well have different causes. As explained before, PMT is really intended to be done on the data of a specific country / region.
* We're using **11 features to perform a linear regression on less than 40 points**. That can't be good ! We have way too much features for such a small sample.

Let's try a simplified model with less features.

In [None]:
model2 = ols(formula = 'consumption_capita ~ rural_ratio+ratio_sanitation+mobile_per_100capita+average_household_size ',
          data = df).fit()
print(model2.summary())

This is much better already. You can see that the **p-values are much lower**, the overall **F-statistic is lower too**, the BIC score is lower (which means better) and the multicolinearity is no longer an issue.   
You would think that this also translates in a less accurate model (because we have less information) but that's not true actually ! The $R^2$ and adjusted $R^2$ improved as well and are higher for the second model. Well as the saying goes, *sometimes less is more* !

So far we've only used external data to assess welfare, but what about Kiva's data actually ? Can't we use some properties of the loans or borrowers and put them into our model ? Well actually we can ! We've already talked up above about a couple things : 
1. Time for a project to get entirely funded (date_funded - date_posted)
2. The median amount of loan per country, median duration to repay the loan and the ratio of the 1st to the 2nd.

In [None]:
df_ratio = round(df_kiva_loans.groupby('country')['ratio_amount_duration'].median(),2)
df_ctime = round(loans_dates.groupby(['country'])['time_funding'].median(),2)
df_camount = round(df_kiva_loans.groupby(['country'])['loan_amount'].median(),2)
df_repay = round(df_kiva_loans.groupby(['country'])['term_in_months'].median(),2)

kiva_indic = pd.concat([df_ratio, df_ctime, df_camount, df_repay], axis=1, join='inner').reset_index()
indicators_mpi_kiva = pd.merge(indicators_mpi,kiva_indic,how='left',on='country')

indicators_mpi_kiva = indicators_mpi_kiva[['MPI','ratio_amount_duration','time_funding','loan_amount','term_in_months']]
indicators_mpi_kiva.corr()

The MPI doesn't seem related at all to the ratio, which is quite surprising to me. Nonetheless, the correlation between the MPI and funding time and repayment time isn't negligible which means those two can prove to be good resources going further.

*** 
#### **Conclusion**     

Let's sum it up for a moment. So far in this second part, we've introduced concepts such as poverty and MPI to understand what we are talking about. After that, we tried to run a very simple and naive regression model that depends on demographic informations and takes countries' data as input. Of course, the model wasn't brilliant as two countries with similar MPI can be extremely different (reason for poverty is not always the same across the world). Now that we get a better grasp of the problem, let's try to make something more interesting.

## 2.4 Building the model
<a id="model"></a>
*** 

As we go deeper in the kernel, we try to be more specific, as it's the goal of this competition :  We try to build something as granular and precise as we can.           
But being **more precise also means acquiring more/better data**, and that's where it actually becomes tricky : Getting country-wise data is straightforward : you can get a lot thanks to Wikipedia alone as a matter of fact ! Getting city-wise data is a different story, let alone district data for example.         
As you can see in this challenge, most of the datasets that were uploaded by kagglers to help with the given task actually provided coutry-wise informations.     
Those informations usually can help getting some insights but can't do much when it comes to building a granular model for poverty estimation.

Looking for sources for data, the first two that come to mind are **The World Bank and The DHS Program**. The later one proved to be a goldmine so let's talk about it in a more detailed fashion.


### 2.4.1 What's the DHS Program ?
<a id='data'></a>

The Demographic and Health Surveys (DHS) Program is responsible for collecting and disseminating accurate, nationally representative data on health and population in developing countries. Since 1984, The DHS Program has provided technical assistance to more than 300 demographic and health surveys in over 90 countries.   
So looking into that, I found that the DHS Program has some extremely detailed data. For exemple if you select one country, it can in fact give you the survey results for each household that responded to the survey !! Amazing right ? Well that comes with some costs (completely worth it though) : 
* First of all, You have to log in the website, build a project and explain what data you want to have access to and what do you want to do with it.  Your answers are verified afterwards and if everything goes well, they'll give you authorization to download and use the data. That being said, you don't have the right to publicly share that data  (for example uploading it as a public dataset on Kaggle), it's for private usage only. You can of course share the results you obtain by your usage.
* After downloading the data, you'll notice that the raw data is hard to deal with : You get hundreds and hundreds of columns with weird names and you have to process all of that. They have a couple of video tutorials on their website for people  who'll use their data for the first time to understand the norms they're using and how to find the information you're looking for.

Thankfully, in our case, a Kaggler got his hands dirty and did a wonderful job with this DHS data. As I was struggling to retrieve information from the raw DHS data, [Frédéric Kosmowski](https://www.kaggle.com/fkosmowski) published his awesome [kernel](https://www.kaggle.com/fkosmowski/matching-kiva-s-borrowers-with-dhs-clusters) that produces a new dataset that matches Kiva's loans to DHS data.
I want to put emphasis on the fact the work he has done is phenomenal. If you go over his R code you'll notice that there's a lot done there. Not only that, I contacted him saying asking for some guidance regarding my own model and he has been very helpful. Since then, we've been constantly exchanging and I want to thank him for that. It felt good collaborating with a fellow Kaggler on a competition that's actually a part of the #DataScienceForGood program.

So let's get to work using the DHS data.

According to Kiva provided data, the most precise model was considering the MPI on a regional level. Here, we'll attempt to do something different by building a new metric using geographical clusters (which is more granular than specific regions).
Firstly, I'll just retrive informations about each cluster and try to build the same regression models as previously but this time, each country will have its own model based on its own clusters.
The consequence is that we'll end up with a way more homogeneous data but also having more data ! Previously we've been using 30ish countries (so 30 training instances) to build a model which isn't a lot, here, each country has actually a lot of clusters so let's see how it goes.


In [None]:
from sklearn.preprocessing import StandardScaler

clusters = pd.read_csv('../input/kivadhsv1/DHS.clusters.csv')

clusters= clusters.drop_duplicates()[['DHSCLUST','DHSCC.x', 'DHS.lat', 'DHS.lon','Country','MPI.median', 'Nb.HH', 'AssetInd.median','URBAN_RURA',
                    'Nb.Electricity', 'Nb.fuel', 'Nb.floor', 'Nb.imp.sanitation', 'Nb.imp.water', 'Median.educ', 'Nb.television', 'Nb.phone']]
clusters.drop(clusters.index[4987],inplace=True)

clusters['Country']=[clusters['Country'].iloc[i] if clusters['DHSCC.x'].iloc[i] !='AM' else 'ARM' for i in range(len(clusters)) ]
clusters[['DHS.lat', 'DHS.lon']] = clusters[['DHS.lat', 'DHS.lon']].astype(float,inplace=True)

for indic in ['Nb.Electricity','Nb.fuel','Nb.floor','Nb.imp.sanitation','Nb.imp.water','Nb.television','Nb.phone'] : 
    clusters[indic]=round(100*clusters[indic].astype(int)/clusters['Nb.HH'].astype(int),2)
    
clusters['URBAN_RURA']=clusters['URBAN_RURA'].apply (lambda x : 1 if x=='U' else 0 )

clusters['DHSCLUST']=[str(clusters['DHSCLUST'].iloc[i])+'_'+clusters['Country'].iloc[i] for i in range(len(clusters)) ]

clusters['AssetInd.median']=clusters['AssetInd.median'].astype(float)
max_asset = max(clusters['AssetInd.median'])
min_asset = min(clusters['AssetInd.median'])
clusters['AssetInd.median']=clusters['AssetInd.median'].apply(lambda x : round((100/(max_asset-min_asset)) * (x-min_asset),2))

clusters.rename(columns={"MPI.median":"MPI_cluster", "AssetInd.median" : "wealth_index", "URBAN_RURA": "urbanity" , 'Nb.Electricity':'ratio_electricity',
                        'Nb.imp.sanitation':'ratio_sanitation','Nb.imp.water':'ratio_water', 'Nb.phone':'ratio_phone',
                        'Nb.floor':'ratio_nakedSoil','Nb.fuel':'ratio_cookingFuel'}, inplace=True)

clusters['MPI_cluster']=clusters['MPI_cluster'].astype(float,inplace=True)

clusters[['urbanity' ,'ratio_sanitation','ratio_phone', 'ratio_electricity', 'ratio_cookingFuel', 'ratio_nakedSoil','ratio_water']] = StandardScaler().fit_transform(clusters[['urbanity' ,'ratio_sanitation','ratio_phone', 'ratio_electricity', 'ratio_cookingFuel', 'ratio_nakedSoil','ratio_water']])

clusters.sample(20)


Let me briefly explain what I do in the code above and the output that we get :  

* I only select the columns that I need (at least for now) which are all demographic, as I've been doing in the previous models. 
* Features that have a name such as 'Nb.some_demographic_property' indicate how much households in that given cluster have that property. I divide those cells by 'Nb.HH' that indicates the number of household in that cluster -> That gives us a ratio instead.
* I rescale the wealth index between 0 and 100 so it can be more meaningful.

As a result, we end up with a table containing all clusters and relative demographic informations.    
Let's now take one particular country, say Colombia, and build a regression model based on its clusters.

In [None]:
clusters_clmb = clusters[clusters['Country']=='COL']

model_clmb= ols(formula = 'MPI_cluster ~ urbanity + ratio_sanitation +ratio_phone+ ratio_electricity+ ratio_cookingFuel+ratio_nakedSoil+ratio_water',
          data = clusters_clmb).fit()

print(model_clmb.summary())

Our takeaway :** it's way better than what we had previously** !! Not convinced ? Look at the p-values we got on the model. **All features have a p-value that's less than the 0.05 standard significance treshold**. Having more data and that data being homogenenous helped us getting more meaningful and trustworthy results.   
The coefficients we get for each indicator give us an idea about the importance of each indicator in estimating the MPI (all features being rescaled). 

### !!! Important !!! 
One may wonder : well you already have the MPI for those clusters, what are the regressions for ? What's their use ?    
As I've explained in the first part of this kernel, when two regions have the same MPI score, we can say that they are equally poor but the reasons leading for this poverty might be very different (remember, MPI is based on 10 different indicators) ! **The coefficients we get from the linear regression can help us understand from what this specific cluster is suffering**. Say $\beta_{\text{nakedSoil}}$ is high, this means the houses' floor in that region is in a really bad state and we should be paying attention to loans coming from that cluster saying that they need to cover their soil. If $\beta_{\text{ratio_water}}$ is high, this cluster households don't have easy access to water and the poorest people will be looking for loans to buy equipements that provide them with water.     
So those coefficients aren't used to do a MPI prediction (they still can do that of course) but are rather used to identify what makes each specific region / cluster poor and gives us very specific understanding about the problems in every zone, which is not the case when the MPI is the only information we have.

### 2.4.2 Assigning loans to clusters
<a id='knn'></a>

Now that we have geographical clusters, we would like to know how we'll assign each loan / borrower to the appropriate cluster.    
Let's load again the extended kiva dataset uploaded by beluga and associate it with coordinates. For now, we'll only work on 5 countries (some of the most present on Kiva).

In [None]:
loan_coords = pd.read_csv('../input/additional-kiva-snapshot/loan_coords.csv')
loans_extended = pd.read_csv('../input/additional-kiva-snapshot/loans.csv')
loans_with_coords = loans_extended.merge(loan_coords, how='left', on='loan_id')

loans_with_coords=loans_with_coords[['loan_id','country_code','country_name','town_name','latitude','longitude',
                                    'original_language','description','description_translated','tags', 'activity_name','sector_name','loan_use',
                                    'loan_amount','funded_amount',
                                    'posted_time','planned_expiration_time','disburse_time', 'raised_time', 'lender_term', 'num_lenders_total','repayment_interval']]
loans_with_coords = loans_with_coords[np.isfinite(loans_with_coords['latitude'])]

loans = loans_with_coords[loans_with_coords['country_name'].isin(['Philippines','Colombia','Armenia','Kenya','Haiti'])]

loans.sample(10)
#loans_with_coords['country_name'].value_counts() -> #Peru #Cambodia #Uganda #Pakistan #Ecuador


As we've seen before, each cluster has two properties [DHS.lat , DHS.lon] that actually indicate the *neighborhood* of that cluster. In the following, we'll perform a K-nn to (with k=1), to assign each loan to a cluster.   
Since we'll be looking to calculate distances between two vectors (latitude , longitude), the euclidean distance won't be the most appropriate to measure which location is the closest to another one. Instead, we'll be using the **Haversine distance.**

*What's the Haversine distance *? Because the earth can be considered a sphere, the quickest route between two points is a "Great Circle" and the Haversine distance aims to calculate the length of that great circle. Here's a picture to illustrate that :
![Great Circle](https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/37652/versions/4/screenshot.png)                     


And here's the formula to calculate the Haversine distance :            
Given $\phi_1$, $\lambda_1$ and $\phi_2$, $\lambda_2$ the geographical latitude and longitude **in radians **of two points A and B,          
$$d_H(A,B) = 2\arcsin \sqrt{\sin^2\left(\frac{\phi_2-\phi_1}{2}\right) +\cos(\phi_1)\cos(\phi_2)\sin^2\left(\frac{\lambda_2-\lambda_1}{2}\right) }$$

In [None]:
'''Performing Knn with k=1 to find the cluster for each loan'''
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1 , metric='haversine')
neigh.fit(np.radians(clusters[['DHS.lat', 'DHS.lon']]), clusters['DHSCLUST']) 
loans['DHSCLUST'] = neigh.predict(np.radians(loans[['latitude','longitude']]))

'''Build a table to show the coordinates of the loan and coordinates of the cluster it is assigned to, then calculate the Haversine distance in kilometers between cluster and loan'''
precision_knn = loans[['loan_id','country_name','latitude','longitude','DHSCLUST']].merge(clusters[['DHSCLUST','DHS.lat','DHS.lon']], how='left', on='DHSCLUST')
lat1 = np.radians(precision_knn['latitude'])
lat2 = np.radians(precision_knn['DHS.lat'])
lon1 = np.radians(precision_knn['longitude'])
lon2 = np.radians(precision_knn['DHS.lon'])
temp = np.power((np.sin((lat2-lat1)/2)),2) + np.cos(lat1) * np.cos(lat2) * np.power((np.sin((lon2-lon1)/2)),2)
precision_knn['distance_km'] = 6371 * (2 * np.arcsin(np.sqrt(temp))) #6371 is the radius of the earth

precision_knn.sample(10)

In [None]:
print("The median distance in kilometers between a loan and the cluster it's assigned to is : " , round(precision_knn['distance_km'].median(),2))

Impressive ! This means that given a loan, we can assign it to a cluster (for which we have a lot of demographic informations) that's on average only **6.5km away !**

### Conclusion
If we stop right here, we already have a model we can follow :   
1. Given a loan with latitude and longitude (lat , long), assign it to nearest cluster (using harvesine distance) and use the MPI or wealth index measure of that cluster as a poverty measure (instead of using regional measure).
2. Given a cluster (or a loan), thanks to the $\beta$s we got from the regression, we can understand what's the main problem in that region : bad sanitation ? no access to water ? etc..


Another interesting thought is, given those clusters, how can we use kiva's data to extract information about borrowers' poverty.
***

## 2.4.3 Kiva's custom metric
<a id="customkiva"></a>
*** 

So far, we've only been using purely demographic properties. Let's now extract information from the kiva side :
* Amount of loan, time needed to fund it / repay it
* Dominant sectors in each cluster

We'll also see how the textual descriptions could be used : 
* Description of the borrower
* Description of the use

In [None]:
useful_clusters = list(loans['DHSCLUST'].value_counts().reset_index()['index'])
clusters = clusters[clusters['DHSCLUST'].isin(useful_clusters)]

### Amount funded, time to fund it, time to reimburse it

In [None]:
temp = loans.dropna(subset=['posted_time','disburse_time', 'raised_time'], how='any', inplace=False)

dates = ['posted_time', 'disburse_time', 'raised_time']
temp[dates] = temp[dates].applymap(lambda x : x.split('+')[0])

temp[dates]=temp[dates].apply(pd.to_datetime)
temp['time_funding'] = temp['raised_time']- temp['posted_time']
temp['time_funding'] = temp['time_funding'] / timedelta(days=1) 
temp['ratio_amount_TimeFunding'] = temp['funded_amount']/temp['time_funding']
temp['ratio_amount_TimeRepay'] = temp['funded_amount']/temp['lender_term']

d = round(temp.groupby(['DHSCLUST'])[['funded_amount','lender_term','ratio_amount_TimeFunding','ratio_amount_TimeRepay']].median(),2).reset_index()

clusters = clusters.merge(d, how='left',on='DHSCLUST')

indicators_kiva_col =  clusters[['MPI_cluster','wealth_index','funded_amount','lender_term','ratio_amount_TimeFunding','ratio_amount_TimeRepay']][clusters['Country']=='COL']
indicators_kiva_col.corr()

We notice that funded_amunt and lender_term are correlated to a certain extent with both the MPI and the wealth index. Now remember that the low values don't mean that those two indicators aren't good enough for measuring poverty because the way the MPI and the wealth-index are calculated depends almost entirely on the demographic indicators that are given so actually almost any other indicator (that's not demographic) could result in a low correlation.

### Sectors

In [None]:
d = loans.groupby(['DHSCLUST','sector_name'])['funded_amount'].count().reset_index()

d['proportion'] = [d['funded_amount'].iloc[i]/len(loans[loans['DHSCLUST']==d['DHSCLUST'].iloc[i]]) for i in range(len(d))]

tmp1 = d[d['sector_name']=='Agriculture'][['DHSCLUST','proportion']]
tmp1 = tmp1.merge(d[d['sector_name']=='Education'][['DHSCLUST','proportion']], how='left', on='DHSCLUST')
tmp1 = tmp1.merge(d[d['sector_name']=='Food'][['DHSCLUST','proportion']], how='left', on='DHSCLUST')
tmp1 = tmp1.merge(d[d['sector_name']=='Health'][['DHSCLUST','proportion']], how='left', on='DHSCLUST')
tmp1.columns=['DHSCLUST','ratio_agriculture', 'ratio_education','ratio_food', 'ratio_health']

clusters = clusters.merge(tmp1 , how='left',on='DHSCLUST')
indicators_kiva_col =  clusters[['MPI_cluster','wealth_index','ratio_agriculture', 'ratio_education','ratio_food', 'ratio_health']][clusters['Country']=='COL']
indicators_kiva_col.corr()

We notice that taking a loan for an educational purpose VS taking a loan that would be invested in agriculture does make a notable difference. Indeed, more educational investment seems to reflect a lower MPI while more agriculture investment reflects a higher MPI **(remember, this is true for Colombia, not every country. The analysis has to be conducted on each country on its own)**.

*** 

If you think of the correlations as linear coefficients, you can build a metric using the correlations (rescaling everything) :       
Given a cluster $C$, 
$$\text{Kiva-metric}(C) = \text{corr}(\text{MPI , agri})(C) \times ratio_{agri}(C) + ... + \text{corr}(\text{MPI, median funded amount})(C)\times \text{median funded _amount}(C) $$

### ***Going further ... ***

### Description of borrower

We have a quite interesting column "description" (and description translated when the original input is not in english). In this column the borrower is actually free to write anything he thinks is important for lenders to read : his age, situation, needs ... you see the point.

In the following, I group by country and cluster and aggregate all the texts (so all loans for a given cluster are now one big text). Then using the spacy library, I get vector columns and finally, using t-SNE, I get a nice 2D plot. In this plot, each country is represented by a colour and each point is a cluster (the geographical demographic clusters). 

# Conclusion
<a id="conclusion"></a>
*** 

After three months of updating this kernel and 60 versions or something like that, it seems that it finally comes to its end :).

I can honestly say that I enjoyed the ride and that it was my best experience on Kaggle so far, mostly because in opposite to other notebooks I wrote / competitions I participated to, I got to collaborate with people I didn't know at all, people that have very different background but share the same interest and passion as me; so I want to obviously thank all the kagglers (Mainly Frédéric) and also non-kagglers that helped me throughout this project.
I enjoyed exchanging mails to understand specificities of the household data or to ask for help when I needed it. I also enjoyed reading the kernels and exploring what each kaggler tried to come up with and, as always, I enjoyed reading research papers and discovering more about such an exciting field.

I hope that my kernel will be of any help to the awesome Kiva platform and that they can use my model to improve their methods in any way, but I also hope that this kernel will help fellow kagglers that are interested in the application of data science to social good and in thise particular case, poverty.

Thank you for reading, 
Mhamed.