In [42]:
import numpy as np # linear algebra
from numpy import log10, ceil, ones
from numpy.linalg import inv 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # prettier graphs
import matplotlib.pyplot as plt # need dis too
%matplotlib inline 
from IPython.display import HTML # for da youtube memes
import itertools # let's me iterate stuff
from datetime import datetime # to work with dates
import geopandas as gpd
from fuzzywuzzy import process
from shapely.geometry import Point, Polygon
import shapely.speedups
shapely.speedups.enable()
import fiona 
from time import gmtime, strftime
from shapely.ops import cascaded_union
import gc

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

sns.set_style('darkgrid') # looks cool, man
import os

In [43]:
# read some files...

# read in MPI
df_mpi_ntl = pd.read_csv("../input/mpi/MPI_national.csv")
df_mpi_subntl = pd.read_csv("../input/mpi/MPI_subnational.csv")

# read in kiva data
df_kv_loans = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/kiva_loans.csv")
df_kv_theme = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/loan_theme_ids.csv")
df_kv_theme_rgn = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/loan_themes_by_region.csv")

# read in kiva enhanced data
df_kiv_loc = pd.read_csv("../input/kiva-challenge-coordinates/kiva_locations.csv", sep='\t', error_bad_lines=False)

# read in geospatial regions

gdf_phl = gpd.GeoDataFrame.from_file("../input/philippines-geospatial-administrative-regions/ph-regions-2015.shp")
gdf_rwa = gpd.GeoDataFrame.from_file("../input/rwanda-2006-geospatial-administrative-regions/Province_Boundary_2006.shp")

# read in cato data
df_cato = pd.read_csv("../input/cato-2017-human-freedom-index/cato_2017_human_freedom_index.csv")

# read in world bank findex and rural population data
df_wb_findex = pd.read_csv("../input/findex-world-bank/FINDEXData.csv")
df_wb_rural = pd.read_csv("../input/world-bank-rural-population/rural_pop.csv")

# read in kenya poverty metrics
df_pov_ken_food = pd.read_csv("../input/kenya-poverty-metrics-by-district/food_poverty_est.csv")
df_pov_ken_ovrl = pd.read_csv("../input/kenya-poverty-metrics-by-district/overall_poverty_est.csv")
df_pov_ken_hrdc = pd.read_csv("../input/kenya-poverty-metrics-by-district/hardcore_poverty_est.csv")

# read in philippines demographics
df_demo_phl = pd.read_csv("../input/kiva-phillipines-regional-info/phil_regions.csv")


# Intro / Kernels
This is going to be my third Kiva Kernel.
1. [Kiva Exploration by a Kiva Lender and Python Newb](https://www.kaggle.com/doyouevendata/kiva-exploration-by-a-kiva-lender-and-python-newb) - I suppose I'll refer to this now as my initial Exploratory Data Analysis kernel.  I have several calls to action in it and some interesting views and exploration of some outliers in data of interest.  My kernel in edit mode now appears all in one line, and I opened a support ticket on April 1st at 12:08am.  It's 12pm on April 9th and the only contact I had my ticket was asking which kernel it was (I only had 1) - I am now giving up on rescuing the draft work I had and creating this kernel.
2. [An Exploratory Look at Kiva *Lenders*](https://www.kaggle.com/doyouevendata/an-exploratory-look-at-kiva-lenders) - this is a kernel I made exploring the additional uploaded loan set and kiva lender files, while I was waiting for my original Kernel to be resolved.
3. Kiva Poverty Targeting - this kernel in which I will reuse and expand upon previous work, particularly from the first kernel.

# Table of Contents
* [1. What is Kiva?](#1)
* [2. Notes On Kiva for Relieving Poverty](#2)
* [3. Multi-Dimensional Poverty Index (MPI)](#3)
  * [3.1 National MPI Based Calculations](#3_1)
      * [3.1.1 Critique 1: Null Rural Percentages](#3_1_1)
      * [3.1.2 Critique 2: Rural Percentage Accuracy](#3_1_2)
      * [3.1.3 Critique 3: Overall Changes Over Time](#3_1_3)
  * [3.2 Sub-National MPI Based Calculations](#3_2)
    * [3.2.1 Critique 4: Inherited National MPI and Mixed Methodology](#3_2_1)
    * [3.2.2 Critique 5: Inaccurate Region Assignment](#3_2_2)
    * [3.2.3 Sub-National Regional Reassignment and Scoring --- Loan Theme Partner Regions](#3_2_3)
    * [3.2.4 Sub-National Regional Reassignment and Scoring --- Loans](#3_2_4)
    * [3.2.5 Sub-National Regional Reassignment and Scoring --- Loans --- Weighted Instead of Averaged](#3_2_5)
* [4. Kenya Analysis](#4)
  * [4.1 Kenya MPI Analysis (National MPI 0.187)](#4_1)
  * [4.2 Kenya FGT Analysis](#4_2)

<a id=1></a>
# 1. What is Kiva?
Kiva is an organization which connects risk-tolerant lenders and their capital to people in poor areas with difficulty accessing credit.  Loans are crowdsourced in increments of $25 and disbursed by field partners, expected to generally be at market or below market interest rates.  The Kiva lending user is directly paid back from the associated loan and thus can be thought of as crowdfunding the purchase of a loan a borrower has requested.
<a id=2></a>
# 2. Notes On Kiva for Relieving Poverty
I am a fan and happy user/lender of Kiva; although it should also be noted that Kiva cannot work miracles.  Much of poverty reduction comes from state supported initiatives, such as providing security for people and property so that markets can function and flourish.  Kiva has one lever to pull and that lever does not address things like a corrupt state or resolve violent conflicts.  Kiva's largest impact will be from directing it's risk tolerant capital provided by its users into the hands of borrowers who have difficulty accessing basic banking functions to securely save money or to borrow at fair market rates.  The ideal case for a standard Kiva loan is likely to an entrepeneur who would otherwise seek a local black market/mafia type money lender who would charge much higher rates.

Kiva results also can be difficult to measure.  We're not going to observe something at a high level aggregate (like GDP for a region) because there are many other factors going on, but mostly for the simple reason that there is not enough Kiva loan activity going on to observe results at such a high aggregate level.  Per [Section 11 of my first Kernel](https://www.kaggle.com/doyouevendata/kiva-exploration-by-a-kiva-lender-and-python-newb), El Salvador (easily) would be the country with the most observable impact at the nation state level;
![](https://www.doyouevendata.com/wp-content/uploads/2018/04/11.png)

However, note that they have about 6,250 loans per million residents.  That's 6.25 people with Kiva capital provided loans per 1,000 people.  That's not going to goose a GDP number, GINI coefficient, MPI metric, or any other nation state aggregate.  Also, all loans are not equally helpful or impactful; it was amusing to find in [Tim W's Kernel](https://www.kaggle.com/timosdk/exploring-motivations-for-kiva-loans/notebook) that he found mutliple people in Peru requesting loans for $600-1200+ stereo systems.  This is certainly more than some of my friends are willing to spend on a kickin' stereo.  It is certainly different than a loan for a store to expand their inventory, or a farmer to purchase some pigs, or for someone to purchase their own toilet.

Since the country level aggregates are not great for measuring the problems Kiva is trying to tackle, they are best off trying to go regional or local, as much as possible.  The GDP effects are not observable, although what is observable is how difficult life in areas is and how Kiva addressing lack of credit and capital may offer some relief.  The more able Kiva is able to understand a locality the better it is able to address its needs and work with its field partners to do so.

<a id=3></a>
# 3. Multi-Dimensional Poverty Index (MPI)
MPI is a poverty metric measuring multiple dimensions and intensity of poverty in an area.  It is provided at the National level, and in some countries, at a Sub-National level.  This Sub-National level is at the top most administrative regions level for countries.  A higher value MPI is worse.  Rural areas have higher values than urban ones.

![](http://hdr.undp.org/sites/default/files/mpi.png)

Kiva is leveraging the MPI in evaluating countries, field partners, and themes.  How is outlined in [Annalie's kernel](https://www.kaggle.com/annalie/kivampi) for Field Partner MPI calculated with both National MPI and Sub-National MPI.  Elliot [has a follow up](https://www.kaggle.com/elliottc/kivampi) building on this which also includes similar logic for loan themes.  Comparing National vs. Sub-National resulted in the following:

![](https://www.doyouevendata.com/wp-content/uploads/2018/04/natreg.gif)

These look pretty similar, with the thought being that the Sub-National is an improvement in granularity in understanding the loan themes at a more local level.  I think this is the case, although this may be in part due to some National values being inherited and we also may be looking at too high level a plot here to recognize the more meaningful differences that result when we look at the indivial loan themes or field partners.  Looking at either may yield different results; I focued on field partners in my initial kernel and will repeat some of that work here.
<a id=3_1></a>
## 3.1 National MPI Based Calculations

Here's the results of the Mozambique field partner calculations as currently done.  The reason I find these calculations questionable has to do with the rural_pct.  Rural percentage is a field partner level attribute.  Note the pretty large difference between MPI Rural and MPI Urban here; rural is pretty high at 0.48.  Urban is at 0.189.  These are *National* values which get weighted into the Partner ID MPI Score, based on the rural percentage.

In [44]:
LT = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/loan_themes_by_region.csv") #.set_index([''])
MPI = pd.read_csv("../input/mpi/MPI_national.csv")[['ISO','MPI Urban','MPI Rural']].set_index("ISO")
LT = LT.join(MPI,how='left',on="ISO")[['Partner ID','Field Partner Name','ISO','MPI Rural','MPI Urban','rural_pct','amount', 'mpi_region']].dropna()

LT['rural_pct'] /= 100
#~ Compute the MPI Score for each loan theme
LT['MPI Score'] = LT['rural_pct']*LT['MPI Rural'] + (1-LT['rural_pct'])*LT['MPI Urban']

LT[LT['ISO'] == 'MOZ'][['Partner ID', 'ISO', 'MPI Rural', 'MPI Urban', 'rural_pct', 'MPI Score']].drop_duplicates()

The big difference in the partners above is entirely due to the rural percentage of 85 vs. 20%.  Are there any flaws with this methodology?
<a id=3_1_1></a>
### 3.1.1 Critique 1: Null Rural Percentages
Is this field populated for a field partner?

In [45]:
df_kiva_rural = df_kv_theme_rgn[['Partner ID', 'rural_pct']].drop_duplicates()
df_kiva_rural['populated'] = np.where(df_kiva_rural['rural_pct'].isnull(), 'No', 'Yes')
df_kiva_rural['populated'].value_counts()

120 of 302, or nearly 40%, of f(provided) field partners have a null value for rural percentage.  This results in a bunch of field partners simply dropping out of any kind of peer analysis.  That's not great, it's a huge amount of lost data!  If Kiva has a null value, it is arguably better to simply inherit the National MPI provided in the Sub-National file.  Here it is for Mozambique.

In [46]:
df_mpi_subntl[df_mpi_subntl['ISO country code'] == 'MOZ'][['ISO country code', 'World region', 'MPI National']].head(1)

Does Mozambique have any null rural percentage partners?  Yes, it does.  How many loans and how much money have they lent out?

In [47]:
df_kv_theme_rgn[df_kv_theme_rgn['country'] == 'Mozambique'][['Partner ID', 'rural_pct']].merge(df_kv_theme_rgn[df_kv_theme_rgn['country'] == 'Mozambique'][['Partner ID', 'number', 'amount']].groupby('Partner ID').sum().reset_index(), on='Partner ID').drop_duplicates()

Not every country has a Sub-National entry however, which is where we pulled our MPI National from.  However MPI National itself seems to be similar in methodology to Kiva's rural percentage rating; I have made a new MPI National Calculated here; what I've done is leverage the provided MPI Rural and MPI Urban and pulled a country level rural percentage from World Bank info, for the most recent year I found available (2016).  I've posted a few with the differences below.  It seems a fair methodology to use this MPI National if MPI has not provided one for a country in their MPI Sub-National file.

In [48]:
df_out = df_mpi_ntl[['ISO', 'MPI Rural', 'MPI Urban']].merge(df_wb_rural[['ISO', '2016']], on='ISO')
df_out = df_out.merge(df_mpi_subntl[['ISO country code', 'MPI National']].drop_duplicates(), left_on='ISO', right_on='ISO country code')
df_out['MPI National Calculated'] = df_out['2016'] / 100 * df_out['MPI Rural'] + (100 - df_out['2016']) / 100 * df_out['MPI Urban']
df_out['MPI National Difference'] = df_out['MPI National'] - df_out['MPI National Calculated']
df_out.drop(columns=['ISO country code', '2016'], inplace=True)
#df_out[df_out['ISO'] == 'MOZ']
df_out[df_out['ISO'].isin(['MOZ', 'RWA', 'AFG', 'GNB', 'BDI'])]

In [49]:
HTML('<img style="margin: 0px 20px" align=left src=https://www.doyouevendata.com/wp-content/uploads/2018/03/attn.png>Current Kiva methodology done for National MPI causes partners to drop from comparison.  We can attempt to keep these partners in the evaluation by applying the general rural percentage to the country to them; this can be done by taking the MPI provided MPI National from the Sub-National file, or if missing, calculating it ourselves for a country based on World Bank data.  This treats field partners as if they loan equally throughout the country; which they likely do not; although this may be preferrable to simply dropping them out of analysis entirely.')

<a id=3_1_2></a>
### 3.1.2 Critique 2: Rural Percentage Accuracy
This critique is twofold;
1. How is this metric tracked over time?  It seems an application point in time field partner level attribute.  Year 1: 100 loans at 0% rural.  Year 2, 10 loans and 90% rural.  If this value is simply replaced, it would incorrectly score this as 90% of 110 loans were applied to rural areas.
2. The values all appear to be nice whole numbers, in multiples of 5 for the most part...  are field partners actually tracking this?  Or are these very off the cuff guesses or loosely calculated values for a section, or even many of them?  I assume the field partner is providing them, would they have any incentive to goose the number?  There's so many nice round numbers that I question if many of these are actually accurate/real.

In [50]:
plt.figure(figsize=(12,6))
sns.distplot(df_kiva_rural[~df_kiva_rural['rural_pct'].isnull()]['rural_pct'], bins=30)
plt.show()

In [51]:
plt.figure(figsize=(15,10))
plotSeries = df_kiva_rural['rural_pct'].value_counts().to_frame().reset_index()
o = plotSeries['index']
ax = sns.barplot(data=plotSeries, x='rural_pct', y='index', color='c', orient='h', order=o)
ax.set_title('Number of Field Partners with Rural Percentage', fontsize=15)
ax.set_ylabel('Rural Percentage')
ax.set_xlabel('Number of Field Partners')
plt.show()

In [52]:
HTML('<img style="margin: 0px 20px" align=left src=https://www.doyouevendata.com/wp-content/uploads/2018/03/attn.png>If Kiva wants to know this value, it would likely be best that require it as a loan level attribute.  This would resolve all issues going forward.  All field partners would have values, the numbers produced would be accurate for all partners, and the proper percentage rural could be calculated for any span of time.')

<a id=3_1_3></a>
### 3.1.3 Critique 3: Overall Changes Over Time
Metnioned in the above section is the question of whether rural_percentage has changes tracked over time; however it's not the only thing changing.  So are the poverty dimensions over time, as state policies and various aid efforts work to improve situations.  Kiva has been around a decade and is likely to be around for some time; a lot can change in the span of decades.  We can't simply replace current MPI numbers with new MPI numbers when they are released and expect to have an accurate picture of the story as it unfolded.

In [53]:
HTML('<img style="margin: 0px 20px" align=left src=https://www.doyouevendata.com/wp-content/uploads/2018/03/attn.png>It feels like the data provided has been produced from their application database; does Kiva have a separate reporting database?  A reporting database/data warehouse could accurately capture the application data as it is at a snapshot in time and track slowly changing dimension values along with it, so things like the proper MPI value is associated with a set of loans at a given point in time.')

<a id=3_2></a>
## 3.2 Sub-National MPI Based Calculations
<a id=3_2_1></a>
### 3.2.1 Critique 4: Inherited National MPI and Mixed Methodology
For Sub-National MPI - where Kiva is able to associate a country's region level specific MPI with lenders, we should see a valuable increase in understanding a lender's situation.  However, not all countries have this region level data.  For those that do not, Kiva inherits the National MPI as noted above.  I think this has two problems as well;
1. As noted above, National MPI as presently calculated does not seem particularly trustworthy.
2. This seems like a significant mixing of methodology; one with hard MPI provided regional numbers, and one with pretty soft appearing field partner rural percentages.  Can we expect fair peer comparisons when mixing such different methodologies?  The safer route, to me, would appear to calculate the National MPI based off of World Bank data as proposed in [Section 3.1.3](#3_1_3).

<a id=3_2_2></a>
### 3.2.2 Critique 5: Inaccurate Region Assignment
I do not know what methodology went into current Kiva loan mpi_region assignment, but the accuracy of assigning loans to the right regions is important when using the regional based Sub-National MPI values.  I actually picked Mozambique at random (Partner ID 23 was a low number when visually scanning through some data) and found their to be pretty significant regional inaccuracies.  First let's take a look at the Sub-National Field Partner MPI Scoring with the current methodology.


In [54]:
ISO = 'MOZ'
MPIsubnat = pd.read_csv("../input/mpi/MPI_subnational.csv")[['Country', 'Sub-national region', 'World region', 'MPI National', 'MPI Regional']]
# Create new column LocationName that concatenates the columns Country and Sub-national region
MPIsubnat['LocationName'] = MPIsubnat[['Sub-national region', 'Country']].apply(lambda x: ', '.join(x), axis=1)

LTsubnat = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/loan_themes_by_region.csv")[['Partner ID', 'Loan Theme ID', 'region', 'mpi_region', 'ISO', 'number', 'amount', 'LocationName', 'names']]

# Merge dataframes
LTsubnat = LTsubnat.merge(MPIsubnat, left_on='mpi_region', right_on='LocationName', suffixes=('_LTsubnat', '_mpi'))[['Partner ID', 'Loan Theme ID', 'Country', 'ISO', 'mpi_region', 'MPI Regional', 'number', 'amount']]

#~ Get total volume and average MPI Regional Score for each partner loan theme
LS = LTsubnat.groupby(['Partner ID', 'Loan Theme ID', 'Country', 'ISO']).agg({'MPI Regional': np.mean, 'amount': np.sum, 'number': np.sum})
#~ Get a volume-weighted average of partners loanthemes.
weighted_avg_LTsubnat = lambda df: np.average(df['MPI Regional'], weights=df['amount'])
#~ and get weighted average for partners. 
MPI_regional_scores = LS.groupby(level=['Partner ID', 'ISO']).apply(weighted_avg_LTsubnat)
MPI_regional_scores = MPI_regional_scores.to_frame()
MPI_regional_scores.reset_index(level=1, inplace=True)
MPI_regional_scores = MPI_regional_scores.rename(index=str, columns={0: 'MPI Score'})

MPI_regional_scores[MPI_regional_scores['ISO'] == ISO].reset_index()

We now have 5 partners instead of just 2.  Something of serious note here is **how drastic** the MPI Score has now changed.  Partner 23 had a value of 0.43635, and now it's 1/10th of that at 0.043.  The National vs. Sub-National KDE in [Section 3](#3) smooths over and hides the fact of this being an extremely different value.  I'll again note that I didn't look for this extreme a situation - I simply began looking at field partner 23 because it was near the top of the loan_theme_regions file.

The reason for this drastic change can be from two factors, and in this case, is arguably both; 
1. 85% of loans being rural by Partner 23 may be an exaggeration of reality, for whatever reason.  (Differing opinion of what rural is, or simply poorly tracked metric or bad guess)
2. Loans being incorrectly assigned to MPI regions.

We can prove the latter is true with a browse of the available data; first let's take a look at the MPI Regional values, then a look at our loan region and mpi_region assignments.

In [55]:
df_mpi_subntl[df_mpi_subntl['ISO country code'] == 'MOZ'][['ISO country code', 'Sub-national region', 'MPI Regional']].sort_values('Sub-national region')

In [56]:
df_kv_theme_rgn[(df_kv_theme_rgn['ISO'] == 'MOZ') & (df_kv_theme_rgn['Partner ID'] == 23)].groupby(['region', 'mpi_region'])[['number', 'amount']].sum().sort_values('amount', ascending=False)

Here we can see the problem; Maputo City is the capital of Mozambique.  Almost all of the regions on the left became associated with the MPI Region of the capital city; which has the lowest regional MPI of 0.043.  I believe the region of MOZ simply dropped out of calculations.  Where are these really?  Everything ending with ", Maputo" is actually a part of Maputo Province (MPI 0.133).  Machava-15 appears to be a neighborhood in Matola within the province as well.  Partner 23 shouldn't have an MPI of 0.43635 (National) but neither should it have a value of only 0.043 (Sub-National).


In [57]:
HTML('<img style="margin: 0px 20px" align=left src=https://www.doyouevendata.com/wp-content/uploads/2018/03/attn.png>Kiva needs to update the accuracy of its current mpi region assignment.  The google API others have leveraged to find points seems to work reasonably well, as does the points in polygon geospatial approach.  It would probably work better at scale with spatial join functionality vs. my crude brute force code.  Kiva could also update their application to require some kind of valid google location point as well.')

<a id=3_2_3></a>
### 3.2.3 Sub-National Regional Reassignment and Scoring --- Loan Theme Partner Regions
I do not know the methodology Kiva has used to associate loans or partners to regions.  I am going to do it with one method using geospatial objects that appears to work reasonably well.  For this I am going to use the data provided in the [All Kiva Challenge Loan Location Coordinates](https://www.kaggle.com/mithrillion/kiva-challenge-coordinates) dataset provided by [Mithrillion](https://www.kaggle.com/mithrillion).  [Beluga](https://www.kaggle.com/gaborfodor) also has an excellent dataset containing [even more loans and their locations](https://www.kaggle.com/gaborfodor/additional-kiva-snapshot), however I was not sure how well it would tie in to the challenge snapshot of field partners and their aggregate metrics.  It would also be much more time to run this method for significantly more loans.

We will determine in which region a location lies by using geopandas and geodataframes.  We will have a location (point) and determine which region (polygon) it is in.  There is a spatial join which likely works much better than the code I have been using, [however it does not appear that it will be available any time soon](https://www.kaggle.com/product-feedback/53008) and I'm unsure how to add it myself.  We'll use this methodology with the aggregate field partner numbers in the loan theme regions input as was done for the existing Sub-National methodology.  The geospatial regions are actually at a lower district level, so we'll have to combine them into larger polygons at our MPI regional level.  In the next section, we'll run these numbers for the actual loans in our challenge set of loans.  Do we see any regional changes?

In [58]:
# MOZ spatial projection
epsg = '42106'

### POINTS ###
# loan theme geodataframe
gdf_loan_theme = df_kv_theme_rgn[['Partner ID', 'region', 'mpi_region', 'ISO', 'number', 'amount']].groupby(['Partner ID', 'region', 'mpi_region', 'ISO']).sum().reset_index().merge(df_kiv_loc, how='left', on='region')
gdf_loan_theme['geometry'] = gdf_loan_theme.apply(lambda row: Point(row['lng'], row['lat']), axis=1)
gdf_loan_theme = gpd.GeoDataFrame(gdf_loan_theme, geometry='geometry')

# seems like this should work per stack overflow but it puts a lower case nan in and isnull doesn't work properly on it.
#gdf_loan_theme['mpi_region_new'] = np.NaN
gdf_loan_theme = gdf_loan_theme.reindex(columns = np.append(gdf_loan_theme.columns.values, ['mpi_region_new']))

gdf_loan_theme.crs = {"init":epsg}
gdf_loan_theme.head()

In [59]:
### POLYGONS ###
# read on geospatial data
gdf_moz = gpd.GeoDataFrame.from_file("../input/mozambique-geospatial-regions/moz_polbnda_adm2_districts_wfp_ine_pop2012_15_ocha.shp")

# massage regional data
gdf_moz['PROVINCE'] = np.where(gdf_moz['PROVINCE'].str.contains('Zamb'), 'Zambézia', gdf_moz['PROVINCE'])

# aggregate districts into regions used for MPI level in MOZ
moz_regions = {}

provinces = gdf_moz['PROVINCE'].drop_duplicates()
for p in provinces:
    polys = gdf_moz[gdf_moz['PROVINCE'] == p]['geometry']
    u = cascaded_union(polys)
    moz_regions[p] = u
    
#make a geodataframe for the regions    
s = pd.Series(moz_regions, name='geometry')
s.index.name = 'mpi_region_new'
s.reset_index()

gdf_moz = gpd.GeoDataFrame(s, geometry='geometry')
#gdf_moz.crs = {"init":'42106'}
gdf_moz.crs = {"init":epsg}
gdf_moz.reset_index(level=0, inplace=True)

#assign regional MPI to regions
gdf_moz = gdf_moz.merge(df_mpi_subntl[df_mpi_subntl['ISO country code'] == ISO][['Sub-national region', 'MPI Regional']], how='left', 
                                      left_on='mpi_region_new', right_on='Sub-national region')
#manual updates due to character or spelling differences
gdf_moz['MPI Regional'] = np.where(gdf_moz['mpi_region_new'] == 'Zambézia', 0.528, gdf_moz['MPI Regional'])
gdf_moz['MPI Regional'] = np.where(gdf_moz['mpi_region_new'] == 'Maputo', 0.133, gdf_moz['MPI Regional'])
gdf_moz['MPI Regional'] = np.where(gdf_moz['mpi_region_new'] == 'Maputo City', 0.043, gdf_moz['MPI Regional'])
gdf_moz = gdf_moz[['mpi_region_new', 'MPI Regional', 'geometry']]
gdf_moz.head()

In [60]:
### subset points, set in regions
gdf_regions = gdf_moz
gdf_points = gdf_loan_theme[gdf_loan_theme['ISO'] == ISO]

### POINTS IN POLYGONS
for i in range(0, len(gdf_regions)):
    print('i is: ' + str(i) + ' at ' + strftime("%Y-%m-%d %H:%M:%S", gmtime()))
    gdf_points['r_map'] = gdf_points.within(gdf_regions['geometry'][i])
    gdf_points['mpi_region_new'] = np.where(gdf_points['r_map'], gdf_regions['mpi_region_new'][i], gdf_points['mpi_region_new'])
#gdf_points[['id', 'mpi_region_new']].to_csv(ISO + '_pip_output.csv', index = False)

In [61]:
gdf_points.groupby(['mpi_region', 'mpi_region_new'])[['number', 'amount']].sum().reset_index().sort_values('number', ascending=False)

At the top we can see a significant change - we moved 1700 loans from Maputo City (0.043) to the Maputo Province (0.133) - this will surely affect our scoring.  Unfortunately we lost some loans; 16 from Kiva's method thought to be in the city ended up as null.  Some Nampula loans stayed Nampula; others went to Zambézia or null.  Now what's our field partner calculated Sub-National MPI?

In [62]:
# Merge dataframes
# why did this NaN that worked as null turn into a nan that i'm forced to treat like a string??
LTsubnat = gdf_points[~(gdf_points['mpi_region_new'] == 'nan')]
LTsubnat = LTsubnat.merge(gdf_regions[['mpi_region_new', 'MPI Regional']], on='mpi_region_new')

#~ Get total volume and average MPI Regional Score for each partner loan theme
LS = LTsubnat.groupby(['Partner ID', 'ISO']).agg({'MPI Regional': np.mean, 'amount': np.sum, 'number': np.sum})
#~ Get a volume-weighted average of partners loanthemes.
weighted_avg_LTsubnat = lambda df: np.average(df['MPI Regional'], weights=df['amount'])
#~ and get weighted average for partners. 
MPI_reg_reassign = LS.groupby(level=['Partner ID', 'ISO']).apply(weighted_avg_LTsubnat)
MPI_reg_reassign = MPI_reg_reassign.to_frame()
MPI_reg_reassign.reset_index(level=1, inplace=True)
MPI_reg_reassign = MPI_reg_reassign.rename(index=str, columns={0: 'MPI Score'})

MPI_reg_reassign[MPI_reg_reassign['ISO'] == ISO].reset_index()

This makes sense visually if we plot these partners on a map as well.  Many loans are in the city region, partner 23 has more in the province, and partner 486 is going out into the much more impoverished areas.

In [63]:
Blues = plt.get_cmap('Blues')

regions = gdf_regions['mpi_region_new']

fig, ax = plt.subplots(1, figsize=(12,12))

for r in regions:
    gdf_regions[gdf_regions['mpi_region_new'] == r].plot(ax=ax, color=Blues(gdf_regions[gdf_regions['mpi_region_new'] == r]['MPI Regional']*1.3))

gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 23)].plot(ax=ax, markersize=10, color='red', label='23')
gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 210)].plot(ax=ax, markersize=10, color='lime', label='210')
gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 366)].plot(ax=ax, markersize=10, color='green', label='366')
gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 468)].plot(ax=ax, markersize=10, color='yellow', label='468')
gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 492)].plot(ax=ax, markersize=10, color='purple', label='492')
#gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 261)].plot(ax=ax, markersize=10, color='orange', label='261')


for i, point in gdf_regions.centroid.iteritems():
    reg_n = gdf_regions.iloc[i]['mpi_region_new']
    reg_n = gdf_regions.loc[i, 'mpi_region_new']
    ax.text(s=reg_n, x=point.x, y=point.y, fontsize='large')
    

ax.set_title('Loans across Mozambique by Field Partner\nDarker = Higher MPI.  Range from 0.043 (Maputo City) to 0.528 (Zambézia)')
ax.legend(loc='upper left', frameon=True)
leg = ax.get_legend()
new_title = 'Partner ID'
leg.set_title(new_title)

plt.show()

<a id=3_2_4></a>
### 3.2.4 Sub-National Regional Reassignment and Scoring --- Loans
We've calculated the above using the loan_theme_region aggregates provided.  What's it look like if we use the values *from the actual loans*?  This would be the ideal calculation given all available data.  Our polygons are the same, but we'll have to convert our loans into points.  Should we use funded_amount or loan_amount?  I would say the latter, as a field lender may disburse the loan with their own capital, or the loan may already by pre-disbursed.

In [64]:
### POINTS ###
# loan geodataframe
gdf_kv_loans = df_kv_loans.merge(df_kiv_loc, on=['region', 'country'], how='left')
gdf_kv_loans['geometry'] = gdf_kv_loans.apply(lambda row: Point(row['lng'], row['lat']), axis=1)
gdf_kv_loans = gpd.GeoDataFrame(gdf_kv_loans, geometry='geometry')

# seems like this should work per stack overflow but it puts a lower case nan in and isnull doesn't work properly on it.
#gdf_loan_theme['mpi_region_new'] = np.NaN
gdf_kv_loans = gdf_kv_loans.reindex(columns = np.append(gdf_kv_loans.columns.values, ['mpi_region_new']))

gdf_kv_loans.crs = {"init":epsg}
gdf_kv_loans.head()

In [65]:
gdf_points = gdf_kv_loans[gdf_kv_loans['country'] == 'Mozambique']
gdf_points = gdf_points.rename(index=str, columns={'partner_id': 'Partner ID'})

### POINTS IN POLYGONS
for i in range(0, len(gdf_regions)):
    print('i is: ' + str(i) + ' at ' + strftime("%Y-%m-%d %H:%M:%S", gmtime()))
    gdf_points['r_map'] = gdf_points.within(gdf_regions['geometry'][i])
    gdf_points['mpi_region_new'] = np.where(gdf_points['r_map'], gdf_regions['mpi_region_new'][i], gdf_points['mpi_region_new'])
gdf_points[['id', 'mpi_region_new']].to_csv(ISO + '_loans_pip_output.csv', index = False)

In [66]:
# Merge dataframes
# why did this NaN that worked as null turn into a nan that i'm forced to treat like a string??
LTsubnat = gdf_points[~(gdf_points['mpi_region_new'] == 'nan')]
LTsubnat = LTsubnat.merge(gdf_regions[['mpi_region_new', 'MPI Regional']], on='mpi_region_new')

LTsubnat['number'] = 1

#~ Get total volume and average MPI Regional Score for each partner loan theme
LS = LTsubnat.groupby(['Partner ID']).agg({'MPI Regional': np.mean, 'loan_amount': np.sum, 'number': np.sum})
#~ Get a volume-weighted average of partners loanthemes.
weighted_avg_LTsubnat = lambda df: np.average(df['MPI Regional'], weights=df['loan_amount'])
#~ and get weighted average for partners. 
MPI_loan_reassign = LS.groupby(level=['Partner ID']).apply(weighted_avg_LTsubnat)
MPI_loan_reassign = MPI_loan_reassign.to_frame()
MPI_loan_reassign.reset_index(level=0, inplace=True)
MPI_loan_reassign = MPI_loan_reassign.rename(index=str, columns={0: 'MPI Score'})

MPI_loan_reassign

Now we've picked up another lender; 210 has entered the scene.  Some of these look similar, although we can see 468 dropped from 0.3484 to 0.2259.  On the map they seem to loan to the most impoverished places, and that was a big jump down.  Why?
<a id=3_2_5></a>
### 3.2.5 Sub-National Regional Reassignment and Scoring --- Loans --- Weighted Instead of Averaged
I think this has to do with this specific bit of code, inherited from [Annalie's kernel](https://www.kaggle.com/annalie/kivampi):

> LS = LTsubnat.groupby(['Partner ID']).agg({'MPI Regional': **np.mean**, 'loan_amount': np.sum, 'number': np.sum})

Why should we take the average of MPI, when we can dollar (or loan count) weight it instead?  Let's try another method where we do so.  What results does that produce?

In [67]:
gdf_out = gdf_points[~(gdf_points['mpi_region_new'] == 'nan')].merge(gdf_regions[['mpi_region_new', 'MPI Regional']], on='mpi_region_new')
gdf_out['MPI Weight'] = gdf_out['funded_amount'] * gdf_out['MPI Regional']
gdf_out = gdf_out.groupby('Partner ID')[['loan_amount', 'MPI Weight']].sum().reset_index()
gdf_out['MPI Score'] = gdf_out['MPI Weight'] / gdf_out['loan_amount']
gdf_out[['Partner ID', 'MPI Score']]

What's the map look like now, and how do these methods all compare?

In [68]:
Blues = plt.get_cmap('Blues')

regions = gdf_regions['mpi_region_new']

fig, ax = plt.subplots(1, figsize=(12,12))

for r in regions:
    gdf_regions[gdf_regions['mpi_region_new'] == r].plot(ax=ax, color=Blues(gdf_regions[gdf_regions['mpi_region_new'] == r]['MPI Regional']*1.3))

gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 23)].plot(ax=ax, markersize=10, color='red', label='23')
gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 210)].plot(ax=ax, markersize=10, color='lime', label='210')
gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 366)].plot(ax=ax, markersize=10, color='green', label='366')
gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 468)].plot(ax=ax, markersize=10, color='yellow', label='468')
gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 492)].plot(ax=ax, markersize=10, color='purple', label='492')
gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 261)].plot(ax=ax, markersize=10, color='orange', label='261')


for i, point in gdf_regions.centroid.iteritems():
    reg_n = gdf_regions.iloc[i]['mpi_region_new']
    reg_n = gdf_regions.loc[i, 'mpi_region_new']
    ax.text(s=reg_n, x=point.x, y=point.y, fontsize='large')
    

ax.set_title('Loans across Mozambique by Field Partner\nDarker = Higher MPI.  Range from 0.043 (Maputo City) to 0.528 (Zambézia)')
ax.legend(loc='upper left', frameon=True)
leg = ax.get_legend()
new_title = 'Partner ID'
leg.set_title(new_title)

plt.show()

In [69]:
# national scores
df_curr_ntl_mpi = LT[LT['ISO'] == ISO][['Partner ID', 'MPI Score']].drop_duplicates()
df_curr_ntl_mpi['method'] = 'Existing National (rural_percentage)'

# subnational scores
df_curr_subnat = MPI_regional_scores[MPI_regional_scores['ISO'] == ISO].reset_index()
df_curr_subnat = df_curr_subnat[['Partner ID', 'MPI Score']]
df_curr_subnat['method'] = 'Existing Sub-National (Partner Regions)'

# amended region theme scores
df_amend_fld_rgn = MPI_reg_reassign[MPI_reg_reassign['ISO'] == ISO].reset_index()[['Partner ID', 'MPI Score']]
df_amend_fld_rgn['method'] = 'Amended Sub-National (Partner Regions)'

# amended loan scores - averaged
MPI_loan_reassign['method'] = 'Amended Sub-National (Loans - Mean)'

# amended loan scores - weighted
gdf_out = gdf_out[['Partner ID', 'MPI Score']]
gdf_out['method'] = 'Amended Sub-National (Loans - Weighted)'

# combine for comparison
frames = (df_curr_ntl_mpi, df_curr_subnat, df_amend_fld_rgn, MPI_loan_reassign, gdf_out)
df_compare = pd.concat(frames)
df_compare['Partner ID'] = df_compare['Partner ID'].astype(str).str.split('.', 1).str[0]

In [70]:
fig, ax = plt.subplots(figsize=(15, 10))
sns.set_palette('muted')
sns.barplot(x='Partner ID', y='MPI Score', data=df_compare, hue='method')

ax.legend(ncol=1, loc='upper right', frameon=True)
ax.set(ylabel='MPI Score',
       xlabel='Partner ID')

leg = ax.get_legend()
new_title = 'Method'
leg.set_title(new_title)

ax.set_title('Existing vs. Amended Field Partner MPI - Mozambique', fontsize=15)
plt.show()

1. National MPI methodology seems very skewed, when not absent.
2. Green to Red is an improvement in mpi_region accuracy at the field partner attribute level.
3. Red to Purple is an improvement in mpi_region accuracy at the loan attribute level.
4. Purple to Mustard is an improvement in mpi_region accuracy at the loan attribute level, and the result of weighting loans not as an average of regions but with a funded_amount weighting.
5. I would argue left to right, or reading the methodologies from top to bottom, is an improvement in calculating Field Partner MPI at each step.

<a id=4></a>
# 4. Kenya Analysis
Kenya has the 6th most loans per capita, and the 2nd most loans in terms of absolute numbers.
<a id=4_1></a>
## 4.1 Kenya MPI Analysis (National MPI 0.187)
Let's do the same analysis above for Kenya.  Kenya has had a lot of redistricting, which can be [read about on wikipedia](https://en.wikipedia.org/wiki/Sub-Counties_of_Kenya).  Since 2013 they've had 47 counties.  MPI reports data for 8 regions; these were provinces in Kenya, also dissolved in 2013.  We'll need to stitch together the geospatial counties into these dissolved regions to properly plot our points in polygons and associate them with MPI.  To save time, actual point-in-polygon code is probably commented out, with a private dataset from a prior run leveraged as a helper dataset, simply to save me time when running this kernel.

In [71]:
ISO = 'KEN'
epsg = '4210'

### POLYGONS ###
# read on geospatial data
gdf_ken = gpd.GeoDataFrame.from_file("../input/kenya-geospatial-administrative-regions/ke_district_boundaries.shp")
gdf_ken['DISTNAME'] = gdf_ken['DISTNAME'].str.title()
gdf_ken = gdf_ken.reindex(columns = np.append(gdf_ken.columns.values, ['mpi_region_new']))

districts = ['Mombasa', 'Kwale', 'Kilifi', 'Tana River', 'Lamu', 'Taita Taveta', 'Malindi']
for d in districts:
    gdf_ken['mpi_region_new'] = np.where(gdf_ken['DISTNAME'] == d, 'Coast', gdf_ken['mpi_region_new'])
    
districts = ['Garissa', 'Wajir', 'Mandera']
for d in districts:
    gdf_ken['mpi_region_new'] = np.where(gdf_ken['DISTNAME'] == d, 'North Eastern', gdf_ken['mpi_region_new'])
    
districts = ['Marsabit', 'Isiolo', 'Meru', 'Tharaka-Nithi', 'Embu', 'Kitui', 'Machakos', 'Makueni', 'Meru South', 'Meru North', 'Meru Central', 'Tharaka', 'Mbeere', 'Moyale', 'Mwingi']
for d in districts:
    gdf_ken['mpi_region_new'] = np.where(gdf_ken['DISTNAME'] == d, 'Eastern', gdf_ken['mpi_region_new'])
    
districts = ['Turkana', 'West Pokot', 'Samburu', 'Trans-Nzoia', 'Uasin Gishu', 'Elgeyo-Marakwet', 'Nandi', 'Trans Nzoia', 'Keiyo', 'Koibatek', 'Marakwet', 'Baringo', 'Laikipia', 'Nakuru', 'Narok', 'Kajiado', 'Kericho', 'Bomet', 'Buret', 'Trans Mara']
for d in districts:
    gdf_ken['mpi_region_new'] = np.where(gdf_ken['DISTNAME'] == d, 'Rift Valley', gdf_ken['mpi_region_new'])
    
districts = ['Kakamega', 'Vihiga', 'Bungoma', 'Busia', 'Butere/Mumias', 'Lugari', 'Mt Elgon', 'Teso']
for d in districts:
    gdf_ken['mpi_region_new'] = np.where(gdf_ken['DISTNAME'] == d, 'Western', gdf_ken['mpi_region_new'])
    
districts = ['Siaya', 'Kisumu', 'Homa Bay', 'Migori', 'Kisii', 'Nyamira', 'Bondo', 'Central Kisii', 'Gucha', 'Kuria', 'Nyando', 'Rachuonyo', 'Suba']
for d in districts:
    gdf_ken['mpi_region_new'] = np.where(gdf_ken['DISTNAME'] == d, 'Nyanza', gdf_ken['mpi_region_new'])
    
districts = ['Nairobi']
for d in districts:
    gdf_ken['mpi_region_new'] = np.where(gdf_ken['DISTNAME'] == d, 'Nairobi', gdf_ken['mpi_region_new'])
    
districts = ['Nyandarua', 'Nyeri', 'Kirinyaga', 'Muranga', 'Kiambu', 'Maragua', 'Thika']
for d in districts:
    gdf_ken['mpi_region_new'] = np.where(gdf_ken['DISTNAME'] == d, 'Central', gdf_ken['mpi_region_new'])

# aggregate districts into regions used for MPI level in MOZ
ken_regions = {}

districts = gdf_ken['mpi_region_new'].drop_duplicates()
for d in districts:
    polys = gdf_ken[gdf_ken['mpi_region_new'] == d]['geometry']
    u = cascaded_union(polys)
    ken_regions[d] = u
    
#make a geodataframe for the regions    
s = pd.Series(ken_regions, name='geometry')
s.index.name = 'mpi_region_new'
s.reset_index()

gdf_ken = gpd.GeoDataFrame(s, geometry='geometry')
gdf_ken.crs = {"init":epsg}
gdf_ken.reset_index(level=0, inplace=True)

#assign regional MPI to regions
gdf_ken = gdf_ken.merge(df_mpi_subntl[df_mpi_subntl['ISO country code'] == ISO][['Sub-national region', 'MPI Regional']], how='left', 
                                      left_on='mpi_region_new', right_on='Sub-national region')


In [77]:
### subset points, set in regions
gdf_regions = gdf_ken
gdf_points = gdf_loan_theme[gdf_loan_theme['ISO'] == ISO]

### POINTS IN POLYGONS
for i in range(0, len(gdf_regions)):
    print('i is: ' + str(i) + ' at ' + strftime("%Y-%m-%d %H:%M:%S", gmtime()))
    gdf_points['r_map'] = gdf_points.within(gdf_regions['geometry'][i])
    gdf_points['mpi_region_new'] = np.where(gdf_points['r_map'], gdf_regions['mpi_region_new'][i], gdf_points['mpi_region_new'])
#gdf_points[['id', 'mpi_region_new']].to_csv(ISO + '_pip_output.csv', index = False)

In [78]:
# amended sub-national partner regions
LTsubnat = gdf_points[~(gdf_points['mpi_region_new'] == 'nan')]
LTsubnat = LTsubnat.merge(gdf_regions[['mpi_region_new', 'MPI Regional']], on='mpi_region_new')

#~ Get total volume and average MPI Regional Score for each partner loan theme
LS = LTsubnat.groupby(['Partner ID', 'ISO']).agg({'MPI Regional': np.mean, 'amount': np.sum, 'number': np.sum})
#~ Get a volume-weighted average of partners loanthemes.
weighted_avg_LTsubnat = lambda df: np.average(df['MPI Regional'], weights=df['amount'])
#~ and get weighted average for partners. 
MPI_reg_reassign = LS.groupby(level=['Partner ID', 'ISO']).apply(weighted_avg_LTsubnat)
MPI_reg_reassign = MPI_reg_reassign.to_frame()
MPI_reg_reassign.reset_index(level=1, inplace=True)
MPI_reg_reassign = MPI_reg_reassign.rename(index=str, columns={0: 'MPI Score'})

MPI_reg_reassign[MPI_reg_reassign['ISO'] == ISO].reset_index()

In [86]:
# now let's do loans
gdf_points = gdf_kv_loans[gdf_kv_loans['country'] == 'Kenya']
gdf_points = gdf_points.rename(index=str, columns={'partner_id': 'Partner ID'})

### POINTS IN POLYGONS
# commented out for speed sake of me working on my own work in this kernel
#for i in range(0, len(gdf_regions)):
#    print('i is: ' + str(i) + ' at ' + strftime("%Y-%m-%d %H:%M:%S", gmtime()))
#    gdf_points['r_map'] = gdf_points.within(gdf_regions['geometry'][i])
#    gdf_points['mpi_region_new'] = np.where(gdf_points['r_map'], gdf_regions['mpi_region_new'][i], gdf_points['mpi_region_new'])
#gdf_points[['id', 'mpi_region_new']].to_csv(ISO + '_loans_pip_output.csv', index = False)

# leverage helper file
gdf_points.drop(columns=['mpi_region_new'], inplace=True)
df_mpi_helper = pd.read_csv("../input/dydkivahelper/KEN_loans_pip_output.csv")
gdf_points = gdf_points.merge(df_mpi_helper, how='left', on='id')

In [87]:
# amended sub-national loans - average
LTsubnat = gdf_points[~(gdf_points['mpi_region_new'] == 'nan') & ~(gdf_points['mpi_region_new'].isnull())]
LTsubnat = LTsubnat.merge(gdf_regions[['mpi_region_new', 'MPI Regional']], on='mpi_region_new')

LTsubnat['number'] = 1

#~ Get total volume and average MPI Regional Score for each partner loan theme
LS = LTsubnat.groupby(['Partner ID']).agg({'MPI Regional': np.mean, 'loan_amount': np.sum, 'number': np.sum})
#~ Get a volume-weighted average of partners loanthemes.
weighted_avg_LTsubnat = lambda df: np.average(df['MPI Regional'], weights=df['loan_amount'])
#~ and get weighted average for partners. 
MPI_loan_reassign = LS.groupby(level=['Partner ID']).apply(weighted_avg_LTsubnat)
MPI_loan_reassign = MPI_loan_reassign.to_frame()
MPI_loan_reassign.reset_index(level=0, inplace=True)
MPI_loan_reassign = MPI_loan_reassign.rename(index=str, columns={0: 'MPI Score'})

MPI_loan_reassign

In [88]:
# amended sub-national loans - weighted
gdf_out = gdf_points[~(gdf_points['mpi_region_new'] == 'nan')].merge(gdf_regions[['mpi_region_new', 'MPI Regional']], on='mpi_region_new')
gdf_out['MPI Weight'] = gdf_out['funded_amount'] * gdf_out['MPI Regional']
gdf_out = gdf_out.groupby('Partner ID')[['loan_amount', 'MPI Weight']].sum().reset_index()
gdf_out['MPI Score'] = gdf_out['MPI Weight'] / gdf_out['loan_amount']
gdf_out[['Partner ID', 'MPI Score']]

In [89]:
Blues = plt.get_cmap('Blues')

regions = gdf_regions['mpi_region_new']

fig, ax = plt.subplots(1, figsize=(12,12))

for r in regions:
    gdf_regions[gdf_regions['mpi_region_new'] == r].plot(ax=ax, color=Blues(gdf_regions[gdf_regions['mpi_region_new'] == r]['MPI Regional']*1.3))

gdf_points[(gdf_points['lng'] > -60) & (gdf_points['lng'] < 44)].plot(ax=ax, markersize=10, color='red', label='loan')
#gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 436)].plot(ax=ax, markersize=10, color='lime', label='436')
#gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 366)].plot(ax=ax, markersize=10, color='green', label='366')
#gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 468)].plot(ax=ax, markersize=10, color='yellow', label='468')
#gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 492)].plot(ax=ax, markersize=10, color='purple', label='492')
#gdf_points[(gdf_points['lat'] != -999) & (gdf_points['Partner ID'] == 261)].plot(ax=ax, markersize=10, color='orange', label='261')


for i, point in gdf_regions.centroid.iteritems():
    reg_n = gdf_regions.iloc[i]['mpi_region_new']
    reg_n = gdf_regions.loc[i, 'mpi_region_new']
    ax.text(s=reg_n, x=point.x, y=point.y, fontsize='large')
    

ax.set_title('Loans across Kenya by Field Partner\nDarker = Higher MPI.  Range from 0.020 (Nairobi) to 0.509 (North Eastern)')
#ax.legend(loc='upper left', frameon=True)
#leg = ax.get_legend()
#new_title = 'Partner ID'
#leg.set_title(new_title)

plt.show()

It's a bit surprising to see that clearly the poorest part of this country has not received any aid from any field partner.  All field partner loans are plotted above from the Kiva Challenge dataset.  The next graph will include include field partners who only lent within Kenya so that a proper comparison between the metrics can be made.  For those partners that provided loans in multiple countries, I would have to do point in polygon analysis and updates for each, to fully calculate the numbers correctly.  As I have only done Kenya, those field partners will be omitted from this graph.

In [90]:
#kenya_only_partners = ['133.0', '138.0', '156.0', '164.0', '218.0', '258.0', '262.0', '276.0', '322.0', '340.0', '386.0', '388.0', '405.0', '436.0', '469.0', '473.0', '491.0', '500.0', '502.0', '505.0', '512.0', '520.0', '526.0', '529.0', '540.0']
#kenya_only_partners = [133, 138, 156, 164, 218, 258, 262, 276, 322, 340, 386, 388, 405, 436, 469, 473, 491, 500, 502, 505, 512, 520, 526, 529, 540]
kenya_only_partners = ['133', '138', '156', '164', '218', '258', '262', '276', '322', '340', '386', '388', '405', '436', '469', '473', '491', '500', '502', '505', '512', '520', '526', '529', '540']

# national scores
df_curr_ntl_mpi = LT[LT['ISO'] == ISO][['Partner ID', 'MPI Score']].drop_duplicates()
df_curr_ntl_mpi['method'] = 'Existing National (rural_percentage)'

# subnational scores
df_curr_subnat = MPI_regional_scores[MPI_regional_scores['ISO'] == ISO].reset_index()
df_curr_subnat = df_curr_subnat[['Partner ID', 'MPI Score']]
df_curr_subnat['method'] = 'Existing Sub-National (Partner Regions)'

# amended region theme scores
df_amend_fld_rgn = MPI_reg_reassign[MPI_reg_reassign['ISO'] == ISO].reset_index()[['Partner ID', 'MPI Score']]
df_amend_fld_rgn['method'] = 'Amended Sub-National (Partner Regions)'

# amended loan scores - averaged
MPI_loan_reassign['method'] = 'Amended Sub-National (Loans - Mean)'

# amended loan scores - weighted
gdf_out = gdf_out[['Partner ID', 'MPI Score']]
gdf_out['method'] = 'Amended Sub-National (Loans - Weighted)'

# combine for comparison
frames = (df_curr_ntl_mpi, df_curr_subnat, df_amend_fld_rgn, MPI_loan_reassign, gdf_out)
df_compare = pd.concat(frames)
df_compare['Partner ID'] = df_compare['Partner ID'].astype(str).str.split('.', 1).str[0]
df_compare = df_compare[df_compare['Partner ID'].isin(kenya_only_partners)]


In [91]:
fig, ax = plt.subplots(figsize=(20, 10))
sns.set_palette('muted')
sns.barplot(x='Partner ID', y='MPI Score', data=df_compare, hue='method')

ax.legend(ncol=1, loc='upper left', frameon=True)
ax.set(ylabel='MPI Score',
       xlabel='Partner ID')

leg = ax.get_legend()
new_title = 'Method'
leg.set_title(new_title)

ax.set_title('Existing vs. Amended Field Partner MPI - Kenya', fontsize=15)
plt.show()

1. There are some wonky National MPI based (blue) outliers here, particuarly for partner 218.
2. There are some wonky Regional MPI based (green) outliers here; 340, 246, 505, 529 - most likely from loans associated with incorrectly assigned MPI regions.
3. Note again the KDE plot in [Section 3](#3) hides the fact that the differences in the presence of both can be quite dramatic, and for that matter, many partners are simply lost under the National based calculation.  The KDE map might not match as well, but the data is missing so it appears more closely correlated than it really likely is.
4. Comparing green (existing Sub-National region partner based MPI) to red (amended mpi_region Sub-National region partner based MPI) we can see some partners with little change, and some with quite significant changes.  I believe the amended red is an improvement in accuracy.
5. The next and last mpi_region located amended calculations are not done at the region partner level, but are both accounted for by the location on the actual atomic loans in the Kiva challenge set.  Thus they are the most accurate representation of the data provided.  The purple line takes the mean of the regional MPI number, while the mustard color takes a funded_amount weighted calcuation of regional MPI.  These do produce different numbers, a particularly large amount in the case of partner 491.


I would argue the Field Partner MPI increases in accuracy going from methodologies left to right, ans that the most accurately calculated Field Partner MPI is that in mustard, which uses the base loans, updated mpi_region values, and takes a funded_amount weighted calculation across all the loans and regions a field partner has disbursed funds to.

<a id=4_2></a>
## 4.2 Kenya FGT Analysis
As noted above in [Section 2](#2), understanding poverty at a more local level is of interest to Kiva in understanding its borrowers' situations.  The Kenya National Bureau of Statistics provided the [2015/16 Integrated Household Budget Survey (KIHBS)](https://www.knbs.or.ke/launch-201516-kenya-integrated-household-budget-survey-kihbs-reports-2/), from which I have [uploaded a dataset](https://www.kaggle.com/doyouevendata/kenya-poverty-metrics-by-district).  It includes food poverty, overall poverty, and a "hardcore" poverty set of tables, at the 47 county level for Kenya.  Let's place our loans into these counties and take a look at what we can see.

This set uses [Foster-Greer-Thorbecke (FGT) indices](https://en.wikipedia.org/wiki/Foster%E2%80%93Greer%E2%80%93Thorbecke_indices), which includes
1. **Headcount Rate** - the proportion of the population that cannot afford the basic basket of goods as measured by a predetermined threshold.
2. **Poverty Gap** - the poverty gap index/depth of poverty provides information on how much poorer the poor people are relative to the poverty line.
3. **Severity of Poverty** - index that tries to take into account both of the above.

Furthermore, it is provided across three levels;
1. **Food Poverty**: Households & individuals whose monthly adult equivalent food consumption expenditure per person is less than KSh 1,954 in rural and peri-urban areas and less than KSh 2,551 in core-urban areas
2. **Overall Poverty**: Households & individuals whose monthly adult equivalent total consumption expenditure per person is less than KSh 3,252 in rural and peri-urban areas and less than KSh 5,995 in core-urban areas
3. **Hardcore or Extreme Poverty**: Households & individuals whose monthly adult equivalent total consumption expenditure per person is less than KSh 1,954 in rural and peri-urban areas and less than KSh 2,551 in core-urban areas

I think it's best to use Severity of Poverty; let's first take a look at a regional map by this metric.

In [134]:
ISO = 'KEN'
epsg = '4210'

### POLYGONS ###
# read on geospatial data
gdf_ken = gpd.GeoDataFrame.from_file("../input/kenya-geospatial-administrative-regions/ke_district_boundaries.shp")
gdf_ken['DISTNAME'] = gdf_ken['DISTNAME'].str.title()
gdf_ken['DISTNAME'] = np.where(gdf_ken['DISTNAME'].str.contains('Taita Taveta'), 'Taita/Taveta', gdf_ken['DISTNAME'])
gdf_ken['DISTNAME'] = np.where(gdf_ken['DISTNAME'].str.contains('Nairobi'), 'Nairobi City', gdf_ken['DISTNAME'])

gdf_ken = gdf_ken.merge(df_pov_ken_ovrl, how='left', left_on='DISTNAME', right_on='residence_county')

# additional manual assignment
districts = ['Butere/Mumias', 'Lugari']
for d in districts:
    gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'] == d, 'Kakamega', gdf_ken['residence_county'])
 
districts = ['Central Kisii', 'Gucha']
for d in districts:
    gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'] == d, 'Kisii', gdf_ken['residence_county'])    
    
districts = ['Keiyo', 'Marakwet']
for d in districts:
    gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'] == d, 'Elgeyo/Marakwet', gdf_ken['residence_county'])   
    
districts = ['Meru Central', 'Meru North']
for d in districts:
    gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'] == d, 'Meru', gdf_ken['residence_county'])       

districts = ['Rachuonyo', 'Suba']
for d in districts:
    gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'] == d, 'Homa Bay', gdf_ken['residence_county']) 

# Buret - In 2010, the district was split between Kericho County and Bomet County.  just putting this in Kericho to make my life easier
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Buret'), 'Kericho', gdf_ken['residence_county'])
# The Iteso in Kenya, numbering about 578,000, live mainly in Busia county.
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Teso'), 'Busia', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Bondo'), 'Siaya', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Koibatek'), 'Baringo', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Kuria'), 'Migori', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Malindi'), 'Kilifi', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Maragua'), 'Muranga', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Mbeere'), 'Embu', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Meru South'), 'Tharaka-Nithi', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Moyale'), 'Marsabit', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Mt Elgon'), 'Bungoma', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Mwingi'), 'Kitui', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Nyando'), 'Kisumu', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Tharaka'), 'Tharaka-Nithi', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Thika'), 'Kiambu', gdf_ken['residence_county'])
gdf_ken['residence_county'] = np.where(gdf_ken['DISTNAME'].str.contains('Trans Mara'), 'Narok', gdf_ken['residence_county'])

# aggregate shapes into counties
ken_regions = {}

districts = gdf_ken['residence_county'].drop_duplicates()
for d in districts:
    polys = gdf_ken[gdf_ken['residence_county'] == d]['geometry']
    u = cascaded_union(polys)
    ken_regions[d] = u
    
# make a geodataframe for the regions    
s = pd.Series(ken_regions, name='geometry')
s.index.name = 'residence_county'
s.reset_index()

gdf_ken = gpd.GeoDataFrame(s, geometry='geometry')
gdf_ken.crs = {"init":epsg}
gdf_ken.reset_index(level=0, inplace=True)

In [109]:
### POINTS IN POLYGONS
# (our points are still kenya)
# set regions to kenya counties
gdf_regions = gdf_ken

gdf_points = gdf_points.reindex(columns = np.append(gdf_points.columns.values, ['residence_county']))

### POINTS IN POLYGONS
# commented out for speed sake of me working on my own work in this kernel
#for i in range(0, len(gdf_regions)):
#    print('i is: ' + str(i) + ' at ' + strftime("%Y-%m-%d %H:%M:%S", gmtime()))
#    gdf_points['r_map'] = gdf_points.within(gdf_regions['geometry'][i])
#    gdf_points['residence_county'] = np.where(gdf_points['r_map'], gdf_regions['residence_county'][i], gdf_points['residence_county'])
#gdf_points[['id', 'residence_county']].to_csv(ISO + '_county_loans_pip_output.csv', index = False)

# leverage helper file
gdf_points.drop(columns=['residence_county'], inplace=True)
df_county_helper = pd.read_csv("../input/dydkivahelper/KEN_county_loans_pip_output.csv")
gdf_points = gdf_points.merge(df_county_helper, how='left', on='id')

In [144]:
# add in kenya overall poverty county data
gdf_regions = gdf_ken
gdf_regions = gdf_regions.merge(df_pov_ken_food, on='residence_county')

Colormap = plt.get_cmap('Oranges')

regions = gdf_regions['residence_county']

fig, ax = plt.subplots(1, figsize=(12,12))

for r in regions:
    gdf_regions[gdf_regions['residence_county'] == r].plot(ax=ax, color=Colormap(gdf_regions[gdf_regions['residence_county'] == r]['Severity of Poverty (%)']/100*2))

for i, point in gdf_regions.centroid.iteritems():
    reg_n = gdf_regions.iloc[i]['residence_county']
    reg_n = gdf_regions.loc[i, 'residence_county']
    ax.text(s=reg_n, x=point.x-0.2, y=point.y, fontsize='small')
    
ax.set_title('Severity of Poverty, Food, Counties Across Kenya')
plt.show()

Instead of showing the percentage and multiplying it by an arbitrary value to get some color, I'm going to normalize the value for the next graph.  This will resulted in the lightest color being used for the region with the lowest value (Nyeri), and the darkest color being used for the region with the highest value (Turkana).  I think this will help us see the differences better.

In [145]:
# add in kenya overall poverty county data
gdf_regions = gdf_ken
gdf_regions = gdf_regions.merge(df_pov_ken_food, on='residence_county')

max_value = gdf_regions['Severity of Poverty (%)'].max()
min_value = gdf_regions['Severity of Poverty (%)'].min()
gdf_regions['sev_pov_nrml'] = (gdf_regions['Severity of Poverty (%)'] - min_value) / (max_value - min_value)

Colormap = plt.get_cmap('Oranges')

regions = gdf_regions['residence_county']

fig, ax = plt.subplots(1, figsize=(12,12))

for r in regions:
    gdf_regions[gdf_regions['residence_county'] == r].plot(ax=ax, color=Colormap(gdf_regions[gdf_regions['residence_county'] == r]['sev_pov_nrml']))

for i, point in gdf_regions.centroid.iteritems():
    reg_n = gdf_regions.iloc[i]['residence_county']
    reg_n = gdf_regions.loc[i, 'residence_county']
    ax.text(s=reg_n, x=point.x-0.2, y=point.y, fontsize='small')
    
ax.set_title('Severity of Poverty, Food, Counties Across Kenya')
plt.show()

In [146]:
# add in kenya overall poverty county data
gdf_regions = gdf_ken
gdf_regions = gdf_regions.merge(df_pov_ken_ovrl, on='residence_county')

max_value = gdf_regions['Severity of Poverty (%)'].max()
min_value = gdf_regions['Severity of Poverty (%)'].min()
gdf_regions['sev_pov_nrml'] = (gdf_regions['Severity of Poverty (%)'] - min_value) / (max_value - min_value)

Colormap = plt.get_cmap('Blues')

regions = gdf_regions['residence_county']

fig, ax = plt.subplots(1, figsize=(12,12))

for r in regions:
    gdf_regions[gdf_regions['residence_county'] == r].plot(ax=ax, color=Colormap(gdf_regions[gdf_regions['residence_county'] == r]['sev_pov_nrml']))

for i, point in gdf_regions.centroid.iteritems():
    reg_n = gdf_regions.iloc[i]['residence_county']
    reg_n = gdf_regions.loc[i, 'residence_county']
    ax.text(s=reg_n, x=point.x-0.2, y=point.y, fontsize='small')
    
ax.set_title('Severity of Poverty, Overall, Counties Across Kenya')
plt.show()

In [148]:
# add in kenya overall poverty county data
gdf_regions = gdf_ken
gdf_regions = gdf_regions.merge(df_pov_ken_hrdc, on='residence_county')

max_value = gdf_regions['Severity of Poverty (%)'].max()
min_value = gdf_regions['Severity of Poverty (%)'].min()
gdf_regions['sev_pov_nrml'] = (gdf_regions['Severity of Poverty (%)'] - min_value) / (max_value - min_value)

Colormap = plt.get_cmap('Purples')

regions = gdf_regions['residence_county']

fig, ax = plt.subplots(1, figsize=(12,12))

for r in regions:
    gdf_regions[gdf_regions['residence_county'] == r].plot(ax=ax, color=Colormap(gdf_regions[gdf_regions['residence_county'] == r]['sev_pov_nrml']))

for i, point in gdf_regions.centroid.iteritems():
    reg_n = gdf_regions.iloc[i]['residence_county']
    reg_n = gdf_regions.loc[i, 'residence_county']
    ax.text(s=reg_n, x=point.x-0.2, y=point.y, fontsize='small')
    

ax.set_title('Severity of Poverty, Extreme, Counties Across Kenya')
plt.show()

Let's stick with overall for now, and plot our loans on it.

In [149]:
# add in kenya overall poverty county data
gdf_regions = gdf_ken
gdf_regions = gdf_regions.merge(df_pov_ken_ovrl, on='residence_county')

max_value = gdf_regions['Severity of Poverty (%)'].max()
min_value = gdf_regions['Severity of Poverty (%)'].min()
gdf_regions['sev_pov_nrml'] = (gdf_regions['Severity of Poverty (%)'] - min_value) / (max_value - min_value)

Colormap = plt.get_cmap('Blues')

regions = gdf_regions['residence_county']

fig, ax = plt.subplots(1, figsize=(12,12))

for r in regions:
    gdf_regions[gdf_regions['residence_county'] == r].plot(ax=ax, color=Colormap(gdf_regions[gdf_regions['residence_county'] == r]['sev_pov_nrml']))

gdf_points[(gdf_points['lng'] > -60) & (gdf_points['lng'] < 44)].plot(ax=ax, markersize=10, color='red', label='loan')
    
for i, point in gdf_regions.centroid.iteritems():
    reg_n = gdf_regions.iloc[i]['residence_county']
    reg_n = gdf_regions.loc[i, 'residence_county']
    ax.text(s=reg_n, x=point.x-0.2, y=point.y, fontsize='small')
    
ax.set_title('Severity of Poverty, Overall, Counties Across Kenya - With Loan Locations')
plt.show()

Wow - that's really quite telling!  Overall the Kiva field partners in Kenya don't appear to be delving into the areas experiencing the most poverty.

0. add kenya graphs and highlights from the actual report
1. add philippines consumption
2. add in cato data
3. do something with findex?
4. do something with stanford data?  http://sustain.stanford.edu/predicting-poverty/