## Creating a granular measure of poverty globally

Kiva's objective is to accurately assess the poverty of borrowers to the most granular level possible - this will allow them to make sure new funding sources target borrowers accurately.

### Currently, a localised poverty index does not exist
MPI is currently being used for this task, but it only goes to a sub-national level -  even at this level it is hard to distinguish between the national level .
Elliot Collins states that "...we can see plenty of loan themes where the two diverge, but this seems like an improvement in granularity, rather than solving any systematic bias. It also looks like the relationship to the field partner-level rural percentage is broadly the same, suggesting to me that the distinction between Rural MPI and Urban MPI from above is largely captured by the region-level disaggregation (i.e. when distinguishing between rural and urban Kiva borrowers, we are usually talking about distinct regions with corresponding differences in MPI score)."

The objective of our project is to create a more localised poverty index that also has a more accurate distinction between rural and urban borrowers in order to enhance the current MPI.

### The dataset
The Global Assessment Report on Disaster Risk and Reduction 2015 (GAR15) is the external dataset being used to create this more granular index.

The data is supplied in a geospatial Esri Shapefile format. For all countries where there is a Kiva loan present, GAR15 data was downloaded from the Humanitarian Data Exchange (HDX) website (https://data.humdata.org/). The Humanitarian Data Exchange (HDX) is an open platform for sharing data, launched in July 2014. The goal of HDX is to make humanitarian data easy to find and use for analysis.

The GAR15 global exposure database is based on a top-down approach where statistical information including socio-economic, building type, and capital stock at a national level are transposed onto the grids of 5x5 km using geographic distribution of population data and gross domestic product (GDP) as proxies.

The image below shows the countries where GAR15 data has been downloaded for this analysis. The green points represent the GAR15 data whilst the red points represent the Kiva loan locations:
![data_extent_small.PNG](https://78.media.tumblr.com/f955b0c142aedeaf94233b6536e2dcc4/tumblr_p8sm8u0hFO1xtbpnno9_1280.png)

Since the data contains socio-economic information (i.e. size of lower/middle/upper classes) and a breakdown of the types of capital stock that would be related to poverty (i.e. schools and hospitals) we have repurposed GAR15 to be used as a poverty index.

The image below shows a zoomed version of the GAR15 data focussing on the city of Bamako in Mali; an area which has had multiple Kiva loans granted. Through utilitising the GAR15 data it will be possible to assign a new poverty index to the loans based on geospatial nearest neighbour analysis:  
![bamako_small.PNG](https://78.media.tumblr.com/6108115a84668d8705387e3999f0a6f7/tumblr_p8sm8u0hFO1xtbpnno6_400.png)

The dataset contains impressive specificity at a global scale, with 4,573,567 rows of data. Each 5x5km grid containts  54 attributes on the population and capital stock within each one across the health and education sectors as well as employment and residential.

#### Measures within the GAR15 data:
**Capital Stock** - the value of the asset in million US$ - The economic value of each building class in one cell is assessed based on the disaggregation of the (national) Produced Capital at grid level. This downscaling was done by using the sub-national values of economic activity as a proxy. The result is the global distribution of the economic value of the urban and rural produced capital by construction class
**Population** - the number of residents of the buildings in the grid


#### Sources:
A detailed description of the sources that were used for the GAR15 dataset is available here: https://data.humdata.org/dataset/1c9cf1eb-c20a-4a06-8309-9416464af746/resource/bf90aaad-b438-4570-8550-1cd6314599d5/download/please-read-metadata-exposure2015.pdf


#### Assumptions:
- Areas with a high value of educational and healthcare assets are more prosperous
- Areas where a higher proportion of the population is middle class is more prosperous
- Public/private sector split is not crucial and the raminifactions of it are unclear, therefore these numbers are combined for the purposes of calculating the index

#### Limitations of the data to be taken into account:
- Value of the assets as a proxy only go so far, in comparison to more typical (but higher level metrics) infant mortality, immunisations per 1000 population
- Just because an area has high health assets doesn't mean all people have access to it


## Methodology

#### 1. Read in GAR15 dataset

In [None]:
import numpy as np 
import pandas as pd
import os
fullgar = pd.read_csv("../input/gar15-full/gar15_full.csv")

#### 2. Calculate the Urban to Rural Population
The Human Development Index's calculation methodology was used to create the localised index (this is detailed in Step 5). 

The methodlodgy for calculating the index uses the min and max values of each attribute in order to normalise them and help define what good looks like. 

Since GAR15 contains data from across the globe, a calculation using the min and max values would be too susceptible to outliers. 
Furthermore, an objective of this index is to compensate for the regular MPI's poor differentiation between rural and urban populations. 

We will use population density and the ratio of rural/urban inhabitants to find natural clusters within the data, so that borrowers are compared to people living within similar conditions.

In this step we calculate the urban to rural population for each 5x5km geography/cell in the data to aid the clustering

In [None]:
#calculate the total urban population in each 5x5
fullgar['u2rratio'] = fullgar['TOT_PU']/(fullgar['TOT_PU'] + fullgar['TOT_PR'])

#### 3. Cluster the cells based on Urban to Rural ratio and population
We will use kmeans clustering to segment the dataset in 5 clusters based on  the Urban to Rural Population ratio and the Population

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(fullgar[['u2rratio','TOT_POB']])
y_kmeans = kmeans.predict(fullgar[['u2rratio','TOT_POB']])
fullgar['clusters'] = kmeans.labels_ 
fullgar[['u2rratio','TOT_POB','clusters']].head(25)
fullgar.clusters.unique()

#### 4. Combine the attributes to be used for the index

A richer society is likely to have more capital stock of the types of assets that improve quality of life and it is these assets that GAR15 is disaggregated by (i.e. health, employment, education). 

In order to normalise the data, capital stock for each sector is calculated per person.  Therefore, we can assume that someone is worse off if their amount of educational capital (i.e. schools) is less in comparison to the average for that cluster.
The calculation for the attributes is straightforward: (value of capital stock divided by number of people). 
The 'middleclass' attribute is an exception. It is based on the assumption that the larger the middle class, the more well off a society is. Therefore, this attribute is the percentage of the total population that is within the middle class


In [None]:
# Health Capital Stock Per Person
fullgar['healthpp'] = ((fullgar['BED_PRV_CU'] + fullgar['BED_PUB_CU'] + fullgar['BED_PUB_CR'] + fullgar['BED_PUB_PR'])/(fullgar['BED_PRV_PU'] + fullgar['BED_PUB_PU'] + fullgar['BED_PUB_PR'] + fullgar['BED_PUB_PR'])).replace(np.nan, 0)
# Education Capital Stock Per Person
fullgar['edupp'] = ((fullgar['EDU_PRV_CU'] + fullgar['EDU_PUB_CU'] + fullgar['EDU_PUB_CR'] + fullgar['EDU_PUB_PR'])/(fullgar['EDU_PRV_PU'] + fullgar['EDU_PUB_PU'] + fullgar['EDU_PUB_PR'] + fullgar['EDU_PUB_PR'])).replace(np.nan, 0)
# Employment: Agriculture Capital Stock Per Person
fullgar['empagripp'] = ((fullgar['EMP_AGR_CU'] + fullgar['EMP_AGR_CR'])/ (fullgar['EMP_AGR_PU'] + fullgar['EMP_AGR_PR'])).replace(np.nan, 0)
# Employment: Governmant Capital Stock Per Person
fullgar['empgovpp'] = ((fullgar['EMP_GOV_CU'] + fullgar['EMP_GOV_CR'])/ (fullgar['EMP_GOV_PU'] + fullgar['EMP_GOV_PR'])).replace(np.nan, 0)
# Employment: Industry Capital Stock Per Person
fullgar['empindpp'] = ((fullgar['EMP_IND_CU'] + fullgar['EMP_IND_CR'])/ (fullgar['EMP_IND_PU'] + fullgar['EMP_IND_PR'])).replace(np.nan, 0)
# Employment: ServiceCapital Stock Per Person
fullgar['empserpp'] = ((fullgar['EMP_SER_CU'] + fullgar['EMP_SER_CR'])/ (fullgar['EMP_SER_PU'] + fullgar['EMP_SER_PR'])).replace(np.nan, 0)
# The percentage of the population that is middle class
fullgar['middleclass'] = ((fullgar['IC_MHG_PR'] + fullgar['IC_MHG_PU'] + fullgar['IC_MLW_PR'] + fullgar['IC_MLW_PU']) / (fullgar['IC_MHG_PR'] + fullgar['IC_MHG_PU'] + fullgar['IC_LOW_PU'] + fullgar['IC_LOW_PR'] + fullgar['IC_HIGH_PU'] + fullgar['IC_HIGH_PR'] + fullgar['IC_MLW_PR'] + fullgar['IC_MLW_PU'])).replace(np.nan, 0)

#### 5. Calculate index for sector and for  each cluster

The Human Development Index (HDI) was referenced for creating the calculations for the index (http://hdr.undp.org/sites/default/files/hdr2016_technical_notes.pdf)

"Minimum and maximum values (goalposts) are set in order to transform the indicators expressed in different units into indices on a scale of 0 to 1. These goalposts act as the “natural zeros” and “aspirational targets,” respectively, from which component indicators are standardized"

The calculation to create this is: 
(actual value - min value)/(max value - min value). 

The dimensional indices are then aggrgated to produce the index. This is calculated by mutiplying all the dimesion together and then calculating the arithmetic mean. The geometric mean is what's used in the HDI, but due to missing values in GAR15, the arithmetic one is more appropriate.

In [None]:
def loop(df):
    '''
    looping through and summing the columns
    '''
    df_list = []
    for i in np.arange(df['clusters'].max()):
        cluster_df = df.loc[df['clusters'] == i]      
        df['healthppind'] = ((df['healthpp'] - df['healthpp'].min())/(df['healthpp'].max() - df['healthpp'].min())).replace(np.nan,0)
        df['eduppind'] = ((df['edupp'] - df['edupp'].min())/(df['edupp'].max() - df['edupp'].min())).replace(np.nan,0)
        df['empagripp'] = ((df['empagripp'] - df['empagripp'].min())/(df['empagripp'].max() - df['empagripp'].min())).replace(np.nan,0)
        df['empgovppind'] = ((df['empgovpp'] - df['empgovpp'].min())/(df['empgovpp'].max() - df['empgovpp'].min())).replace(np.nan,0)
        df['empindppind'] = ((df['empindpp'] - df['empindpp'].min())/(df['empindpp'].max() - df['empindpp'].min()) ).replace(np.nan,0)
        df['empserppind'] = ((df['empserpp'] - df['empserpp'].min())/(df['empserpp'].max() - df['empserpp'].min())).replace(np.nan,0)
        df['middleclassind'] = ((df['middleclass'] - df['middleclass'].min())/(df['middleclass'].max() - df['middleclass'].min())).replace(np.nan,0)
    
        #index created off arithmetic mean not geometric mean
        df['lmpi'] = ((df['healthppind']+df['eduppind']+df['empagripp']+df['empgovppind']+df['empindppind']+df['empserppind']+df['middleclassind'])/7).replace(np.nan,0)
        
        df_list.append(cluster_df)
        
    '''
    Concatenating the frames
    '''
    final_df = pd.concat(df_list)
    
    return(final_df)

output = loop(fullgar)
output.head()

#### 5. Output the index for each 5x5km square to produce a poverty index

In [None]:
final = output[['ID_5X', 'ISO3','lmpi']].to_csv('lmpi.csv')

#### 6. Geospatial output and analysis

Once the new GAR15 Poverty Index was calculated, the values were joined back to the original Esri Shapefile format. Due to file size restrictions with the Esri Shapefile format, the final spatial data output has been exported as a OGC Geopackage. 

The OGC Geopackage has been uploaded as 'kiva_final_output.gpkg' and can be visualised in many standard GIS packages such as QGIS. The geopackage contains two output layer:
1. gar15_plus: Source GAR15 dataset with the new GAR15 Poverty Index attribution.
2. loan_gar15_plus: Kiva loans with nearest neighbour geospatial analysis applied to define the closest GAR15 feature. It is then possible to directly compare the Kiva loan attribution to the GAR15 data.  

From these two datasets it will be possible to gain new insights into current and future Kiva loans through the analysis of the new GAR15 Poverty Index.

The visualisation below show some initial analysis of the output data. Further images can be found in the attached image zip folder. It is worth noting the poverty index values currently developed can be improved upon following additional research and testing. However, this output does give an example of what kind of analysis can be done using available open source data. Given additional time it will be possible to extract further insights into the data using these such poverty indexes.

**Education Index**

The image below shows the GAR15 Education Index heat map distribution over the Kiva loan regions. From this index it is possible to gain insights into the education capital around a specific area.
![eduppind_map.png](https://78.media.tumblr.com/ec930ea0f1a4178b903fcecac43d8b45/tumblr_p8sm8u0hFO1xtbpnno7_1280.png)

**Loan Analysis with GAR15 Poverty Index Values**

The image below shows the attribution of an individual Kiva Loan in Bamako where the Kiva loan data has been joined to the GAR15 Poverty Index data through geospatial nearest neightbour analysis. From this data it will be possible to gain a better idea of the environment where the loan has been granted at a much more granular level (5km x 5km).
![bamako_loan_analysis.PNG](https://78.media.tumblr.com/33cbe6febd9aea7222c6cce0e856cb18/tumblr_p8sm8u0hFO1xtbpnno10_1280.png)


#### 7. Conclusion

From our analysis we have shown that open source geospatial data is available and that it can be used to give Kiva a better idea of the geodemographics of a population at a much more granular level than is currently used. Through bringing together datasets from multiple sources it is possible to gain insights which have not previously been possible. As a result we have created a more localised poverty index that also has a more accurate distinction between rural and urban borrowers in order to enhance the current MPI.

As mentioned above, it should be highlighted that the poverty indexes we have calculated can be developed further. However, this analysis has shown that there is a definite opportunity to further utilise the power of geospatial data to enable Kiva to better understand loan geographies.

Our objective of this work was to show that:
a) Open source geospatial data is available for microgrant analysis.
b) Combining data science techniques with geospatial analysis can prove to be a very powerful tool and is an area which is often overlooked by the data science/BI community. 

Thanks for your time. We hope you find the ideas raised in our analysis useful.