# Introduction

Hi Kagglers,

This is a supplementary notebook to the Kiva Data Science for Good challenge. The results of this notebook are used in the [Kiva Philippines Poverty Score notebook](https://www.kaggle.com/rossrco/kiva-philippines-poverty-score).

In this notebook, we'll match the region from the `kiva_loans.csv` dataset to a region in the Philippines.

# Data Transformation Description

The `kiva_loans.csv` dataset contains a `region` column. The data in this column represents a written description of the borrower's location. The issue we are trying to solve here is that:

* Sometimes the column represents the name of a city followed by the name of its corresponding province
* Sometimes the column represents the name of a major city in the region and has no reference to the province
* Sometimes the column represents the name of small cities or villages that are not administrative centers and are hard to locate

From the onset, we'll tackle this challenge only for the Philippines region as it is the region of concern in the [main notebook](https://www.kaggle.com/rossrco/kiva-philippines-poverty-score). In light of this, it would be useful to mention the structure and hierarchy of the Filippino administrative regions:

1. Island Group
2. Region
3. Province
4. City

In the most common case, the `region` column of the `kiva_loans.csv` dataset contains patterns of the type: 'city name, province'. In some rare cases the the `region` column contains patterns of the type: 'city name'. We'll use these patterns to match either the city or the province to a region. To do that, we'll utilize a small dataset of island groups, regions, provinces and cities based on [this wikipedia article](https://en.wikipedia.org/wiki/Administrative_divisions_of_the_Philippines).

Finally, we should mention that through these heueristics, we managed to map 150 000 out of 160 000 loan locations. The rest 10 000 could be matched in the future through other methods (e.g. Google Maps API, etc.).

In [4]:
#numeric
import numpy as np
import pandas as pd

#visualization
import matplotlib.pyplot as plt
import seaborn as sns
import folium

plt.style.use('bmh')
%matplotlib inline

#system
import os
import re

#Pandas warnings
import warnings
warnings.filterwarnings('ignore')

In [5]:
loans = pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/kiva_loans.csv')
phil_loans = loans[loans.country == 'Philippines']
geonames_phil = pd.read_csv('../input/administrative-regions-in-the-philippines/ph_regions.csv')

from difflib import get_close_matches

def match_region(loc_string, match_entity = 'province', split = True):
    if split == True:
        region = loc_string.split(',')[-1]
    else:
        region = loc_string
    
    matches = get_close_matches(region, geonames_phil[match_entity].unique().tolist())
    
    if not matches:
        return 'no_match'
    else:
        return geonames_phil.region[geonames_phil[match_entity] == matches[0]].iloc[0]
    
phil_loans.region.fillna('', inplace = True)
phil_loans.rename(columns = {'region' : 'location'}, inplace = True)
phil_loans['region'] = [match_region(loc_string) for loc_string in phil_loans.location]

city_drop = re.compile(r'(.*)(city)', re.I)
phil_loans.location[phil_loans.region == 'no_match'] = [re.match(city_drop, l).group(1).lower()\
                                                        if re.match(city_drop, l)\
                                                        else l for l\
                                                        in phil_loans.location[phil_loans.region == 'no_match']]

phil_loans['region'][phil_loans.region == 'no_match'] = np.vectorize(match_region)(phil_loans['location'][phil_loans.region == 'no_match'], 'city', False)

phil_loans.region[phil_loans.location == 'Sogod Cebu'] = geonames_phil.region[geonames_phil.city == 'cebu'].iloc[0]

phil_loans_extract = phil_loans[(phil_loans.borrower_genders.notna()) & (phil_loans.region != 'no_match')]

phil_loans_extract['borrower_genders'] = phil_loans_extract['borrower_genders']\
.map({'female' : 1,\
      'male' : 0})

phil_loans_extract.rename(columns = {'borrower_genders' : 'house_head_sex_f'}, inplace = True)



In [6]:
phil_loans_extract.to_csv('kiva_loans_ph_transofrmed.csv')