## Data Preprocessing

Using `loans.csv` from http://s3.kiva.org/snapshots/kiva_ds_csv.zip (found at http://build.kiva.org)

Only want after 2010. 

** Features ** 

- `loan_amount` 
- `status`: loans that are funded or expired
- `sector`
- `country_name`
- `borrower_genders`

Writes the final dataframe as `../loans_mini.csv`

In [1]:
import pandas as pd
from collections import Counter

In [2]:
LOANS = pd.read_csv('../loans.csv', parse_dates = ['posted_time'])

In [3]:
loans = LOANS[LOANS['posted_time'].dt.year >= 2010].copy()
loans.sort_values('posted_time', inplace=True)
print('first loan: {}'.format(loans['posted_time'].min()))
loans = loans[['borrower_genders', 'loan_amount', 'sector', 'country_name', 'status']]  # cols
loans = loans[(loans['status'] == 'funded') | (loans['status'] == 'expired')]  # rows
loans.dropna(subset=['borrower_genders'], inplace=True)  # not interested in rows where gender is missing
loans.reset_index(drop=True,inplace=True)

first loan: 2010-01-01 00:26:40


In [4]:
def genders_groups(x):
    x = x.split(', ')
    if len(x) == 1:
        return x[0]
    else:
        genders = Counter(x).keys()
        if 'female' in genders and 'male' not in genders:
            return 'female_group'
        elif 'female' in genders and 'male' in genders:
            return 'mixed_group'
        else:
            return 'male_group'

In [5]:
loans['gender_group'] = loans['borrower_genders'].map(genders_groups)
loans.drop('borrower_genders', axis=1, inplace=True)
loans.to_csv('../loans_mini.csv', index=False)