**Kiva Loan Applicants Poverty Prediction**

Goal: Kiva's objective is to assess the poverty level of borrowers to the most granular level possible.

Kiva.org is an online crowdfunding platform to extend financial services to poor and financially excluded people around the world. Kiva lenders have provided over \$1 billion dollars in loans to over 2 million people. Kiva has provided its dataset which includes information of loans made over the past years. The objective of this project is to build more localized models to estimate the poverty levels of residents in the regions where Kiva has active loans based on shared economic and demographic characteristics with loan data provided by Kiva and outside data sources.

Microlending is a way of alleviating poverty. Having the ability to predict individual applicant’s wealth level as granular as possible helps institutions, loan lenders and policy makers make decisions to help alleviate poverty more efficiently. Resources could be more actively allocated to applicants with certain wealth level to encourage entrepreneurship. In addition, knowing individual’s poverty level increases transparency for lenders to understand the risks of default for borrowers, without being biased by factors such as gender, sector, and region. In that sense, more lenders will be willing to participate in providing loans, aggregating more capital for microlending. 

Kiva loan requests are posted for a maximum of 30 days. If they're not funded within that time period, the loan expires and the Field Partner does not receive the funds for that particular loan. Kiva lenders have shown a clear preference for females and specific groups, as the rate for men's loans expiring is nearly three times higher than the rate for women. While it is not completely surprising that Kiva users prefer giving loans to women over men, the magnitude of this gap is quite large. A similar trend emerges when looking at expired loans by continent. Africa and Asia have very low rates of expired loans (less than 3\%), whereas nearly 20\% of loans in the US do not get fully funded. Europe and North America (excluding the US) also have a high proportion of expired loans. Again, while this inclination is not surprising, the magnitude of the difference is somewhat shocking. By evaluating individual’s wealth level, we can reduce the bias in selecting applicants to fund.


**Data**

The loan dataset provided by kiva contains a set of information for each loan application: dollar value of loan funded on Kiva.org; total dollar amount of loan; loan activity type; sector of loan activity as shown to lenders; country name; name of location within country; repayment interval, which is the frequency at which lenders are scheduled to receive installments, and loan theme, as well as Kiva’s estimates as to the various geolocations in which a loan theme has been offered. We will use the geolocation information provided by kiva loan dataset to better understand the spread of the population applying for loans.

Given the geographic location of the loan data, gathering demographic information associated with location is of utmost importance. We requested Demographic and Health Surveys (DHS) data for Philippines and geographic information related to it from DHS website \footnote{\url{https://dhsprogram.com/data/}}. Such survey data is collected through field work. It clusters individual being surveyed on to a specific DHS cluster, and it provides geo-location information related to each DHS cluster. There is a total of 776 clusters in Philippines, and the number of individuals being surveyed on varies for different clusters. One piece of information in the DHS dataset that specifically ties to the goal of this project is the wealthscore evaluated for those individuals being surveyed on. It was calculated based on the asset value and other factors. 

We developed the following procedure for combining the two datasets: first, we find the cluster each loan applicant belongs to by finding the closest cluster based on geo location information; second, we propose two ways of using wealthscores in the DHS dataset to give labels to kiva applicants, which will be further discussed in the next section.


**Model Formulation**
The major challenge of this project is that the loan dataset we are working on does not provide true labels. In order to solve this problem, we introduce the DHS dataset to provide estimated labels for our loan dataset. As introduced in the previous section, the DHS dataset divides Philippines into 776 clusters geographically and provides an index called wealthscore for each individual in the cluster. We assign label to each loan data point based on the DHS cluster it is in and the wealthscores in the cluster.



In [None]:
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn import preprocessing
from scipy import optimize


# take input data from Phil_loan_meanLabel.csv
def get_data(csv_file):
    df = pd.read_csv(csv_file)
    # drop columns that are not useful
    df = df.drop(['country_code',
                  'activity','use',
                  'country.x',
                  'region',
                  'currency',
                  'geo',
                  'lat',
                  'lon',
                  'mpi_region',
                  'mpi_geo',
                  'posted_time',
                  'disbursed_time',
                  'funded_time',
                  'geocode',
                  'tags',
                  'borrower_genders',
                  'LocationName',
                  'nn.idx',
                  'nn.dists',
                  'ISO',
                  'number',
                  'Unnamed: 0',
                  'forkiva',
                  'names',
                  'Partner.ID',
                  'Field.Partner.Name',
                  'Loan.Theme.ID',
                  'date',
                  'Loan.Theme.Type',
                  'amount',
                  'id',
                  'DHSCLUST',
                  'rural_pct'       #drop this col since it contains NaN
                  ], axis=1)
    df = pd.get_dummies(df, prefix=['sector.x','repayment_interval'])

    X = df.drop(['wealthscore'],axis=1)
    Y = df['wealthscore']
    X = X.values
    Y = np.array(Y)
    df = df.drop(['wealthscore'],axis=1)
    return [X,Y,df]

# find the mean and variance of each cluster
def find_cluster_mean_var():
    dhs = pd.read_csv("../input/philippines/Phil_DHS_info.csv")

    cluster_mean_var_dict = dict()

    num_cluster = 794

    for cluster_num in range(1, num_cluster+1):
        cur_cluster = dhs.loc[dhs["DHSCLUST"] == cluster_num]
        cur_cluster_mean = cur_cluster["wealthscore"].mean()
        cur_cluster_var = cur_cluster["wealthscore"].var()

        if (np.isnan(cur_cluster_mean)):
            cur_cluster_mean = 0.0
        if (np.isnan(cur_cluster_var)):
            cur_cluster_var = 1.0
        cluster_mean_var_dict[cluster_num] = (cur_cluster_mean, cur_cluster_var)
    return cluster_mean_var_dict






# based on the assumption that labels are assigned uniformly within a cluster
def linear_poverty():
    # training
    [X,Y,df] = get_data("../input/philippines/Phil_loan_meanLabel.csv")

    min_max_scaler = preprocessing.MinMaxScaler()
    X_scaled = min_max_scaler.fit_transform(X)


    regr = linear_model.LinearRegression()
    regr.fit(X_scaled,Y)

    predict_Y = regr.predict(X_scaled)

    df['naive_value'] = predict_Y

    df.to_csv('naive_value.csv')



# based on the assumption that labels are assigned normly within a cluster
def gaussian_poverty():
    cluster_mean_var_dict = find_cluster_mean_var()
    [X, Y, df] = get_data("../input/philippines/Phil_loan_meanLabel.csv")
    K = 794

    # find the gradient of loss function, take derivative then set to zero
    # optimize over w
    def derivative_m(w_1, m):
        K = 794
        result = 0
        for k in range(1, K+1):
            (mu_k, sig_k) = cluster_mean_var_dict[k]
            for i in range(0, N):
                if sig_k <= 1.0: sig_k = 1.0
                result += (X_scaled[i][m]**2 * w_1 - mu_k*X_scaled[i][m]) / sig_k
                
        return result

    min_max_scaler = preprocessing.MinMaxScaler()
    X_scaled = min_max_scaler.fit_transform(X)


    N = len(X_scaled)
    M = len(X_scaled[0])

    ws = []

    for m in range(0, M):
        root = optimize.newton(derivative_m, 50.0, tol=0.001, args = (m,))
        ws.append(root)

    print(ws)
    # ws = [760254.49989362864, 768486.12283061701, 363324.08014879079, 798248.12640516623, 40352.972485067876, 40352.972484749189, 40352.972484983307, 40352.972484801423, 40352.972484807746, 40352.972484819227, 40352.972483169637, 40352.972484795268, 40352.972484850201, 40352.972484722704, 40352.972484746431, 40352.972479315904, 40352.972484980615, 40352.972484980528, 40352.972484808932, 40352.972484769722, 40352.972485522754, 40352.972484685706]


    predict_Y = []

    for i in range(0, N):
        predict_Y.append(np.dot(X[i], ws))

    df['gaussian_value'] = predict_Y

    df.to_csv('gaussian_value.csv')

'''
if __name__ == '__main__':
    linear_poverty()
    gaussian_poverty()
'''



The above code implements two models, and results could be generated if you uncomment the very last part. We combined the results of both models and acquired night light intensity comparison in the next kernel:  https://www.kaggle.com/chifmackenzie/kiva-loan-applicants-poverty-prediction-result/