# Data Science for Good: Kiva Crowdfunding
## This kernel is designed to do the research and analysis on the data provided by Kiva.org. https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding


# Problem Statement
For the locations in which Kiva has active loans, your objective is to pair Kiva's data with additional data sources to estimate the welfare level of borrowers in specific regions, based on shared economic and demographic characteristics.

A good solution would connect the features of each loan or product to one of several poverty mapping datasets, which indicate the average level of welfare in a region on as granular a level as possible. Many datasets indicate the poverty rate in a given area, with varying levels of granularity. Kiva would like to be able to disaggregate these regional averages by gender, sector, or borrowing behavior in order to estimate a Kiva borrower’s level of welfare using all of the relevant information about them. Strong submissions will attempt to map vaguely described locations to more accurate geocodes.

Kernels submitted will be evaluated based on the following criteria:

1. Localization - How well does a submission account for highly localized borrower situations? Leveraging a variety of external datasets and successfully building them into a single submission will be crucial.

2. Execution - Submissions should be efficiently built and clearly explained so that Kiva’s team can readily employ them in their impact calculations.

3. Ingenuity - While there are many best practices to learn from in the field, there is no one way of using data to assess welfare levels. It’s a challenging, nuanced field and participants should experiment with new methods and diverse datasets.

# Objective Statement
In order to assess the level of impact Kiva sponsored mirco-loans had at regional level, we will make following analysis.
1. To extract several insight out of the historical miro-loans given out by Kiva over a period of time
2. To assess the change in economic indicator of the region over a period of time (6 months before and after) influenced by the kiva loans.
3. To check the credit worthiness of loanee, by assessing the repayment pattern across demographics
4. To assess the nature of revisiting loanee for recurring loans, bigger loans, one-off loans.
5. To test the probability of a Kiva loan repayment in near term for a specific region.. using machine learning algorithm 



In [None]:
cd /

In [None]:
cd ../kaggle/input/

In [None]:
ls

In [None]:
import numpy as np 
from numpy import array
import pandas as pd 
import matplotlib 
import matplotlib.pyplot as plt
from matplotlib import cm
import plotly.graph_objs as go
import seaborn as sns
color = sns.color_palette()
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.offline as offline
offline.init_notebook_mode()
import plotly.tools as tls
import squarify
%matplotlib inline
from sklearn import datasets, linear_model, metrics

In [None]:
kiva_loans_data = pd.read_csv("data-science-for-good-kiva-crowdfunding/kiva_loans.csv")
kiva_mpi_region_location_data = pd.read_csv("data-science-for-good-kiva-crowdfunding/kiva_mpi_region_locations.csv")
loan_theme_ids_data = pd.read_csv("data-science-for-good-kiva-crowdfunding/loan_theme_ids.csv")
loan_themes_by_region_data = pd.read_csv("data-science-for-good-kiva-crowdfunding/loan_themes_by_region.csv")

print ("Size of kiva_loans_data", kiva_loans_data.shape)
print ("Size of kiva_mpi_region_location_data",kiva_mpi_region_location_data.shape)
print ("Size of loan_theme_ids_data",loan_theme_ids_data.shape)
print ("Size of loan_themes_by_region_data",loan_themes_by_region_data.shape)

# Renaming dataset for quicker access in further code
d1 = kiva_loans_data
d2 = kiva_mpi_region_location_data 
d3 = loan_theme_ids_data 
d4 = loan_themes_by_region_data


In [None]:
kiva_loans_data.info()


In [None]:
kiva_loans_data.head()

# initial observation from d1 -  loan data

* Funded amount appears to be same as Loan amount. is it always the case. if it is same for all the entries, we can ignore either of the column for further anlysis
* what are the different category of activities?


In [None]:
kiva_loans_data.describe()

In [None]:
kiva_mpi_region_location_data.info()

In [None]:
kiva_mpi_region_location_data.head()

In [None]:
loan_theme_ids_data.info()


In [None]:
loan_theme_ids_data.head()


In [None]:
loan_themes_by_region_data.info()

In [None]:
loan_themes_by_region_data.head()

In [None]:
loan_themes_by_region_data.describe()

In [None]:
# Define function, which are used in the below script

def check_diff(t1:list,detail_flag:bool):
    
    """
    check_diff function takes a list of array with two column. 
    it compares the first column value with second column value
    if both these values are not same, it will increment the count of 
    different such entries it found in the list
    it also has a second parameter - "detail_flag" which is of type boolean
    if it is True, it will return all the entry which are with different value
    setting it to false will only return the total count of such record
    """
    i = 0
    j = 0
    for x in t1:
        if x[0] != x[1]:
            if detail_flag and j < 5: 
                print (f"{x[0]} != {x[1]} -> Different")
                j += 1
            i += 1
    if j == 5:
        print ("..... only first 5 rec are displayed\n")
    print (f"Count of total record = {i}")

In [None]:
# To infer if Loan amount is same as funded amount, check the number of loans where they are different
check_diff(d1.values[:,1:3],True)

Since there are lot of loan records where Funded amount was less than Loan Amount requested. Since Funded_amount is more meaningful for the given loan, we should drop the Loan_amout column from the table, as it seems irrelevant and sort of duplicate data. 

In [None]:
d1['repayment_interval'].value_counts().plot(kind="pie",figsize=(12,12))

In [None]:
plt.figure(figsize=(15,8))
count = kiva_loans_data['repayment_interval'].value_counts().head(10)
sns.barplot(count.values,count.index,)
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=19)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Repayment interval types', fontsize=12)
plt.title('Repayment interval count', fontsize=16)

In [None]:
plt.figure(figsize=(15,8))
count = kiva_loans_data['country'].value_counts().head(10)
sns.barplot(count.values,count.index,)
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=19)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Country name', fontsize=12)
plt.title('Country wise distributionof kiva loans', fontsize=16)


In [None]:
plt.figure(figsize = (12, 8))

sns.distplot(kiva_loans_data['funded_amount'])
plt.show() 
plt.figure(figsize = (12, 8))
plt.scatter(range(kiva_loans_data.shape[0]), np.sort(kiva_loans_data.funded_amount.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('loan_amount', fontsize=12)
plt.title("Loan Amount Distribution")
plt.show()

In [None]:
d2.columns

In [None]:
# Distribution of world regions
plt.figure(figsize=(15,8))
count = d2['world_region'].value_counts()
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=19)
plt.xlabel('Count', fontsize=12)
plt.ylabel('world region name', fontsize=12)
plt.title("Distribution of world regions", fontsize=16)


In [None]:
#Distribution of lender count(Number of lenders contributing to loan)
print("Number of lenders contributing to loan : ", len(kiva_loans_data["lender_count"].unique()))
print(kiva_loans_data["lender_count"].value_counts().head(10))
lender = kiva_loans_data['lender_count'].value_counts().head(40)
plt.figure(figsize=(15,8))
sns.barplot(lender.index, lender.values, alpha=0.9, color=color[0])
plt.xticks(rotation='vertical')
plt.xlabel('lender count(Number of lenders contributing to loan)', fontsize=12)
plt.ylabel('count', fontsize=12)
plt.title("Distribution of lender count", fontsize=16)
plt.show()


In [None]:
#Distribution of Loan Activity type

plt.figure(figsize=(15,8))
count = kiva_loans_data['activity'].value_counts().head(30)
sns.barplot(count.values, count.index)
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=12)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Activity name?', fontsize=12)
plt.title("Top Loan Activity type", fontsize=16)


In [None]:
#Distribution of Number of months over which loan was scheduled to be paid back
print("Number of months over which loan was scheduled to be paid back : ", len(kiva_loans_data["term_in_months"].unique()))
print(kiva_loans_data["term_in_months"].value_counts().head(10))
lender = kiva_loans_data['term_in_months'].value_counts().head(70)
plt.figure(figsize=(15,8))
sns.barplot(lender.index, lender.values, alpha=0.9, color=color[0])
plt.xticks(rotation='vertical')
plt.xlabel('Number of months over which loan was scheduled to be paid back', fontsize=12)
plt.ylabel('count', fontsize=12)
plt.title("Distribution of Number of months over which loan was scheduled to be paid back", fontsize=16)
plt.show()


In [None]:
plt.figure(figsize=(15,8))
count = kiva_loans_data['sector'].value_counts()
squarify.plot(sizes=count.values,label=count.index, value=count.values)
plt.title('Distribution of sectors')

In [None]:
plt.figure(figsize=(15,8))
count = kiva_loans_data['activity'].value_counts()
squarify.plot(sizes=count.values,label=count.index, value=count.values)
plt.title('Distribution of Activities')

In [None]:
plt.figure(figsize=(15,8))
count = kiva_loans_data['repayment_interval'].value_counts()
squarify.plot(sizes=count.values,label=count.index, value=count.values)
plt.title('Distribution of repayment_interval')

In [None]:
plt.figure(figsize=(15,8))
count = kiva_loans_data['use'].value_counts().head(10)
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=19)
plt.xlabel('Count', fontsize=12)
plt.ylabel('uses of loans', fontsize=12)
plt.title("Most popular uses of loans", fontsize=16)

In [None]:
gender_list = []
for gender in kiva_loans_data["borrower_genders"].values:
    if str(gender) != "nan":
        gender_list.extend( [lst.strip() for lst in gender.split(",")] )
temp_data = pd.Series(gender_list).value_counts()

labels = (np.array(temp_data.index))
sizes = (np.array((temp_data / temp_data.sum())*100))
plt.figure(figsize=(15,8))

trace = go.Pie(labels=labels, values=sizes)
layout = go.Layout(title='Borrower Gender')
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="BorrowerGender")

In [None]:
kiva_loans_data.borrower_genders = kiva_loans_data.borrower_genders.astype(str)
gender_data = pd.DataFrame(kiva_loans_data.borrower_genders.str.split(',').tolist())
kiva_loans_data['sex_borrowers'] = gender_data[0]
kiva_loans_data.loc[kiva_loans_data.sex_borrowers == 'nan', 'sex_borrowers'] = np.nan
sex_mean = pd.DataFrame(kiva_loans_data.groupby(['sex_borrowers'])['funded_amount'].mean().sort_values(ascending=False)).reset_index()
print(sex_mean)
g1 = sns.barplot(x='sex_borrowers', y='funded_amount', data=sex_mean)
g1.set_title("Mean funded Amount by Gender ", fontsize=15)
g1.set_xlabel("Gender")
g1.set_ylabel("Average funded Amount(US)", fontsize=12)

In [None]:
f, ax = plt.subplots(figsize=(15, 5))
print("Genders count with repayment interval monthly\n",kiva_loans_data['sex_borrowers'][kiva_loans_data['repayment_interval'] == 'monthly'].value_counts())
print("Genders count with repayment interval weekly\n",kiva_loans_data['sex_borrowers'][kiva_loans_data['repayment_interval'] == 'weekly'].value_counts())
print("Genders count with repayment interval bullet\n",kiva_loans_data['sex_borrowers'][kiva_loans_data['repayment_interval'] == 'bullet'].value_counts())
print("Genders count with repayment interval irregular\n",kiva_loans_data['sex_borrowers'][kiva_loans_data['repayment_interval'] == 'irregular'].value_counts())

sns.countplot(x="sex_borrowers", hue='repayment_interval', data=kiva_loans_data).set_title('sex borrowers with repayment_intervals');

In [None]:
#Distribution of Kiva Field Partner Names with funding count
print("Top Kiva Field Partner Names with funding count : ", len(loan_themes_by_region_data["Field Partner Name"].unique()))
print(loan_themes_by_region_data["Field Partner Name"].value_counts().head(10))
lender = loan_themes_by_region_data['Field Partner Name'].value_counts().head(40)
plt.figure(figsize=(15,8))
sns.barplot(lender.index, lender.values, alpha=0.9, color=color[0])
plt.xticks(rotation='vertical', fontsize=14)
plt.xlabel('Field Partner Name', fontsize=18)
plt.ylabel('Funding count', fontsize=18)
plt.title("Top Kiva Field Partner Names with funding count", fontsize=25)
plt.show()

In [None]:
countries_funded_amount = kiva_loans_data.groupby('country').mean()['funded_amount'].sort_values(ascending = False)
print("Top Countries with funded_amount(Dollar value of loan funded on Kiva.org)(Mean values)\n",countries_funded_amount.head(10))

In [None]:
data = [dict(
        type='choropleth',
        locations= countries_funded_amount.index,
        locationmode='country names',
        z=countries_funded_amount.values,
        text=countries_funded_amount.index,
        colorscale='Red',
        marker=dict(line=dict(width=0.7)),
        colorbar=dict(autotick=False, tickprefix='', title='Top Countries with funded_amount(Mean value)'),
)]
layout = dict(title = 'Top Countries with funded_amount(Dollar value of loan funded on Kiva.org)',
             geo = dict(
            showframe = False,
            #showcoastlines = False,
            projection = dict(
                type = 'Mercatorodes'
            )
        ),)
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)
    

In [None]:
mpi_region_amount = round(loan_themes_by_region_data.groupby('mpi_region').mean()['amount'].sort_values(ascending = False))
print("Top mpi_region with amount(Dollar value of loans funded in particular LocationName)(Mean values)\n",mpi_region_amount.head(10))

In [None]:
data = [dict(
        type='choropleth',
        locations= mpi_region_amount.index,
        locationmode='country names',
        z=mpi_region_amount.values,
        text=mpi_region_amount.index,
        colorscale='Red',
        marker=dict(line=dict(width=0.7)),
        colorbar=dict(autotick=False, tickprefix='', title='Top mpi_regions with amount(Mean value)'),
)]
layout = dict(title = 'Top mpi_regions with amount(Dollar value of loans funded in particular LocationName)',
             geo = dict(
            showframe = False,
            #showcoastlines = False,
            projection = dict(
                type = 'Mercatorodes'
            )
        ),)
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)


In [None]:
plt.figure(figsize=(15,8))
count = round(kiva_loans_data.groupby(['sector'])['loan_amount'].mean().sort_values(ascending=False))
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=12)
plt.xlabel('Average loan amount in Dollar', fontsize=20)
plt.ylabel('Loan sector', fontsize=20)
plt.title('Popular loan sector in terms of loan amount', fontsize=24)

In [None]:

plt.figure(figsize=(15,8))
count = round(kiva_loans_data.groupby(['activity'])['loan_amount'].mean().sort_values(ascending=False).head(20))
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=12)
plt.xlabel('Average loan amount in Dollar', fontsize=20)
plt.ylabel('Loan sector', fontsize=20)
plt.title('Popular loan activity in terms of loan amount', fontsize=24)

In [None]:

plt.figure(figsize=(15,8))
count = round(kiva_loans_data.groupby(['country'])['loan_amount'].mean().sort_values(ascending=False).head(20))
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=12)
plt.xlabel('Average loan amount in Dollar', fontsize=20)
plt.ylabel('Countries', fontsize=20)
plt.title('Popular countries in terms of loan amount', fontsize=24)

In [None]:
plt.figure(figsize=(15,8))
count = round(kiva_loans_data.groupby(['region'])['loan_amount'].mean().sort_values(ascending=False).head(20))
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=12)
plt.xlabel('Average loan amount in Dollar', fontsize=20)
plt.ylabel('regions(locations within countries)', fontsize=20)
plt.title('Popular regions(locations within countries) in terms of loan amount', fontsize=24)

In [None]:
sector_repayment = ['sector', 'repayment_interval']
cm = sns.light_palette("red", as_cmap=True)
pd.crosstab(kiva_loans_data[sector_repayment[0]], kiva_loans_data[sector_repayment[1]]).style.background_gradient(cmap = cm)

In [None]:
sector_repayment = ['country', 'repayment_interval']
cm = sns.light_palette("red", as_cmap=True)
pd.crosstab(kiva_loans_data[sector_repayment[0]], kiva_loans_data[sector_repayment[1]]).style.background_gradient(cmap = cm)

In [None]:
#Correlation Matrix
corr = kiva_loans_data.corr()
plt.figure(figsize=(12,12))
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values, annot=True, cmap='cubehelix', square=True)
plt.title('Correlation between different features')
corr

In [None]:
fig = plt.figure(figsize=(15,8))
ax=sns.kdeplot(kiva_loans_data['term_in_months'][kiva_loans_data['repayment_interval'] == 'monthly'] , color='b',shade=True, label='monthly')
ax=sns.kdeplot(kiva_loans_data['term_in_months'][kiva_loans_data['repayment_interval'] == 'weekly'] , color='r',shade=True, label='weekly')
ax=sns.kdeplot(kiva_loans_data['term_in_months'][kiva_loans_data['repayment_interval'] == 'irregular'] , color='g',shade=True, label='irregular')
ax=sns.kdeplot(kiva_loans_data['term_in_months'][kiva_loans_data['repayment_interval'] == 'bullet'] , color='y',shade=True, label='bullet')
plt.title('Term in months(Number of months over which loan was scheduled to be paid back) vs Repayment intervals')
ax.set(xlabel='Terms in months', ylabel='Frequency')

In [None]:
temp = loan_themes_by_region_data['forkiva'].value_counts()
labels = temp.index
sizes = (temp / temp.sum())*100
trace = go.Pie(labels=labels, values=sizes, hoverinfo='label+percent')
layout = go.Layout(title='Loan theme specifically for Kiva V.S. Loan theme not specifically for Kiva')
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

# Linear Regression on two variable 


In [None]:
def estimate_coef(x, y):
    # number of observations/points
    n = np.size(x)
 
    # mean of x and y vector
    m_x, m_y = np.mean(x), np.mean(y)
 
    # calculating cross-deviation and deviation about x
    SS_xy = np.sum(y*x - n*m_y*m_x)
    SS_xx = np.sum(x*x - n*m_x*m_x)
 
    # calculating regression coefficients
    b_1 = SS_xy / SS_xx
    b_0 = m_y - b_1*m_x
 
    return(b_0, b_1)
 
def plot_regression_line(x, y, b, xlabel, ylabel):
    # plotting the actual points as scatter plot
    plt.scatter(x, y, color = "m",
               marker = "o", s = 30)
 
    # predicted response vector
    y_pred = b[0] + b[1]*x
 
    # plotting the regression line
    plt.plot(x, y_pred, color = "g")
 
    # putting labels
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
 
    # function to show plot
    plt.show()
 


In [None]:
    # observations
    x = np.array(d1["funded_amount"])
    y = np.array(d1["loan_amount"])
 
    # estimating coefficients
    b = estimate_coef(x, y)
    print("Estimated coefficients:\nb_0 = {}  \
          \nb_1 = {}".format(b[0], b[1]))
 
    xlabel = "Funded Amount"
    ylabel = "Loan Amount"
    # plotting regression line
    plot_regression_line(x, y, b, xlabel, ylabel)


In [None]:
    # observations
    x = np.array(d1["term_in_months"])
    y = np.array(d1["loan_amount"])
 
    # estimating coefficients
    b = estimate_coef(x, y)
    print("Estimated coefficients:\nb_0 = {}  \
          \nb_1 = {}".format(b[0], b[1]))
 
    xlabel = "Term in Months"
    ylabel = "Loan Amount"
    # plotting regression line
    plot_regression_line(x, y, b, xlabel, ylabel)


In [None]:
    # observations
    x = np.array(d1["lender_count"])
    y = np.array(d1["loan_amount"])
 
    # estimating coefficients
    b = estimate_coef(x, y)
    print("Estimated coefficients:\nb_0 = {}  \
          \nb_1 = {}".format(b[0], b[1]))
 
    xlabel = "Lenders count"
    ylabel = "Loan Amount"
    # plotting regression line
    plot_regression_line(x, y, b, xlabel, ylabel)

    

 # Multi linear regression
 

In [None]:
# defining feature matrix(X) and response vector(y)
mymap= {'irregular':1, 'bullet':2, 'monthly':3, 'weekly':4, "male":11, "female":12}
X = d1[["funded_amount", "term_in_months"]][d1.borrower_genders=="male"]
y = d1[["repayment_interval"]][d1.borrower_genders=="male"].applymap(lambda s: mymap.get(s) if  s in mymap else s)

# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
                                                    random_state=1)

# create linear regression object
reg = linear_model.LinearRegression()

# train the model using the training sets
reg.fit(X_train, y_train)

# regression coefficients
print('Coefficients: \n', reg.coef_)

# variance score: 1 means perfect prediction
print('Variance score: {}'.format(reg.score(X_test, y_test)))

# plot for residual error

## setting plot style
plt.style.use('fivethirtyeight')

## plotting residual errors in training data
plt.scatter(reg.predict(X_train), reg.predict(X_train) - y_train,
            color = "green", s = 10, label = 'Train data')

## plotting residual errors in test data
plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test,
            color = "blue", s = 10, label = 'Test data')

## plotting line for zero residual error
plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)

## plotting legend
plt.legend(loc = 'upper right')

## plot title
plt.title("Residual errors")

## function to show plot
plt.show()