# Data Science for Good: Kiva Crowdfunding  
<img src="https://cdn-images-1.medium.com/max/2000/1*uQpsmDSjNg75xsJ0j9dD0Q.jpeg" height="280" width="700" />
___

[Kiva.org](https://www.kiva.org/) is an online crowdfunding platform to extend financial services to poor and financially excluded people around the world. Kiva lenders have provided over $1 billion dollars in loans to over 2 million people. In order to set investment priorities, help inform lenders, and understand their target communities, knowing the level of poverty of each borrower is critical. However, this requires inference based on a limited set of information for each borrower.

In Kaggle Datasets' inaugural [Data Science for Good](http://blog.kaggle.com/2017/11/16/introducing-data-science-for-good-events-on-kaggle/) challenge, Kiva is inviting the Kaggle community to help them build more localized models to estimate the poverty levels of residents in the regions where Kiva has active loans. Unlike traditional machine learning competitions with rigid evaluation criteria, participants will develop their own creative approaches to addressing the objective. Instead of making a prediction file as in a supervised machine learning problem, submissions in this challenge will take the form of Python and/or R data analyses using Kernels, Kaggle's hosted Jupyter Notebooks-based workbench.

Kiva has provided a dataset of loans issued over the last two years, and participants are invited to use this data as well as source external public datasets to help Kiva build models for assessing borrower welfare levels. Participants will write kernels on this dataset to submit as solutions to this objective and five winners will be selected by Kiva judges at the close of the event. In addition, awards will be made to encourage public code and data sharing. With a stronger understanding of their borrowers and their poverty levels, Kiva will be able to better assess and maximize the impact of their work.

The sections that follow describe in more detail how to participate, win, and use available resources to make a contribution towards helping Kiva better understand and help entrepreneurs around the world.
___  


<img src="http://www-kiva-org.global.ssl.fastly.net/cms/kiva_logo_tag_0.png" height="280" width="700"  />
## Problem Statement
For the locations in which Kiva has active loans, your objective is to pair Kiva's data with additional data sources to estimate the welfare level of borrowers in specific regions, based on shared economic and demographic characteristics.

A good solution would connect the features of each loan or product to one of several poverty mapping datasets, which indicate the average level of welfare in a region on as granular a level as possible. Many datasets indicate the poverty rate in a given area, with varying levels of granularity. Kiva would like to be able to disaggregate these regional averages by gender, sector, or borrowing behavior in order to estimate a Kiva borrower’s level of welfare using all of the relevant information about them. Strong submissions will attempt to map vaguely described locations to more accurate geocodes.

Kernels submitted will be evaluated based on the following criteria:

**1. Localization** - How well does a submission account for highly localized borrower situations? Leveraging a variety of external datasets and successfully building them into a single submission will be crucial.

**2. Execution** - Submissions should be efficiently built and clearly explained so that Kiva’s team can readily employ them in their impact calculations.

**3. Ingenuity** - While there are many best practices to learn from in the field, there is no one way of using data to assess welfare levels. It’s a challenging, nuanced field and participants should experiment with new methods and diverse datasets.
___

## Loading Packages  
The libraries below will be used to load and explore Kiva Crowdfunding data challenge data

In [14]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import plotly.graph_objs as go
import plotly.tools as tools
import plotly.offline as ply
ply.init_notebook_mode(connected=True)
import seaborn as sns 
color = sns.color_palette()
import matplotlib.pyplot as plt
%matplotlib inline
# Any results you write to the current directory are saved as output.

## Listing kiva loan data files
In order for us to know which files we should load we firstly list the files from input directory 

In [15]:
print(os.listdir("../input/"))

### Lets load the the kiva locations regions data First
> Nothing is better than going home to family and eating good food and relaxing.
> by **Irina Shayk**

In [16]:
locations = pd.read_csv('../input/kiva_mpi_region_locations.csv')

### Lets see the first few lines
it good to know how data looks like to analyze it

In [17]:
locations.head()

## Creating visual graphs by countries
* Firstly lets plot the graphs from loans that have most loans to the least  
* Geo-location on map countries and regions with loans  
* Other Visuals to make the work clean

In [18]:
Y=locations.country.value_counts().index[::-1]
X=locations.country.value_counts().values[::-1]
data = go.Bar(
    x = X,
    y = Y,
    orientation = 'h',
    marker=dict(
        color=X,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Countries Around the world Kiva fund with Loans',
    width=800,
    height=1200,
    )
figure = go.Figure(data=[data], layout=layout)
ply.iplot(figure, filename="LoansbyKiva")

## Cummulative frequence
* Having Fun with data at hand Information  

In [19]:
data = [dict(
  type = 'scatter',
  x = X,
  y = Y,
  mode = 'markers',
  transforms = [dict(
    type = 'groupby',
    groups = X
  )]
)]

ply.iplot({'data': data}, validate=False)

## Bar Chart for Regions
* We have six regions around the world **Kiva** fund with Loans

In [20]:
Y=locations.world_region.value_counts().index[::-1]
X=locations.world_region.value_counts().values[::-1]
data = go.Bar(
    x = Y,
    y = X,
    orientation = 'v',
    marker=dict(
        color=X,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Regions Kiva Fund',
    width=700,
    height=500,
    )
figure = go.Figure(data=[data], layout=layout)
ply.iplot(figure, filename="PerRegionLoans")

## Grouping number of loans by country
* Arrange data by Country and number of loans
* Plotting geographical representation of **Kiva** loans per country 

In [21]:
map_df = pd.DataFrame(locations['country'].value_counts()).reset_index()
map_df.columns=['country', 'loans']
map_df = map_df.reset_index().drop('index', axis=1)

In [22]:
data = [ dict(
        type = 'choropleth',
        locations = map_df['country'],
        locationmode = 'country names',
        z = map_df['loans'],
        text = map_df['country'],
        colorscale = [[0,"rgb(5, 50, 172)"],[0.85,"rgb(40, 100, 190)"],[0.9,"rgb(70, 140, 245)"],
            [0.94,"rgb(90, 160, 245)"],[0.97,"rgb(106, 177, 247)"],[1,"rgb(220, 250, 220)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '',
            title = 'Number of Loans'),
      ) ]

layout = dict(
    title = 'Number of Loans Per Country',
    geo = dict(
        showframe = False,
        showcoastlines = True,
        projection = dict(
            type = 'Mercator'
        )
    )
)

figure = dict( data=data, layout=layout )
ply.iplot(figure, validate=False, filename='countryandloans')

##  Multidimensional Poverty Index(MPI) for each Country
* Visualizing and doing statistics on MPI per country to spot median, mean, and outliers per country

In [23]:
trace = []
for name, group in locations.groupby("country"):

    trace.append ( 
        go.Box(
            x=group["MPI"].values,
            name=name
        )
    )
layout = go.Layout(
    title='Multidimensional Poverty Index(MPI) for earch Country',
    width = 1000,
    height = 2000
)
figure = go.Figure(data=trace, layout=layout)
ply.iplot(figure, filename="ContryMPIndex")

##  Multidimensional Poverty Index(MPI) for each World Region
* Visualizing and doing statistics on MPI per Region to spot median, mean, and outliers per Region

In [24]:
trace = []
for name, group in locations.groupby("world_region"):

    trace.append ( 
        go.Box(
            y=group["MPI"].values,
            name=name
        )
    )
layout = go.Layout(
    title='Multidimensional Poverty Index(MPI) for each Region',
    width = 750,
    height = 800,
    orientation= 'v',
)
figure = go.Figure(data=trace, layout=layout)
ply.iplot(figure, filename="WorldRegionMPI")

# Loans dataset
*  Exploring the dataset for loans, and types of loans and others
* This is about visuals as well 
___
### Firstly Load data as normal

In [25]:
loans = pd.read_csv('../input/kiva_loans.csv')

### Lets view the head and how data looks like

In [26]:
loans.head()

### Loans by Activity
* Lets see the activities for loans 

In [27]:
Y=loans.activity.value_counts().index[::-1]
X=loans.activity.value_counts().values[::-1]
data = go.Bar(
    x = X,
    y = Y,
    orientation = 'h',
    marker=dict(
        color=X,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Loans by Activity',
    width=850,
    height=1000,
    )
figure = go.Figure(data=[data], layout=layout)
ply.iplot(figure, filename="LoansSeries")

### Loans by Sector 
Lets see the sectors the loans consists of

In [29]:
Y=loans.sector.value_counts().index[::-1]
X=loans.sector.value_counts().values[::-1]
data = go.Bar(
    x = X,
    y = Y,
    orientation = 'h',
    marker=dict(
        color=X,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Loans by Sector',
    width=900,
    height=600,
    )
figure = go.Figure(data=[data], layout=layout)
ply.iplot(figure, filename="SectorLoans")

### Lets see how the loans are used 
* The topics below by word clouds are usage of **Kiva** loans

In [30]:
from wordcloud import WordCloud
wordcloud = WordCloud(width=1440, height=1080).generate(" ".join(loans.use.astype(str)))
plt.figure(figsize=(20, 15))
plt.imshow(wordcloud)
plt.axis('off')

# Time series

In [20]:
loans.info()

In [21]:
loans.date = pd.to_datetime(loans.date)

### Loans regions on WordCloud
* regions representation on wordcloud

In [31]:
from wordcloud import WordCloud
wordcloud = WordCloud(width=1440, height=1080).generate(" ".join(loans.region.astype(str)))
plt.figure(figsize=(20, 15))
plt.imshow(wordcloud)
plt.axis('off')

### Repayment of loans 
* Loans on repayment type

In [33]:
# borrower_genders 	repayment_interval
Y=loans.repayment_interval.value_counts().index[::-1]
X=loans.repayment_interval.value_counts().values[::-1]
data = go.Bar(
    x = X,
    y = Y,
    orientation = 'h',
    marker=dict(
        color=X,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Loans distribution by Gender',
    width=900,
    height=600,
    )
fig = go.Figure(data=[data], layout=layout)
ply.iplot(fig, filename="Gender")

## Cleaning Gender 
* Cleaning gender on the dataset to filter out unnecessary characters 

In [34]:
placeholder = list(loans.borrower_genders)
loans.borrower_genders = ['female' if str(gender).find('female') else 'male' for gender in loans.borrower_genders ]

### checking gender on dataset

In [35]:
loans.head()

In [36]:
import plotly.figure_factory as ff
group_labels = ['Terms in Months', 'Lender count']
trace1 = go.Histogram(
    x=loans.term_in_months[:1000],
    opacity=0.75,
    histnorm='count',
    name='control'
)
trace2 = go.Histogram(
    x=loans.lender_count[:1000],
    opacity=0.75,
    histnorm='count',
    name='control'
)

data = [trace1, trace2]
layout = go.Layout(barmode='overlay')
figure = go.Figure(data=data, layout=layout)

ply.iplot(figure, filename='histogram')

In [37]:
from collections import Counter
def gender_rank(text):
    if text == 'male':
        return 'male'
    elif text == 'female':
        return 'female'
    else:
        text = Counter(str(text).split(',')).most_common()[0][0]
        if text.replace(' ', '') == 'nan':
            return np.NaN
        return text.replace(' ', '')

In [38]:
d = [gender_rank(x) for x in placeholder]

###  Borrowers Gender Statistics on Loans
* plot the visuals for gender based repayment type

In [39]:
# borrower_genders 	repayment_interval
Y=loans.borrower_genders.value_counts().index[::-1]
X=loans.borrower_genders.value_counts().values[::-1]
data = go.Bar(
    x = X,
    y = Y,
    orientation = 'h',
    marker=dict(
        color=X,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Loans by Gender',
    width=700,
    height=300,
    )
figure = go.Figure(data=[data], layout=layout)
ply.iplot(figure, filename="GenderType")

###  Males loans type repayment

In [40]:
# borrower_genders 	repayment_interval
Y=loans[loans.borrower_genders == 'male'].repayment_interval.value_counts().index[::-1]
X=loans[loans.borrower_genders == 'male'].repayment_interval.value_counts().values[::-1]
data = go.Bar(
    x = X,
    y = Y,
    orientation = 'h',
    marker=dict(
        color=X,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Loans distribution by Males',
    width=700,
    height=300,
    )
fig = go.Figure(data=[data], layout=layout)
ply.iplot(fig, filename="GenderMale")

### Gender Female repayment 

In [41]:
# borrower_genders 	repayment_interval
Y=loans[loans.borrower_genders == 'female'].repayment_interval.value_counts().index[::-1]
X=loans[loans.borrower_genders == 'female'].repayment_interval.value_counts().values[::-1]
data = go.Bar(
    x = X,
    y = Y,
    orientation = 'h',
    marker=dict(
        color=X,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Loans by Females',
    width=700,
    height=300,
    )
fig = go.Figure(data=[data], layout=layout)
ply.iplot(fig, filename="GenderFemale")

In [42]:
X=loans[loans.borrower_genders == 'male'].sector.value_counts()
trace = go.Pie(labels=X.index, values=X.values)
ply.iplot([trace], filename='basic_pie_chart')

In [43]:
X=loans[loans.borrower_genders == 'female'].sector.value_counts()
trace = go.Pie(labels=X.index, values=X.values)
ply.iplot([trace], filename='basic_pie_chart')

In [44]:
loans[['country', 'region', 'currency', 'borrower_genders', 'repayment_interval', 'activity']].groupby('country').head()

# Loans themes by regions 
* regions themes by loans

In [45]:
region = pd.read_csv('../input/loan_themes_by_region.csv')

### Lets see the data 

In [46]:
region.head()

### Regions Analysis 

In [47]:
region_df = pd.DataFrame(region['country'].value_counts()).reset_index()
region_df.columns=['country', 'loans']
region_df = region_df.reset_index().drop('index', axis=1)

In [48]:
data = [ dict(
        type = 'choropleth',
        locations = region_df['country'],
        locationmode = 'country names',
        z = region_df['loans'],
        text = region_df['country'],
        colorscale = [[0,"rgb(5, 50, 172)"],[0.85,"rgb(40, 100, 190)"],[0.9,"rgb(70, 140, 245)"],\
            [0.94,"rgb(90, 160, 245)"],[0.97,"rgb(106, 177, 247)"],[1,"rgb(220, 250, 220)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '',
            title = 'Number of Loans'),
      ) ]

layout = dict(
    title = 'Loans by Country',
    geo = dict(
        showframe = False,
        showcoastlines = True,
        projection = dict(
            type = 'Mercator'
        )
    )
)

figure = dict( data=data, layout=layout )
ply.iplot( figure, validate=False, filename='regionship')

## World of Scatter

In [49]:
trace0 = go.Scatter(
    x=region['country'].value_counts().index,
    y=region['country'].value_counts().values,
    text=region['country'].value_counts().index,
    mode='markers',
    marker=dict(
        color = np.random.randn(500), #set color equal to a variable
        colorscale='Jet',
        showscale=True,
        size=[i/5  if i < 550 else i/50 for i in region['country'].value_counts().values],
    )
)

data = go.Data([trace0])
ply.iplot(data, filename='mpl-7d-bubble')

## Another dimension

In [50]:
trace0 = go.Scatter(
    x=region['country'].value_counts().index,
    y=region['country'].value_counts().values,
    text=region['country'].value_counts().index,
    mode='markers',
    marker=dict(
        color = np.random.randn(500), #set color equal to a variable
        colorscale='Jet',
        showscale=True,
        size=[i/2  if i < 550 else i/50 for i in region['country'].value_counts().values],
    )
)

data = go.Data([trace0])
ply.iplot(data, filename='mplbubble')

## Theme ids 

In [51]:
themeids = pd.read_csv('../input/loan_theme_ids.csv')

### Lets see the first few lines

In [52]:
themeids.head()

## Loans Theme Type
* Themeids 

In [53]:
X = themeids['Loan Theme Type'].value_counts().index[::-1]
Y = themeids['Loan Theme Type'].value_counts().values[::-1]
data = go.Bar(
    x = Y,
    y = X,
    orientation = 'h',
    marker=dict(
        color=Y,
        colorscale = 'Jet',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Loans by Loan Theme Type',
    width=900,
    height=1000,
    )
fig = go.Figure(data=[data], layout=layout)
ply.iplot(fig, filename="Loans")

# Don't Forget to like the notebooks cheers, Keep visiting, more analysis coming