![Title](https://financialaid.ucsc.edu/images/yourstory_loans.jpg)

# Introduction
***

## About KIVA
[Kiva](https://www.kaggle.com/kiva) is an online crowdfunding platform to extend financial services to poor and financially excluded people around the world. Kiva lenders have provided over 1 billion dollars in loans to over 2 million people. In order to set investment priorities, help inform lenders, and understand their target communities, knowing the level of poverty of each borrower is critical. However, this requires inference based on a limited set of information for each borrower.

## Notebook Goals
***
   > This notebook aims to accomplish the following: 
* Do an extensive exploratory data analysis(EDA)
* Apply well-thought visualizations that tell a story through data visualization. 


***
If there are any recommendations/changes you would like to see in this notebook, please feel free to leave a comment. Any feedbacks/constructive criticisms would be truely appreciated. **This notebook will always be a work in progress. So, please stay tuned for more to come.**

**If you like this notebook, you can upvote/leave a comment.**

# Loading Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# import plotly
import plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.tools as tls
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as fig_fact
plotly.tools.set_config_file(world_readable=True, sharing='public')

# import pyplot. 
import matplotlib.pyplot as plt

## in order to show more columns. 
pd.options.display.max_columns = 999

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory


import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.a

# Loading Datasets

In [None]:
## From Data Science for good: Kiva funding. 
kiva_loans = pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/kiva_loans.csv')
loan_themes = pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/loan_themes_by_region.csv')
mpi_region_locations = pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/kiva_mpi_region_locations.csv')
theme_id = pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/loan_theme_ids.csv')

## From additional data sources. 
country_stats = pd.read_csv('../input/additional-kiva-snapshot/country_stats.csv')
#all_loans = pd.read_csv('../input/additional-kiva-snapshot/loans.csv')
lenders = pd.read_csv('../input/additional-kiva-snapshot/lenders.csv')
loan_coords = pd.read_csv('../input/additional-kiva-snapshot/loan_coords.csv')
locations = pd.read_csv('../input/additional-kiva-snapshot/locations.csv')

##mpi
mpi_national = pd.read_csv('../input/mpi/MPI_national.csv')
mpi_subnational = pd.read_csv('../input/mpi/MPI_subnational.csv')

#all_data = [kiva_loans, loan_themes, mpi_region_locations, theme_id, country_stats, loans, lenders, loan_coords, locations]

# A Glimpse of the Datasets and their Missing Values
***
## ***Data Science for Good Datasets***

>**KIVA Loans Dataset**

In [None]:
kiva_loans.head()

**> Missing values in Kiva_loans dataset**

In [None]:
total = kiva_loans.isnull().sum()[kiva_loans.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(kiva_loans)*100,4))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

>**Loan Themes Dataset**

In [None]:
loan_themes.head()

**> Missing values in *loan_themes* dataset**

In [None]:
total = loan_themes.isnull().sum()[loan_themes.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(loan_themes)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

>**MPI Region Locations Dataset**

In [None]:
mpi_region_locations.head()

**> Missing values in *mpi_region_location* dataset**

In [None]:
total = mpi_region_locations.isnull().sum()[mpi_region_locations.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(mpi_region_locations)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

>**Theme ID Dataset**

In [None]:
theme_id.columns = ['id','loan_theme_id','loan_theme_type','partner_id']

In [None]:
theme_id.head()

**> Missing values in *theme_id* dataset**

In [None]:
total = theme_id.isnull().sum()[theme_id.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(theme_id)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

## ***Additional Datasets***


>**Country Stats Dataset**

In [None]:
country_stats.head()

In [None]:
total = country_stats.isnull().sum()[country_stats.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(country_stats)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

>**Lenders Dataset**

In [None]:
lenders.head()

**> Missing values in *lenders* dataset**

In [None]:
total = lenders.isnull().sum()[lenders.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(lenders)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

>**Loan Coordinates Dataset**

In [None]:
loan_coords.head()

**> Missing values in *loan_coords* dataset**

In [None]:
total = loan_coords.isnull().sum()[loan_coords.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(loan_coords)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

**There are no missing values in *loan_coords* dataset**

>**Locations Dataset**

In [None]:
locations.head()

**> Missing values in *locations* dataset**

In [None]:
total = locations.isnull().sum()[locations.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(locations)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

## ***Multidimensional Poverty Measures Datasets***

>**MPI National Dataset**

In [None]:
mpi_national.head()

**> Missing values in *mpi_national* dataset**

In [None]:
total = mpi_national.isnull().sum()[mpi_national.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(mpi_national)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

**There are no missing values in *mpi_national* dataset**

>**MPI Subnational Dataset**

In [None]:
mpi_subnational.columns = ['ISO_country_code','Country','Sub_national_region','world_region','MPI_national','MPI_regional','Headcount_ratio_regional','intensit_of_deprivation_regional']
mpi_subnational.head()

**> Missing values in *mpi_subnational* dataset**

In [None]:
total = mpi_subnational.isnull().sum()[mpi_subnational.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(mpi_subnational)*100,2))
pd.concat([total, percent], axis=1, keys=['total_missing', 'percent'])

**There is only 1 missing value in *mpi_subnational* dataset**

# Exploratory Data Analysis(EDA)
## An Overview of the features
***
**Countries with most Amount of Loans**

In [None]:
top_countries = kiva_loans.country.value_counts().head(20)

data = [go.Bar(
    x=top_countries.index,
    y=top_countries.values,
    width = [1.1],## customizing the width.
    marker = dict(
        color=['green',]),
    )]
layout = go.Layout(
    title = "Countries with Most Loans",
    xaxis = dict(
        title = "Countries"
    ),
    yaxis = dict(
        title = 'Loans',
        autorange = True,
        autotick = True,
        showgrid = True,
        showticklabels = True,
        tickformat = ',d'### Note for me: took me about an hour just to get this right. 
    )
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename='basic-bar')

**Philippines** is the country with highest number of funded loans. 

**Countries with Highest Funded Amounts($)**

In [None]:
top_funded_countries = kiva_loans.groupby(["country"])["funded_amount"].sum().sort_values(ascending = False).head(20)
##top_funded_countries_2 = pd.pivot_table(kiva_loans, index = 'country',values = 'funded_amount', aggfunc='sum').sort_values(by = "funded_amount", ascending = False).head(20) ##Alternative way to get the same info. 

data = [go.Bar(
    x=top_funded_countries.index, ## top_funded_countries_2.index
    y=top_funded_countries.values, ## top_funded_countries_2.funded_amount.values
    width = [1.1,],## customizing the width.
    marker = dict(
        color=['green',]),##makes the first bar green. 
    )]
layout = go.Layout(
    title = "Countries with Highest Funded Amounts",
    margin = go.Margin(b = 140,l = 95),
    xaxis = dict(
        title = "Countries"
    ),
    yaxis = dict(
        title = '$ amount',
        showgrid = True,
        ticks = 'inside'
        
    )
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename='basic-bar')

**Philippines** is also the country with highest amout of loan funding in terms of dollars. .

## Regions with Most Loans

In [None]:
top_regions = kiva_loans.region.value_counts().head(20)

data = [go.Bar(
    x=top_regions.index,
    y=top_regions.values,
    #width = [1.0,0.7,0.7,0.7,0.7,0.7,0.7,0.7,0.7,0.7],## customizing the width.
    marker = dict(
        color=['green']),
    )]
layout = go.Layout(
    title = "Top Regions for Kiva Loans",
    margin = go.Margin(b = 150),
    xaxis = dict(
        title = "Regions"
    ),
    yaxis = dict( 
        title = 'Loan Counts',
        tickformat = ',d',
        ticks = 'inside'
    )
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename='basic-bar')

**Kaduna, Nigeria** is the region with highest amount of loans.

## Top Sectors for Loans

In [None]:
sectors = kiva_loans.sector.value_counts()

data = [go.Bar(
    x=sectors.index,
    y=sectors.values,
    #width = [0.9,0.9,0.9,0.7,0.7,0.7,0.7,0.7,0.7,0.7],## customizing the width.
    marker = dict(
        color=['green', 'green', 'green']),
    )]
layout = go.Layout(
    title = "Sectors with Highest Loan Counts",
    xaxis = dict(
        title = "Sectors"
    ),
    yaxis = dict( 
        title = 'Loans',
        tickformat = ',d'
    )
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename='basic-bar')

**Agriculture, Food and Retail** are the top sectors with most loans

## Top Activities

In [None]:
activities = kiva_loans.activity.value_counts().head(20)

data = [go.Bar(
    x=activities.index,
    y=activities.values,
    #width = [1.1, 1.1],## customizing the width.
    marker = dict(
        color=['green', 'green']),
    )]
layout = go.Layout(
    title = "Top activities of Loans",
    xaxis = dict(
        title = "Activities"
    ),
    yaxis = dict( 
        title = 'Loans', 
        tickformat = ',d'
    )
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename='basic-bar')

**Farming & General Store** are the top activities done by loan borrowers. 

## Borrower's repayment interval

In [None]:
repayment = kiva_loans.repayment_interval.value_counts()

labels = repayment.index
values = repayment.values
colors = ['#FEBFB3', '#E1396C', '#96D38C', '#D0F9B1']

data = go.Pie(labels=labels, values=values,
               hoverinfo='label+percent', textinfo='percent',
               textfont=dict(size=20),
               marker=dict(colors=colors,
                           line=dict(color='#000000', width=2)))
layout = go.Layout(
    title = "Pie Chart for Repayment Interval",
)

fig = go.Figure(data = [data], layout = layout)
py.iplot(fig, filename='styled_pie_chart')

In [None]:
## Most Loans Period in terms of Months.

terms = kiva_loans.term_in_months.value_counts()

data = [go.Bar(
    x=terms.index,
    y=terms.values,
    #width = [1.1, 1.1],## customizing the width.
    marker = dict(
        color=['green', 'green', 'green']),
    )]
layout = go.Layout(
    title = "Loan period in terms of Months",
    xaxis = dict(
        title = "Activities"
    ),
    yaxis = dict( 
        title = 'Loans', 
        tickformat = ',d'
    )
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename='basic-bar')

## Top Uses

In [None]:
## I noticed that some of the data is inconsistant and are basically repeated because of upper/lower case difference. 
kiva_loans.use = kiva_loans.use.str.lower()
## Also I stumbled upon lines where the only difference is a ".". So, I got rid of the difference. 
kiva_loans.use = kiva_loans.use.str.strip('.')
## Its always a good idea to get rid of any extra white spaces. 
kiva_loans.use = kiva_loans.use.str.strip()
kiva_loans.use = kiva_loans.use.str.strip('.')



In [None]:
##There are different version so saying the same thing. therefore I have decided to merge them all together. 
kiva_loans.replace('to buy a water filter to provide safe drinking water for their family', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to buy a water filter to provide safe drinking water for her family', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to buy a water filter to provide safe drinking water for his family', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to buy a water filter to provide safe drinking water for the family', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to buy a water filter, to provide safe drinking water for her family', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to buy a water filter, to provide safe drinking water for their family', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to buy a water filter to provide safe drinking water for their families', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to purchase a water filter to provide safe drinking water for the family', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to buy a water filter to provide safe drinking water', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to purchase a water filter to provide safe drinking water', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to buy a water filter', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)
kiva_loans.replace('to buy a water filter in order to provide safe drinking water for their family', 'to buy a water filter to provide safe drinking water for his/her/their family', inplace = True)

In [None]:
uses = kiva_loans.use.value_counts().head(20)

data = [go.Bar(
    x=uses.index,
    y=uses.values,
    width = [1.0,0.7,0.7,0.7,0.7,0.7,0.7,0.7,0.7,0.7],## customizing the width.
    marker = dict(
        color=['rgb(0, 200, 200)', 'black)','black','black','black','black','black','black','black','black','black','black','black']),
    )]
layout = go.Layout(
    title = "Top Uses of the Loans",
    margin=go.Margin(b =270),## this is so we are able to read the labels in xaxis. 'b' stands for bottom, similarly left(l), 
                            ##right(r),top(t) 
    xaxis = dict(
        title = "Uses"
    ),
    yaxis = dict( 
        title = 'Loans',
        tickformat = ',d',

    )
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename='horizontal-bar')

It is spectacular to see that, almost 16,000 people took loans just to buy filter so that they can get access to safe drinking water. I wonder which countries these peoples are from. 

In [None]:
kiva_loans[kiva_loans.use == "to buy a water filter to provide safe drinking water for his/her/their family"].country.value_counts()

It turns out that, most of the loans were taken by people from Cambodia. That's quite interesting. Let's do a little research about this situation in Cambodia.  

In [None]:
loans_cambodia = kiva_loans[kiva_loans.country == 'Cambodia']

In [None]:
uses = loans_cambodia.use.value_counts().head(20)

data = [go.Bar(
    x=uses.index,
    y=uses.values,
    width = [1.0,0.7,0.7,0.7,0.7,0.7,0.7,0.7,0.7,0.7],## customizing the width.
    marker = dict(
        color=['rgb(0, 200, 200)']),
    )]
layout = go.Layout(
    title = "Top Uses of the Loans in Cambodia",
    margin=go.Margin(b =270),## this is so we are able to read the labels in xaxis. 'b' stands for bottom, similarly left(l), 
                            ##right(r),top(t) 
    xaxis = dict(
        title = "Uses"
    ),
    yaxis = dict( 
        title = 'Loans',
        tickformat = ',d',

    )
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename='horizontal-bar')

In [None]:
from wordcloud import WordCloud

names = loans_cambodia["use"][~pd.isnull(kiva_loans["use"])]
#print(names)
wordcloud = WordCloud(max_font_size=90, width=800, height=300).generate(' '.join(names))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.title("What Drives People to Take Loans", fontsize=35)
plt.axis("off")
plt.show() 

In [None]:
from wordcloud import WordCloud

names = kiva_loans["use"][~pd.isnull(kiva_loans["use"])]
#print(names)
wordcloud = WordCloud(max_font_size=60, width=800, height=300).generate(' '.join(names))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.title("What Drives People to Take Loans", fontsize=35)
plt.axis("off")
plt.show() 

## Timeline of Kiva Funding

In [None]:
## changing date type from object to datetime64 for date column.  
print ("Given raised_time type: {}".format(kiva_loans.posted_time.dtype))
## modifying the date type so that we can access day, month or yeah for analysis.
kiva_loans['posted_time'] = pd.to_datetime(kiva_loans['posted_time'], format = "%Y-%m-%d %H:%M:%S", errors='ignore')
kiva_loans['disbursed_time'] = pd.to_datetime(kiva_loans['disbursed_time'], format = "%Y-%m-%d %H:%M:%S", errors='ignore')
kiva_loans['funded_time'] = pd.to_datetime(kiva_loans['funded_time'], format = "%Y-%m-%d %H:%M:%S", errors='ignore')
print ("Modified raised_time type: {}".format( kiva_loans.posted_time.dtype))

In [None]:

loan_each_day_of_month = kiva_loans.funded_time.dt.day.value_counts().sort_index()

data = [go.Scatter(
    x=loan_each_day_of_month.index,
    y=loan_each_day_of_month.values,
    mode = 'lines+markers'
)]
layout = go.Layout(
    title = "Loan Funded in Each Day of Every Month",
    xaxis = dict(
        title = "Dates",
        autotick = False
    ),
    yaxis = dict(
        title = 'Loans',
        tickformat = ',d'
    )
    
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = 'scatter_funded_amount')

More loans were funded during the **last half of the months**

In [None]:
loan_each_month = kiva_loans.funded_time.dt.month.value_counts().sort_index()

data = [go.Scatter(
    x=loan_each_month.index,
    y=loan_each_month.values,
    mode = 'lines+markers'
)]
layout = go.Layout(
    title = "Loan Funded in Each Month",
    xaxis = dict(
        title = "Months",
        autotick = False
    ),
    yaxis = dict(
        title = 'Loans',
        tickformat = ',d'
    )
    
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = 'scatter_funded_amount')



Kiva is most active during the **end of the year.** 

In [None]:
loan_each_year = kiva_loans.funded_time.dt.year.value_counts().sort_index()

data = [go.Scatter(
    x=loan_each_year.index,
    y=loan_each_year.values,
    mode = 'lines+markers'
)]
layout = go.Layout(
    title = "Loan Funded in Each Year",
    xaxis = dict(
        title = "years",
        autotick = False
    ),
    yaxis = dict(
        title = 'loans',
        tickformat = ',d'
    )
    
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = 'scatter_funded_amount')

Kiva has been working hard with the highest amount of **181,782 loans funded in 2016** 

## Let's Dive a Little Deep in to Philippines
***
**This part is a work in progress**
***

In [None]:
## Let's Dive into Philippines
phili_kiva = kiva_loans[kiva_loans.country == "Philippines"]

In [None]:
print ("Total loans: {}".format(len(phili_kiva)))
print("Total funded amount: ${}".format(phili_kiva.funded_amount.sum()))
print("Total loan amount: ${}".format(phili_kiva.loan_amount.sum()))

In [None]:
loan_per_day = phili_kiva.date.value_counts(sort = False)

data = [go.Scatter(
    x=loan_per_day.index,
    y=loan_per_day.values,
    mode = 'markers'
)]
layout = go.Layout(
    title = "Kiva funding in Philippines",
    xaxis = dict(
        title = "date"
    ),
    yaxis = dict(
        title = '$ amount'
    )
    
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = 'scatter_funded_amount')

In [None]:
phili_kiva.groupby(['sector'])['funded_amount'].mean().sort_values(ascending = False)

# Working with Geo Data
***
**This part is a work in progress**




In [None]:
## Merging two columns on "loans.loan_id = loan_coords.loan_id " 
kiva_loans = kiva_loans.merge(loan_coords, how = 'left', left_on = 'id' ,right_on ='loan_id' )

In [None]:
## I am really into working with geo data. So, Let's see if we are missing any lon/lats at this point.
missing_geo = kiva_loans[kiva_loans.latitude.isnull()]
print ("Total missing Latitudes and Longtitudes are: {}".format(len(missing_geo)))

It looks like we are missing lots of geo(latitude, longitude) data. We can use other data sources to get missing latitude and longitudes. Couple of things to make notes..
* Let's use **id** columns as key to join kiva_loans and other tables inorder to get geo datapoints. 
* If there is no **id** column, I will go with the **region** column to pinpoint the data. 
* If we were unable to access all the geo data's by then, I will have to use **country** column to fill the remaining 
* We can use **loan_themes** dataset by using region column. It might not be accurate, but worth trying. 

In [None]:
## Extracting region from formatted_address column. 
locations['region'] = [i.split(',')[0] for i in locations.formatted_address]

## Extracting country names from formatted_address column. 
## List comprehension doesn't know how to deal with errors, So, I have had to use good and old for loop. 
a = []
for i in locations.formatted_address:
    try:
        a.append((i.split(',')[1]).strip())
    except:
        a.append((i.split(',')[0]).strip())
        
locations['country'] = a

***
If you like to discuss any other projects or just have a chat about data science topics, I'll be more than happy to connect with you on:

**LinkedIn:** https://www.linkedin.com/in/masumrumi/ 

**My Website:** http://masumrumi.strikingly.com/ 

*** This kernel will always be a work in progress. I will incorporate new concepts of data science as I comprehend them with each update. If you have any idea/suggestions about this notebook, please let me know. Any feedback about further improvements would be genuinely appreciated.***
***
### If you have come this far, Congratulations!!

### If this notebook helped you in any way or you liked it, please upvote and/or leave a comment!! :) 

