**Background : Kiva is an online crowdfunding platform which provides financial services and loans to financially excluded people around the world. Kiva has provided over 1 billion dollars in loans to over 2 million people. **

**Objective : Prioritise investments, help inform lenders, understand their target communities and knowing the level of poverty of each borrower. Helping Kiva build models for assessing borrowing welfare levels. **

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import geopandas as gpd
import squarify
%matplotlib inline
sns.set(style="darkgrid")

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
#import dataset
loans = pd.read_csv("../input/kiva_loans.csv")
mpi = pd.read_csv("../input/kiva_mpi_region_locations.csv")
loan_theme = pd.read_csv("../input/loan_theme_ids.csv")
loan_region = pd.read_csv("../input/loan_themes_by_region.csv")

In [None]:
loans.head()

In [None]:
loans.info()

In [None]:
loans['Posted_Date'],loans["Posted_Time"] = zip(*loans["posted_time"].map(lambda x: x.split(' ')))


In [None]:
loans= loans.drop('posted_time',axis=1)

In [None]:
loans = loans[['id','funded_amount','loan_amount','activity','sector','use','country_code',
                    'country','region','currency','partner_id','Posted_Date','Posted_Time','disbursed_time',
                    'funded_time','term_in_months','lender_count','borrower_genders','repayment_interval']]

In [None]:
loans.describe()

In [None]:
#convert id and partner id to object type
loans['id'] = loans['id'].astype('object')
loans['partner_id'] = loans['partner_id'].astype('object')

In [None]:
loans.rename(index=str,columns = {'id' :'loan_id','borrower_genders':'genders'},inplace=True)

In [None]:
f, ax = plt.subplots(ncols = 1)
ax = sns.kdeplot(loans['loan_amount'], shade = 'True', color = "red")
ax = sns.kdeplot(loans['funded_amount'],shade = 'True', color = "blue")
ax.set_xlim(0,100000)
ax.set_title("Distribution of Loan Amount and Funded Amount",fontsize = 20)
f.set_size_inches(15,10)

**From the above density plot, we can figure out, that mostly the loans are taken between 200 - 15000 approximately. There are very few people who have taken loans in the range of 20000 - 90000. There are a number of people who have taken loans between 95k and above till 1lakh. What can be the reason for it ? For what activities are the loans taken ? We will find out in the later analysis. Also there are times, when the funded amount is less than the loan amount. **

**Let's check out the activities for what the people are borrowing the money for. **

In [None]:
loans['activity'].unique()

In [None]:
activity = pd.DataFrame(loans.groupby(['activity'])['loan_amount'].sum()).reset_index()

In [None]:
activity.sort_values(by = 'loan_amount',ascending = False,inplace = True)

In [None]:
top_10_activity = activity.head(10)
top_10_activity

In [None]:
bottom_10_activity = activity.tail(10)

In [None]:
# Two subplots
f, (ax1, ax2) = plt.subplots(ncols=2, sharey=False)
sns.barplot(y = 'activity',x = 'loan_amount', data = top_10_activity,ax=ax1)
ax1.set(xlabel = "Loan Amount in Crores")
ax1.set_title("Top 10 activities",fontsize = 20)
sns.barplot(y = 'activity',x = 'loan_amount', data = bottom_10_activity,ax=ax2)
ax2.set(xlabel = "Loan Amount in Thousands")
ax2.set_title("Bottom 10 activities",fontsize = 20)
f.set_size_inches(15, 11)

**From the above graph, we can infer that the loan is taken mostly for Farming, agriculture, livestock activities i.e the loan is taking to sustain a livelihood. Very small amounts of loan is taken for some kind of celebration or personal care. **

**I am curious to know on an indivdual level, the distribution of loans taken for farming, lets see**

In [None]:
filter = loans[loans['activity'] == 'Farming']
farming = filter[['loan_amount','activity']]
farming.sort_values(by='loan_amount',ascending=False,inplace=True)

In [None]:
f,ax = plt.subplots(ncols=1)
ax = sns.kdeplot(farming['loan_amount'],shade=True)
ax.set_title ('Loan amount taken for Farming',fontsize = 20)
f.set_size_inches(15,10)

**Though Farming has the highest amount of loans taking cumulatively, we can see that the amount of individual loans taken for farming are very 
small amounts. Most of the loans are taken between 25 - 5000. There are a just 2 loans taken above 25k and less than 50k.**

In [None]:
loans['sector'].unique()

In [None]:
sector = loans.groupby(['sector'])['loan_amount'].sum().reset_index()

In [None]:
sector.sort_values(by = 'loan_amount',ascending = False, inplace = True)

In [None]:
sector

In [None]:
f,ax = plt.subplots(ncols = 1)
sns.barplot(y = 'sector',x = 'loan_amount', data= sector)
ax.set_title ('Loan Amount taken sector-wise',fontsize = 20)
f.set_size_inches(15,10)

**Let us check the loan distribution across the top 3 sectors ie. Agriculture, Retail and Food**

In [None]:
value = ['Agriculture','Food','Retail']
sector_1 = loans[loans['sector'].isin(value)]
sector_1.head()

In [None]:
f,ax = plt.subplots(ncols=1)
ax = sns.boxplot(x = 'sector', y = 'loan_amount',data=loans)
f.set_size_inches(30, 30)
# There is an outlier, who has taken a loan of 100000, I am ignoring that outlier and reducing the limit
ax.set_ylim(0,50000)
ax.set_title('Sector-wise loan amount distribution',fontsize = 20)

In [None]:
f,ax = plt.subplots(ncols=1)
ax = plt.scatter(x='loan_amount',y='term_in_months',data=loans,marker = 'o',alpha = 0.1,color = 'red')
f.set_size_inches(15, 10)


**I thought that the loan_amount and the repayment interval will be linearly related, but that is clearly not the case, there is no linear relationship between them which can be inferred from the above scatter plot**

In [None]:
f,ax = plt.subplots(ncols = 1)
ax = sns.countplot(x ='repayment_interval',data = loans)
ax.set_title('Count of Repayment Interval',fontsize = 20)
f.set_size_inches(15,10)

In [None]:
bullet = loans[loans['repayment_interval'] == 'bullet']
irregular = loans[loans['repayment_interval'] == 'irregular']

In [None]:
irregular_country = pd.DataFrame(irregular['country'].unique())
irregular_country.columns = ["Country"]
irregular_country.head(20)

**We can see that mostly the underdeveloped or developing countries have irregular loan repayment interval. **

**Let us check out the total loan amount taken country wise**

In [None]:
country_loan = loans.groupby(['country'])['loan_amount'].sum().reset_index()
country_loan.sort_values(by = 'loan_amount',ascending = False, inplace = True)

In [None]:
country_loan.head(10)

**From the irregular_country dataframe and the country_loan dataframe, we can see that the countries which have taken maximum amounts of loans are also the underdeveloped or developing countries which have irregular repayment intervals. Therefore, kiva must concentrate on these countries. **

In [None]:
f,ax = plt.subplots(ncols = 1)
sns.barplot(x = 'loan_amount',y = 'country', data = country_loan.head(10))
f.set_size_inches(15,10)
ax.set_title('Loan amount given Country-Wise',fontsize = 20)

In [None]:
loan_theme.head()

In [None]:
loan_type = loan_theme.groupby(['Loan Theme Type'])['id'].count().reset_index()
loan_type.sort_values(by = 'id',ascending = False,inplace = True)
loan_type.rename(index = str, columns = {'id' : 'Count'},inplace = True)
loan_type_top10 = loan_type.head(10)
loan_type_bottom10 = loan_type.tail(10)

In [None]:
f, (ax1, ax2) = plt.subplots(ncols=2, sharey=False)
ax1 = sns.barplot(y = 'Loan Theme Type', x = 'Count', data = loan_type_top10,ax = ax1)
ax1.set_title ("Top 10 Loan Theme Type",fontsize=20)
ax2 = sns.barplot(y = 'Loan Theme Type', x = 'Count', data = loan_type_bottom10,ax = ax2)
ax2 .set_title ("Bottom 10 Loan Theme Type",fontsize=20)
f.set_size_inches(20,15)

In [None]:
loans.head(2)

In [None]:
loans_use = loans['use'].astype(str)
type(loans_use)

In [None]:
loans_use.dropna(axis = 0, how ='any')

In [None]:
from wordcloud import WordCloud,STOPWORDS
import nltk

In [None]:
words = []
for i in range(0,len(loans_use)):
    words.append(nltk.word_tokenize(loans_use[i]))

words = [i for i in words for i in i]
words

In [None]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords.append('buy')
stopwords.append('purchase')
stopwords.append('sell')
wordcloud = WordCloud(max_font_size = 30,width = 600, height = 300,stopwords = stopwords).generate(" ".join(words))
plt.figure(figsize =(15,8))
plt.imshow(wordcloud)
plt.title('Wordcloud for loan uses',fontsize = 20)
plt.axis('off')
plt.show()

**From the wordcloud we can see that the loan amount have been mostly used to probably purchase drinking water supplies, fertilizers,pay tuition, for family, cooking oil, sewing machine, canned goods, raise pigs, dairy cow.**

**MPI is multidimensional Poverty Index which is a measure of acute poverty which captures severe deprivations each person faces at the same time with respect to education, health and living standards. The MPI accesses poverty at an individual level. **

In [None]:
mpi.head()

**In the MPI dataset, we need to delete the LocationName and geo column as they are redundant**

In [None]:
mpi.drop(['LocationName','geo'],axis = 1, inplace = True)

In [None]:
mpi['world_region'].unique()

In [None]:
mpi.info()

**Let us see the number of loans distributed through the world regions**

In [None]:
count_loan_region = pd.DataFrame(mpi['world_region'].value_counts()).reset_index()
count_loan_region.columns = ['world_region','count_of_loans']
count_loan_region

In [None]:
f,ax = plt.subplots(ncols = 1)
# Create a circle for the center of the plot
my_circle=plt.Circle( (0,0), 0.7, color='white')
ax = plt.pie(count_loan_region['count_of_loans'], labels = count_loan_region['world_region'],wedgeprops = { 'linewidth' : 7, 'edgecolor' : 'white' })
p=plt.gcf()
p.gca().add_artist(my_circle)
f.set_size_inches(10,10)
plt.show()


**Let us see the distribution of MPI index using kdeplot**

In [None]:
f,ax = plt.subplots(ncols = 1)
sns.kdeplot(mpi['MPI'], kernel ='gau',shade = True,bw ='scott',color ='green')
f.set_size_inches(15,10)
ax.set_title('Multidimensional Poverty Index Distribution',fontsize = 20)

**Let us plot MPI distribution for all the Regions using KDE plot**

In [None]:
g = sns.FacetGrid(data = mpi, col = 'world_region',hue = 'world_region',dropna = True)
g.map(sns.kdeplot, "MPI",shade = True)

**We can see that the Europe and Central Asia regions as expected have most of their MPI within 0 - 0.2 which is a good sign. Where as South Asia and Sub-Saharan African Regions has a fairly normal distribution. Latin American, East Asia and Pacific and Arab states have rightly skewed distribution. **

In [None]:
loan_region.head()

In [None]:
loan_region['Field Partner Name'].unique()

**Lets analyze which are the Field Partners which provide most of the loans and in which Loan Theme type. **

In [None]:
loan_field_partner = loan_region.groupby(['Field Partner Name'])['amount'].sum().reset_index().sort_values(by = 'amount',ascending = False)
theme_type = loan_region.groupby(['Loan Theme Type'])['amount'].sum().reset_index().sort_values(by = 'amount',ascending = False)

In [None]:
loan_field_partner.head()

In [None]:
theme_type.head()

In [None]:
loan_field_partner_top20 = loan_field_partner.head(20)
theme_type_top20 = theme_type.head(20)

In [None]:
#create a color palette matching the values 
cmap = matplotlib.cm.viridis
mini=min(loan_field_partner_top20['amount'])
maxi=max(loan_field_partner_top20['amount'])
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
colors = [cmap(norm(value)) for value in loan_field_partner_top20['amount']]


f,ax = plt.subplots(ncols=1)
ax = squarify.plot(sizes = loan_field_partner_top20['amount'],value = loan_field_partner_top20['amount'], norm_x = 100, norm_y = 100, label = loan_field_partner_top20['Field Partner Name'],color = colors )
f.set_size_inches(20,10)
ax.set_title('Top 20 Loan Field Partners providing loans',fontsize = 20)

In [None]:
#create a color palette matching the values 
cmap = matplotlib.cm.viridis
mini=min(theme_type_top20['amount'])
maxi=max(theme_type_top20['amount'])
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
colors = [cmap(norm(value)) for value in theme_type_top20['amount']]


f,ax = plt.subplots(ncols=1)
ax = squarify.plot(sizes = theme_type_top20['amount'],value = theme_type_top20['amount'], norm_x = 100, norm_y = 100, label = theme_type_top20['Loan Theme Type'],color = colors )
f.set_size_inches(20,10)
ax.set_title ('Top 20 Loan theme types',fontsize=20)