## Rwanda Kiva Loan Project

#### Load Libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(color_codes = True)
color = sns.color_palette()
import plotly.express as px
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
import plotly.tools as tls

from numpy import array
from matplotlib import cm
from collections import OrderedDict

cmaps = OrderedDict()


import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)



#### Read the Data and Overview of Data

In [None]:
kiva=pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/kiva_loans.csv')

kiva.head()

In [None]:
kiva1=pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/kiva_mpi_region_locations.csv')

kiva1.head()

In [None]:
kiva2=pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/loan_theme_ids.csv')

kiva2.head()

In [None]:
themes=pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/loan_themes_by_region.csv')
themes.head()

Little description of kiva_loans_data for numerical features

## Subset Rwanda Data

In [None]:
kiva_rwanda = kiva[kiva['country']=='Rwanda']

In [None]:
kiva_rwanda.head()

In [None]:
rwanda= kiva_rwanda[['region','funded_amount', 'loan_amount', 'activity', 'sector',
                    'term_in_months', 'borrower_genders','lender_count','repayment_interval', 'use']]
rwanda.head().reset_index()

#### Understanding the Data More

In [None]:
rwanda.info()

#### Checking Missing Values

In [None]:
rwanda.isna().any()


 The borrower_genders, use and the region column contain some missing values

In [None]:
rwanda.isna().all()

No single column has all the data missing( This is great aspect to our analysis)

In [None]:
rwanda.isna().sum()

Checking the total number of missing values

There are 6138 missing values on the region column 14 missing values under borrower_genders column and 15 missing values in use

#### Creating a copy of the data frame and working with the new copy

In [None]:
new_rwanda=rwanda.copy()
new_rwanda.isna().sum()

#### Exploring the numerical values by using describe() function

In [None]:
rwanda.describe()

The maximum loan given was 50,000; the longest loan term was 41 months, the lender_count maximum was 1302

## Data Exploration

#### Top sectors in which more loans were given

In [None]:
plt.figure(figsize=(15,8))
sector_name = rwanda['sector'].value_counts()
sns.barplot(sector_name.values, sector_name.index)
for i, v in enumerate(sector_name.values):
    plt.text(0.8,i,v,color='k',fontsize=19)
plt.xticks(rotation='vertical')
plt.xlabel('loan_amount')
plt.ylabel('Sector Name')
plt.title("Top sectors in which more loans were given")
plt.show()

Food sector is very frequent followed by Agriculture in terms of number of loans.

### Types of Repayment Interval

In [None]:
plt.figure(figsize=(12,8))
count = rwanda['repayment_interval'].value_counts().head(10)
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=19)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Types of repayment interval', fontsize=12)
plt.title("Types of repayment intervals with their count", fontsize=16)

In [None]:
rwanda['repayment_interval'].value_counts().plot(kind="pie",figsize=(10,10))

Types of repayment interval

Irregular (More frequent)
Monthly
bullet
(less frequent)

### Histogram

In [None]:
px.histogram(rwanda, x= 'lender_count', range_x=[0,40], color='repayment_interval')

### Facet Column

In [None]:
px.histogram(rwanda, x= 'lender_count', facet_col ='repayment_interval')

### Most frequent regions that got loans

In [None]:
# Plot the most frequent regions
plt.figure(figsize=(15,8))
count = rwanda['region'].value_counts().head(10)
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=19)
plt.xlabel('Count', fontsize=12)
plt.ylabel('region name', fontsize=12)
plt.title("Most frequent regions for kiva loan", fontsize=16)

Kigali is most frequent region who got more loans** followed by Rulindo**

In [None]:
rwanda.columns

## Distribution

#### Distribution of funded amount

In [None]:
# Distribution of funded amount
plt.figure(figsize = (12, 8))

sns.distplot(rwanda['funded_amount'])
plt.show() 
plt.figure(figsize = (12, 8))
plt.scatter(range(rwanda.shape[0]), np.sort(rwanda.funded_amount.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('loan_amount', fontsize=12)
plt.title("Loan Amount Distribution")
plt.show()

#### Distribution of loan amount

In [None]:
# Distribution of loan amount
plt.figure(figsize = (12, 8))

sns.distplot(rwanda['loan_amount'])
plt.show()
plt.figure(figsize = (12, 8))

plt.scatter(range(rwanda.shape[0]), np.sort(rwanda.loan_amount.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('loan_amount', fontsize=12)
plt.title("Loan Amount Distribution")
plt.show()

Outliers 

#### Distribution of Rwanda regions

In [None]:
# Distribution of Rwanda regions
plt.figure(figsize=(12,8))
count = rwanda['region'].value_counts()
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=19)
plt.xlabel('Count', fontsize=12)
plt.ylabel('rwanda region name', fontsize=12)
plt.title("Distribution of rwanda regions", fontsize=16)

As we can see Kigali region got more number of loans.
Bugesera, Karongi, Masaka, Kabarore, Muhanga, Gatenga and Bugarama is least frequent Rwanda region.

#### Distribution of Loan Activity type

In [None]:
#Distribution of Loan Activity type

plt.figure(figsize=(12,8))
count = rwanda['activity'].value_counts().head(30)
sns.barplot(count.values, count.index)
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=12)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Activity name?', fontsize=12)
plt.title("Top Loan Activity type", fontsize=16)

Top 2 loan activity which got more number of funding are Farming and Food market

#### Distribution of terms_in_month(Number of months over which loan was scheduled to be paid back)

In [None]:
#Distribution of Number of months over which loan was scheduled to be paid back
print("Number of months over which loan was scheduled to be paid back : ", len(rwanda["term_in_months"].unique()))
print(rwanda["term_in_months"].value_counts().head(10))
lender = rwanda['term_in_months'].value_counts().head(70)
plt.figure(figsize=(15,8))
sns.barplot(lender.index, lender.values, alpha=0.9, color=color[0])
plt.xticks(rotation='vertical')
plt.xlabel('Number of months over which loan was scheduled to be paid back', fontsize=12)
plt.ylabel('count', fontsize=12)
plt.title("Distribution of Number of months over which loan was scheduled to be paid back", fontsize=16)
plt.show()

6 months over which loan was scheduled to be paid back have taken higher times followed by 8 and 10.

#### Distribution of sectors

In [None]:
plt.figure(figsize=(10,5))
plt.title('Loan Amount by Sector')
sns.barplot(x ='sector',y= 'loan_amount', data= rwanda,ci= None, color= 'lightblue', estimator= np.sum  )
plt.xticks(rotation=75)

plt.show()

Food, Retail sectors received highest loan amounts followed by Agriculture and Clothing sectors.  

In [None]:
plt.figure(figsize=(12,5))
plt.title('Loan Amount by Sector')
sns.barplot(x ='sector',y= 'loan_amount', data= rwanda,ci= None, estimator= np.sum, hue=('repayment_interval')  )
plt.xticks(rotation=75)

plt.show()

Agriculture having the highest bullet and food and retail sectors having highest irregular repayment interval

#### Scatter Plot

In [None]:
px.scatter(rwanda, x=  'loan_amount', y= 'lender_count', color='sector', size='loan_amount', hover_data= ['funded_amount'])

There is positive correlation btwn lender_count and loan-amount

 #### Distribution of Most popular uses of loans

In [None]:
plt.figure(figsize=(15,8))
count = rwanda['use'].value_counts().head(10)
sns.barplot(count.values, count.index, )
for i, v in enumerate(count.values):
    plt.text(0.8,i,v,color='k',fontsize=19)
plt.xlabel('Count', fontsize=12)
plt.ylabel('uses of loans', fontsize=12)
plt.title("Most popular uses of loans", fontsize=16)

Most popupar use of loan is to pay workers who harvested maize

### Borrower Gender: Female V.S. Male

Dropping the missing values on the borrower_genders Column

In [None]:
new_rwanda=rwanda.copy()
new_rwanda.isna().sum()


In [None]:
new_rwanda.dropna(subset=['borrower_genders'], inplace = True)
new_rwanda.isna().sum()


#### Examining the borrower_genders column using unique()

In [None]:
new_rwanda['borrower_genders'].unique()

#### Cleaning the borrower_genders Column
we create a function to fix the gender column

In [None]:
kiva.dropna(subset = ['borrower_genders'], inplace = True)
def gender_lead(gender):
    gender = str(gender)
    if gender.startswith('f'):
        gender = 'female'
    else:
       gender = 'male' 
    
    return gender


#### Applying the function on the borrower_genders column to fix the problem

In [None]:
new_rwanda['borrower_genders']= new_rwanda['borrower_genders'].apply(gender_lead)

#### Checking if the problem is aligned

In [None]:
new_rwanda['borrower_genders'].unique()

In [None]:
# Jackpot!! Now we have two variables needed for our anlysis

new_rwanda['borrower_genders'].nunique()

## Analysis

Exploring the numerical values by using describe() function

In [None]:
new_rwanda.describe()

The maximum loan given was 50,000; the longest loan term was 41 months, the lender_count maximum was 1302

#### Trend of loan amount V.S. funded amount

In [None]:
new_rwanda.info()

In [None]:
new_rwanda.head()

In [None]:
sector_df=new_rwanda.groupby(['borrower_genders','sector'])['loan_amount', 'lender_count', 'funded_amount'].sum().sort_values(by= 'loan_amount', ascending= False).reset_index().head(10)

In [None]:
sector_df

#### Funded Amount

In [None]:
new_rwanda['funded_amount'].hist(bins=50) 

No extreem values signifying the data is accurate

#### Loan Amount

In [None]:
new_rwanda['loan_amount'].hist(bins=50)

No extreem values signifying the data is accurate

#### Defining Variables for Coding

In [None]:
funded_amount = new_rwanda['funded_amount']
loan_amount = new_rwanda['loan_amount']
activity = new_rwanda['activity']
sector = new_rwanda['sector']
term = new_rwanda['term_in_months']
gender = new_rwanda['borrower_genders']
count = new_rwanda['lender_count']
repayment = new_rwanda['repayment_interval']
region = new_rwanda['region']

In [None]:
region.isna().sum() 

In [None]:
new_rwanda.duplicated().sum() 

There are many missing values in Rwanda regions but (no duplicates)

#### The total Kiva loan in Rwanda region 

In [None]:
loan_amount.sum() 

The total Kiva loan its USD 16,616,750.0


## Grouping The Data

In [None]:
rwanda_region =new_rwanda.groupby(['region','sector','activity','borrower_genders', 'repayment_interval']).sum().sort_values(by='loan_amount', ascending=False).reset_index()
rwanda_region.head()

In [None]:
rwanda_region_sector =new_rwanda.groupby(['region','sector','borrower_genders', 'repayment_interval']).sum().sort_values(by='loan_amount', ascending =False).reset_index()
rwanda_region_sector.head()

####  Loan amount vs repayment interval by borrower genders

In [None]:
plt.figure(figsize=(11,8))
plt.title('loan amount vs Repayment interval by gender ')
plt.xticks(rotation=75)

sns.barplot(x='repayment_interval', y='loan_amount', data = rwanda_region, ci=None, hue ='borrower_genders')
plt.show()

#### Loan Amount vs Sector

In [None]:
plt.figure(figsize=(15,10))
plt.title('Loan amount in every sector')
plt.xlabel('sector')
plt.ylabel('loan_amount')

plt.xticks(rotation =75)

sns.barplot(x='sector', y='loan_amount', data =rwanda_region_sector, ci =None)

plt.show()

## Pair Plot

In [None]:
sns.pairplot(rwanda_region_sector)

there is higher correlation btwn funded amount and loan amount

### Correlation Matrix and Heatmap

#### Sectors and Repayment Intervals correlation

In [None]:
sector_repayment = ['sector', 'repayment_interval']
cm = sns.light_palette("red", as_cmap=True)
pd.crosstab(new_rwanda[sector_repayment[0]], new_rwanda[sector_repayment[1]]).style.background_gradient(cmap = cm)

Food Sector had higher number of monthly repayment interval followed by Education sector.Food had higher irregular repayment interval followed by retail.

#### Term_In_Months V.S. Repayment_Interval

In [None]:
fig = plt.figure(figsize=(15,8))
ax=sns.kdeplot(new_rwanda['term_in_months'][new_rwanda['repayment_interval'] == 'monthly'] , color='b',shade=True, label='monthly')
ax=sns.kdeplot(new_rwanda['term_in_months'][new_rwanda['repayment_interval'] == 'weekly'] , color='r',shade=True, label='weekly')
ax=sns.kdeplot(new_rwanda['term_in_months'][new_rwanda['repayment_interval'] == 'irregular'] , color='g',shade=True, label='irregular')
ax=sns.kdeplot(new_rwanda['term_in_months'][new_rwanda['repayment_interval'] == 'bullet'] , color='y',shade=True, label='bullet')
plt.title('Term in months(Number of months over which loan was scheduled to be paid back) vs Repayment intervals')
ax.set(xlabel='Terms in months', ylabel='Frequency')

Repayment Interval monthly having higher frequency than others repayment intervals

#### Correlation Matrix and Heatmap of Rwanda

In [None]:
#Correlation Matrix
corr = new_rwanda.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values, annot=True, cmap='cubehelix', square=True)
plt.title('Correlation between different features')
corr
plt.show

 There is higher correlation between loan_amount and funded_amount 

#### Region and Repayment Intervals correlation

In [None]:
sector_repayment = ['region', 'repayment_interval']
cm = sns.light_palette("red", as_cmap=True)
pd.crosstab(new_rwanda[sector_repayment[0]], new_rwanda[sector_repayment[1]]).style.background_gradient(cmap = cm)

Kigali had higher number of monthly repayment interval than others.

### Scatter Plot

In [None]:
plt.figure(figsize = (15,10))
sns.scatterplot(x = 'loan_amount', y = 'lender_count', data = rwanda_region_sector, hue = 'sector',size = 'loan_amount',sizes = (100,300))
plt.show()

Agriculture sector had the highest lender count and loan amount

### Pair Plot

In [None]:
sns.pairplot(rwanda_region_sector)

### Time Series Analysis

#### 1. Trend of loan amount V.S. funded amount

In [None]:
rwanda= kiva_rwanda[['region','funded_amount', 'loan_amount', 'activity', 'sector','posted_time','disbursed_time','funded_time',
                    'term_in_months', 'borrower_genders','lender_count','repayment_interval', 'use']]
rwanda.head().reset_index()

In [None]:
rwanda.posted_time = pd.to_datetime(rwanda['posted_time'])
rwanda.disbursed_time = pd.to_datetime(rwanda['disbursed_time'])
rwanda.funded_time = pd.to_datetime(rwanda['funded_time'])

In [None]:
rwanda.index = pd.to_datetime(rwanda['posted_time'])
plt.figure(figsize = (12, 8))
ax = rwanda['loan_amount'].resample('w').sum().plot()
ax = rwanda['funded_amount'].resample('w').sum().plot()
ax.set_ylabel('Amount ($)')
ax.set_xlabel('month-year')
ax.set_xlim((pd.to_datetime(rwanda['posted_time'].min()), 
             pd.to_datetime(rwanda['posted_time'].max())))
ax.legend(["loan amount", "funded amount"])
plt.title('Trend of loan amount V.S. funded amount')

plt.show()

parterns of loan amount and funded amount with time was higher in Jan 2017

#### 2. Trend of unfunded amount V.S. funded amount

In [None]:
rwanda.index = pd.to_datetime(rwanda['posted_time'])

rwanda['unfunded_amount'] = rwanda['loan_amount'] - rwanda['funded_amount']
plt.figure(figsize = (12, 8))
ax = rwanda['unfunded_amount'].resample('w').sum().plot()
ax = rwanda['funded_amount'].resample('w').sum().plot()
ax.set_ylabel('Amount ($)')
ax.set_xlabel('month-year')
ax.set_xlim((pd.to_datetime(rwanda['posted_time'].min()), 
             pd.to_datetime(rwanda['posted_time'].max())))
ax.legend(["unfunded amount", "funded amount"])
plt.title('Trend of unfunded amount V.S. funded amount')

plt.show()

### Rwanda Mapbox

In [None]:
themes=pd.read_csv('../input/data-science-for-good-kiva-crowdfunding/loan_themes_by_region.csv')
themes.head()

In [None]:
themes_rwanda = themes[themes['country'] == 'Rwanda'].reset_index(drop = True)
themes_rwanda.head()

In [None]:
px.set_mapbox_access_token('pk.eyJ1IjoiamFyZWQ4NyIsImEiOiJja2s1d3BkZHAwdWVvMnZxbnI0Mzh6cmRvIn0.LHpqVCTKMSNRcSt2c5usrg')

In [None]:
px.scatter_mapbox(themes_rwanda, lat = 'lat', lon = 'lon', color = 'region',size = 'amount',center = dict(lat = -1.5791079, lon = 30.0694123), zoom = 10 )


### Summary

1. The total Kiva loan RWF 16,616,750.0
2. The maximum loan given was 50,000; the longest loan term was 41 months, the lender_count maximum was 1302
3. Food sector is very frequent followed by Agriculture in terms of number of loans in Rwanda
4. Irregular repayment interval were More frequent (4317) followed by Monthly at 1459  bullet (less frequent at 959)
5. Kigali is the most frequent region that got more loans** followed by Rulindo with 111 counts**
6. Kigali region got more number of loans. Bugesera, Karongi, Masaka, Kabarore, Muhanga, Gatenga and Bugarama is least frequent Rwanda regions.
7. Food, Retail sectors received highest loan amounts followed by Agriculture and Clothing sectors.
8. Top 2 loan activity which got more funding are Farming and Food market with 976 and 736 counts respectively
9. 6 months over which loan was scheduled to be paid back have taken higher times followed by 8 and 10  months.
10. Most popupar use of loan was to pay workers who harvested maize
11. Food Sector had higher number of monthly repayment interval followed by Education sector.
12. Food sector had higher irregular repayment interval followed by retail sector.
13. Repayment Interval monthly had higher frequency than other repayment intervals
14. The maximum loan given was 50,000; the longest loan term was 41 months, the lender_count maximum was 1302
15. There is higher correlation between loan_amount and funded_amount
16. Kigali had higher number of monthly repayment interval than other regions.
17. funded amount was higher than unfunded amount.