### Kiva Dataset Analysis

### Country : Samoa


- Analysis of Kiva Loans in Samoa to gain a better understanding of the distribution of the loans in the country and insights on how the lending program could be improved.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# Importing Libraries
import pandas as pd
import numpy as np

from datetime import datetime

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


### Read in the data and get a summary.

In [None]:
kiva = pd.read_csv('/kaggle/input/data-science-for-good-kiva-crowdfunding/kiva_loans.csv')
kiva.head(3)

In [None]:
kiva.info()

In [None]:
samoa = kiva[kiva['country'] == 'Samoa']

In [None]:
# Opted to save the data as csv and access it from there
samoa.to_csv('samoa.csv', index = False)

In [None]:
df = pd.read_csv('samoa.csv')
df.sample(3)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.duplicated().any()

In [None]:
df.isnull().sum()

In [None]:
df.dropna(subset = ['funded_time'], inplace = True)

- I opted to clean the other columns right before I perform an analysis involving them, as opposed to cleaning them all at once before beginning the analysis

In [None]:
df.describe()

## Exploratory Analysis

### 1. Brief look at top regions and top sectors

#### - For the top sector, which activity got the highest number of loans?

In [None]:
df['sector'].value_counts()

In [None]:
sector_df = df[df['sector'] == 'Food']

In [None]:
loans_by_activity = sector_df.groupby('activity')['loan_amount'].sum()\
.sort_values(ascending = False).reset_index()


In [None]:
plt.figure(figsize = (10,5))

plt.title('Loan Amount by Activity for the Food Sector', fontsize = 15, color = '#00008B')
plt.ylabel('Activity', fontsize = 15)
plt.xlabel('Loan Amount', fontsize = 15)

a_list = list(loans_by_activity['activity'])
b_list = list(loans_by_activity['loan_amount'])

a_list.reverse()
b_list.reverse()

plt.barh(a_list, b_list, color = '#008080')

plt.show()

- Food and Agriculture are the most funded sectors, and possibly the most needful ones.
- The most funded activity for the top sector(Food) is Food production and sales. 

#### - Were the overall top sectors by loan amount still the top ones in all the years?

- To answer this question I converted the 'date' column into a datetimeindex using pandas, then saved this as a new column that only indicates the years.

In [None]:
df['year'] = pd.DatetimeIndex(df['date']).year


In [None]:
sector_df = df.groupby(['year','sector'])['loan_amount'].sum().reset_index()
sector_df.head(3)

In [None]:
fig = px.bar(sector_df, x = 'year', y = 'loan_amount', \
             color = 'sector', barmode = 'group', height = 300)
fig.show()

- The top sectors, Agriculture food and retail still held the same position in all the years except in 2016, in which the third sector was Arts. Services and Retail also featured as the top sectors in all the years.

### 2. Repayment Intervals

#### -  How do the repayment intervals compare among the regions, especially the top region?

- Cleaned the 'repayment interval and the 'region' columns before beginning.
- Decided to assign 'unavailable' to null occurrences in the 'regions' column, so when I encounter such a region I know the data was missing. 

In [None]:
print(df['repayment_interval'].isnull().sum())
df['region'].isnull().sum()

In [None]:
len(df['region'])

In [None]:
df['region'].fillna('unavailable', inplace = True)

In [None]:
# get a list of the top regions
df['region'].value_counts().head(5)

In [None]:
# Get the various repayment intervals.
df['repayment_interval'].value_counts()

- Obtain a DataFrame in which the 'repayment_interval' column is either monthy or 'bullet'. Get a list of the regions present in this DataFrame, then compare the list with the 'top regions' list. 

In [None]:

regular_pays = df[(df['repayment_interval'] == 'monthly') | (df['repayment_interval'] == 'bullet')]


In [None]:
regions_list = list(regular_pays['region'])
regions_list

- Only one of the top regions (Laulii) is present in the list from the dataFrame, meaning the borrowers from the top regions mostly repay the loans irregularly.

#### - How do the repayment periods compare overally?

In [None]:
# Get a look at the different repayment periods.
df['term_in_months'].value_counts()

- Iterate over the 'term_in_months' column and create a new column that shows the period taken to repay a loan as either 'less than a year', 'between one and two years' or 'more than two years'.

In [None]:
repayment_period =[]
for item in df['term_in_months']:
    if item < 12.0:
       repayment_period.append('less than a year')
    elif item >=12.0 and item < 24.0:
        repayment_period.append('between one and two years')
    else:
        repayment_period.append('more than two years')
    
df['repayment_period'] = repayment_period
    


In [None]:
df.sample(1)

In [None]:
x = df['repayment_period'].value_counts()

In [None]:
plt.figure(figsize = (9,9))
labels =  ['Between a yr and two', 'less than a yr', 'More than two yrs']

plt.pie(x, labels = labels, autopct = '%0.2f%%',shadow = True, labeldistance = 0.75, \
        explode = [0,0,0.8], colors = ['#DDA0DD', '#FFFACD', '#0000'])
plt.show()

- A large percentage (99.21) of the borrowers took between a year and two to repay the loans. 
- A smaller number (0.75%) repaid within less than a year just a few outliers took more than two years.
- We see that although most of the people repay the loans irregularly, they finish paying in just a little over a year.

#### - How do the repayment periods compare in the most funded regions ?

In [None]:
region_df = df.groupby(['region', 'repayment_period'])['funded_amount'].sum()\
.sort_values(ascending = False).reset_index().head(10)

region_df.head(7)

In [None]:
region_df['repayment_period'].value_counts()

- Many of the most funded regions are also the most referenced (top) regions. In these regions all the borrowers repay between one and two years. 
- The outliers who pay before a year elapses or after two years do not belong in the most funded regions.

#### - How do the repayment periods compare in the most funded use?

In [None]:
use_df = df.groupby('use')['funded_amount'].sum().sort_values(ascending = False)\
.reset_index().head(10)
use_df['repayment_period'] = df['repayment_period']


In [None]:
use_df['repayment_period'].value_counts()

In [None]:
use_df

- The most funded uses are a variety, mostly from the Food, Agriculture and Arts sectors. All the borrowers also took between a year and two to repay the loans.

#### - Did the people with more than two years repayment time pay regularly or irregularly?

In [None]:
long_payments = df[df['repayment_period'] == 'more than two years']

In [None]:
long_payments['repayment_interval'].value_counts()

In [None]:
long_payments['funded_amount']

- This is a group of outliers, since most borrowers take less than two years to repay the loans.
- We see that this group of people was actually making regular installments. They just took a longer time possibly because they could only make small installments at a time, seeing as their funded amount is not very large.


### 3.  Time Series Analysis

- Analysis where one variable is time.

- Started by making a copy of my DataFrame then converting the date column into datetime datatype and setting it as the index for this DataFrame.

In [None]:
copy_df = df.copy()

In [None]:
copy_df['date'] = pd.DatetimeIndex(copy_df['date'])

copy_df.set_index(copy_df['date'], inplace = True )


In [None]:
copy_df.sample(2)

#### - What is the trend in funded amount over the years?

In [None]:
funds = copy_df[['funded_amount']]
funds.head(3)

In [None]:
plt.figure( figsize = (20,7))

plt.title('Funded Amount over the Years', fontsize = 20)
plt.xlabel('Years', fontsize = 20)
plt.ylabel('Funds', fontsize = 20)

plt.xticks(rotation = 75)
plt.plot(funds, color = '#483D8B')

plt.show()

- The trend is unclear from this chart, opted to reduce the frequency so it can be observed clearly. I downsampled to a yearly frequency to achieve this, using the sum of the funded amount of an entire year.

In [None]:
yearly_loans = pd.DataFrame()
yearly_loans['loan_amount'] = copy_df['loan_amount'].resample('y').sum()
yearly_loans['loan_amount']


In [None]:
yearly_loans.plot(figsize = (10,5), title = 'Yearly Trend in Total Loan Amount', fontsize = '10')

plt.show()


- From the line chart we can tell 2017 had the least loan amounts of the four years, while 2016 had the most.
- A piechart to better represent the proportions is shown below.

In [None]:
plt.figure(figsize = (3,3))
labels =  ['2014', '2015', '2016', '2017']

plt.pie(yearly_loans, labels = labels, radius = 2, autopct = '%0.1f%%', \
        shadow = True, explode = [0, 0, 0.2, 0], startangle = 90)
plt.show()

#### - How were the loans distributed between the genders over the years?

- Started by cleaning the 'borrower_genders' column, used forward filling method.

In [None]:
print(len(copy_df['borrower_genders']))
copy_df['borrower_genders'].isnull().sum()

In [None]:
copy_df['borrower_genders'].value_counts()

- There are only two borrower genders, which shows that unlike most of the other countries, there was no group borrowing in Samoa. Kiva could consider introducing this.

In [None]:
copy_df['borrower_genders'].fillna(method = 'ffill', inplace = True)

- Get the data from each of the years, and compare the two genders. 
- The proportion is shown on a pie chart

In [None]:
yr1 = copy_df['2014']
yr2 = copy_df['2015']
yr3 = copy_df['2016']
yr4 = copy_df['2017']

In [None]:
print(yr1['borrower_genders'].value_counts())
print(yr2['borrower_genders'].value_counts())
print(yr3['borrower_genders'].value_counts())
print(yr4['borrower_genders'].value_counts())

In [None]:
plt.figure(figsize = (12,12))
plt.subplot(2,2,1)
plt.title(' 2014 Ratio of Male to female Loan Recipients', fontsize = 15, color = '#000000')
labels = ['female', 'male']
values = [1904, 14]
plt.pie(values, labels = labels, autopct = '%0.2f%%', pctdistance = 0.8, shadow = True,\
        explode = [0, 0.8], labeldistance = 0.3, colors = ['#FF8C00','#FFC0CB'])

plt.subplot(2,2,2)
plt.title('2015 Ratio of Male to female Loan Recipients', fontsize = 15, color = '#000000')
labels = ['female', 'male']
values = [1540, 11]
plt.pie(values, labels = labels, autopct = '%0.2f%%', pctdistance = 0.8, shadow = True,\
        explode = [0, 0.8], labeldistance = 0.3, colors = ['#5f13e4','#B0C4DE'])

plt.subplot(2,2,3)
plt.title('2016 Ratio of Male to female Loan Recipients', fontsize = 15, color = '#000000')
labels = ['female', 'male']
values = [2171, 12]
plt.pie(values, labels = labels, autopct = '%0.2f%%', pctdistance = 0.8, shadow = True,\
        explode = [0, 0.8], labeldistance = 0.3, colors = ['#5f13e4','#A9A9A9'])

plt.subplot(2,2,4)
plt.title('2017 Ratio of Male to female Loan Recipients', fontsize = 15, color = '#000000')
labels = ['female', 'male']
values = [998, 29]
plt.pie(values, labels = labels, autopct = '%0.2f%%', pctdistance = 0.8, labeldistance = 0.3,\
        shadow = True, explode = [0, 0.8], colors = ['#DAA520','#D2B48C'])

plt.show()

- In all the years, the number of male borrowers is significantly lower than that of female borrowers.
- The highest proportion of males was observed in 2017, as 2.82%

### 4. Further Analysis With Visualization

####  - What was the time difference between disbursed time and funded time ?

- I converted the two columns to datetime datatype. I then got the difference, which comes as the number of days and hours, and wrote this into a new column. 
- To get the difference in just number of days I divide this new column by a daily unit, using numpy.

In [None]:
copy_df['disbursed_time'] = pd.to_datetime(copy_df['disbursed_time'])
copy_df['funded_time'] = pd.to_datetime(copy_df['funded_time'])

In [None]:
copy_df['time_difference'] = copy_df['funded_time']- copy_df['disbursed_time']
copy_df['time_difference'] = (copy_df['time_difference'] / np.timedelta64 (1,'D')).astype(int)

In [None]:
copy_df['time_difference'].sample(4)

In [None]:
copy_df['time_difference'].median()

In [None]:
plt.figure(figsize = (10,5))

plt.title('Distribution of Time Difference between Funded and Disbursed Time', fontsize = 15)
plt.xlabel('Time Difference')
plt.ylabel('Loan Recipients')

median_time = 31
plt.axvline(median_time, color = 'yellow', label = 'median number of days', linewidth = 2)

plt.legend()

x = copy_df['time_difference']

bins = [0,10,20,30,40,50,60,70,80,90]
plt.hist(x, edgecolor = 'red', bins = bins)

plt.show()

- The median of the distribution is 31. For most borrowers, the wait is between 20-30 days from when the funds are disbursed to when they reach them. 
- Only a few number of times do the funds take more than 40 days, or less than 20 days.

#### - Distribution of repayment term and Funded Amount

In [None]:
from scipy.stats import norm
sns.set_style('darkgrid')

plt.figure(figsize = (13,4))
plt.subplot(1,2,1)
plt.title('Distribution of Repayment Term', color = 'blue')
plt.xlabel('Term in months')

bins = [0,5,10,15,20,25,30,35,40]
plt.hist(df['term_in_months'], color = 'red',edgecolor = 'black', bins = bins)
plt.grid(False)

plt.subplot(1,2,2)
plt.title('Distribution of Funded Amount', color = 'red')
plt.xlabel('Funded Amount')
bins = [0,500,1000,1500,2000,2500,3000,3500,4000]
plt.hist(df['funded_amount'], bins = bins)

plt.show()

- The distribution density of the repayment term, x mostly lies at 10 < x < 20. This corresponds with the repayment interval we had seen earlier where most of the people took between a year and two to repay the loans. 
- For the funded amount, the density is a higher towards the left of 1000. Most people take small loans, below 1000. Only a few number of people (compared to the total population) take loans greater than 1000

#### - Top Loan Amounts vs Lender Count

- I only considered the loan amounts that appear most in the data, and selected the top 5.

In [None]:
df['loan_amount'].value_counts(ascending = False)

In [None]:
temp_df = df[df.loan_amount.isin([400,600,450,800,425])]


In [None]:
sns.set_style('darkgrid')
plt.figure(figsize = (15,6))
plt.title('Top Loan Amounts vs Lender Count', color = '#191970', fontsize = 20)
sns.violinplot(x = 'loan_amount', y = 'lender_count', hue = 'repayment_period',\
               data = temp_df, linewidth = 0.7, Bw = 0.4, inner= 'box', estimator = np.sum)
plt.show()

- From the violin plots we see that most of the loans had between 10 to 30 lenders.
- The higher loan amounts had more lenders. 
- For the highter loan amounts, those with less lenders were repaid in less than a year, while those with more lenders were repaid in a longer period.

#### - What are the most dominant Loan Theme Types?

- Load the csv containing the Loan Theme Types, and obtain data for Samoa. 

In [None]:
df2 = pd.read_csv('/kaggle/input/data-science-for-good-kiva-crowdfunding/loan_themes_by_region.csv')


In [None]:
samoa2 = df2[df2['country'] == 'Samoa']
samoa2.head(1)

In [None]:
samoa2['Loan Theme Type'].isna().sum()

In [None]:
samoa2['Loan Theme Type'].value_counts()

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize = (10,5))
plt.title('Loan Amount by Loan Theme Type', fontsize = 15)
plt.xlabel('Loan Theme Type', fontsize = 15)
plt.ylabel('Loan Amount', fontsize = 15)

plt.xticks(rotation = 45)
sns.boxplot(x = 'amount', y = 'Loan Theme Type', data = samoa2, notch = True,\
            order = ['Underserved', 'SME', 'Higher Education', 'Seasonal Worker'])

plt.show()

- The 'Underserved' theme is the most dominant. Judging from the other themes and the top sector which was food, I took this theme to represent insufficient basic needs such as food. The distribution is mostly around 10,000<x<20,000, which is both the largest distribution, and the one with the highest loan amounts.
- The other themes include SME, which is the second in distribution density and amount. 
- Higher Education and Seasonal Worker have the smallest distribution as well as the least loan amounts.

In [None]:
samoa2['Field Partner Name'].value_counts()

- Samoa has only one field Partner.

### 5. Suggestions

##### 1. Group borrowing
- Unlike the rest of Kiva dataset, in Samoa there hasn't been any groups borrowing as a single unit. All the borrowers are individuals and they only take small loan amounts. Kiva could encourage the borrowers to partner up or form groups that are able to take bigger loans, which can have a more economic significance. The groups can use the loans as capital in SMEs.

##### 2. Women make up 99% of the borrowers.
- Encourage men in Samoa to take advantage of kiva loans for economic developments. The SME is the second most funded theme, more loans could be channeled to this area to empower these people. 

##### 3. Field Partners.
- There is only one field partner in Samoa. Kiva could partner up with more organizations to work as field partners in order to reach more regions. 
