# Kiva loans and linear regression
## Fitting the linear regression model by country

### Foreword

This workbook was completed at the beginning of July 2020, more than two years after the original publishing date of this dataset on kaggle.com.
The main objective of this workbook is to effectively apply machine learning algorithms to the dataset in order to determine possible correlations between values.

This dataset also introduced me to kiva, an incredible crowdfunding community whose microcredits empower borrowers from around the world. You too can become a lender - kiva will help you find a purposeful endeavor that will inspire change and progress. Find additional information on kiva.org.
****
***
#### Disclaimer

This workbook is fairly short and does not do justice to the data set in terms of data exploring. I advise you to first discover the nature of the data yourself or with the help of other kernels published on Kaggle. 
***
***
The code below was heavily commented on in order to improve its readability.
If you have any remarks or suggestions, I would be more than glad to see them in the comments section - I am always a message away.

Thank you for your attention and I hope you will enjoy exploring this workbook.

Sincerely,
<br>
Alexander NINUA


## 1. Forming a hypothesis

In [None]:
import pandas as pd
import numpy as np

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
sns.set_palette('pastel')

In [None]:
#The following line of code enables the automatic graph display in Jupyter notebook

%matplotlib inline

#The rest of modules used in the workbook are going to be loaded when they are needed

In [None]:
df = pd.read_csv("../input/data-science-for-good-kiva-crowdfunding/kiva_loans.csv")

In [None]:
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', 30)

##### General information about the dataset

In [None]:
df.info()  

In [None]:
df.head(3)

Does this dataset present any correlations between its numerical factors?

In [None]:
ax = sns.heatmap(df.drop(['id', 'partner_id'], axis = 1).corr(), cmap = 'coolwarm', annot = True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.xticks(rotation=45)

In [None]:
# As we can see from the heatmap above, some of the factors, such as funded_amount and lender_count present a strong correlation.
# This correlation can be interpreted in the following way: the bigger the sum of the loan, the more lenders it needs to be funded
df[['funded_amount', 'lender_count']].corr()

However, the dataset does not only present numerical values: the columns 'term_in_months' or 'borrower_genders' present categorical values that can still be interpreted. 


## **Hypothesis**

Each and every loan described in the dataset comes with additional values. 
The hypothesis of this research states that the additional values, when analyzed, can predict the loan amount.
In the following chapters we will try to study this relationship by country using linear regression.



## 2. Preparing datasets by country

In this section we will be dividing the data by country - this will facilitate further analysis. 
My goal is to have a dictionnary with country names as keys and their respective datasets as values.

In [None]:
#extracting all the unique country names in a single list
countries = list(df['country'].unique())

In [None]:
#Each country is introduced as a key of the dictionnary with a corresponding dataset as a value 
countrydict = {elem : pd.DataFrame() for elem in countries}
for key in countrydict.keys():
    countrydict[key] = df[:][df.country == key]

In [None]:
#Each dataset will have to drop the columns unnecessary to our future work
for key in countrydict:
    countrydict[key].drop(['id','activity', 'use', 'country_code','region','currency','partner_id','posted_time','disbursed_time','funded_time','tags','date','loan_amount','country', 'loan_amount'], axis=  1, inplace = True)

In [None]:
#This function will rework the gender column. As it is a simplification it has to be taken with a grain of salt
def genalloc(x):
    x = str(x)
    #The following line transforms gender strings in lists 'female, female, female, male' -> ['female','female','female','male']
    x = [x.strip() for x in x.split(',')]
    
    #monogender lists keep their value as a string
    if len(x) == 1:
        if x[0] == 'male':
            return 'male'
        elif x[0] == 'female':
            return 'female'
    #longer lists get a new string assigned based on their gender composition
    if len(x) > 1:
        if all(i in x for i in ['male', 'female']):
            return 'mixed'
        elif x[0] == 'male':
            return 'men'
        elif x[0] == 'female':
            return 'women'

In [None]:
#The fuctio is then applied to the 'genders' column
for key in countrydict:
    countrydict[key].dropna(inplace = True)
    countrydict[key]['borrower_genders'] = countrydict[key]['borrower_genders'].apply(lambda x: genalloc(x))

In [None]:
countrydict['Pakistan']['borrower_genders'].unique()

Now we are going to prepare the data for the linear regression. 
Before fitting the regression model, will need to make sure that : 
* Categorical values have dummy values assigned
* All the dataframes are of the same shape
* Countries with less than 1000 entries are out of the list

In [None]:
#Creating data dummies for the categorical values of our datasets
for key in countrydict:
    sex = pd.get_dummies(countrydict[key]['borrower_genders'],drop_first= True)
    ints = pd.get_dummies(countrydict[key]['repayment_interval'],drop_first= True)
    sec = pd.get_dummies(countrydict[key]['sector'],drop_first= True)
    countrydict[key].drop(['borrower_genders','repayment_interval','sector'], axis = 1, inplace = True)
    countrydict[key] = pd.concat([countrydict[key], sex, ints,sec],axis = 1)

In [None]:
#Unfortunately, not all datasets have the same column entries. 
#The following dict has countries as keys with their respective dataset shapes as values
dfshapes = {}
for key in countrydict:
    dfshapes[key] = pd.DataFrame(index = countrydict[key].columns.drop('funded_amount')).shape
    
print(max(dfshapes, key=dfshapes.get))
print(dfshapes[max(dfshapes, key=dfshapes.get)])
print('\n')
print(min(dfshapes, key=dfshapes.get))
print(dfshapes[min(dfshapes, key=dfshapes.get)])

#We can see that the gap between the shapes of countries is rather big

In [None]:
#We will use Kenya's columns as standard for all the countries
#The same number of columns
full_columns = countrydict['Kenya'].columns

In [None]:
#If the columns is absent in a dataframe it is added with empty values
for key in countrydict:
    for x in full_columns:
        if x not in countrydict[key]:
            countrydict[key][x] = 0

In [None]:
#The shapes are now normalized
dfshapes = {}
for key in countrydict:
    dfshapes[key] = pd.DataFrame(index = countrydict[key].columns.drop('funded_amount')).shape

print('Kenya \n',dfshapes['Kenya'],'\n','Mauritania\n',dfshapes['Mauritania'])

In [None]:
#We will gather lengths of dataframes in order to weed out country datasets that have a small amount of entries
#The following loops will remove datasets of countries that have less than a thousand entries 
short = []
for k, v in countrydict.items():
    if len(v) < 1000:
        short.append(k)
print(f'There are {len(short)} countries that have fewer than 1000 entries')
for x in short:
    countrydict.pop(x)


## 3. Linear Regression by country

In this part we will apply linear regression to every country in countrydict. The end goal is to see whether we can predict the funded amount based on the auxiliary data, such as the sector of the project, gender(s) of the borrower(s) or the proposed repayment interval.

___


We will later see which countries have the most precise regression models and what role do the different coefficients play there. 

In [None]:
#Importing necessary modules from sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [None]:
#Initiating dataframes that will later hold the coefficients for every model as well as the different metrics measuring models' efficiency
coefficients = pd.DataFrame(index = full_columns[1:])
errors = pd.DataFrame(index = ['Mean Absolute Error', 'Mean Squared Error', 'Mean Squared Error Root','R2 Score'])

In [None]:
#The following loop iterates through each dataset in countrydict
#Sklearn algorithms split data in test/train sets
#Linear regression model is initiated and fitted for the required information
#Both coefficients and metrics for each country are later appended to the respective dataframes

for key in countrydict:
    X = countrydict[key].drop('funded_amount', axis = 1)
    y = countrydict[key]['funded_amount']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
    lm = LinearRegression()
    lm.fit(X_train,y_train)
    preds = lm.predict(X_test)
    err_list = []
    err_list.append(metrics.mean_absolute_error(y_test,preds))
    err_list.append(metrics.mean_squared_error(y_test, preds))
    err_list.append(np.sqrt(metrics.mean_squared_error(y_test, preds)))
    err_list.append(metrics.r2_score(y_test, preds))
    coefficients[str(key)] = lm.coef_
    errors[str(key)] = err_list

In [None]:
#Below we can see a dataframe with metrics calculated for each country's regression model
errors = errors.transpose()
errors.nlargest(n = 75 ,columns = 'R2 Score')

In [None]:
#The R2 scores range from around 0.55 to 0.98 with a sigificant concentration around 0.9
sns.distplot(errors['R2 Score'], bins = 25, kde = False)

In [None]:
#This function will add the entry count for each country
#This will enable us to see the relation between the R2 Score and the number of entries the regression is based on
df_ccount = df.groupby('country').count()
def country_count(x):
    return df_ccount.loc[str(x), 'id']

In [None]:
#Creating the Count column which will display the number of entries for each country
errors.reset_index(level = 0, inplace = True)
errors['Count'] = errors['index'].apply(lambda x: country_count(x))

In [None]:
errors.set_index('index', inplace = True)
errors.nlargest(n = 10 ,columns = 'R2 Score')

In [None]:
#As we can see, there is no appearent correlation between the number of entries and the R2 Score
sns.jointplot(x = 'R2 Score', y = 'Count', data = errors, kind = 'hex')

In [None]:
#We will also search for correlation of the R2 core with the standard deviation of the funded amount
#It is possible to hypothesize that, in this case, a good R2 Score correlates with low standard deviation
def df_std(x):
    return df[df['country'] == str(x)]['funded_amount'].std()

In [None]:
errors.reset_index(level = 0, inplace = True)
errors['Standard Deviation'] = errors['index'].apply(lambda x: df_std(x))
errors.set_index('index', inplace = True)

In [None]:
#As we can see on the plot below,there is no apparent correlation between the R2 Score and the Standard Deviation
sns.jointplot(x = 'R2 Score', y = 'Standard Deviation', data = errors)

***

Now we can say that the regression model is fitted to each country with more than a thousand entries. 

At this point you can download the information above and do your own research on the subject of data coefficients by country and their possible interpretations. 

Below you will find a brief visualization of metrics and coefficients

***

In [None]:
#Countries with highest and lowest R2 scores
top_R2 = errors.nlargest(n = 75 ,columns = 'R2 Score')['R2 Score'].reset_index(level=0, inplace=False)
R2_compare = pd.concat([top_R2[:5], top_R2[-5:]])
sns.catplot(x = 'index', y = 'R2 Score', data = R2_compare, aspect = 4, kind = 'bar')
plt.title('Countries with highest and lowest R2 Scores',fontsize = 20)
plt.xlabel('Country', fontsize = 15)
plt.ylabel('R2 Score', fontsize = 15)

In [None]:
#The following Coefficients table show coefficients that went into constructing each regression for each country 
coefficients = coefficients.transpose()
coefficients.head(5)

In [None]:
country_sectors = pd.DataFrame(index = coefficients.columns[9:], columns = ['Country Min', 'Min', 'Country Max','Max'])


In [None]:
for x in coefficients.columns[9:]:
    country_sectors.loc[str(x)]['Country Min'] = coefficients[coefficients[str(x)] == coefficients[str(x)].min()].index[0]
    country_sectors.loc[str(x)]['Min'] = coefficients[coefficients[str(x)] == coefficients[str(x)].min()][str(x)][0]
    country_sectors.loc[str(x)]['Country Max'] = coefficients[coefficients[str(x)] == coefficients[str(x)].max()].index[0]
    country_sectors.loc[str(x)]['Max'] = coefficients[coefficients[str(x)] == coefficients[str(x)].max()][str(x)][0]

In [None]:
country_sectors

Our research shows that linear regression models, when fitted by country, demonstrate an impressive R2 Score. The coefficients are also interesting to analyze, however, it is important to take them with a grain of salt.

The data is there and it can be explored further and further. Feel free to download my code and work with the data on your own terms. 

I hope that my workbook helped you discover interesting details of this dataset. Again, all the feedback on my work is highly appreciated.

Thank you for your attention!