# IBRD Statement of Loans Data Exploration

## Table of Contents
<ul>
    <li><a href="#data_dict">Data Dictionary</a></li>
    <li><a href="#data_wrangle">Data Wrangling</a></li>
    <li><a href="#eda">Exploratory Data Analysis</a></li>
    <ul>
        <li><a href="#univariate">Univariate Analysis</a></li>
        <li><a href="#bivariate">Bivariate Analysis</a></li>
        <li><a href="#multivariate">Multivariate Analysis</a></li>
    </ul>    
    <li><a href="#explain">Explanatory Data Analysis</a></li>
    <ul>
        <li><a href="#explain_1">Distribution of Loans By Amounts</a></li>
        <li><a href="#explain_2">Distribution of Loans By Type</a></li>
        <li><a href="#explain_3">Distribution of Loans By Region</a></li>
    </ul>    
</ul>


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install seaborn --upgrade

This notebook explores Statement of Loans data of World Bank's International Bank for Reconstruction and Development (IBRD), downloaded from [Kaggle](https://www.kaggle.com/theworldbank/ibrd-statement-of-loans-data?select=ibrd-statement-of-loans-historical-data.csv).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

The main CSV is updated periodically with a snapshot of Statement of Loans data, which is appended to the existing CSV data. To ensure that we are working with the latest updated version of this data, we will extract the latest update from this main CSV (reflected in the "End of Period" field of this CSV, which was _"2019-10-31T00:00:00.000"_ while this analysis was being conducted).

In [None]:
df = pd.read_csv('/kaggle/input/ibrd-statement-of-loans-data/ibrd-statement-of-loans-historical-data.csv')

df = df[df["End of Period"] == "2019-10-31T00:00:00.000"]

Please note that this notebook was worked on using the [latest seaborn version](https://medium.com/@michaelwaskom/announcing-the-release-of-seaborn-0-11-3df0341af042). So in case you want to run this notebook, you will need to upgrade your seaborn to version 0.11.0, otherwise some of the function calls (_such as sb.displot_) will fail.

In [None]:
sb.__version__

<span id="data_dict"></span>
## Data Dictionary

While you can go through more detailed data dictionary of this dataset [here](https://finances.worldbank.org/api/assets/F90CF55E-6394-42B4-A7CB-C619A386C736?download=true), following is a quick overview of some of the notable fields and values:
* Region
    * "OTHER" Region is used for loans to IFC (another World Bank Group organisation)
* Country
    * "World" Country is used for loans to IFC
* Loan Type (_these may not match one-to-one with the values_)    
    * B Loan: includes Contingency and Regular B-loans and guarantees
    * Pool Loan: Currency Pooled Loans
    * Fixed Spread Loans (FSL): includes both fixed spread loans and IBRD flexible loans that have either fixed spread or variable spread terms
    * IFC Loan: loan to the IFC
    * Non Pool: Non Pooled Non-IFC Loans
    * SCL: Single Currency Loans
    * SCP USD: Single Currency Pooled Loans - USD
    * SCP DEM: Single Currency Pooled Loans - EUR
    * SCP JPY: Single Currency Pooled Loans – JPY
* Loan Status (_these may not match one-to-one with the values_)    
    * APPROVED: Loan has been approved by the Bank
    * SIGNED: Loan has been signed by both parties
    * EFFECTIVE: Loan has been made effective in accordance with the terms of the legal agreement
    * DISBURSING: Loan is disbursing
    * DISBURSED: Loan has no undisbursed balance
    * REPAID: Loan has been fully repaid
    * CANCELLED: Entire loan principal has been cancelled
    * TERMINATED: Unsigned loan that has been cancelled in full 

<span id="data_wrangle"></span>
## Data Wrangling

### Quality Assessment

Conducting data quality assessment to identify any issues that need to be fixed / cleaned up before we can start our analysis.

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df[df["Original Principal Amount"] == 0].shape[0]

We can see that some loans have been approved with zero (0) principal amount. As per [notes found online](https://finances.worldbank.org/api/assets/F90CF55E-6394-42B4-A7CB-C619A386C736?download=true):
> when new loans are set up to facilitate conversions from one product to another (e.g. a conversion from a Single Currency loan to a Fixed Spread loan), the original principal amount for the new loan may be shown as zero, since the new loan does not represent a new Board commitment. Similarly, new credit tranches created for the MDRI portion of an existing credit may have an original principal amount of zero since they do not represent new Board approvals. 

In [None]:
df[df["Interest Rate"] == 0].shape[0]

We can also see that many loans have 0% interest. As per [notes found online](https://finances.worldbank.org/api/assets/F90CF55E-6394-42B4-A7CB-C619A386C736?download=true):
> For loans that could have more than one interest rate (e.g. FSL or SCL fixed rate loans), the interest rate is shown as “0”. 

In [None]:
df['Loan Type'].value_counts()

In [None]:
df['Loan Status'].value_counts()

### Data Issues Identified
1. Following fields are of object data type
    - `End of Period`
    - `First Repayment Date`
    - `Last Repayment Date`
    - `Agreement Signing Date`
    - `Board Approval Date` 
    - `Effective Date (Most Recent)`
    - `Closed Date (Most Recent)`
    - `Last Disbursement Date`
2. `Currency of Commitment` field has no value   
3. There is an apostrophe in `Borrower's Obligation` column

### Data Cleaning

Cleaning up data before we can use it for our analysis.

#### Issue # 1
_A lot of date time fields are of  `object` data type. These should be converted to `datetime` data type._

**_Cleaning_**

In [None]:
# looping through the fields and converting those to datetime type
fields = ['End of Period', 'First Repayment Date', 'Last Repayment Date', 'Agreement Signing Date', 'Board Approval Date', 'Effective Date (Most Recent)', 'Closed Date (Most Recent)', 'Last Disbursement Date']

for field in fields:
    df[field] = pd.to_datetime(df[field])

**_Testing_**

All fields with datetimes values have been converted to datetime data type.

In [None]:
df.info()

#### Issue # 2
`Currency of Commitment` field has no value. So we are going to drop that field.

**_Cleaning_**

In [None]:
df.drop(columns=['Currency of Commitment'], inplace=True)

**_Testing_**

`Currency of Commitment` field has been removed.

In [None]:
df.info()

#### Issue # 3

There is an apostrophe in `Borrower's Obligation` column. While it is not a big issue, it is better that we remove this apostrophe, so that the field can be accessed easily later.

**_Cleaning_**

In [None]:
df.columns = df.columns.str.replace("'s", "")

**_Testing_**

In [None]:
df.info()

### Records with Principal Amount and Interest Rate of Zero

Given that the zero value of Principal Amount and Interest Rate is not just junk data, but actually represent something, we will keep those records and continue with our analysis (instead of dropping those rows).

<span id="eda"></span>
## Exploratory Data Analysis

Now that we have assessed and cleaned up the data, let's start with the exploratory data analysis.

### Intention of the analysis

This is an interesting dataset covering loans that IBRD has given to various countries from 1947 to 2019. I am keen to explore how these IBRD loans have evolved over all this time and whether any particular country or geography stand out for any interesting reasons. 

In this regard, I believe principal amounts and interest rates will be particularly useful features to help me with the exploration.

#### Common Functions and Variables

In [None]:
# Assigning base color
base_color = sb.color_palette()[0] 

# Assigning dinominator (1,000,000) to help display big values
mm_var = 1000000

# label for pre-2000
pre_2000 = "Before 2000"

# label for post-2000
post_2000 = "Since 2000"

# variable for displot graphs
plot_height = 6

'''
Common function to set figure size
'''
def setsize(width=8, height=6):
    plt.figure(figsize=(width,height))

#### Key Dates

Projects in this dataset have been approved from 1947 till 2019.

In [None]:
df["Board Approval Date"].min(), df["Board Approval Date"].max()

Exploring some key dates and relationship between them. In general, it seems that Board Approval Date is followed by Agreement Signing Date, which is followed by Effective Date.

In [None]:
df[["Board Approval Date", "Agreement Signing Date", "Effective Date (Most Recent)"]].sample(10)

Looks like there is only one loan for which agreement was signed before the board approval.

In [None]:
df[df['Board Approval Date'] > df['Agreement Signing Date']]

There are 12 loans where the agreement was signed after it was effective. All except one of these have been fully cancelled, and one of them has been terminated.

In [None]:
df[df['Agreement Signing Date'] > df['Effective Date (Most Recent)']]

<span id="univariate"></span>
### Univariate Analysis

Analysing variables one-by-one, starting with Original Principal Amount.

Most of the loans seem to be of less than \\$500 million, with some big outliers.

In [None]:
sb.displot(df["Original Principal Amount"]/mm_var, height=plot_height, bins=50, kde=True);
plt.xlabel("Original Principal Amount ($ mm)");

Zooming in below \\$500 million, most of the loans seem to be less than \\$100 million.

In [None]:
sb.displot(df["Original Principal Amount"]/mm_var, height=plot_height, kde=True);
plt.xlim((0,500))
plt.xlabel("Original Principal Amount ($ mm)");

Most loans were approved in the 1990s, with another [smaller] peak around 2010.

In [None]:
sb.displot(df["Board Approval Date"].dt.year, height=plot_height, kde=True);

Interest Rate plot also shows bi-modal tendencies, with two peaks, one at 0% and another one around 7.5%.

In [None]:
sb.displot(df["Interest Rate"], height=plot_height, kde=True);

Most of the loans have been fully repaid.

In [None]:
setsize()
sb.countplot(y=df["Loan Status"], color=base_color, order=df["Loan Status"].value_counts().index);

Most of the loans are of type CPL.

In [None]:
setsize()
sb.countplot(y=df["Loan Type"], color=base_color, order=df["Loan Type"].value_counts().index);

Based on the analysis done till now, we have noticed that the dataset has loans from 1947 till 2019, with data showing bi-modal tendencies. So it is worth exploring further whether loans approved before and after a certain period have some stark differences.

During our analysis above, we noticed that there were two peaks of years in which board approvals were provided. These peaks lied before and after 2000, so we are taking the year 2000 as the point of cut-off to conduct another round of univariate analysis.

In [None]:
# pre-2000 board approved loans
df["post_2000"] = False

# 2000 onwards board approved loans
df["post_2000"] = (df["Board Approval Date"].dt.year >= 2000)

df["post_2000"].value_counts()

As shown above, a lot more loans were given before 2000 than after.

Looking at loan amounts again, but this time with a pre-2000 and post-2000 view.

In [None]:
sb.displot(data=df, x=df["Original Principal Amount"]/mm_var, hue="post_2000", bins=50, height=plot_height);

Zooming in sub-$500 million loans...

In [None]:
ticks = np.arange(0,500+1, 50)
sb.displot(data=df, x=df["Original Principal Amount"]/mm_var, bins=ticks, hue="post_2000", kde=True, height=plot_height);

plt.xlim((0,500))
plt.xticks(ticks);
plt.xlabel("Original Principal Amount ($ mm)");

Looking at original principal amount, from a before and after 2000 perspective, we can see a very different picture of loan amounts approved.
* **pre-2000**: Vast majority of the loans are for \\$50 million or less.
* **2000 onwards**: Loans amounts are much more widely distributed.

Taking the same view for interest rates, we can again see that the interest rates for loans before and after 2000 are very different.
* **pre-2000**: Majority of the loans get an interest rate between 5-9%.
* **2000 onwards**: Majority of the loans are given at 0% interest.

In [None]:
interest_bins = np.arange(0,df["Interest Rate"].max()+1.0, 1.0)
sb.displot(data=df, x="Interest Rate", bins=interest_bins, hue="post_2000", kde=True, height=6);
plt.xticks(interest_bins);

Analysing loan types from a before and after 2000 perspective explains why we see a big spike in 0% interest loans since 2000. As we can see in the graph below, almost all of loans post-2000 were of type FSL and SCL. As mentioned in the [data dictionary](https://finances.worldbank.org/api/assets/F90CF55E-6394-42B4-A7CB-C619A386C736?download=true) 
> For loans that could have more than one interest rate (e.g. FSL or SCL fixed rate loans), the interest rate is shown as “0”. 

In [None]:
setsize()
sb.countplot(data=df, x="Loan Type", hue="post_2000");

Loan status also shows very different picture for loans approved before and after 2000. In this case, it is less surprising though.
* **pre-2000**: Most of the loans have been fully repaid. You would expect this, given that these loans were approved at least more than 20 years ago.
* **2000 onwards**: Majority of the loans are still being disbursed and repaid.

In [None]:
setsize()
sb.countplot(data=df, y="Loan Status", hue="post_2000");

#### Summary of univariate analysis

Our analysis till now has shown some interesting information and insights. 

To begin with loan amounts approved are skewed, with majority of the approved loans being less than \\$100 million. Interestingly, more loans were approved in the 1990s than in any other decade. Majority of the loans have been repaid.

The biggest insight that came out during our analysis was that the loans approved before year 2000 and loans approved since then have very different characteristics. To begin with, number of loan applications approved since 2000 were nearly one-third of the number of loan applications approved before 2000. A huge majority of the pre-2000 approved loans were for \\$50 million or less; on the other hand, post-2000 approved loans were more evenly distributed in terms of loan amounts. In addition to that, almost all of loans approved post-2000 were of Fixed Spread (FSL) or Single Currency (SCL) type loans, which was in huge contrast to pre-2000 loans (which were mostly Non-Pooled (NPL) and Currency Pooled (CPL) loans). This also meant that the interest rates post-2000 were mostly showing up as 0%, as FSL and SCL loans can have more than one interest rate.

<span id="bivariate"></span>
### Bivariate Analysis

Exploring other features to find interesting trends, before we conduct further analysis of pre- and post-2000 loans.

Indonesia has the highest outstanding balance of all the countries, as of the end of period date i.e. 31 Oct 2019, even though there are other countries that have taken more loan. This means that others have been paying off their loans more or that they have yet to drawdown more. 

In [None]:
setsize()
s_ctry_princpl = df.groupby("Country")["Borrower Obligation"].sum().sort_values(ascending=False).head(20)
sb.barplot(y=s_ctry_princpl.index, x=s_ctry_princpl/mm_var, color=base_color);
plt.xlabel("Total Borrower Obligation ($ mm)");

Looking at countries with highest interest rates for active loans, World Bank charges its own group organisation [IFC] with highest interest rates for active loans. 

IFC is closely followed by Zimbabwe, which pays a lot higher interest rates than any other country. Let's try to find out the reason behind huge difference in interest rates paid by Zimbabwe compared to other countries.

In [None]:
# only considering records with active loans
df_loans_held = df[df["Loans Held"] > 0]

# list of top 20 countries that pay the highest average interest rates
list_loans = df_loans_held.groupby("Country")["Interest Rate"].mean().sort_values(ascending=False).head(20)

# looking at active loans for only those top 20 countries
df_loans_held = df_loans_held[df_loans_held["Country"].isin(list_loans.index)]

In [None]:
setsize()

# using pointplot so see the average interest rate and also range of rates that the country generally gets
sb.pointplot(data=df_loans_held, y="Country", x="Interest Rate", join=False, order=list_loans.index)
plt.xticks(np.arange(0,list_loans.max()+1, 1));

If we do not consider IFC (as it is a World Bank Group organisation), then Zimbabwe seems to be paying high interest rates in general (with interest rates across all quartiles) i.e. there do not seem to be any outlier loans causing this huge gap.

In [None]:
df_loans_held.groupby("Country")["Interest Rate"].describe()

Looking at number of loans by loan types for each country, looks like Zimbabwe is the only country with all its loans as CPL (Currency Pooled Loans). The only other country with any CPL loans is Chile, which only has one CPL loan.

In [None]:
pd.crosstab(df_loans_held["Country"], df_loans_held["Loan Type"])

Looking at Interest Rates by Loan Type clarifies the picture even more. As shown in the boxplot below, CPLs have one of the highest interest rates of all loans types. 
* IFCT loans are only for IFC (which is paying the highest interest rate anyway, as seen earlier). 
* Other loan products with high interest rates (NPL, SCPD and SCPM) are not active loans for any country.
* Note that most of the countries currently have FSL loans, which as per the notes shared earlier, have interest rates listed as zero (0), because its interest rates vary.

So this explains why Zimbabwe is the country which is paying a lot higher interest rate than any other country with active loans.

In [None]:
setsize()

sb.boxplot(data=df, y="Interest Rate", x="Loan Type", color=base_color);

Let's further explore the differences in loans approved pre- and post-2000, using bivariate analysis.

Looking at interest rates and original principal amount, we can again see that most of the post-2000 loans are showing up with 0% interest because of FSL and SCL loan types. An interesting trend shows up here that [atleast] pre-2000 loans are generally given out in multiples of $10 million.

In [None]:
setsize()
sb.scatterplot(y=df["Interest Rate"], x=df["Original Principal Amount"]/mm_var, alpha=0.1, hue=df.post_2000);
plt.xlim((0, 150)) # most of the loans are of less than $150 million
plt.xlabel("Original Principal Amount ($ mm)");

Looking at loans by Region, Latin American region has taken the most loans (in terms of total amount), both pre- and post-2000. 

In addition to that, "East Asia and Pacific" and "Africa" are the only regions that have taken lesser loan amount post-2000, compared to pre-2000.

In [None]:
setsize()
df_region_prinpl = df.groupby(["Region", "post_2000"])["Original Principal Amount"].sum().sort_values(ascending=False).reset_index()
sb.barplot(data=df_region_prinpl, y="Region", x=df_region_prinpl["Original Principal Amount"]/mm_var, hue="post_2000")
plt.xlabel("Total Original Principal Amount ($ mm)");

Average loan amounts taken post-2000 have increased across all regions compared to average loan amount pre-2000; and that too by a huge margin. Average loan amounts taken by South Asian countries post-2000 far exceeds any other region. When combined with the total principal graph [above], it signals that fewer countries have taken loans since year 2000 (as shown in table below), but a lot higher amounts than they did pre-2000.

In [None]:
setsize()
df_region_prinpl = df.groupby(["Region", "post_2000"])["Original Principal Amount"].mean().sort_values(ascending=False).reset_index()
sb.barplot(data=df_region_prinpl, y="Region", x=df_region_prinpl["Original Principal Amount"]/mm_var, hue="post_2000");
plt.xlabel("Average Original Principal Amount ($ mm)");

Following table confirms the understanding that loans taken across all regions have decreased since 2000.

In [None]:
df.groupby(["Region", "post_2000"])["Country"].count().unstack()

While Mexico had loaned most amount before year 2000, India increased its lending drastically since 2000 and became the largest borrower by a considerable margin. Thailand, Pakistan, Nigeria and Algeria have not taken any loans since 2000. And while Egypt, Iraq, Jordan, Kazakhstan, Serbia, Tunisia and Ukraine never took loan before 2000, they have done so since 2000.

In [None]:
df_pre2000 = df[df.post_2000 == False]
df_post2000 = df[df.post_2000 == True]

# countries who have taken most loan before 2000
pre2000 = df_pre2000.groupby("Country")["Original Principal Amount"].sum().sort_values(ascending=False).head(20)

# countries who have taken most loan since 2000
post2000 = df_post2000.groupby("Country")["Original Principal Amount"].sum().sort_values(ascending=False).head(20)

pre2000 = pd.DataFrame(pre2000)
post2000 = pd.DataFrame(post2000)

# joining the two dataframes to have a consolidated view
df_maxloan = pre2000.join(post2000, how="outer", lsuffix="_pre", rsuffix="_post")

# making column names shorter
df_maxloan.columns = ["pre_2000", "post_2000"]

df_maxloan = df_maxloan.reset_index().sort_values("pre_2000", ascending=False)

# melting the pre- and post-2000 principal amount columns so that we can use a barplot and its hue property
df_maxloan = df_maxloan.melt(id_vars=["Country"], var_name="Time", value_name="Original Principal Amount")

In [None]:
setsize(10,8)
sb.barplot(data=df_maxloan, x=df_maxloan["Original Principal Amount"]/mm_var, y="Country", hue="Time")
plt.legend(loc=4);
plt.xlabel("Total Original Principal Amount ($ mm)");

Top 5 countries in terms of overall principal amount pre- and post-2000 pretty much stay the same (although in different order):
* India
* Brazil
* Mexico
* Indonesia
* China

In [None]:
df.groupby("Country")["Original Principal Amount"].sum().sort_values(ascending=False).head(5)

#### Summary of bivariate analysis

As part of our bivariate analysis, before digging further in the differences between pre- and post-2000 loans, we explored some other aspects of data to see if we can find any interesting trends.

We found out that even though other countries have taken more loan, Indonesia had the highest outstanding balance as of 31 Oct 2019, hinting that other countries either have not drawn down or they have been paying off their loans a lot faster. The difference between countries, however, was not very stark so we did not explore in to it further.

We also noticed that out of all countries with active loans, Zimbabwe pays off the highest interest rate by a large margin. On further investigation, we found out that the reason behind this is that unlike other countries with active loans, all of loans Zimbabwe had were of loan type with relatively high interest rates. An important point to highlight here is that most of the countries had loans of FSL type for which the interest rate is shown as zero (0) in this dataset.

On further exploring the difference between pre-2000 and post-2000 loans, we found out that "East Asia and Pacific" and "Africa" are the only two regions that have taken lesser loan post-2000, compared to pre-2000. Other regions loaned higher total amounts post-2000. We also found out that average loan amounts increased drastically for all regions post-2000. This difference was a lot more stark than total loan amounts, hinting that fewer countries have taken loans since 2000. In addition to that, we also found out that while Mexico had loaned most amount before year 2000, India increased its lending drastically since 2000 and became the largest borrower by a considerable margin. The Top 5 countries in terms of overall principal amount pre- and post-2000 pretty much stay the same.

<span id="multivariate"></span>
### Multivariate Analysis

Looking at loan types in more detail now. 

The pre-2000 and post-2000 picture for top 5 countries that have taken out most loans looks very interesting.
* All 5 countries gave highest interest for CPL loans
* Before 2000, most of the loans taken out by all countries were CPL loans.
* Since 2000, all FSL has become the most popular loan type. 
* SCL loans have been quite common all throughout. 

In [None]:
list_toploans = df.groupby("Country")["Original Principal Amount"].sum().sort_values(ascending=False).head(5).index
list_toploans

In [None]:
df_toploans = df[df["Country"].isin(list_toploans)]

ax = sb.FacetGrid(data=df_toploans, col="Country", row="post_2000", hue="Loan Type", margin_titles=True);
ax.map(sb.scatterplot, "Original Principal Amount", "Interest Rate");
ax.add_legend();

In [None]:
pd.crosstab(df_toploans["Country"], [df_toploans["post_2000"], df_toploans["Loan Type"]])

More generally looking at how loan types are spread across different loan statuses, 
* SCPY, BLNR, SCPD and SCPM loans have been fully repaid
* BLNC, IFCM, IFCT and GURB loans are the types with most cancellations
* FSL loan types seem to be most active ones, with loans spread out across all of statuses (except Fully Transferred)

In [None]:
# pivoting data by type and status
df_type_status = df.pivot_table(index="Loan Type", columns="Loan Status", values="Loan Number", aggfunc=len, margins=True, fill_value=0)

# converting counts to percentages
df_type_status = df_type_status.div(df_type_status.iloc[:,-1], axis="index")*100

# resetting index and rounding of decimal places
df_type_status = df_type_status.round(0).reset_index()

# dropping margins created by pivot table
df_type_status.drop(columns=["All"], inplace=True)
df_type_status.drop([13], inplace=True)

# melting all loan statuses so that we can use the dataframe to generate a graph
df_type_status = df_type_status.melt(id_vars=["Loan Type"], var_name="Loan Status", value_name="Percentage")

In [None]:
ax = sb.FacetGrid(data=df_type_status, col="Loan Type", col_wrap=6);
ax.map(sb.barplot, "Loan Status", "Percentage");
ax.set_xticklabels(rotation=90);

#### Summary of multivariate analysis

For the multi-variate analysis, we focused our analysis more at loan types. 

To begin with we looked at differences in loans taken out by top 5 countries [that have taken most loans] i.e. Indonesia, Brazil, Mexico, India and China. Our analysis aligned with what we found out earlier i.e. that FSL are most common loan types post-2000. While prior to 2000, most of the loans taken out by all top 5 countries were CPL loans, since 2000, FSL has become the most popular loan type amongst these countries. CPL was the most most expensive (high interest rates) loan type that all 5 of these countries had pre-2000; it has been replaced by SCL as the most expensive loan post-2000.

Looking at loan types in general, SCPY, BLNR, SCPD and SCPM loans have been fully repaid and unsurprisingly FSL are currently the most active loans. 

<span id="explain"></span>
## Explanatory Data Analysis

In this investigation, we wanted to look at differences in the nature of loans granted by IBRD before the 2000 and since then. The main focus of this presentation will be on loan amounts, loan types and regions.

<span id="explain_1"></span>
### Distribution of Loans By Amounts

There is a clear difference in the amount of loans aproved before and since 2000, with majority of loans approved before year 2000 being of principal amount less than $50 million, while the principal amounts of loans approved since 2000 are more evenly distributed.

In [None]:
ticks = np.arange(0,500+1, 50);
sb.displot(data=df, x=df["Original Principal Amount"]/mm_var, bins=ticks, hue="post_2000", height=plot_height, legend=False);

plt.xlim((0,500));
plt.xticks(ticks);
plt.xlabel("Original Principal Amount ($ mm)");
plt.ylabel("")
plt.title("Distribution of Loans (with Principal Amount less than $500 mm)");
plt.legend([post_2000, pre_2000]);

<span id="explain_2"></span>
### Distribution of Loans By Type

Loans taken out by countries since 2000 have been mostly of FSL (Fixed Spread Loans) type, which is very different from before 2000, where most of the loans are of CPL (Currency Pooled Loans) and NPL (Non Pooled Loans) types.

In [None]:
plt.figure(figsize=(8,6));
sb.countplot(data=df, x="Loan Type", hue="post_2000");
plt.title("Distribution of Loans By Type")
plt.ylabel("")
plt.legend([pre_2000, post_2000]);

<span id="explain_3"></span>
### Distribution of Loans By Region

We see that while some regions have got more total loan amounts since 2000, compared to before 2000, others have actually got less - so there is not consistent pattern. However, we notice that average loan amounts has increased drastically across all regions, pointing to the fact that fewer countries have taken loans but those loans are of higher amounts.

In [None]:
# setting legend line colors
from matplotlib.lines import Line2D
custom_lines = [Line2D([0], [0], color="#df8138", lw=4),
                Line2D([0], [0], color="#36749e", lw=4)]

# preparing dataset to analyse Original Principal Amount by region
df_region_prinpl = df.groupby(["Region", "post_2000"])["Original Principal Amount"].agg(['sum', 'mean']).reset_index()

fig, ax = plt.subplots(ncols = 2, figsize = [14,8])
sb.barplot(data=df_region_prinpl, y="Region", x=df_region_prinpl["sum"]/mm_var, hue="post_2000", ax=ax[0])
ax[0].set_xlabel("Total Original Principal Amount ($ mm)");
ax[0].set_ylabel("");
ax[0].get_legend().remove()

sb.barplot(data=df_region_prinpl, y="Region", x=df_region_prinpl["mean"]/mm_var, hue="post_2000", ax=ax[1]);
ax[1].set_xlabel("Average Original Principal Amount ($ mm)");
ax[1].set_yticklabels("");
ax[1].set_ylabel("");
ax[1].get_legend().remove()

fig.legend(custom_lines, [post_2000, pre_2000], 1)
plt.suptitle("Original Principal Amount by Region", size=13);   