> # Data Analysis is better when it's hypothesis driven

> ## But how do you generate hypotheses with exploratory data analysis and visualization?

**Table of contents**<a id='toc0_'></a>    
- [EDA - Case study](#toc1_)    
  - [Business Context](#toc1_1_)    
    - [Markets for dummies](#toc1_1_1_)    
    - [Our business](#toc1_1_2_)    
  - [Data exploration](#toc1_2_)    
    - [How much data do we have?](#toc1_2_1_)    
    - [What time frame does the dataset span?](#toc1_2_2_)    
    - [How is our data segmented?](#toc1_2_3_)    
    - [Do each of the loan types correspond to a specific type of client?](#toc1_2_4_)    
    - [How does each of our segment look like?](#toc1_2_5_)    
    - [Is `small_corp_variable` a segment worth exploring?](#toc1_2_6_)    
    - [Why is the `small_corp_variable` segment so different from the rest?](#toc1_2_7_)    
    - [How do we pick the high-value customers?](#toc1_2_8_)    
    - [Why do we have such high variance at low volumes?](#toc1_2_9_)    
    - [What changed over time?](#toc1_2_10_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[EDA - Case study](#toc0_)

Before diving into the data, some of the best people to help with hypothesis setting are the people who run the business or work with the data every day.

In [None]:
# Import them libraries

pd.options.display.float_format = '{:,.2f}'.format



## <a id='toc1_1_'></a>[Business Context](#toc0_)

Today we're back into finance world again. Except now we work for a bank that's trying to analyse the loans which are most profitable. Why?

### <a id='toc1_1_1_'></a>[Markets for dummies](#toc0_)

> 🐂 **Bull market** = a market that is on the rise and where the conditions of the economy are generally favorable.  

In bull markets, businesses are typically driven by growth so then the talk is mostly focused on increasing revenue.

> 🐻 **Bear market** = exists in an economy that is receding and where most stocks are declining in value.  

In bear markets, cash is tight so businesses become more driven by profitability.

### <a id='toc1_1_2_'></a>[Our business](#toc0_)

In the case of this bank, we're operating in a bear market where cash is tight so the bank wants to maximize its profitability. This is why they came to a bunch of data analysts to tell them which companies they should give out loans for.

In [None]:
financials = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/financials_loans.csv')


The variables:  
- `client_t` - client type
- `loan_t` - loan type
- `Volume` - loan amount
- `Maturity` - days to pay the loan
- `fixed component` -  the amount of money made from fixed interest per dollar. 
- `variable component` -  the amount of money made from variable interest per dollar. 
- `fixed costs` - the overall processing fees of the bank.
- `Unitary profitability` - the amount of money made per dollar. This can come from loan interest rates or bank fees.

## <a id='toc1_2_'></a>[Data exploration](#toc0_)

### <a id='toc1_2_1_'></a>[How much data do we have?](#toc0_)

In [None]:
# Check shape

The bank gave out 266K loans in the time period we have available. Does this also mean that the bank had 266K clients?

In [None]:
# We can of course check this

Almost. There are a few clients that seem to have taken more than one loan.

### <a id='toc1_2_2_'></a>[What time frame does the dataset span?](#toc0_)

Seems the time covered by the dataset is 2012-2018 but we will double-check this later.

### <a id='toc1_2_3_'></a>[How is our data segmented?](#toc0_)

In [None]:
# Review unique values in categorical columns

The bank has 2 type of clients and loans, both evenly distributed. In reality, it might happen that only one of these classes is representative of the dataset and in that case we would focus on that particular class before analyzing less common classes.

### <a id='toc1_2_4_'></a>[Do each of the loan types correspond to a specific type of client?](#toc0_)

It seems not, meaning each type of client (individual, corporation) have access to either a fixed or variable interest loan. This also means that our data is split into 4 groups, which means we can figure out the profitability for each of these subsets to provide advice to the bank.

**Note:** This is very, very simplified. In reality you'd have hundreds of subgroups to look at to figure out which ones drive most of the profitability. A rule of thumb would be to figure out which segments represent most of your data (>90% of your data or money depending on the segments).

In [None]:
# We'll create a new column for this segmentation

### <a id='toc1_2_5_'></a>[How does each of our segment look like?](#toc0_)

In [None]:
# I'm most interested in profitability, so then that's the variable I'll look at first

Observations:
- There are clear differences in `Unitary profitability` among my segments so the segment does influence the profitability
- The variation in profitability for individuals is much smaller compared to that of corporations. This makes sense, as you'd expect individuals to have less negotiating power compared to corporations.
- The segment with small corporation clients that take out variable-interest loans is completely different from the others.

### <a id='toc1_2_6_'></a>[Is `small_corp_variable` a segment worth exploring?](#toc0_)

In [None]:
# But first... is it worth exploring?

Definitely. I have 34 billion in this type of loans. If this segment only had 70K in loans, it would not be as important.

### <a id='toc1_2_7_'></a>[Why is the `small_corp_variable` segment so different from the rest?](#toc0_)

In [None]:
# Get small_corp_var segment

In [None]:
# Check profitability distribution

To figure out which variable potentially drives profitability, we can check correlations among variables:

In [None]:
# Non-linear correlation

We see there's a high correlation between `Unitary profitability` and `variable component` and `Volume`. We can confirm this using a pairplot:

In [None]:
# Visual correlation

Observations:
- Anything that looks like a rectangle/blob => no correlation
- When you have rectangles, you typically deal with 2 uniform distributions
- When you have circles, you typically deal with 2 normal distributions
- We can notice the high correlation between `Unit profitability` and `Volume` and `variable component` also! We can also see that `Volume` and `variable component` themselves are correlated

In [None]:
# Vol - profitability correlation

In [None]:
# This looks a bit cluttered so we will add some transparency

Observations:
- As the loan amount increases we notice the unitary profitability decreases. This makes sense for two reasons:
    - We expect a client borrowing a higher amount of money will negotiate a lower interest rate.
    - The fixed fees (processing, credit score rating, etc.) are a larger proportion of smaller loans compared to larger loans.
- We do have a couple of clustered points that are interesting. **Which one should we spend most of our time on?**
- The variance at lower volumes is much higher when comparing to lower volumes, potentially due to the fixed costs affecting the profitability at small loan amounts.

### <a id='toc1_2_8_'></a>[How do we pick the high-value customers?](#toc0_)

### <a id='toc1_2_9_'></a>[Why do we have such high variance at low volumes?](#toc0_)

We will have a look at loans with very low and very high profitability but the same volume:

In [None]:
print('High profitability small loans')

print('Low profitability small loans')


What is different between these 2 sets of loans?

### <a id='toc1_2_10_'></a>[What changed over time?](#toc0_)

In [None]:
# First off, we need to change the dtype

In [None]:
# Get the mean profitability over time

In [None]:
# plot data

Observations:
- Loans are typically given out at the beginning of the year
- The profitability drastically increased in 2016

The reason for this is that in 2016 the bank acquired another bank and started using their CRM system which automatically increased the fixed costs every year based on inflation and other factors. Previously, the bank hadn't done this so many clients kept having their loan with lower fixed costs than needed.