# Credit Company Analysis

For this notebook, it will be analyzed patterns in a dataset that has characteristics about loans. And answer questions that can help businesses that works in the loan industry to avoid customers that have a high chance of defaulting a loan.
<br>I'll follow the CRISP-DM approach, so the outcome can be standardized across its development.


This notebook will be divided by the CRISP-DM standard and it will be as follows
* Business Understanding
* Data Understanding
* Data Preparation
* Modeling
* Evaluation
* Deploy

# Business Understanding

In this section its discussed about what questions the analysis of this data must answer. 
<br> 
The financial sector is one that always has data to be analyzed. Based on this dataset we can understand that we're dealing with a company that loans money to people.
<br>
To guide this analysis I developed 3 questions that we need to answer to gather the benefits from it.

* Which type of contract is the most defaulted ? 
* How the default behavior is divided across regions and social context.
* What are the factors that most relate a costumer to default a loan. And how the analysis can help avoid this.

# Data Understading

This part is where the real work begins. Data Understanding and Data preparation are the parts where we need to understand, filter, clean, impute, remove, and much more so the data processed can generate results in a more reliable way for the analysis and for model ingestion.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, confusion_matrix,accuracy_score
from sklearn.preprocessing import StandardScaler

%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# loading data into dataframes
df = pd.read_csv('/kaggle/input/loan-defaulter/application_data.csv')
columns_desc = pd.read_csv('/kaggle/input/loan-defaulter/columns_description.csv')

In [None]:
# First we need to understand the shape of the data and the types of columns (categorical or numerical) we are working with
print(f'Shape of Data: {df.shape}')
cat_columns = df.select_dtypes(include=[object]).columns
num_columns = df.select_dtypes(include=[int, float]).columns
print(f'Number of Categorical Columns: {len(cat_columns)} \nNumerical Columns: {len(num_columns)}')
df.describe()

In [None]:
# Evaluate the amount of null values in our dataset so we can clean it later
most_null_columns = df.isnull().sum().sort_values(ascending=False)
gt_5_percent = sum(most_null_columns > df.shape[0] * 0.05)
print(f'Number of columns with missing data greater than 5% of the entire dataset: {gt_5_percent} \n')
print(list(most_null_columns))
print()
print(list(most_null_columns.keys()))

In [None]:
# plotting the difference between non-performing loans and those that paid correctly
target_values = df[['TARGET']].value_counts()

print(target_values[0] / df.shape[0])
print(target_values[1] / df.shape[0])
target_values.plot.bar()

In [None]:
# understanding difference between gender on the loan distribution
print(df['CODE_GENDER'].value_counts() / df.shape[0])
df['CODE_GENDER'].value_counts().plot.bar()

In [None]:
# undertanding how is the distribution between gender across defaulted loans
print(df[df['TARGET'] == 1]['CODE_GENDER'].value_counts() / df[df['TARGET'] == 1].shape[0])
df[df['TARGET'] == 1]['CODE_GENDER'].value_counts().plot.bar()

In [None]:
# understading the correlation between numerical variables and target
print(df[num_columns].abs().apply(lambda x: x.corr(df.TARGET)).sort_values().to_string())

In [None]:
# understading the correlation between categorical variables and target
categorical_variables = pd.get_dummies(df[cat_columns], drop_first=True)
print(categorical_variables.apply(lambda x: x.corr(df.TARGET)).sort_values().to_string())

In [None]:
# analyzing loans based on contract type
print(df['NAME_CONTRACT_TYPE'].value_counts())
df['NAME_CONTRACT_TYPE'].value_counts().plot.bar(rot=0, title='Total loans by Contract Type')
fig = df['NAME_CONTRACT_TYPE'].value_counts().plot.bar(rot=0,title='Total loans by Contract Type').get_figure()
fig.savefig('loan_type_plot.png')

In [None]:
# analyzing defaulted loans based on contract type
defaulted_df = df[df['TARGET'] == 1]
print(defaulted_df['NAME_CONTRACT_TYPE'].value_counts())
defaulted_df['NAME_CONTRACT_TYPE'].value_counts().plot.bar(rot=0)
fig = defaulted_df['NAME_CONTRACT_TYPE'].value_counts().plot.bar(rot=0, title='Total Defaulted loans by Contract Type').get_figure()
fig.savefig('default_loan_type_plot.png')

In [None]:
# plotting the difference between contract types
bar_plot = pd.concat(
    [df.rename(columns={"NAME_CONTRACT_TYPE": "Total Loans",})['Total Loans'].value_counts().to_frame().transpose(),
     df.rename(columns={"NAME_CONTRACT_TYPE": "Total Defaulted Loans",})[df['TARGET'] == 1]['Total Defaulted Loans'].value_counts().to_frame().transpose(),]
).plot.bar(rot=0, title='Total Loans by Default Status and Contract Type', figsize=(12,8))
fig = bar_plot.get_figure()
fig.savefig('total_default_loan_type_plot.png')

In [None]:
# plotting relative region population
bar_plot = df[df['TARGET'] == 1]['REGION_POPULATION_RELATIVE'].plot.hist(title='Relative Region Population (default loans)', figsize=(10,3))
fig = bar_plot.get_figure()
fig.savefig('relative_pop.png')

In [None]:
# plotting the age distribution in defaulted loans
fig, ax = plt.subplots()
(df[df['TARGET'] == 1]['DAYS_BIRTH'].abs()//365).value_counts().to_frame().sort_index().plot(
    kind='line', 
    ax=ax, 
    title='Age Distribution (default loans)'
)

ax.legend([])
fig = fig.get_figure()
fig.savefig('total_defaulte_age.png')

In [None]:
bar_plot = df[df['TARGET'] == 1]['NAME_FAMILY_STATUS'].replace('Single / not married', 'Single').value_counts()[:-1].plot.bar(
    color='green',
    title='Marital status (default loans)',
    rot=0
)
fig = bar_plot.get_figure()
fig.savefig('marital_default.png')

In [None]:
bar_plot = df[df['TARGET'] == 1]['CODE_GENDER'].value_counts().plot.bar(
    title='Gender (default loans)',
    rot=0
)
fig = bar_plot.get_figure()
fig.savefig('gender_default.png')

In [None]:
fig, ax = plt.subplots()
df[df['TARGET'] == 1].rename(columns={'NAME_EDUCATION_TYPE': "Education",})['Education'].value_counts()[:-1].plot(
    kind='pie',  
    fontsize=10,
     autopct='%.2f',
    figsize=(8,6),
    title='Education distribution (default loans)'
)
fig = fig.get_figure()
fig.savefig('education_default.png')

In [None]:
b_plot = df[df['TARGET'] == 1].groupby(pd.cut(df[df['TARGET'] == 1]['AMT_INCOME_TOTAL'], np.arange(0, 600_000, 50000))).sum()['TARGET'].plot.bar(
    rot=20,
    title="Customer's Income by ranges of 50k (default loans)",
    figsize=(12,6),
    xlabel=''
)
fig = b_plot.get_figure()
fig.savefig('income_graph.png')

Some insights from the Data Understanding part:

* Shape of Data: 307511 rows, 122 columns
* Number of Categorical Columns: 16 
* Numerical Columns: 106
* Number of columns with missing data greater than 5% of the entire dataset: 57 
* From the total of 307511 loans, 91.9% (282602) were Ok and 8.1% (24631) were default.
* From all the variables the ones that had a slighlty more correlation with Target was EXT_SOURCE_1, 2 and 3, tha represents the normalized score from an external source.
* This shows a slight negative relation, which makes sense. If the score is high the chance of the customer default will be lower.

### Data Preparation

Now we get hands dirty on the dataset and focus on improving the overall data quality. 
<br>
Some of the common techniques that it will be used are:

* Dropping / Imputing NaNs - Removing unnecessary data and imputing some with average or mode
* One Hot Enconding - For adding the categorial values to the evaluation and predicting of the Target variable
* Outliers - find and remove outliers that can bias the analysis
* Low Variance variables - remove variables that doesn't have impact on the target variable

In [None]:
# Dropping / Imputing NaNs - Removing unnecessary data and imputing some with average or mode
# Decided to remove columns with 50%+ of NaN values without ext_sources
percentage = 50
min_count =  int(((100 - percentage) / 100) * df.shape[0] + 1)
dropped_nans_df = df[df.columns[~df.columns.isin(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3'])]].dropna(axis=1, thresh=min_count)
dropped_nans_df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']] = df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].dropna()

# for numerical ones imputing average
apply_mean = lambda col: col.fillna(col.mean())
imputed_mean_num_df = dropped_nans_df.select_dtypes(include=[int, float]).abs().apply(apply_mean)
dropped_nans_df[imputed_mean_num_df.columns] = imputed_mean_num_df
dropped_nans_df.head()

In [None]:
# One Hot Enconding
print(dropped_nans_df.shape)
encoded_df = pd.get_dummies(dropped_nans_df)
encoded_df.head()

In [None]:
# Removing low variance columns
encoded_df.loc[:, encoded_df.std() > .2]

In [None]:
# Removing outliers
from scipy.stats import zscore
zscores = zscore(encoded_df)
abs_z_scores = np.abs(zscores)
filtered_entries = (abs_z_scores < 11).all(axis=1)
new_df = encoded_df[filtered_entries]
new_df

# Modeling

For this project I decided to use a simpler approach by using the LogisticRegression algorithm

In [None]:
# train/test splitting
X = new_df[new_df.columns[~new_df.columns.isin(['TARGET', 'SK_ID_CURR'])]]
y = new_df['TARGET']

# pre processing
standardizer = StandardScaler()
X = standardizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

In [None]:
# setting up and applying logistic regression
lg_model = LogisticRegression(solver='liblinear')
lg_model.fit(X_train, y_train)

#Make predictions using the testing set
y_pred = lg_model.predict(X_test)

# Evaluation

In [None]:
cm = confusion_matrix(y_test, y_pred)
TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)

accuracy =  (TP+TN) /(TP+FP+TN+FN)
print('Accuracy of the binary classification = {:0.3f}'.format(accuracy_score(y_test, y_pred)))

In [None]:
weights = lg_model.coef_[0]
abs_weights = np.abs(weights)

#get the sorting indices
sorted_index = np.argsort(abs_weights)[::-1]


#get the index of the top-3 features
top_3 = sorted_index[:3]

#get the names of the top 3 most important features
print(list(new_df.iloc[:, top_3].columns))


# Conclusion

From this analysis, we achieved to answer the questions that were brought up. 
<br>
It was discovered that the Contract Type of cash loan is the one that has the most default behavior in terms of size and percentage.
<br>
Also that the behavior of defaulting loans happens more often at an early age people, usually female, with a great chance of been married and having at most secondary education and earning at most 200k annually.
<br>
And that the variables that  most relate for predicting a loan to default are relative region population, credit amount and organization type.