## EDA 
### Section 1: Summary Statistics and Distributions:

#### 1. Calculate basic summary statistics 
(mean, median, mode, standard deviation) for numerical columns like loan_amnt, annual_inc, etc.
Create histograms and box plots to visualize the distribution of loan amounts, annual income, and other key variables.
Calculate and visualize the percentage of loans that were fully repaid (loan_status) versus those that defaulted.
#### 2. Categorical Variables Analysis:
Explore the distribution of categorical variables like home_ownership, purpose, and verification_status.
Calculate the count and percentage of loans for each category within these variables.
Visualize these distributions using bar plots or pie charts.

#### 3. Time-Series Analysis:
Convert date columns (issue_d, last_pymnt_d, etc.) to datetime objects.
Create a time-series plot of loan issuance over time to identify trends.
Analyze loan performance over time, such as the percentage of loans that default or are paid off.

### Section 2: Advanced and Interesting EDA Activities:
#### 1. Loan Default Analysis by Features:

Perform a deeper analysis of loan defaults by exploring relationships between default rates and features like int_rate, emp_length, dti, and annual_inc.
Use statistical tests or visualizations to identify significant differences in default rates among different groups.

#### 2. Feature Engineering and Correlation Analysis:
Create new features or transformations based on domain knowledge, such as debt-to-income ratio (dti) or loan-to-income ratio.
Compute correlations between features and the target variable (repay_fail) to identify which features are most influential in loan repayment.

#### 3. Geospatial Analysis:
Utilize geographical information such as zip_code and addr_state to perform geospatial analysis.
Visualize loan distribution and loan performance on a map to identify geographic trends and hotspots.
Analyze whether the location has an impact on loan defaults or interest rates.

### Import All Necessary Packages

In [69]:
import pandas as pd


In [70]:
try:
    df = pd.read_csv('data\data.csv', encoding='iso-8859-1')
except Exception as e:
    print(f"An error occurred while reading the CSV file: {e}")


In [71]:
# This cell has commented code meant to understand the df
# Unblock it to see some characteritics of the df
# df.head(5)
# df.columns
# df.shape # (38480, 37) Has 37 columns
# df.describe()
# df.info() # df has several dtypes for each row.
df.head(5)

Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,emp_length,...,total_acc,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,repay_fail
0,2,2,2,0.0,0.0,0.0,36 months,0.0,0.0,< 1 year,...,1.0,0.0,0.0,0.0,0.0,Jan-07,0.0,Jan-07,Jan-07,1
1,3,545583,703644,2500.0,2500.0,2500.0,36 months,13.98,85.42,4 years,...,10.0,3075.291779,3075.29,2500.0,575.29,Jul-13,90.85,Aug-13,Jun-16,0
2,4,532101,687836,5000.0,5000.0,5000.0,36 months,15.95,175.67,4 years,...,15.0,2948.76,2948.76,1909.02,873.81,Nov-11,175.67,,Mar-12,1
3,5,877788,1092507,7000.0,7000.0,7000.0,36 months,9.91,225.58,10+ years,...,20.0,8082.39188,8082.39,7000.0,1082.39,Mar-14,1550.27,,Mar-14,0
4,6,875406,1089981,2000.0,2000.0,2000.0,36 months,5.42,60.32,10+ years,...,15.0,2161.663244,2161.66,2000.0,161.66,Feb-14,53.12,,Jun-16,0


In [72]:
df.isna().sum()

Unnamed: 0                    0
id                            0
member_id                     0
loan_amnt                     1
funded_amnt                   1
funded_amnt_inv               1
term                          0
int_rate                      0
installment                   1
emp_length                  993
home_ownership                0
annual_inc                    2
verification_status           0
issue_d                       0
loan_status                   0
purpose                       0
zip_code                      0
addr_state                    0
dti                           0
delinq_2yrs                   1
earliest_cr_line              0
inq_last_6mths                1
mths_since_last_delinq    24363
open_acc                      1
pub_rec                       1
revol_bal                     4
revol_util                   59
total_acc                     1
total_pymnt                   1
total_pymnt_inv               1
total_rec_prncp               1
total_re

In [73]:
# Lets drop next_pymnt_d since it has so many missing values (35097 missing values)
# Lets drop mths_since_last_delinq since it has so many missing values (24363 missing values)
df.drop('next_pymnt_d', axis= 1, inplace=True)
df.drop('mths_since_last_delinq', axis=1, inplace=True)

In [76]:
df.columns

Index(['Unnamed: 0', 'id', 'member_id', 'loan_amnt', 'funded_amnt',
       'funded_amnt_inv', 'term', 'int_rate', 'installment', 'emp_length',
       'home_ownership', 'annual_inc', 'verification_status', 'issue_d',
       'loan_status', 'purpose', 'zip_code', 'addr_state', 'dti',
       'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'open_acc',
       'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'last_pymnt_d',
       'last_pymnt_amnt', 'last_credit_pull_d', 'repay_fail'],
      dtype='object')

In [79]:
#loan Amount Statistical Analysis
# Mean
# Find out if there are empty or missing values
# df.loan_amnt.mean() # = 11094.73
