### In finance, there are several major types of risks that individuals, businesses, and investors need to be aware of and manage. Each type of risk has its own characteristics, causes, and potential consequences. 

**The Four major types of risks in finance explained in detail:**

1. **Market Risk:**
   - **Definition:** Market risk, also known as systematic risk or non-diversifiable risk, is the risk associated with the overall market or a specific market segment. It refers to the potential for losses due to fluctuations in market factors such as interest rates, exchange rates, and stock prices.
   - **Causes:** Market risk can result from economic events, geopolitical developments, central bank policy changes, and broader market sentiment.
   - **Consequences:** Market risk can lead to declines in the value of investments and portfolios. It affects all investors to some degree and cannot be eliminated through diversification.

2. **Credit Risk:**
   - **Definition:** Credit risk, also known as default risk, is the risk that a borrower or issuer of debt securities may fail to meet their financial obligations, such as making interest payments or repaying principal.
   - **Causes:** Credit risk arises from the financial instability or creditworthiness of borrowers, which can be influenced by economic conditions, business performance, and management decisions.
   - **Consequences:** Credit risk can result in losses for lenders or investors holding debt securities. It is a primary concern for bondholders and creditors.

3. **Liquidity Risk:**
   - **Definition:** Liquidity risk is the risk that an asset cannot be bought or sold quickly enough in the market without significantly affecting its price. It pertains to the ease of converting an asset into cash.
   - **Causes:** Liquidity risk can be caused by a lack of trading activity, market disruptions, or when an asset is illiquid by nature.
   - **Consequences:** Liquidity risk can lead to difficulties in selling assets when needed, potentially resulting in losses or the inability to meet financial obligations.

4. **Operational Risk:**
   - **Definition:** Operational risk arises from internal failures, including human errors, system malfunctions, fraud, and inadequate processes or controls within an organization.
   - **Causes:** Operational risk is often attributed to human actions or system failures. It can also result from external events, such as natural disasters.
   - **Consequences:** Operational risk can lead to financial losses, damage to reputation, legal issues, and disruptions in business operations.
   
   
### A chart of the other types of financial risks:

In [None]:
from IPython.display import Image

# Specify the path to the image
image_path = "/kaggle/input/credit-risk/TypesOfFinancialRisk.png"

# Display the resized image"
Image(image_path, width=800, height=800)

### In this notebook, my focus would be on: 

- **credit risk**

- **how are loan borrowers assessed and graded based on their credit history**

- **creating a model to help profile a new borrower**

- **predict an appropriate interest rate for them, based on their history**

In [None]:
import pandas as pd

data = pd.read_csv('/kaggle/input/credit-risk/loan.csv', low_memory=False)
data

In [None]:
data.describe()

In [None]:
data.isnull().sum().sort_values(ascending=False)

In [None]:
data.dtypes.value_counts()

## Data Exploration

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Define the desired order for the 'grade' variable (alphabetical sorting)
grade_order = sorted(data['grade'].unique())

# Define a custom color palette with colors in increasing order
custom_palette = sns.color_palette("coolwarm", len(grade_order))

sample_size = 1000  # Adjust this to your desired sample size
sampled_data = data.sample(sample_size)

# Create a scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='loan_amnt', y='int_rate', hue='grade', hue_order=grade_order, 
                data=sampled_data, palette=custom_palette)

# Add labels and title
plt.xlabel('Loan Amount')
plt.ylabel('Interest Rate')
plt.title('Interest Rates vs Loan Amount')

# Show legend and customize the legend title
plt.legend(title='Grade', loc='upper right')

# Display the plot
plt.show()

**You can observe patterns in the interest rate segmentation, almost seems like clusters of them.**

- As the interest rate increases for each category of loan amount lended, the grade decreases.

- Borrowers with a lower grade or a "Credit Rating or Score" are assigned a higher interest rate, as lending money to borrowers with lower credit score is termed as risky by the lender, thus a premium is charged as a hedge against the risk the lender undertakes. Lending institutions operate to maximize their profits while managing risks. They use credit scoring models and other underwriting criteria to assess the credit-worthiness of borrowers. The interest rate assigned to a borrower is a reflection of the lender's perception of risk against him. Lenders seek to strike a balance between offering competitive rates to attract borrowers and ensuring that the interest rates cover potential losses due to defaults.

- Borrowers can influence their interest rates by improving their credit profiles. This may involve maintaining a good payment history, reducing outstanding debt, and managing their credit responsibly. By doing so, borrowers can qualify for loans with lower interest rates and better terms.

- It's essential to note that interest rates can also be influenced by regulatory changes and broader economic conditions. Government policies, market interest rates, and lender competition can impact the rates offered to borrowers.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Change in interest rates with changes in grades
plt.figure(figsize=(12, 6))
sns.pointplot(x='sub_grade', y='int_rate', data=data, order=sorted(data['sub_grade'].unique()))
plt.title("Interest Rate vs Grades")
plt.xlabel("Subgrade")
plt.ylabel("Interest Rate")
plt.xticks(rotation=45)
plt.show()

In [None]:
# Create a histogram plot of interest rates grouped by grades
plt.figure(figsize=(10, 6))

# Define the desired order for the 'grade' variable (alphabetical sorting)
grade_order = sorted(data['grade'].unique())

# Define a custom color palette with colors in increasing order
custom_palette = sns.color_palette("coolwarm", len(grade_order))

sns.histplot(data=data, x='int_rate', hue='grade',hue_order=grade_order, bins=30, kde=True)
plt.title('Interest Rate Distribution by Grades')
plt.xlabel('Interest Rate')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Count the number of occurrences of each grade
grade_counts = data['grade'].value_counts()

# Create a bar plot
plt.figure(figsize=(8, 6))
grade_counts.plot(kind='bar')
plt.title('Number of Loans per Grade')
plt.xlabel('Grade')
plt.ylabel('Number of Loans')
plt.xticks(rotation=0)  # To make sure the grade labels are not rotated

# Show the plot
plt.show()

In [None]:
# Group the data by subgrade and count the number of occurrences
subgrade_counts = data['sub_grade'].value_counts()

# Create a DataFrame from the counts
subgrade_counts_df = pd.DataFrame({'Subgrade': subgrade_counts.index, 'Count': subgrade_counts.values})

# Plot the stacked bar chart
plt.figure(figsize=(12, 6))
plt.bar(subgrade_counts_df['Subgrade'], subgrade_counts_df['Count'])
plt.xlabel('Subgrade')
plt.ylabel('Count')
plt.title('Number of Loans by Subgrade')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()

In [None]:
# Visualize the distribution of loan amounts
plt.figure(figsize=(10, 6))
sns.histplot(data['loan_amnt'], bins=30, kde=True)
plt.title('Loan Amount Distribution')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Visualize the distribution of interest rates
plt.figure(figsize=(10, 6))
sns.histplot(data['int_rate'], bins=30, kde=True)
plt.title('Interest Rate Distribution')
plt.xlabel('Interest Rate')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Visualize the count of different loan terms
plt.figure(figsize=(8, 5))
sns.countplot(x='term', data=data)
plt.title('Count of Loan Terms')
plt.xlabel('Term')
plt.ylabel('Count')
plt.show()

With a suitable understanding of the dataset from the above exploration, we can go ahead with pre-processing data for the model.

## Data Pre-Processing

In [None]:
# Set the threshold for null values (80% in this case), higher than it will get dropped
threshold = 0.80

# Calculate the number of null values in each column
null_counts = data.isnull().sum()
null_percentages = (null_counts / len(data)) * 100

# Identify columns with null percentages greater than or equal to the threshold
columns_to_drop = null_percentages[null_percentages >= threshold].index
data = data.drop(columns=columns_to_drop)

data.shape

**The size decreased from (887379,74) to x46, thus 28 high null columns were dropped**

In [None]:
# Separating a table which containts target features for the model

target_columns = ['id', 'grade', 'sub_grade', 'int_rate']
target = data[target_columns].copy()

# Remove the specified columns from the original DataFrame 'data'
data.drop(['grade', 'sub_grade'], axis=1, inplace=True)

target

In [None]:
# check for any null values
target.isna().sum()

### Text pre-processing

In [None]:
object_columns = data.select_dtypes(include=['object']).columns

for column in object_columns:
    print(column)

In [None]:
## Identifying the differnt values in categorical columns

for column in data[object_columns].columns:
    unique_values = data[column].value_counts()
    print(f"\nColumn: {column}")
    print(unique_values)

In [None]:
# dropping some useless columns, which won't contribute to the model

data.drop(['url', 'policy_code', 'member_id'], axis=1, inplace=True)

In [None]:
# categorical encoding for selective features

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

columns = ['home_ownership', 'verification_status', 'initial_list_status', 'application_type', 'pymnt_plan']
for col in columns:
    data[col] = le.fit_transform(data[col])

In [None]:
# separating a table containing features which require text pre-processing
text_columns = ['id', 'issue_d', 'earliest_cr_line', 'last_credit_pull_d', 'term', 'title', 'purpose', 'addr_state', 'zip_code', 'loan_status']

# creating a copy here, so that you can run the below cell to get text df back,
# if you mess up on some processing down below
data_c = data.copy()

# Remove the specified columns, except for 'id', from the original DataFrame 'data'
data.drop(columns=[col for col in text_columns if col != 'id'], inplace=True)

In [None]:
# Create a new DataFrame 'text' containing the specified columns, 
# including 'id', as it would help in merging back later
text = data_c[text_columns].copy()

text

In [None]:
# text.to_csv('loan/loan_text.csv', index=False)

In [None]:
# Calculate the counts for each loan status
loan_status_counts = text['loan_status'].value_counts().reset_index()
loan_status_counts.columns = ['Loan Status', 'Count']

# Create the horizontal bar graph
plt.figure(figsize=(12, 6))
sns.barplot(x='Count', y='Loan Status', data=loan_status_counts)
plt.title('Loan Status Distribution')
plt.xlabel('Count')
plt.ylabel('Loan Status')
plt.show()

### Explaining the Loan Status'
The loan status' are basically of 4 types, with each having sub-divisions

1. **Active Loans**
* Recently issued loans (<6 months)
* Currently active, i.e. under their tenure


2. **Fully Paid Loans**
* Does not meet the credit policy. Status:Fully Paid
* Fully Paid - All outstanding payments are done and loan is closed.


3. **Defaulted Loans**
* In Grace Period - 1-15 days have passed since the due date
* Late (16-30 days) - 16-30 days have passed since due date
* Late (31-120 days) - 31-120 days have passed since due date


4. **Loans with Late Payments**
* Default - The borrower is not able to make outstanding payments for an extended period of time
* Charged Off - A charge-off usually occurs when the creditor has deemed an outstanding debt is uncollectible
* Does not meet the credit policy. Status:Charged Off

In [None]:
# Define a custom encoding dictionary
# assign weights to categories based on what impact they should have
custom_encoding = {
    'Fully Paid': 10,
    'Does not meet the credit policy. Status:Fully Paid': 8,
    'Current': 4,
    'Issued': 2,
    'In Grace Period': 0,
    'Late (16-30 days)': -2,
    'Late (31-120 days)': -5,
    'Charged Off': -7,
    'Does not meet the credit policy. Status:Charged Off': -9,
    'Default': -10
}

# Apply the custom encoding
text['loan_status'] = text['loan_status'].map(custom_encoding)

In [None]:
from IPython.display import Image

# Specify the path to the image
image_path = "/kaggle/input/credit-risk/loan-default_633ad27865f83.jpg"

# Display the resized image"
Image(image_path, width=800, height=800)

#### Even one missed payment can damage your credit score!

**If a period of 120 days has passed since the default notice, the creditor will send a letter claiming the total amount payable.**

1. Secured loans

In the case of secured loans like loans against property, home loans and car loans, the legal rights of the property or the car is handed over to the lender in repeated cases of default. If assets like gold, shares/ other investments and insurance are pledged, the lender takes possession of these assets to sell them off at market value and recover their loss. Here, the lender has the right to sell the asset to recover their funds when you have too many defaults. However, before they do so, the financial institution is obligated to notify the borrower to pay off their debt within a specified time limit.

2. Unsecured loans: If you don't pledge any asset or provide any guarantor, the loan is considered unsecured. Defaulting on such loans could lead to the following
* An increased Interest rate: If you haven't paid your EMIs, the lender will increase the interest rate and/ or levy additional fees and charges on your loan.
* A lower credit score: An EMI default would lead to the borrower's credit score lowered, which affects his future ability to take debt.
* Collection agencies: Some lenders turn to collection agencies to get back their money. These agencies could call you. write you letters or make a house visit.
* A lawsuit by the lender: Some lenders who don't receive their money sue the defaulting borrowers. This could mean clearing off the outstanding and paying for the legal fees and charges for the borrower.

3. Student loans

Student loans are often considered high risk in terms of default due to the nature of the loan. Students usually struggle to meet their payment right out of college, which ultimately leads to increased interest amounts and a bad credit score in the long run, which can hamper their future credit capability.


If you do default on a loan, don't worry. You can bring yourself out of that situation by taking the following steps:

1. Don't panic: Defaulting a loan payment can cause stress and worry. Begin with calmly figuring out your expenditure and understanding how you were unable to make the payment.

2. Communicate with the lender: Explain the reason for your loan default and work out a solution that benefits both of you. Some institutions are flexible with their policy terms, which can come in handy when negotiating your repayment plan.

3. Consider refinancing: Refinancing gives you the ability to reduce your monthly EMI amount. However, most financial institutions will only consider individuals with good credit scores for refinancing.

In [None]:
from IPython.display import Image

# Specify the path to the image
image_path = "/kaggle/input/credit-risk/us-regions-map.jpg"

# Display the resized image"
Image(image_path, width=800, height=800)

### Feature engineering the borrower's location

The first three digits of a U.S. ZIP code represent a sectional center facility (SCF). An SCF is a central mail processing facility that serves as a hub for mail distribution within a specific geographic region. These three digits help route mail more efficiently within the United States.

The first digit of the ZIP code represents a group of U.S. states, and the second and third digits represent a region within that group (or perhaps a large city). Together, these first three digits narrow down the destination area for incoming mail. The remaining two digits in a ZIP code provide even more precise location information.

For example, in the ZIP code "90210," the first digit "9" represents a group of western U.S. states, and the "02" represents a particular area within that region, which happens to correspond to Beverly Hills, California. The final "10" provides additional specificity within Beverly Hills.

In this example, we only have values for the first three digits. However, since we have the state names, we can create a new first digit, by encoding the region in which person might live in, followed by the extracted three digit zip code. 

The US is divided up into five regions:
- Northeast
- West
- Midwest
- Southwest
- Southeast

In [None]:
# Define the mapping of state abbreviations to regions

state_to_region = {
    'AL': 'Southeast',
    'AK': 'West',
    'AZ': 'Southwest',
    'AR': 'Southeast',
    'CA': 'West',
    'CO': 'West',
    'CT': 'Northeast',
    'DE': 'Northeast',
    'DC': 'Northeast',
    'FL': 'Southeast',
    'GA': 'Southeast',
    'HI': 'West',
    'ID': 'West',
    'IL': 'Midwest',
    'IN': 'Midwest',
    'IA': 'Midwest',
    'KS': 'Midwest',
    'KY': 'Southeast',
    'LA': 'Southeast',
    'ME': 'Northeast',
    'MD': 'Northeast',
    'MA': 'Northeast',
    'MI': 'Midwest',
    'MN': 'Midwest',
    'MS': 'Southeast',
    'MO': 'Midwest',
    'MT': 'West',
    'NE': 'Midwest',
    'NV': 'West',
    'NH': 'Northeast',
    'NJ': 'Northeast',
    'NM': 'Southwest',
    'NY': 'Northeast',
    'NC': 'Southeast',
    'ND': 'Midwest',
    'OH': 'Midwest',
    'OK': 'Southwest',
    'OR': 'West',
    'PA': 'Northeast',
    'RI': 'Northeast',
    'SC': 'Southeast',
    'SD': 'Midwest',
    'TN': 'Southeast',
    'TX': 'Southwest',
    'UT': 'West',
    'VT': 'Northeast',
    'VA': 'Southeast',
    'WA': 'West',
    'WV': 'Southeast',
    'WI': 'Midwest',
    'WY': 'West'
}

# Clean and map the state abbreviations
text['addr_state'] = text['addr_state'].str.strip()

# Create a new column 'region' based on the mapping
text['region'] = text['addr_state'].map(state_to_region)

In [None]:
plt.figure(figsize=(14, 6))
sns.countplot(x='region', data=text)
plt.title('Borrower Locations by State')
plt.xlabel('State')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

In [None]:
text.drop(['addr_state'], axis=1, inplace=True)

text['region'] = le.fit_transform(text['region'])

# Merge 'zip_code' and 'region' and convert to integer
text['location'] = text['region'].astype(str) + text['zip_code'].astype(str)

# Extract the first four characters from 'location'
text['location'] = text['location'].str[:4]
text['location'] = text['location'].astype(int)

text.drop(['region', 'zip_code'], axis=1, inplace=True)

### Handling the Date Columns

The columns "issue_d," "earliest_cr_line," and "last_credit_pull_d" in a loan dataset represent dates related to a borrower's credit history and the loan issuance.

<br> 1. **"issue_d" (Issue Date):**
   - represents the date when the loan was issued or originated.
   - Lenders use this date to track when the borrower received the loan.


<br> 2. **"earliest_cr_line" (Earliest Credit Line):**
   - represents the date when the borrower opened their earliest known credit account.
   - It is a crucial date in assessing a borrower's credit history and length of credit.
   - Lenders consider the length of credit history when evaluating creditworthiness.


<br> 3. **"last_credit_pull_d" (Last Credit Pull Date):**
   - represents the date when the most recent inquiry or update was made to the borrower's credit report by a lender or financial institution.
   - It indicates the date when the lender last checked the borrower's credit information.
   - Lenders may use this date to ensure the borrower's creditworthiness has not changed significantly since the loan application.

In [None]:
import numpy as np

# Fill missing values in 'earliest_cr_line' and 'last_credit_pull_d' with empty strings
na_text_cols = ['earliest_cr_line', 'last_credit_pull_d']

for col in na_text_cols:
    text[col].fillna("2016-01-01", inplace=True)

# Convert the date strings to datetime objects
text['issue_d'] = pd.to_datetime(text['issue_d'])
text['earliest_cr_line'] = pd.to_datetime(text['earliest_cr_line'])
text['last_credit_pull_d'] = pd.to_datetime(text['last_credit_pull_d'])

# feature engineering the datetime columns
text['issue_d'] = ((pd.to_datetime("2016-01-01") - text['issue_d']) / np.timedelta64(1, 'M')) / 12
text['earliest_cr_line'] = ((pd.to_datetime("2016-01-01") - text['earliest_cr_line']) / np.timedelta64(1, 'M')) / 12
text['last_credit_pull_d'] = ((pd.to_datetime("2016-01-01") - text['last_credit_pull_d']) / np.timedelta64(1, 'M')) / 12

In [None]:
# Split 'term' column on space and keep only the first term
text['term'] = text['term'].str.split(' ').str[1]

# # Divide the 'term' values by 12 to get a yearly term
text['term'] = text['term'].astype(int) / 12

In [None]:
# filling up na with value in purpose, so that the merged column doesn't throw NA values
text['purpose'] = text['purpose'].str.replace('_', ' ')

text['purpose'] = le.fit_transform(text['purpose'])

text.drop(['title'], axis=1, inplace=True)

In [None]:
# text.to_csv('loan/text_cleaned.csv', index=False)

### Now, merging the divided columns to move ahead with further model building

In [None]:
data = pd.merge(data, text, on='id', how='outer')

In [None]:
data

### Handling rest of the missing values

In [None]:
from sklearn.impute import KNNImputer

# Initialize the KNNImputer with the number of neighbors (k)
knn_imputer = KNNImputer(n_neighbors=50)

# Perform KNN imputation on the DataFrame
data_imputed = knn_imputer.fit_transform(data)

# Convert the result back to a DataFrame
data = pd.DataFrame(data_imputed, columns=data.columns)

data

In [None]:
#Using Pearson Correlation
plt.figure(figsize=(25,10))
cor = data.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

In [None]:
# data.to_csv('loan/loan_cleaned.csv', index=False)

## Credit Risk Model

In [None]:
# creating a copy as a fail safe in case of any errors, and a restart is required
data_m = data.copy()

In [None]:
data_m.drop(['id'], axis=1, inplace=True)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Step 1: Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_m)

In [None]:
pca = PCA(n_components=0.90)  # Choose the explained variance threshold
reduced_data = pca.fit_transform(scaled_data)

In [None]:
reduced_data.shape

### K-Means Clustering:


1. **Model Architecture Explanation:**
   - K-Means is a centroid-based clustering algorithm.
   - It starts with randomly initializing K cluster centroids (points in the feature space).
   - Then, it assigns each data point to the nearest centroid.
   - After that, it recalculates the centroids as the mean of all points assigned to each cluster.
   - These two steps (assignment and update) are repeated iteratively until convergence.
   - The algorithm aims to minimize the sum of squared distances between data points and their respective cluster centroids.

2. **When to Use the Model:**
   - K-Means is used for clustering data into K distinct groups based on similarity.
   - It's useful when you have unlabeled data and want to discover hidden patterns or groupings.
   - Common applications include customer segmentation, image compression, document clustering, and anomaly detection.

3. **Cost Function and Average Time Complexity:**
   - The cost function of K-Means is the sum of squared distances between data points and their assigned cluster centroids.
   - Mathematically, it's expressed as: J = Σ ||xⁱ - μᵢ||², where xⁱ is a data point, μᵢ is the centroid of cluster i, and Σ sums over all data points.
   - K-Means has an average time complexity of O(t * K * N * d), where:
     - t is the number of iterations (convergence usually occurs within a small number of iterations).
     - K is the number of clusters.
     - N is the number of data points.
     - d is the number of dimensions (features).
   - Despite its efficiency, K-Means can struggle with large datasets, high dimensionality, and non-spherical clusters.

4. **Evaluation Metrics:**
   - There are several metrics to evaluate K-Means clustering results, including:
     - **Inertia or Within-Cluster Sum of Squares (WCSS):** It measures how tightly grouped the data points are within each cluster. Lower values indicate better clustering.
     - **Silhouette Score:** This metric quantifies how similar each data point is to its own cluster compared to other clusters. Values range from -1 to 1, where higher values are better.
     - **Davies-Bouldin Index:** It measures the average similarity ratio of each cluster with the cluster that is most similar to it. Lower values suggest better clustering.
     - **Calinski-Harabasz Index (Variance Ratio Criterion):** It measures the ratio of between-cluster variance to within-cluster variance. Higher values indicate better separation of clusters.

   The choice of evaluation metric depends on the nature of your data and your specific goals. Typically, a combination of these metrics is used to assess the quality of K-Means clustering.

In [None]:
from IPython.display import Image

# Specify the path to the image
image_path = "/kaggle/input/credit-risk/MqHvx.png"

# Display the resized image"
Image(image_path, width=800, height=800)

### Elbow method to find number of clusters

The Elbow Method is a heuristic technique used to determine the optimal number of clusters in a dataset for K-Means clustering. It is called the "Elbow Method" because the plot of the number of clusters against the within-cluster sum of squares (WCSS) resembles an elbow, and the "elbow point" is the point at which the WCSS starts to decrease at a slower rate. Here's a detailed explanation of how the Elbow Method works:

**1. Within-Cluster Sum of Squares (WCSS):**
   - The WCSS measures the compactness of clusters. It is calculated as the sum of squared distances between each data point in a cluster and the centroid of that cluster. Mathematically, for each cluster `k`, the WCSS is given by:
     ```
     WCSS(k) = Σ(distance(data_point_i, centroid_k)^2) for all data points in cluster k
     ```     
   - The total WCSS for all clusters is calculated as the sum of WCSS for each cluster.

**2. How the Elbow Method Works:**
   - The Elbow Method involves running the K-Means algorithm for a range of values of `k` (the number of clusters), typically from 1 to a predefined maximum value.
   
   - For each value of `k`, calculate the WCSS.
   
   - Plot the number of clusters (`k`) against the corresponding WCSS values. You'll typically see a plot where the WCSS decreases as `k` increases. The Elbow Method helps you identify the point at which this decrease slows down, creating an "elbow" shape in the plot.

**3. Choosing the Optimal Number of Clusters:**
   - The key question is, "Where is the elbow point in the plot?" This point indicates the optimal number of clusters.
   
   - When you plot `k` vs. WCSS, you'll observe that initially, as `k` increases, WCSS decreases sharply. This is because with more clusters, the data points are closer to their respective centroids. However, beyond a certain point, adding more clusters does not significantly reduce the WCSS, and the rate of decrease slows down.
   
   - The elbow point is where the rate of WCSS reduction starts to decrease. It represents a balance between the model's complexity (number of clusters) and its goodness of fit (compactness of clusters). Choosing `k` at the elbow point aims to find a balance between overfitting (too many clusters) and underfitting (too few clusters).

**4. Interpreting the Elbow Point:**
   - There might not always be a clear, distinct elbow point in the plot. In such cases, the choice of `k` may be somewhat subjective. You can use your domain knowledge or other validation techniques to make the final decision.

Since the dataset is large, this process takes time so you just need to be patient with it.

In [None]:
# Step 3: Identify number of clusters with elbow method
from yellowbrick.cluster import KElbowVisualizer

print('Elbow Method to determine the number of clusters to be formed:')
Elbow_M = KElbowVisualizer(KMeans(), k=15)
Elbow_M.fit(reduced_data)
Elbow_M.show()

#### At k=8 the distortion score to WCSS trade-off is balanced, hence it is a recommended choice.

In [None]:
# Step 4: K-Means Clustering
optimal_clusters = 8  # Replace with your chosen number
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(reduced_data)

# Add cluster labels to the original dataset
data_m['cluster'] = cluster_labels

I have tried to apply Heirarchial Clustering and DBSCAN to this model too.

Due to the very large size of this dataset, 
- Heirarchial kills the kernel, thus terminating the process.
- DBSCAN doesn't produce any significant variance in clusters.

**Hence, K-Means is the best option here**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sample_size = 1000  
sampled_data = data_m.sample(sample_size)

# Define the desired order for the 'grade' variable (sorting)
grade_order = sorted(data_m['cluster'].unique())

# Define a custom color palette with colors in increasing order
custom_palette = sns.color_palette("coolwarm", len(grade_order))

# Create a scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='loan_amnt', y='int_rate', hue='cluster', hue_order=grade_order, 
                data=sampled_data, palette=custom_palette)

# Add labels and title
plt.xlabel('Loan Amount')
plt.ylabel('Interest Rate')
plt.title('K means Interest Rates vs Loan Amount Clusters')

# Show legend
plt.legend(title='Grades', loc='upper right')

# Display the plot
plt.show()

## Interest Rate Prediction

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Assuming X is your feature matrix and y is the target variable (int_rate)
X = data_m.drop(columns=['int_rate'])  # Exclude target columns
y = data_m['int_rate']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # You can adjust the test_size

# Initialize a Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)  # You can adjust n_estimators

# Fit the model to the training data
rf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")