# Credit Risk Classification

In this project, we'll use various techniques to train and evaluate a model based on loan risk. We'll work with a dataset of historical lending activity from a peer-to-peer lending services company to build a model that can identify the creditworthiness of borrowers.

## Before You Begin

1. Create a new repository for this project called `credit-risk-classification`. Do not add this homework to an existing repository.

2. Clone the new repository to your computer.

3. Inside your `credit-risk-classification` repository, create a folder titled `Credit_Risk`.

4. Inside the `Credit_Risk` folder, add the `credit_risk_classification.ipynb` and `lending_data.csv` files found in the `Starter_Code.zip` file.

5. Push your changes to GitHub.

## Files

Download the following files to help you get started:

[Module 20 Challenge files](https://example.com) (Replace this with the actual link to download the files)

## Instructions

### Split the Data into Training and Testing Sets

1. Open the `credit_risk_classification.ipynb` notebook and use it to complete the following steps:

2. Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

3. Create the labels set (`y`) from the "loan_status" column, and then create the features (`X`) DataFrame from the remaining columns.

   NOTE: A value of 0 in the "loan_status" column means that the loan is healthy. A value of 1 means that the loan has a high risk of defaulting.

4. Split the data into training and testing datasets using `train_test_split`.

### Create a Logistic Regression Model with the Original Data

1. Use your knowledge of logistic regression to complete the following steps:

2. Fit a logistic regression model using the training data (`X_train` and `y_train`).

3. Save the predictions for the testing data labels by using the testing feature data (`X_test`) and the fitted model.

4. Evaluate the model's performance by doing the following:

   - Generate a confusion matrix.
   - Print the classification report.

5. Answer the following question: How well does the logistic regression model predict both the 0 (healthy loan) and 1 (high-risk loan) labels?

### Write a Credit Risk Analysis Report

Write a brief report that includes a summary and analysis of the performance of the machine learning models used in this project. The report should be written as the `README.md` file included in your GitHub repository.

#### Structure of the Report

1. An overview of the analysis: Explain the purpose of this analysis.

2. The results: Using a bulleted list, describe the accuracy score, the precision score, and recall score of the machine learning model.

3. A summary: Summarize the results from the machine learning model. Include your justification for recommending the model for use by the company. If you don’t recommend the model, justify your reasoning.

## Requirements

### Split the Data into Training and Testing Sets (30 points)

To receive all points, you must:

- Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame. (5 points)
- Create the labels set (`y`) from the "loan_status" column, and then create the features (`X`) DataFrame from the remaining columns. (10 points)
- Split the data into training and testing datasets using `train_test_split`. (15 points)

### Create a Logistic Regression Model (30 points)

To receive all points, you must:

- Fit a logistic regression model using the training data (`X_train` and `y_train`). (10 points)
- Save the predictions for the testing data labels by using the testing feature data (`X_test`) and the fitted model. (5 points)
- Evaluate the model's performance by doing the following:
  - Generate a confusion matrix. (5 points)
  - Generate a classification report. (5 points)
- Answer the following question: How well does the logistic regression model predict both the 0 (healthy loan) and 1 (high-risk loan) labels? (5 points)

### Write a Credit Risk Analysis Report (20 points)

To receive all points, you must:

- Provide an overview that explains the purpose of this analysis. (5 points)
- Using a bulleted list, describe the accuracy, precision, and recall scores of the machine learning model. (5 points)
- Summarize the results from the machine learning model. Include your justification for recommending the model for use by the company. If you don’t recommend the model, justify your reasoning. (10 points)

### Coding Conventions and Formatting (10 points)

To receive all points, you must:

- Place imports at the top of the file, just after any module comments and docstrings and before module globals and constants. (3 points)
- Name functions and variables with lowercase characters, with words separated by underscores. (2 points)
- Follow DRY (Don't Repeat Yourself) principles, creating maintainable and reusable code. (3 points)
- Use concise logic and creative engineering where possible. (2 points)

### Code Comments (10 points)

To receive all points, your code must:

- Be well-commented with concise, relevant notes that other developers can understand. (10 points)

---

Please replace the placeholder link with the actual link to download the files for the project. Also, make sure to include the necessary code and analysis details in the `credit_risk_classification.ipynb` notebook as per the instructions given in the challenge.

### General Outline:

1. **Data Preparation:**
   - Read the dataset `lending_data.csv` into a Pandas DataFrame.
   - Create the features (`X`) and labels (`y`) by separating the target column "loan_status" from the remaining columns.
   - Check for missing values and handle them if necessary.

2. **Split Data into Training and Testing Sets:**
   - Split the data into training and testing sets using `train_test_split` from scikit-learn.

3. **Build and Train a Logistic Regression Model:**
   - Import the necessary libraries, including `LogisticRegression` from scikit-learn.
   - Create an instance of the `LogisticRegression` model.
   - Train the model using the training data.

4. **Make Predictions and Evaluate the Model:**
   - Use the trained model to make predictions on the test data.
   - Evaluate the model's performance using metrics like accuracy, precision, recall, and confusion matrix.

5. **Write the Credit Risk Analysis Report (README.md):**
   - Provide an overview explaining the purpose of the analysis.
   - Summarize the accuracy, precision, and recall scores of the logistic regression model.
   - Include a summary of the results from the machine learning model and your justification for recommending or not recommending the model for use by the company.

### Code Snippets:

Below are some code snippets that you can use as a starting point. Remember to replace placeholders such as file paths and variable names with your specific information:

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

# Read the dataset into a DataFrame
data = pd.read_csv('path_to_data/lending_data.csv')

# Create features (X) and labels (y)
X = data.drop('loan_status', axis=1)
y = data['loan_status']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the evaluation metrics
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(classification_rep)
```

Remember to customize the code based on your specific dataset, file paths, and variable names. Also, ensure that you have installed the required libraries (pandas, scikit-learn) using `pip` before running the code.

Finally, write the Credit Risk Analysis Report in the `README.md` file in your GitHub repository, following the structure mentioned in the original instructions.

Good luck with your project! If you encounter any specific issues or need further guidance, feel free to ask.

In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [2]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
# YOUR CODE HERE!

# Review the DataFrame
# YOUR CODE HERE!

### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
# YOUR CODE HERE!]

# Separate the X variable, the features
# YOUR CODE HERE!

In [4]:
# Review the y variable Series
# YOUR CODE HERE!

In [5]:
# Review the X variable DataFrame
# YOUR CODE HERE!

### Step 3: Check the balance of the labels variable (`y`) by using the `value_counts` function.

In [6]:
# Check the balance of our target values
# YOUR CODE HERE!

### Step 4: Split the data into training and testing datasets by using `train_test_split`.

In [7]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
# YOUR CODE HERE!

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [8]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!

# Fit the model using training data
# YOUR CODE HERE!

### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [9]:
# Make a prediction using the testing data
# YOUR CODE HERE!

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [10]:
# Print the balanced_accuracy score of the model
# YOUR CODE HERE!

In [11]:
# Generate a confusion matrix for the model
# YOUR CODE HERE!

In [12]:
# Print the classification report for the model
# YOUR CODE HERE!

### Step 4: Answer the following question.

**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** WRITE YOUR ANSWER HERE!

---

## Predict a Logistic Regression Model with Resampled Training Data

### Step 1: Use the `RandomOverSampler` module from the imbalanced-learn library to resample the data. Be sure to confirm that the labels have an equal number of data points. 

In [13]:
# Import the RandomOverSampler module form imbalanced-learn
from imblearn.over_sampling import RandomOverSampler

# Instantiate the random oversampler model
# # Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!

# Fit the original training data to the random_oversampler model
# YOUR CODE HERE!

ModuleNotFoundError: No module named 'imblearn'

In [None]:
# Count the distinct values of the resampled labels data
# YOUR CODE HERE!

### Step 2: Use the `LogisticRegression` classifier and the resampled data to fit the model and make predictions.

In [None]:
# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
# YOUR CODE HERE!

# Fit the model using the resampled training data
# YOUR CODE HERE!

# Make a prediction using the testing data
# YOUR CODE HERE!

### Step 3: Evaluate the model’s performance by doing the following:

* Calculate the accuracy score of the model.

* Generate a confusion matrix.

* Print the classification report.

In [None]:
# Print the balanced_accuracy score of the model 
# YOUR CODE HERE!

In [None]:
# Generate a confusion matrix for the model
# YOUR CODE HERE!

In [None]:
# Print the classification report for the model
# YOUR CODE HERE!

### Step 4: Answer the following question

**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** YOUR ANSWER HERE!