# Assignment 7.1: Pitching Your ML Algorithm

Find an exciting business problem, find data, and solve the problem with machine learning in Python using the CRISP-DM methodology and algorithms covered in the course:
1. Business understanding - What does the business need?
2. Data understanding - What data do we have/need? Is it clean?
3. Data preparation - How do we organize the data for modeling?
4. Modeling - What modeling techniques should we apply?
5. Evaluation - Which model best meets the business objectives?
6. Deployment - How to get the model in production and ensure it works?

Note: it is required that your team use GitHub to host your code, collaborate and manage versions. GitHub helps ensure traceability, allow rollback, and avoid unintended overwrites and loss of code. You can use the integration between Google Colab and GitHub to achieve the goals of the project.

Dataset: **Home Credit Default Dataset**

**Project Scenario:**
Before presenting to the executive decision-making body, there is a code review round and a technical solution review by a technical committee. The committee consists of fellow ML Engineers, Data Engineers, Architects, ML managers, and the Head of Machine Learning.
There are three key deliverables for this final team project:
Google Colab notebook (ipynb)
Business presentation slides (pptx/pdf)
Recorded video presentation (mp4)
The Colab notebook is for the technical committee, whereas the business brief is targeted to a non-technical executive committee. Both committees will view the slides and video. Please see the following explanation for the requirements on each file


**Google Colab Notebook with Python code:**
The Google Clab Notebook must be organized like a report where the code blocks are interspersed with text blocks. The text block that appears before the code block must cover the explanations of the approach. The text blocks that follow the output graphs and tables must contain inference, actionable insight, and recommendations. The code blocks themselves must be annotated with comments so they are readable.

The notebook must contain the following sections:

* Problem statement and justification for the proposed approach.
* Data understanding (EDA) - a graphical and non-graphical representation of relationships between the response variable and predictor variables.
* Data preparation.
* Feature engineering - data pre-processing - missing values, outliers, etc.
* Feature Selection - how were the features selected based on the data analysis?
* Modeling - selection, comparison, tuning, and analysis - consider ensembles.
* Evaluation - performance measures, results, and conclusions.
* Discussion and conclusions - address the problem statement and recommendation.

# Problem Statement

In [None]:
# @title Justification


In [None]:
# @title Deveopment Approach


# Data Preparation

In [None]:
# @title Imports

import pandas as pd

In [None]:
# @title Loading Data

# Read and store CSV files
train_csv_path = "application_train.csv"
test_csv_path = "application_test.csv"
previous_csv_path = "previous_application.csv"

# Load only the relevant columns from the CSV file
columns_to_load = ['SK_ID_CURR', 'NAME_CONTRACT_STATUS', 'NAME_CLIENT_TYPE', 'RATE_DOWN_PAYMENT', 'CNT_PAYMENT']

df_train = pd.read_csv(train_csv_path, index_col='SK_ID_CURR') # Dataset used for training the models
df_test = pd.read_csv(test_csv_path, index_col='SK_ID_CURR') # Dataset used to test after model creation
df_previous_application = pd.read_csv(previous_csv_path, usecols=columns_to_load)

# Feature Engineering

## Cleaning

In [None]:
le = LabelEncoder()

# Get categorical columns
categorical_cols_train = df_train.select_dtypes(include=['object', 'category']).columns
categorical_cols_test = df_test.select_dtypes(include=['object', 'category']).columns

# Drop rows with NaN values
df_train.dropna(inplace=True)
df_test.dropna(inplace=True)

# Apply LabelEncoder on categorical columns
for col in set(categorical_cols_train).union(set(categorical_cols_test)):
    if col in df_train.columns:
        df_train[col] = le.fit_transform(df_train[col])
        if col in df_test.columns:
            # Detect new categories in test set
            new_categories = set(df_test[col]).difference(le.classes_)
            # Drop rows with new categories
            df_test = df_test.loc[~df_test[col].isin(new_categories)]
            # Transform the remaining data
            df_test[col] = le.transform(df_test[col])


In [None]:
# @title Cleaning data of null values

inferred_dtypes, df_train = clean_df(df_train)
inferred_dtypes, df_test = clean_df(df_test)

# Feature Selection

# Modeling

# Evaluation

# Conclusion