# Data Preparation 

In this notebook, the main focus is on preparing the dataset for future analysis. This involves performing various data preparation steps to ensure the data is ready for further exploration.

One of the key aspects of data preparation in this project is to create new features that will serve as the target variables in the upcoming analysis. These features are generated by aggregating information for each unique ID in the credit sample. Specifically, we have created the `6mo_delinquency` and `12mo_delinquency` features, which indicate whether an individual has had a delinquent account in the past 6 and 12 months, respectively.

Additionally, we have addressed data quality issues by removing duplicated rows, handling missing values, and generating new features where appropriate. 

Finally, we have merged the two samples, `credit` and `applications`, to bring together relevant information and provide a comprehensive dataset for future analysis.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns

In [2]:
from functions import credit_approval_data_cleaner, n_mo_delinquency

In [3]:
applications = pd.read_csv('../data/application_record.csv')
credits = pd.read_csv('../data/credit_record.csv')


In [4]:
#Removing duplicates applications (with the same id from the application dataset). 

# applications = applications.groupby('ID').filter(lambda x: len(x) != 2)
applications = applications.drop_duplicates(subset='ID', keep=False)

> In the following, we are creating our training and testing samples. 

In [5]:
ID_test = applications['ID'].sample(frac=0.15, random_state=42) #Splitting the sample based on ID

applications_test = applications[applications['ID'].isin(ID_test)]
applications_train = applications[~applications['ID'].isin(ID_test)]

credits_test = credits[credits['ID'].isin(ID_test)]
credits_train = credits[~credits['ID'].isin(ID_test)]

The next code block utilized the `credit_approval_data_cleaner` function to clean the training data. The test set data will be cleaned later on in the model evaluation notebook.

In [6]:
train_cleaned = credit_approval_data_cleaner(credits_train, applications_train, [3, 6, 12])

In [7]:
train_cleaned.to_csv('../data/train_cleaned.csv', index=False)

In [8]:
applications_test.to_csv('../data/applications_test.csv', index=False)
credits_test.to_csv('../data/credits_test.csv', index=False)