# Data Profiling
# 0. Set up

In [ ]:
# import libraries
import pandas as pd
import seaborn as sns

# load data
df = pd.read_csv('input/dirty-loan-data.csv')
df.shape

In [ ]:
df.head(5)

# 1. Create Data Profile

In [ ]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title='Loan Dataset Profiling Report', minimal=True)
profile.to_file(output_file='output/profile/loan_data_profile.html')

# 1.1. Analyze Integrity
## 1.1.1. Duplicates
To identify duplicates the first step is to remove the first column, which simply contains a row index. If we do not remove this column, the profiling will identify no rows as duplicates.

In [ ]:
# drop first column
df = df.drop(columns='Unnamed: 0')

Looking at the output generated by the profiling, the first obvious problem is that the ID variables (**id_pk**, **id_member**) do not seem to be unique. This might be due to duplicated entries in the dataset. We can check this by filtering for all entries containing duplicated 'member_id' and order by 'member_id'.

In [ ]:
# filter for all entries containing duplicated 'member_id' order by member_id
df[df.duplicated(subset='member_id', keep=False)].sort_values(by='member_id')

Additionally we can also check for rows where all variables are identical. This will return a boolean series where True indicates that the row is a duplicate of a previous row. We can then sum the number of True values to get the number of duplicates.

In [ ]:
# show number of duplicates where all rows are equal
df.duplicated().sum()

Looking at the output we can see that we indeed have 95 duplicated cases (here the duplicates are identified by rows that contain identical values for all variables). However, we have 200 rows with duplicated member IDs (Here we identify duplicates based on members IDs only). This means that we have some cases where the member ID is duplicated, but the other variables are not identical. This is a problem, as it indicates that we have some inconsistencies in the data. Now let's start by removing the rows where all variables are identical.

In [ ]:
# remove duplicates
df = df.drop_duplicates()
df.shape

Now we can re-check if we still have duplicates in the member_id variable. As we can see in the output below this is indeed the case. The mistake seems to lie in the variable **installment** where we have differing values for the same case. This is likely a data error, as the variable **installment** should contain the monthly payment owed by the borrower. This means that the value 50000 is likely a mistake, as it is unlikely that a borrower would have to pay 50000 per month. We should therefore remove duplicated cases where installment have a very high value.

In [ ]:
# filter for all entries containing duplicated 'member_id' order by member_id and installment
df[df.duplicated(subset='member_id', keep=False)].sort_values(by=['member_id', 'installment'])

In [ ]:
# drop row contains duplicate member_id and installment differs, remove row with highest installment
df = df.sort_values(by='installment', ascending=True).drop_duplicates(subset='member_id', keep='first')
# check if all duplicates in member_id are removed
df[df.duplicated(subset='member_id', keep=False)].shape

In [ ]:
# show row for member id 5779043
df[df['member_id'] == 5779043]

## 1.1.2. Inconsistencies
* Inconsistencies in Class Variables
* Data Types

# 1.2. Analyze Completeness
## 1.2.1. Missing Values

# 1.4. Analyze Accuracy
## 1.4.1. Data Types and Formats
Another problem identified in the data are that at the moment dates are split into years and month. This becomes first visible when looking at the variables **issue_year** and **issue_month** as an example. Since we only have observations for 2015, this means that the **issue_year** variable by itself does not provide any useful information (no variation) in future analytical applications. Only in combination with the month, does the year provide useful information.

In [ ]:
# show distribution of issue_year
sns.countplot(data=df, x='issue_year')

Contrary to **issue_year* the variable **issue_month** does show some variation, as shown in the plot below.

In [ ]:
# show distribution of issue_month
sns.countplot(data=df, x='issue_month')

However, we also see that *October* was misspelled as *Octxyz*. We can fix this by replacing the misspelled value with the correct one.

In [ ]:
# replace misspelled value
df['issue_month'] = df['issue_month'].replace('Octxyz', 'Oct')
df['issue_month'].unique()

To remedy the fact that issue_year contains no variation, but we still would like to know the year, we can create a new variable that combines the year and month into a single variable. This will allow us to keep the information about the year, but also the month. The same transformation can be applied to the other time variables in the dataset (earliest_cr_line_month, earliest_cr_line_year, last_pymnt_month, last_pymnt_year, next_pymnt_month, next_pymnt_year, last_credit_pull_month, last_credit_pull_year).

In [ ]:
# create new variable issue_date with data type datetime
df['issue_date'] = pd.to_datetime(df['issue_year'].astype(str) + '-' + df['issue_month'], format='%Y-%b')

# drop issue_year and issue_month
df = df.drop(columns=['issue_year', 'issue_month'])

# show first 5 rows
df['issue_date'].head(5)

# 1.5. Text Issues

# Data Protection
We might want to encrypt some variables that contain sensitive information, such as the url. This is because the url might contain information that could be used to identify the borrower. We can do this by replacing the url with a hash value.

In [ ]:
# encrypt url
df['url'] = df['url'].apply(lambda x: hash(x))