# 1. Improving credit card scoring through data quality

We are all familiar with the axiom "Garbage in, garbage out", and this is very much true, specially in a setting and market where we see models getting more and more commmoditized. The business advatage will remain in the component that it is unique to every organization - the data.

In every case - and particularly for credit scoring use cases - data preparation is a paramount. Nevertheless, and althoug the achievments we have observed in the past few years, data preparation is still the most challenging and time-consuming step. Ensuring data quality helps data teams to achieve bigger ROI from AI initiatives at a fraction of the effort it used to, translating into better scorecards that positively impact the business and customer experience.

When we look into the credit scoring, there are particular issues that can dampen model accuracy - presence of outliers, missing values and the presence of imbalanced classes. 

In this usecase we will explore not only an iterative, traceable and comparable data processing for to improve the quality of the data for credit scorecards, but also how to mitigate each one of the identified challenges: missing data, presence of duplicates and last but not the least, imabalanced data.


The dataset leveraged for the use-case can be easily found in [Kaggle - Give me some credit](https://www.kaggle.com/competitions/GiveMeSomeCredit/overview).

## Read the data

The first step is to read the data. As Data Scientists it is usual to find the use of CSV as the main data sources. For that reason we enable a flexible interface to consume CSV data. 

- Through YData SDK - LocalConnector - this method returns automatically a Dataset object that is ready to be consumed in downstream applications such as Synthetisizer.
- Leveraging pandas - Pandas is available in YData image. This method requires the data to be converted into Dataset, if the user wants to leverage downstream applications such as Synthesizers, or the scale from our distributed computing engine.

In this particular example, we have decided to load the data while using pandas SDK method `read_csv`.

### Import needed packages

In [14]:
import os

import numpy as np
import pandas as pd

from ydata.utils.formats import read_json
from ydata.dataset import Dataset
from ydata.metadata import Metadata

from ydata.connectors.filetype import FileType
from ydata.connectors import GCSConnector

### Read the data from a cloud storage

In [15]:
token = read_json('ydata-academy.json')
connector = GCSConnector(project_id=token['project_id'], keyfile_dict=token)

In [16]:
dataset = connector.read_file('gs://ydata-academy/credit_scoring/data.csv', file_type=FileType.CSV)

#Calculate the dataset Metadata
metadata = Metadata(dataset)

[########################################] | 100% Completed | 3.98 sms


In [17]:
dataset.head(20)

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0
5,0,0.213179,74,0,0.375607,3500.0,3,0,1,0,1.0
6,0,0.305682,57,0,5710.0,,8,0,3,0,0.0
7,0,0.754464,39,0,0.20994,3500.0,8,0,0,0,0.0
8,0,0.116951,27,0,46.0,,2,0,0,0,
9,0,0.189169,57,0,0.606291,23684.0,9,0,4,0,2.0


## Preparing the data for the data preparation workflow
Now that we have our dataset ready to start working, before any data analysis and preparation it is always recommended to create a holdout. This ensures unbiased results in terms of improvements achieved. 

In [23]:
data = dataset.to_pandas()

In [24]:
try:
    holdout_size=float(os.environ['HOLDOUT_SIZE'])
except:
    holdout_size=0.15

In [25]:
#Creating the holdout set
msk = np.random.rand(len(data)) < (1-holdout_size)
test = data[~msk]
data = data[msk]

In [None]:
dataset = Dataset(data)
metadata = Metadata(dataset)

In [None]:
print(metadata)

In [None]:
#Get metadata warnings
warnings=[]
for warning, val in metadata.warnings.items():
    for col in val:
        try:
            level = col.details['level'].name
            value = round(col.details['value'], 4)
        except:
            level = None
            value = None
        warnings.append({'warning': warning, 'column': col.column, 'level': level, 'value':value})

df_warnings= pd.DataFrame(warnings)

## Create pipeline outputs
Setting our pipeline outputs to share data and artifacts in-between the different steps of the pipeline, as well as for visualization.

### Outputs

In [9]:
#Save the train metadata
metadata.save('metadata.pkl')

#Save the train dataset
data.to_csv('train.csv')

#Save the holdout set
test.to_csv('test.csv')

### Visualizations

In [10]:
import json

metadata = {
    'outputs' : [
        {
      'type': 'table',
      'storage': 'inline',
      'format': 'csv',
      'header': list(df_warnings.columns),
      'source': df_warnings.to_csv(header=False, index=False)
    }
    ]
  }

with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(metadata, metadata_file)