# 4. Duplicate records

## Processing

Duplicates can represent a massive problem for the training of a Machine Learning model. In a lot of cases it is recommend to drop the duplicate records, unless, they are one of the behaviors being modeled and their presence provides in fact relevant information for the use-case being explored.

For the particular pipeline we are exploring, credit scoring, the presence of duplicates can lead to:

- Overfit of the model to training data which leads to a poor generalization in production systems. Biased models are something that can be dangerous, particularly, for the example use case we are exploring;
- Duplicates can also lead to information leakage between validation and training sets. If the same point is splitted between the training and validation, this will increase in validation performance of the model

For this particular dataset, the presence of duplicates is rather small, nevertheless, we will compare our scoring model performance before and after the duplicates drop.

### Import the needed packages

In [9]:
import os

import pickle 

import pandas as pd

In [10]:
prep = pickle.load(open('prep_parameters.pkl', 'rb'))

## Identify the number of duplicates

In [11]:
## Read the data from the last pipeline step
train_data = pd.read_csv('prep_traindata.csv', index_col=[0])

In [12]:
n_duplicates = train_data[train_data.duplicated()].shape[0]

In [13]:
try:
    drop = int(os.getenv('DROP'))
except:
    drop = 1

In [14]:
if drop and n_duplicates>0:
    prep_data = train_data.drop_duplicates()
    prep['Drop Duplicates'] = 1
else:
    prep_data=train_data
    prep['Drop Duplicates'] = 0

## Outputs

In [15]:
prep_data.to_csv('prep_traindata.csv')

pickle.dump(prep, open('prep_parameters.pkl', 'wb'))

### Creating the pipeline step outputs

In [16]:
import json
metadata = {
    'outputs' : [
        {
      'type': 'markdown',
      'storage': 'inline',
      'source': f'## **Drop duplicates:** {bool(drop)}',
        }, 
        {
      'type': 'markdown',
      'storage': 'inline',
      'source': f'## **Number of duplicates:** {n_duplicates}',
        },
    ]
  }

with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(metadata, metadata_file)