# 1. The impact of synthetic data on customer churn

We are all familiar with the axiom "Garbage in, garbage out", and this is very much true, specially in a setting and market where we see models getting more and more commmoditized. The business advatage will remain in the component that it is unique to every organization - the data.

In every case - and particularly for credit scoring use cases - data preparation is a paramount. Nevertheless, and althoug the achievments we have observed in the past few years, data preparation is still the most challenging and time-consuming step. Ensuring data quality helps data teams to achieve bigger ROI from AI initiatives at a fraction of the effort it used to, translating into better scorecards that positively impact the business and customer experience.

When we look into the credit scoring, there are particular issues that can dampen model accuracy - presence of outliers, missing values and the presence of imbalanced classes.

In this usecase we will explore not only an iterative, traceable and comparable data processing for to improve the quality of the data for credit scorecards, but also how to mitigate each one of the identified challenges: missing data, presence of duplicates and last but not the least, imabalanced data.

The dataset leveraged for the use-case can be easily found in [Kaggle - Telco churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn/code).


### Import needed packages

In [1]:
import os
import pandas as pd
from pathlib import Path

# Importing YData's SDK packages
from ydata.labs.datasources import DataSources
from ydata.dataset import Dataset

from functions.saving_functions import save_file

## Read the data

The first step is to read the data. We have previously created the DataSource at the level of the UI. 

Because Fabric enables an ease integration between the labs and other workbenck elements (such as Datasources and Synthetisizers). The code below depicts how:

In [2]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='17488144-9467-4902-8df9-5380a6a5700e', namespace='a0717134-f616-42fd-a38e-468c63d26802')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m32
[1mNumber of rows: [0m7043
[1mDuplicate rows: [0m33
[1mTarget column: [0m

[1mColumn detail: [0m
               Column    Data type Variable type Characteristics
0          CustomerID       string        string              id
1               Count  categorical           int                
2             Country  categorical        string        location
3               State  categorical        string        location
4                City       string        string                
5            Zip Code    numerical           int                
6            Lat Long       string        string                
7            Latitude    numerical         float                
8           Longitude    numerical         float                
9              Gender  categorical        string                
10     Senior Citizen  categorical        string               

In [3]:
#Convert dataset to Dask engine
dd_dataset = dataset.to_dask()

#Rename the columns tenure and Churn Value for ease of exploration
dd_dataset = dd_dataset.rename(columns={"Tenure Months": "Tenure", "Churn Value": "Churn"})

In [4]:
dataset = Dataset(dd_dataset)

## Data Exploration 

As a critical first step, we should develop a more comprehensive view of our data in order to understand the key drivers of customer churn.

In [5]:
# Quickly previewing the Dataset
dataset.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Yes,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,5340,Competitor had better devices


It is crucial for any Data Science process that the dataset variable types are correctly set, otherwise the process of data preparation might not be optimal, resulting in lower performance at the level of the classifiers to be built!

In [6]:
print(dataset)

[1mDataset 
 
[0m[1mShape: [0m(7043, 32)
[1mSchema: [0m
               Column Variable type
0          CustomerID        string
1               Count           int
2             Country        string
3               State        string
4                City        string
5            Zip Code           int
6            Lat Long        string
7            Latitude         float
8           Longitude         float
9              Gender        string
10     Senior Citizen        string
11            Partner        string
12         Dependents        string
13             Tenure           int
14      Phone Service        string
15     Multiple Lines        string
16   Internet Service        string
17    Online Security        string
18      Online Backup        string
19  Device Protection        string
20       Tech Support        string
21       Streaming TV        string
22   Streaming Movies        string
23           Contract        string
24  Paperless Billing        string
25

Based on the above dataset summary and metadata information, we have identified that the **CustomerID** column is in fact an *ID* type. For that reason we have to update the Metadata selected data types. 

In [9]:
#Updating the metadata with the correct data types
metadata.update_datatypes({'CustomerID': 'string',
                           'Churn Reason': 'longtext'})

## Create pipeline outputs

In [10]:
metadata.save('metadata.pkl')
#saving the dataset
save_file(dataset, file_path='dataset.pkl')

In [11]:
#Get metadata warnings
warnings=[]
for warning, val in metadata.warnings.items():
    for col in val:
        try:
            level = col.details['level'].name
            value = round(col.details['value'], 4)
        except:
            level = None
            value = None
        warnings.append({'warning': warning, 'column': col.column, 'level': level, 'value':value})

df_warnings= pd.DataFrame(warnings)

In [12]:
import json

metadata = {
    'outputs' : [
        {
      'type': 'table',
      'storage': 'inline',
      'format': 'csv',
      'header': list(df_warnings.columns),
      'source': df_warnings.to_csv(header=False, index=False)
    }
    ]
  }

with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(metadata, metadata_file)