# Generate churn training dataset

We are starting from the [telco churn dataset](https://www.kaggle.com/blastchar/telco-customer-churn) and we are generating new features that you normally see in churn models.

We will keep 'customerID','tenure','Churn' features from original dataset and end up with the following features:
- id: Customer Id
- customer_tenure: How many years the customer has subscribed
- product_tenure: How long is the customer in the specific product
- activity_last_6_months: Minutes talked on phone last 6 months in total
- activity_last_12_months: Minutes talked on phone last 12 months in total
- churned: Bool indicating if the customer churned

In [None]:
# Convert original dataset into a simplified one
import pandas as pd
df = pd.read_csv('Original_Dataset.csv')

In [None]:
columns_to_keep = ['customerID','tenure','Churn']
df = df[columns_to_keep]
df.head(2)

In [None]:
df.rename(columns = {'customerID':'id', 'tenure': 'customer_tenure'}, inplace = True) 
df.head(2)

In [None]:
df['churned'] = (df['Churn'] == "Yes")
df.drop('Churn',axis=1, inplace=True)
df.head(5)

In [None]:
import numpy as np
df['activity_last_6_months'] = np.random.randint(0,1000,size=(len(df),1))

In [None]:
df['activity_last_12_months'] = df['activity_last_6_months'] * (2 + np.random.random(size=(len(df))))
df['activity_last_12_months'] = df['activity_last_12_months'].astype(int)
# If customer is less than 6 months, assign 6 months activity to 12 months and not use double or more
df.loc[df['customer_tenure'] <= 6, 'activity_last_12_months'] = df.loc[df['customer_tenure'] <= 6, 'activity_last_6_months']
df.head(5)

In [None]:
# Generate random product tenure data
df['product_tenure'] = np.random.randint(0,74,size=(len(df),1))
# But make sure that no value is greater than the overall customer tenure
df.loc[df['product_tenure'] > df['customer_tenure'], 'product_tenure'] = df.loc[df['product_tenure'] > df['customer_tenure'], 'customer_tenure']
df.head(5)

In [None]:
# Replace id with user_<index number>.
df['id'] = 'user'
# Dataset has 42258 rows so padding with 5 zeros is enough
df['id'] = df['id'].str.cat(df.index.to_series().map('{:05d}'.format), sep ="_")
df.head()

In [None]:
# Reorder columns
df = df.reindex(columns= ['id', 'customer_tenure', 'product_tenure','activity_last_6_months', 'activity_last_12_months', 'churned'])
df.head(5)

In [None]:
!pip install pandas-profiling

In [None]:
from pandas_profiling import ProfileReport

ProfileReport(df, title='Pandas Profiling Report', explorative=True)

In [None]:
df.to_parquet('CustomerInfo.parquet')