In [2]:
import pandas as pd
import numpy as np

This notebook produces normalised scores for event metric observations. Rescaling the observations in this way makes them easier to interpret, particularly when the metric is highly skewed. Skewed metrics are log-transformed to remove the skew.

The prodecure for producing the scores is as follows:

1) Determine whether the metric is significantly skewed by checking the skew statistic. A typical threshold considers the skew to be significant when it is above 4. Check that the minimum metric value (before any transformation) is zero. If the metric is not significantly skewed or has negative values,
skip to step 4.
2) Add 1 to the skewed customer metrics, so that a customer that has a zero event count now has 1, a customer with 1 now has 2, and so on.
3) Take the logarithm of all the skewed metrics.
4) Calculate the mean and the standard deviation of the metric values at this point in the process. If the metric was not skewed, these values are simply the original metrics. If the metric was skewed, use the logarithm of 1 plus the original customer metrics.
5) Subtract the mean from all values.
6) Divide all values by the standard devation.

The result is a score which is the measure of the number of standard deviations above of below the mean on which the value falls.

In [17]:
# we consider columns with a skew of > 4.0 to be skewed
skew_thresh = 4.0

# read in the event metrics dataset
dataset_path = "../create-churn-dataset/socialnet_dataset.csv"
churn_data = pd.read_csv(dataset_path, index_col=[0, 1])
data_scores = churn_data.copy()
# is_churn column should not be converted into a score
data_scores.drop('is_churn', axis=1)

# read in the summary stats
summarystats_path = "../metric-summary-stats/socialnet_dataset_summarystats.csv"
stats = pd.read_csv(summarystats_path, index_col=0)
stats = stats.drop('is_churn')

# identify the skewed columns
skewed_columns = (stats['skew'] > skew_thresh) & (stats['min'] >= 0)
skewed_columns = skewed_columns[skewed_columns]

# iterate over the skewed columns
for col in skewed_columns.keys(): 
    data_scores[col] = np.log(1.0 + data_scores[col]) 
    stats.at[col,'mean'] = data_scores[col].mean() 
    stats.at[col,'std'] = data_scores[col].std()

# normalise all columns by subtracting the mean and dividing by the standard deviation
data_scores = (data_scores-stats['mean']) / stats['std'] 
data_scores['is_churn'] = churn_data['is_churn'] 

In [20]:
data_scores.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,account_tenure,adview_per_month,dislike_per_month,is_churn,like_per_month,message_per_month,newfriend_per_month,post_per_month,reply_per_month,unfriend_per_month
account_id,observation_date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
46,09/02/2020,-0.23279,-0.536797,0.703097,False,0.422753,2.1981,-0.291202,-1.655806,2.399927,-0.498848
95,09/02/2020,-0.23279,1.077039,-0.363161,False,0.000366,-1.155536,-0.450281,1.567462,-0.813962,-0.498848
216,09/02/2020,-0.23279,1.091718,0.486525,False,-0.72495,-0.949616,-0.291202,0.402234,-0.813962,1.529451
219,09/02/2020,-0.23279,0.223127,0.309573,False,-0.72495,0.417219,-0.132124,0.251232,0.720016,-0.498848
321,09/02/2020,-0.23279,-0.015462,-0.363161,False,1.578468,0.23733,-0.132124,0.163971,0.507796,-0.498848


We can check these scores by using describe().

We can see that for all metrics, all values (with the exception of `newfriend_per_month`) are just a few standard deviations from the mean, have been centered on the mean (since the mean for each metric is very close to zero), and scaled such that the standard deviation is 1.

In [24]:
data_scores.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
account_tenure,9307.0,-1.2215200000000001e-17,1.0,-0.654713,-0.496492,-0.391011,-0.23279,4.513846
adview_per_month,9307.0,2.046045e-16,1.0,-2.261376,-0.64754,0.038063,0.687243,4.041998
dislike_per_month,9307.0,-5.344149e-17,1.0,-1.9252,-0.579733,-0.036598,0.703097,3.90665
like_per_month,9307.0,2.58046e-16,1.0,-2.519082,-0.659847,-0.02589,0.670302,3.697388
message_per_month,9307.0,1.2978650000000001e-17,1.0,-1.941914,-0.659387,-0.052898,0.680433,3.588445
newfriend_per_month,9307.0,-5.191459e-17,1.0,-0.768437,-0.609359,-0.291202,0.186033,15.139403
post_per_month,9307.0,3.130144e-16,1.0,-2.229881,-0.618247,0.012969,0.665982,3.767534
reply_per_month,9307.0,-1.236789e-16,1.0,-1.325288,-0.813962,-0.138026,0.720016,3.653187
unfriend_per_month,9307.0,8.207085e-17,1.0,-0.498848,-0.498848,-0.498848,-0.498848,7.614347


In [21]:
# save the scores
data_scores.to_csv('../../output/socialnet_dataset_scores.csv', header=True)