In [2]:
import pandas as pd
import numpy as np

This notebook produces normalised scores for event metric observations. Rescaling the observations in this way makes them easier to interpret, particularly when the metric is highly skewed. Skewed metrics are log-transformed to remove the skew.

The prodecure for producing the scores is as follows:

1) Determine whether the metric is significantly skewed by checking the skew statistic. A typical threshold considers the skew to be significant when it is above 4. Check that the minimum metric value (before any transformation) is zero. If the metric is not significantly skewed or has negative values,
skip to step 4.
2) Add 1 to the skewed customer metrics, so that a customer that has a zero event count now has 1, a customer with 1 now has 2, and so on.
3) Take the logarithm of all the skewed metrics.
4) Calculate the mean and the standard deviation of the metric values at this point in the process. If the metric was not skewed, these values are simply the original metrics. If the metric was skewed, use the logarithm of 1 plus the original customer metrics.
5) Subtract the mean from all values.
6) Divide all values by the standard devation.

The result is a score which is the measure of the number of standard deviations above of below the mean on which the value falls.

In [14]:
# we consider columns with a skew of > 4.0 to be skewed
skew_thresh = 4.0

# read in the event metrics dataset
dataset_path = "../create-churn-dataset/socialnet_dataset.csv"
churn_data = pd.read_csv(dataset_path, index_col=[0, 1])
data_scores = churn_data.copy()
# is_churn column should not be converted into a score
data_scores.drop('is_churn', axis=1)

# read in the summary stats
summarystats_path = "../metric-summary-stats/socialnet_dataset_summarystats.csv"
stats = pd.read_csv(summarystats_path, index_col=0)
stats = stats.drop('is_churn')

# identify the skewed columns
skewed_columns = (stats['skew'] > skew_thresh) & (stats['min'] >= 0)
skewed_columns = skewed_columns[skewed_columns]

# iterate over the skewed columns
for col in skewed_columns.keys(): 
    data_scores[col] = np.log(1.0 + data_scores[col]) 
    stats.at[col,'mean'] = data_scores[col].mean() 
    stats.at[col,'std'] = data_scores[col].std()

# normalise all columns by subtracting the mean and dividing by the standard deviation


In [15]:
type(skewed_columns)

pandas.core.indexes.base.Index

In [16]:
skewed_columns

Index(['like_per_month', 'post_per_month', 'adview_per_month',
       'dislike_per_month', 'message_per_month', 'reply_per_month'],
      dtype='object')