In this example we show how the model hub can be used to get the contribution of features to reaching conversion. With the model hub, you can estimate the contribution, as well as evaluate the model performance. 

In [None]:
from modelhub import ModelHub
from matplotlib import pyplot as plt

In [None]:
# instantiate the model hub
modelhub = ModelHub(time_aggregation='YYYY-MM-DD')
# get the Bach DataFrame with Objectiv data
df = modelhub.get_objectiv_dataframe(start_date='2022-02-01')

In [None]:
# define which events to use as conversion events
modelhub.add_conversion_event(location_stack=df.location_stack.json[{'id': 'objectiv-on-github', 
                                                                     '_type': 'LinkContext'}:].fillna(
                                             df.location_stack.json[{'id': 'github', '_type': 'LinkContext'}:]),
                              event_type='PressEvent',
                              name='github_press')

In [None]:
df['root'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')
df['nice_name'] = df.location_stack.ls.nice_name

We want to obtain the impact of pressing in individual sections (root location) on our website. We assume a true causal relation between the number of clicks per root location and conversion. Make sure to think of this assumption when using this model on your own data. Therefore we estimate conversion by the number of presses in each root location on our site per user using a logistic regression model. The coefficients of this regression can be interpreted as the contribution to conversion (direction and magnitude). 

The feature importance model returns a trained model, but also the data sets that is used for training the model based on the parameters. 

This let's you adjust the data set further or use the model as is.

The model has methods for the accuracy assesment.

#### todo 

model should be a class that allows for rerunning the data on a cleaned data set. currently just a dict with results.

In [None]:
# todo, do return the model but not fitted
# X_temp, y_temp, model = modelhub.agg.feature_importance_new(
#     data=df[df.event_type=='PressEvent'],
#     name='github_press',
#     feature_column='root'
# )

In [None]:
X_temp, y_temp = modelhub.agg.create_feature_usage_data_set(
    data=df[df.event_type=='PressEvent'],
    name='github_press',
    feature_column='root'
)

In [None]:
y_temp.head()

In [None]:
X_temp.head()

In our example, we will go into detailed assessment of the model's accuracy, so we won't jumpt to the model results, but instead first look at our data set and prepare a proper data set for the model.

In [None]:
data_set_temp = X_temp.copy()
data_set_temp['is_converted'] = y_temp
# todo sum axis = 1? now gets all user ids
data_set_temp['total_press'] = X_temp.stack().to_frame().reset_index().groupby('user_id').__stacked.sum().to_pandas()

### Cleaning the dataset

First the data set has to be prepared. The data set and the relation between predictors and the predicted classes have to fulfill several assumptions, such as there are sample size, linearity between features and log odds and no influential outliers. We look at our data to try to get the best possible data set for the model`


In [None]:
data_set_temp.describe().head()

In [None]:
data_set_temp.is_converted.value_counts().head()

In [None]:
(data_set_temp.is_converted.value_counts()/data_set_temp.is_converted.count()).head()

We see most variables have a mean of less than zero. We can also look at the distributions of the variables. We split the histograms for each variable by conversion.

In [None]:
# figure, axis = plt.subplots(len(X_temp.data_columns), 2,figsize=(15,30))

# for idx, name in enumerate(X_temp.data_columns):
#     data_set_temp[data_set_temp.is_converted==True][['about']].plot.hist(bins=20,title='Converted',ax=axis[idx][0])
#     data_set_temp[data_set_temp.is_converted==False][['about']].plot.hist(bins=20,title='Not converted',ax=axis[idx][1])
# plt.tight_layout()

To make this problem less, we first drop two variables that have (almost) no distribution at all: privacy and join slack.

In [None]:
data_set_temp = data_set_temp.drop(columns=['privacy','join-slack'])
columns_remaining = [x for x in data_set_temp.data_columns if x in X_temp.data_columns]

Also, to unskew the data, we drop all users that have visited only one page, as we believe that such cases don't have any explanatory power to the target (it means reaching the goal after one click). 

Those might, for instance, be users that wanted to go to our github, and this was the quickest way to get there with the results Google provided them. In that case, the intent of the user (something of which we can never be sure), was going to the github page. The features did not convince them. 

By filtering like this, it is more likely that the used features on our website did, or did not convince a user to check out our product on github. This is exactly what we are after.

In [None]:
data_set_temp = data_set_temp[data_set_temp.total_press>1]

In [None]:
data_set_temp.describe().head()

In [None]:
# figure, axis = plt.subplots(len(X_temp.data_columns), 2,figsize=(15,30))

# for idx, name in enumerate(X_temp.data_columns):
#     data_set_temp[data_set_temp.is_converted==True][['about']].plot.hist(bins=20,title='Converted',ax=axis[idx][0])
#     data_set_temp[data_set_temp.is_converted==False][['about']].plot.hist(bins=20,title='Not converted',ax=axis[idx][1])
# plt.tight_layout()

Although our feature usage is still skewed, with many users using a feature 0 times, it is better than before. The data set is also smaller, but (slgihtly) more balanced. This can be seen from the (in most cases) higher mean and std as well as the plots.

In [None]:
data_set_temp.is_converted.value_counts().head()

In [None]:
(data_set_temp.is_converted.value_counts()/data_set_temp.is_converted.count()).head()

In [None]:
data_set = data_set_temp[columns_remaining+['is_converted']]
X = data_set[columns_remaining]
y = data_set.is_converted

## Model

The model of choice is a logistic regression. This model gives a probablity of converting and also lets us interpret the feature coefficients, which is key to our goal.

#### Error metric

We choose the error metric. Not overall f1 score, becasue we have an imbalanced dataset. So we look at predicting conversions in particular, while keeping in mind that overall accuracy shouldnt drop too much.

One way is to balance data set, but we don't cause we want to use all data.

The most important measure of model goodness is AUC. The reason is that we are not so much interested in the actual predicted label, as we are interested in the coefficients of the model. The AUC can then give a good indication of performance compared to a baseline.



The 'feature_importance_proto' model splits the data in five folds and runs the model five times. The results is based on the average of the coeficients of the five runs. The AUC for each model is averaged and interpreted as model goodness.

In [None]:
results = modelhub.aggregate.feature_importance_proto(X, y, print_report = False)

In [None]:
results['auc_mean']

The average of the AUC on the five test sets indicates an ok performance. Therefore the coefficients can be interpreted as giving some explanation to predicting conversion.

In [None]:
results['feature']

In [None]:
results['coef']

The average feature coeficient, for most features is, quite stable and has the same sign for all runs, tested solely on unseen data for model training. The lower the std, the more certain we are of the actual value of the feature importance. Do note that 3/4 of every training set contains the same data.