This is one of the Objectiv example notebooks. For more examples visit the 
[example notebooks](https://objectiv.io/docs/modeling/example-notebooks/) section of our docs. The notebooks can run with the demo data set that comes with the our [quickstart](https://objectiv.io/docs/home/quickstart-guide/), but can be used to run on your own collected data as well.

All example notebooks are also available in our [quickstart](https://objectiv.io/docs/home/quickstart-guide/). With the quickstart you can spin up a fully functional Objectiv demo pipeline in five minutes. This also allows you to run these notebooks and experiment with them on a demo data set.

In this example we will demo how the model hub provides a tool kit for modeling the importance of features
on achieving a conversion goal.

### Import the required packages for this notebook
The open model hub package can be installed with `pip install objectiv-modelhub` (this installs Bach as well).  
If you are running this notebook from our quickstart, the model hub and Bach are already installed, so you don't have to install it separately.

In [None]:
from modelhub import ModelHub
from matplotlib import pyplot as plt

At first we have to instantiate the Objectiv DataFrame object and the model hub.

In [None]:
# instantiate the model hub
modelhub = ModelHub(time_aggregation='YYYY-MM-DD')
# get the Bach DataFrame with Objectiv data
df = modelhub.get_objectiv_dataframe(start_date='2022-02-15', end_date='2022-05-16')

The feature importance model from the open model hub creates a Bach data set that can be used for the model
as well as the model that returns the results. The model includes tools to assess the accuracy of your
model as well.

First we have to define the conversion goal that we are predicting as well as the features that we want to
use as predictors.

In [None]:
# define which events to use as conversion events
modelhub.add_conversion_event(location_stack=df.location_stack.json[{'id': 'modeling', '_type': 'RootLocationContext'}:],
                              event_type='PressEvent',
                              name='use_modeling')

In [None]:
df['root'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id')

In our example, the conversion goal is reaching the modeling section. We want to obtain the impact of
pressing in individual sections (root location) on our website. We assume there is as causal relation
between the number of clicks of a user per root location and conversion. For demonstration purposes these
are appropriate features because of the limited amount of root locations in this data set.
Make sure to think of this assumption when using this model on your own data. Therefore we estimate
conversion by the number of presses in each root location on our site per user using a logistic regression
model. The coefficients of this regression can be interpreted as the contribution to conversion (direction and
magnitude).

Now the data set and untrained model can be instantiated.

In [None]:
X_temp, y_temp, model = modelhub.agg.feature_importance(
    data=df[df.event_type=='PressEvent'],
    name='use_modeling',
    feature_column='root'
)

This let's you adjust the data set further or use the model as is. `y_temp` is a BooleanSeries that
indicates conversion per user. `X_temp` is a DataFrame with the number
of presses per user_id. For users that did converted in the selected data, only usage from _before_
reaching conversion is counted. The `model` is the
toolkit that can be used to assess the feature importance on our conversion goal.
In this example we first review the data set with Bach before using it for the actual model training (hence
the `_temp` suffix). We create a single DataFrame that has all the features, the target and a sum of all
features.

In [None]:
y_temp

In [None]:
y_temp.head()

In [None]:
X_temp

In [None]:
X_temp.head()

In our example, we will go into detailed assessment of the model's accuracy, so we won't jumpt to the model results, but instead first look at our data set and prepare a proper data set for the model.

In [None]:
data_set_temp = X_temp.copy()
# we save the columns that are in our data set, these will be used later.
columns = X_temp.data_columns
data_set_temp['is_converted'] = y_temp
data_set_temp['total_press'] = modelhub.map.sum_feature_rows(X_temp)

### Reviewing the data set

For a logistic regression several assumptions, such as sample size, no influential outliers and linear
relation between the features and the logit of the goal should be fulfilled. We will look at our data to
get the best possible data set for our model.

In [None]:
data_set_temp.describe().head()

We have 543 samples in our data. The description of our data set learns us that the mean is quite low for
most features and the standard deviation as well. This indicates that the feature usage is not distributed
very well.

In [None]:
data_set_temp.is_converted.value_counts().head()

In [None]:
(data_set_temp.is_converted.value_counts()/data_set_temp.is_converted.count()).head()

The data set is not balanced in terms of users that did or did not reach conversion: 74 converted users (13
.6%). While this is not necessarily a problem, it influences the metric we choose to look at for model
performance. The model that we instantiated already accommodates for this.

We can also plot histograms with Bach of the features so we can inspect the distributions more closely.

In [None]:
figure, axis = plt.subplots(len(columns), 2, figsize=(15,30))

for idx, name in enumerate(columns):
    data_set_temp[data_set_temp.is_converted==True][[name]].plot.hist(bins=20, title='Converted', ax=axis[idx][0])
    data_set_temp[data_set_temp.is_converted==False][[name]].plot.hist(bins=20, title='Not converted', ax=axis[idx][1])
plt.tight_layout()

We see that some features are not useful at all ('join-slack' and 'privacy'), so we will remove them. Moreover
we think that users that clicking only once in any of the root locations will not provide us with any
explantory behavior for the goal.

Those users might, for instance, be users that wanted to go to our modeling section, and this was the
quickest way to get there with the results Google provided them. In that case, the intent of the user
(something of which we can never be sure), was going to the modeling section. The features did not convince them.

By filtering like this, it is more likely that the used features on our website did, or did not convince a
user to check out the modeling section of our docs. This is exactly what we are after. An additional
advantage is that the distribution of feature usage will most likely get more favorable after removing
1-press-users.

In [None]:
data_set_temp = data_set_temp.drop(columns=['privacy','join-slack'])
# we update the columns that are still in our data set.
columns = [x for x in data_set_temp.data_columns if x in X_temp.data_columns]

In [None]:
data_set_temp = data_set_temp[data_set_temp.total_press>1]

If we rerun the code above to review the data set we find that the data set is more balanced (16.5%
converted), although it is a bit small now (406 samples). The distributions as shown by describing the data
set and the histograms look indeed better for our model now. We will use this data set to create our X and
y data set that we will use in the model.

In [None]:
data_set_temp.describe().head()

In [None]:
data_set_temp.is_converted.value_counts().head()

In [None]:
(data_set_temp.is_converted.value_counts()/data_set_temp.is_converted.count()).head()

In [None]:
figure, axis = plt.subplots(len(columns), 2, figsize=(15,30))

for idx, name in enumerate(columns):
    data_set_temp[data_set_temp.is_converted==True][[name]].plot.hist(bins=20, title='Converted', ax=axis[idx][0])
    data_set_temp[data_set_temp.is_converted==False][[name]].plot.hist(bins=20, title='Not converted', ax=axis[idx][1])
plt.tight_layout()

In [None]:
X = data_set_temp[columns]
y = data_set_temp.is_converted

### Train and evaluate the model

As mentioned above, the model is based on logistic regression. Logistic regression seems sensible as it is
used for classification, but also has relatively easy to interpret coefficients for the features. The
feature importance model uses the AUC to assess the performance. This is because we are more interested in
the coefficients than the actual predicted labels, and also because this metric can handle imbalanced data
sets. The feature importance model by default trains a logistic regression model three times on the entire
data set split in three folds. This way we can not only calculate the AUC on one test after training the
model. But also see whether the coefficients for the model are relatively stable when trained on different
data. After fitting the model, the results (the average coefficients of the three models) as well as the
performace of the three models can be retrieved with `model` methods.

In [None]:
model.fit(X, y, seed=.4)

In [None]:
model.results()

The mean of the coefficients are returned together with the standard deviation. The lower the standard
deviation, the more stable the coefficients in the various runs. Our results show that 'about' has most
negative impact on conversion, while 'tracking', 'blog' and 'taxonomy' most positive.

In [None]:
model.auc()

The average AUC of our models is 0.69. This is better than a baseline model (0.5 AUC). However, it also
means that it is not a perfect model and therefore the chosen features don't fully explain the conversion.
Among others, some things that might improve further models are a larger test set, other explanatory
variables (i.e. more detailed locations instead of only root locations), more information on the users (i.e. user referrer as a proxy for user intent).

In [None]:
model.results(full=True)

This concludes the example of our feature importance model in the model hub.

For an overview of all currently available models, check out the :ref:`models <models>`.