This example shows how Bach can be used for feature engineering. We'll go through describing the data, finding
outliers, transforming data and grouping and aggregating data so that a useful feature set is created that
can be used for machine learning. We have a separate example available that goes into the details of how a
data set prepared in Bach can be used for machine learning with sklearn.

In [None]:
from modelhub import ModelHub
from sklearn import cluster

At first we have to instantiate the Objectiv DataFrame object and the model hub.

In [None]:
# instantiate the model hub
modelhub = ModelHub(time_aggregation='YYYY-MM-DD')
# get the Bach DataFrame with Objectiv data
df = modelhub.get_objectiv_dataframe(start_date='2022-01-04')

We create a data set of per user all the root locations that the user clicked on. For the ins and outs on feature engineering see our feature engineering example.

In [None]:
df['root'] = df.location_stack.ls.get_from_context_with_type_series(type='RootLocationContext', key='id').fillna('empty')

In [None]:
features = df[(df.event_type=='PressEvent')].groupby('user_id').root.value_counts()

In [None]:
features.head()

In [None]:
features_unstacked = features.unstack(fill_value=0).drop(columns=['empty'])
# sample or not
kmeans_frame = features_unstacked
kmeans_frame = features_unstacked.get_sample(table_name='kmeans_test',sample_percentage=50,overwrite=True,seed=2224)

Now we have a basic feature set that is small enough to fit in memory. This can be used with sklearn, as we
demonstrate in this example.

In [None]:
# export to pandas now
pdf = kmeans_frame.to_pandas()

In [None]:
pdf

In [None]:
# do basic kmeans
est = cluster.KMeans(n_clusters=3)
est.fit(pdf)
pdf['cluster'] = est.labels_

Now you can use the created clusters on your entire data set again if you add it back to your DataFrame.
This is simple, as Bach and pandas are cooperating nicely. Your original Objectiv data now has a 'cluster'
column.

In [None]:
kmeans_frame['cluster'] = pdf['cluster']

You can use this column, just as any other. For example you can now use your created clusters to group models
from the model hub by:

In [None]:
kmeans_frame.sort_values(['cluster','docs']).head(100)

In [None]:
df_with_cluster = df.merge(kmeans_frame[['cluster']], on='user_id')

In [None]:
df_with_cluster.head()

In [None]:
modelhub.aggregate.session_duration(df_with_cluster, groupby='cluster').head()