# Sarus Demo - Build and activate marketing segmentation on sensitive households data with privacy guarantees

In this example, we use a public retail dataset, available on our github: https://github.com/sarus-tech/demo-notebooks

The objective is to build a market segmentation for the digital marketing team to activate. The data is sensitive as it contains confidential information about households. To segment the households data and push it to the activation tool without never seeing it directly, we work with Sarus.

Read more in this blog post: https://www.sarus.tech/post/marketing-segmentation-strategy-without-data-access.

#0 - Installing Sarus and importing modules

In [None]:
%%capture
!pip install "sarus[sklearn]==0.6.5"

In [None]:
%%capture
import sarus

# Just adding "sarus." to the import lines (see supported libraries in the Sarus documentation)
import sarus.pandas as pd
from sarus.sklearn.cluster import KMeans

In [None]:
from sarus import Client
client = Client(url='https://admin.sarus.tech/gateway', email='analyst@example.com')

Password: ··········


#1. Selecting protected households data and extracting the table of interest

In [None]:
remote_dataset = client.dataset(slugname='retail_data')

In [None]:
remote_dataset.tables()

[['retail_data', 'private', 'demographics_demo'],
 ['retail_data', 'private', 'transactions_sample'],
 ['retail_data', 'private', 'products_demo']]

In [None]:
### Checking the structure of the table: fallback on the synthetic data as seeing rows of the real data is forbidden
remote_dataset.table(["demographics_demo"]).as_pandas().head(5)

Evaluated from synthetic data only


Unnamed: 0,household_id,age,income,home_ownership,marital_status,household_size,household_comp,kids_count
0,1,65+,35-49K,,Unmarried,1,1 Adult No Kids,0
1,2,35-44,35-49K,Homeowner,Married,4,2 Adults Kids,2
2,3,55-64,25-34K,,Unmarried,2,1 Adult Kids,1
3,4,45-54,50-74K,,Married,2,1 Adult Kids,3+
4,5,25-34,50-74K,Homeowner,Unmarried,1,1 Adult No Kids,0


In [None]:
### Extracting the interesting part of the datasets via a SQL query 

query = """ 
SELECT *
FROM retail_data.private.demographics_demo d
    JOIN retail_data.private.transactions_sample t
        USING (household_id)
    JOIN retail_data.private.products_demo p
        USING (product_id)
"""

#2. Exploring the data

In [None]:
df = remote_dataset.sql(query).as_pandas()
print(df.shape)
df.head(5)

Evaluated from synthetic data only
(406868, 24)
Evaluated from synthetic data only


Unnamed: 0,product_id,household_id,age,income,home_ownership,marital_status,household_size,household_comp,kids_count,store_id,...,coupon_disc,coupon_match_disc,week,transaction_timestamp,manufacturer_id,department,brand,product_category,product_type,package_size
0,2,1,65+,35-49K,,Unmarried,1,1 Adult No Kids,0,317,...,0.0,0.0,5,2017:46:07-07-05 21,5565,DELI,National,SOFDIOGT,SPIZ GOOAGAT,
1,2,1,65+,35-49K,,Unmarried,1,1 Adult No Kids,0,31824,...,0.0,0.0,4,2017-04-26:29:05 17,5565,DELI,National,SOFDIOGT,SPIZ GOOAGAT,
2,2,1,65+,35-49K,,Unmarried,1,1 Adult No Kids,0,453,...,0.0,0.0,3,2017-08 23 17-18:01,5565,DELI,National,SOFDIOGT,SPIZ GOOAGAT,
3,4,1,65+,35-49K,,Unmarried,1,1 Adult No Kids,0,368,...,0.0,0.0,17,2017-01:44:16:46:42,978,GROCERY,National,REAS,SO,
4,5,1,65+,35-49K,,Unmarried,1,1 Adult No Kids,0,400,...,0.0,0.0,40,201-11-15:49:06-12,260,NUTRITION,National,DREEFREED BRS/REERSND,CAIRYDEED FOTOS ENSETS-C00%/,1.5 TE


In [None]:
# Checking the number of households 
df.household_id.nunique()

Evaluated from synthetic data only


800

In [None]:
# Checking the missing values
df.isna().sum()

Evaluated from synthetic data only


product_id               0
household_id             0
age                      0
income                   0
home_ownership           0
marital_status           0
household_size           0
household_comp           0
kids_count               0
store_id                 0
basket_id                0
quantity                 0
sales_value              0
retail_disc              0
coupon_disc              0
coupon_match_disc        0
week                     0
transaction_timestamp    0
manufacturer_id          0
department               0
brand                    0
product_category         0
product_type             0
package_size             0
dtype: int64

#3. Preprocessing the data and training a clustering ML model

In [None]:
df_dem = df[['household_id', 'home_ownership', 'age', 'income', 'marital_status', 'household_size', 'household_comp', 'kids_count']]
df_dem = df_dem.drop_duplicates()

In [None]:
### Encoding categorial variables

cat = pd.get_dummies(df_dem.select_dtypes(["object"]), drop_first=True)
cat = pd.concat([df_dem['household_id'], cat], axis=1)

In [None]:
### Adding the cosmetics consumption column
cosmetics_consumption = df.loc[df['department'] == 'COSMETICS'].groupby('household_id').\
  agg({'income' : 'count'}).rename(columns = {'income' : 'count_cosmetics_consumption'})
df_full = pd.merge(cat, cosmetics_consumption, how='left', on=['household_id']).fillna(0)

In [None]:
### Fitting a sklearn clustering model
model = KMeans(n_clusters=2)
fitted_model = model.fit(df_full)

In [None]:
### Checking the model; it is "Whitelisted": it means the Data owner has exceptionally granted me the right to fit the model on the real remote data directly with Differential Privacy
fitted_model

Whitelisted


#4. Push the resulting ids to an endpoint for activation in a marketing tool



In [None]:
labels = fitted_model.predict(df_full)
new_df = pd.concat([df_dem.reset_index()['household_id'], pd.DataFrame(labels, columns=['group'])], axis=1)

In [None]:
new_df.head()

Evaluated from synthetic data only


Unnamed: 0,household_id,group
0,1,1
1,2,1
2,3,1
3,4,1
4,5,1


In [None]:
### Creating the two lists of ids

list_ids_1 = new_df.loc[new_df.group == 0]['household_id']
list_ids_2 = new_df.loc[new_df.group == 1 ]['household_id']

In [None]:
list_ids_1.shape

In [None]:
### Pushing the first segmentat of ids to the endpoint for activation in the marketing campaign solution

sarus.push(list_ids_1, endpoint="https://my_marketing_solution/activate", name='cosmetics_audience_1') # NB: BETA VERSION. NB2: the server is implemented outside of Sarus

In [None]:
### Pushing the first segmentat of ids to the endpoint for activation in the marketing campaign solution

sarus.push(list_ids_2, endpoint="https://my_marketing_solution/activate", name='cosmetics_audience_2') # NB: BETA VERSION. NB2: the server is implemented outside of Sarus

#5. Conclusion

We could build a market segmentation using usual python libraries, without ever seeing the real households data, and push the insights to a third-party tool for the digital marketing team to use them! The data were fully protected and we were able to unlock all its significant value.

Want to schedule a test and see Sarus in action on your data? [Get in touch!](https://www.sarus.tech/contact) 