#   Разработка модели предсказания новых покупок для пользователей

##  Описание и загрузка данных

Интернет-магазин собирает данные о действиях пользователей: просмотры, покупки, участие в рассылках и другие взаимодействия с платформой.
Цель проекта — построить модель машинного обучения, которая будет предсказывать, совершит ли пользователь покупку в течение следующих 90 дней.

Это позволит:

- сегментировать пользователей по вероятности покупки,

- эффективно настраивать маркетинговые рассылки,

- повысить конверсию и оптимизировать рекламный бюджет.

Тип задачи: бинарная классификация
Целевая переменная: факт покупки в течение 90 дней (0 или 1)

**План работ**

1. [Загрузка и анализ данных](#загрузка-и-анализ-данных)   
   Импорт данных, обзор структуры, поиск пропусков и аномалий

2. [Исследовательский анализ данных](#Предобработка-данных)  
   Кодирование категориальных переменных, масштабирование и очистка

3. [Инженерия признаков](#инженерия-признаков)  
   Построение поведенческих, технических и временных признаков

4. [Обучение моделей](#Обучение-моделей)  
   Построение и настройка моделей: логистическая регрессия, деревья, бустинг

5. [Оценка качества](#Оценка-качества)  
   Расчёт метрик: ROC-AUC, F1, анализ важности признаков

6. [Финальные выводы](#Финальные-выводы)  
   Интерпретация результатов и рекомендации по применению модели

# Загрузка и предобработка данных
Импортируем библиотеки, загружаем датасет и выполняем первичный анализ:
- Размер данных
- Проведём мёрдж данных
- Типы признаков
- Пропущенные значения
- Распределения

In [71]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [72]:
data_folder_path = "/home/pavel/Data/filtered_data/"
name_messages = "apparel-messages.csv"
name_purchases = "apparel-purchases.csv"
name_target = "apparel-target_binary.csv"

name_campaign = "full_campaign_daily_event.csv"


name_campaign_channel = "full_campaign_daily_event_channel.csv"

In [73]:
messages = pd.read_csv(data_folder_path + name_messages)
purchases = pd.read_csv(data_folder_path + name_purchases)
target = pd.read_csv(data_folder_path + name_target)

campaign = pd.read_csv(data_folder_path + name_campaign)
campaign_channel = pd.read_csv(data_folder_path + name_campaign_channel)

In [74]:
def check_data(data):
    print("head")
    display(data.head())
    print("info")
    display(data.info())
    print("describe")
    display(data.describe())
    print("")
    print("")

In [75]:
print("MESSAGES")
check_data(messages)

print("PURCHASES")
check_data(purchases)

print("TARGET")
check_data(target)

print("CAMPAIGN")
check_data(campaign)

print("CAMPAIGN_CHANNEL")
check_data(campaign_channel)

MESSAGES
head


Unnamed: 0,bulk_campaign_id,client_id,message_id,event,channel,date,created_at
0,4439,1515915625626736623,1515915625626736623-4439-6283415ac07ea,open,email,2022-05-19,2022-05-19 00:14:20
1,4439,1515915625490086521,1515915625490086521-4439-62834150016dd,open,email,2022-05-19,2022-05-19 00:39:34
2,4439,1515915625553578558,1515915625553578558-4439-6283415b36b4f,open,email,2022-05-19,2022-05-19 00:51:49
3,4439,1515915625553578558,1515915625553578558-4439-6283415b36b4f,click,email,2022-05-19,2022-05-19 00:52:20
4,4439,1515915625471518311,1515915625471518311-4439-628341570c133,open,email,2022-05-19,2022-05-19 00:56:52


info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12739798 entries, 0 to 12739797
Data columns (total 7 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   bulk_campaign_id  int64 
 1   client_id         int64 
 2   message_id        object
 3   event             object
 4   channel           object
 5   date              object
 6   created_at        object
dtypes: int64(2), object(5)
memory usage: 680.4+ MB


None

describe


Unnamed: 0,bulk_campaign_id,client_id
count,12739800.0,12739800.0
mean,11604.59,1.515916e+18
std,3259.211,326551800.0
min,548.0,1.515916e+18
25%,8746.0,1.515916e+18
50%,13516.0,1.515916e+18
75%,14158.0,1.515916e+18
max,14657.0,1.515916e+18




PURCHASES
head


Unnamed: 0,client_id,quantity,price,category_ids,date,message_id
0,1515915625468169594,1,1999.0,"['4', '28', '57', '431']",2022-05-16,1515915625468169594-4301-627b661e9736d
1,1515915625468169594,1,2499.0,"['4', '28', '57', '431']",2022-05-16,1515915625468169594-4301-627b661e9736d
2,1515915625471138230,1,6499.0,"['4', '28', '57', '431']",2022-05-16,1515915625471138230-4437-6282242f27843
3,1515915625471138230,1,4999.0,"['4', '28', '244', '432']",2022-05-16,1515915625471138230-4437-6282242f27843
4,1515915625471138230,1,4999.0,"['4', '28', '49', '413']",2022-05-16,1515915625471138230-4437-6282242f27843


info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202208 entries, 0 to 202207
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   client_id     202208 non-null  int64  
 1   quantity      202208 non-null  int64  
 2   price         202208 non-null  float64
 3   category_ids  202208 non-null  object 
 4   date          202208 non-null  object 
 5   message_id    202208 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 9.3+ MB


None

describe


Unnamed: 0,client_id,quantity,price
count,202208.0,202208.0,202208.0
mean,1.515916e+18,1.006483,1193.301516
std,145951400.0,0.184384,1342.252664
min,1.515916e+18,1.0,1.0
25%,1.515916e+18,1.0,352.0
50%,1.515916e+18,1.0,987.0
75%,1.515916e+18,1.0,1699.0
max,1.515916e+18,30.0,85499.0




TARGET
head


Unnamed: 0,client_id,target
0,1515915625468060902,0
1,1515915625468061003,1
2,1515915625468061099,0
3,1515915625468061100,0
4,1515915625468061170,0


info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49849 entries, 0 to 49848
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   client_id  49849 non-null  int64
 1   target     49849 non-null  int64
dtypes: int64(2)
memory usage: 779.0 KB


None

describe


Unnamed: 0,client_id,target
count,49849.0,49849.0
mean,1.515916e+18,0.019278
std,148794700.0,0.137503
min,1.515916e+18,0.0
25%,1.515916e+18,0.0
50%,1.515916e+18,0.0
75%,1.515916e+18,0.0
max,1.515916e+18,1.0




CAMPAIGN
head


Unnamed: 0,date,bulk_campaign_id,count_click,count_complain,count_hard_bounce,count_open,count_purchase,count_send,count_soft_bounce,count_subscribe,...,nunique_open,nunique_purchase,nunique_send,nunique_soft_bounce,nunique_subscribe,nunique_unsubscribe,count_hbq_spam,nunique_hbq_spam,count_close,nunique_close
0,2022-05-19,563,0,0,0,4,0,0,0,0,...,4,0,0,0,0,0,0,0,0,0
1,2022-05-19,577,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,2022-05-19,622,0,0,0,2,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
3,2022-05-19,634,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,2022-05-19,676,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131072 entries, 0 to 131071
Data columns (total 24 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   date                 131072 non-null  object
 1   bulk_campaign_id     131072 non-null  int64 
 2   count_click          131072 non-null  int64 
 3   count_complain       131072 non-null  int64 
 4   count_hard_bounce    131072 non-null  int64 
 5   count_open           131072 non-null  int64 
 6   count_purchase       131072 non-null  int64 
 7   count_send           131072 non-null  int64 
 8   count_soft_bounce    131072 non-null  int64 
 9   count_subscribe      131072 non-null  int64 
 10  count_unsubscribe    131072 non-null  int64 
 11  nunique_click        131072 non-null  int64 
 12  nunique_complain     131072 non-null  int64 
 13  nunique_hard_bounce  131072 non-null  int64 
 14  nunique_open         131072 non-null  int64 
 15  nunique_purchase     131072 n

None

describe


Unnamed: 0,bulk_campaign_id,count_click,count_complain,count_hard_bounce,count_open,count_purchase,count_send,count_soft_bounce,count_subscribe,count_unsubscribe,...,nunique_open,nunique_purchase,nunique_send,nunique_soft_bounce,nunique_subscribe,nunique_unsubscribe,count_hbq_spam,nunique_hbq_spam,count_close,nunique_close
count,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,...,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0
mean,8416.743378,90.982971,0.932655,78.473434,3771.091,0.577927,11634.14,27.807312,0.140518,6.362679,...,3683.0,0.465103,11537.16,27.573799,0.134125,5.960602,0.810364,0.809799,8e-06,8e-06
std,4877.369306,1275.503564,30.198326,1961.317826,65160.67,9.10704,175709.5,736.944714,2.072777,79.172069,...,62586.47,7.126368,172700.5,734.0507,1.976439,73.284148,183.298579,183.298245,0.002762,0.002762
min,548.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4116.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,7477.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,13732.0,2.0,0.0,0.0,30.0,0.0,0.0,0.0,0.0,1.0,...,30.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
max,15150.0,128453.0,5160.0,287404.0,5076151.0,1077.0,11543510.0,76284.0,189.0,9089.0,...,2922440.0,779.0,7094600.0,76281.0,177.0,8299.0,63920.0,63920.0,1.0,1.0




CAMPAIGN_CHANNEL
head


Unnamed: 0,date,bulk_campaign_id,count_click_email,count_click_mobile_push,count_open_email,count_open_mobile_push,count_purchase_email,count_purchase_mobile_push,count_soft_bounce_email,count_subscribe_email,...,count_send_email,nunique_hard_bounce_email,nunique_hbq_spam_email,nunique_send_email,count_soft_bounce_mobile_push,nunique_soft_bounce_mobile_push,count_complain_email,nunique_complain_email,count_close_mobile_push,nunique_close_mobile_push
0,2022-05-19,563,0,0,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2022-05-19,577,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2022-05-19,622,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2022-05-19,634,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2022-05-19,676,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131072 entries, 0 to 131071
Data columns (total 36 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   date                             131072 non-null  object
 1   bulk_campaign_id                 131072 non-null  int64 
 2   count_click_email                131072 non-null  int64 
 3   count_click_mobile_push          131072 non-null  int64 
 4   count_open_email                 131072 non-null  int64 
 5   count_open_mobile_push           131072 non-null  int64 
 6   count_purchase_email             131072 non-null  int64 
 7   count_purchase_mobile_push       131072 non-null  int64 
 8   count_soft_bounce_email          131072 non-null  int64 
 9   count_subscribe_email            131072 non-null  int64 
 10  count_unsubscribe_email          131072 non-null  int64 
 11  nunique_click_email              131072 non-null  int64 
 12  nunique_cli

None

describe


Unnamed: 0,bulk_campaign_id,count_click_email,count_click_mobile_push,count_open_email,count_open_mobile_push,count_purchase_email,count_purchase_mobile_push,count_soft_bounce_email,count_subscribe_email,count_unsubscribe_email,...,count_send_email,nunique_hard_bounce_email,nunique_hbq_spam_email,nunique_send_email,count_soft_bounce_mobile_push,nunique_soft_bounce_mobile_push,count_complain_email,nunique_complain_email,count_close_mobile_push,nunique_close_mobile_push
count,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,...,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0,131072.0
mean,8416.743378,41.582169,49.400803,423.706,3347.385,0.357483,0.220444,24.474823,0.140518,6.362679,...,4189.581,18.535683,0.809799,4186.898,3.332489,3.311653,0.932655,0.921326,8e-06,8e-06
std,4877.369306,745.484035,1036.952898,9753.384,64448.59,8.287483,3.7965,727.069387,2.072777,79.172069,...,107319.8,1349.473695,183.298245,107261.8,120.916269,120.094858,30.198326,29.71517,0.002762,0.002762
min,548.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4116.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,7477.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,13732.0,1.0,0.0,23.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,15150.0,59365.0,128453.0,2597015.0,5076151.0,1077.0,431.0,76284.0,189.0,9089.0,...,7094600.0,287341.0,63920.0,7094600.0,21831.0,21389.0,5160.0,5043.0,1.0,1.0






Необходимо подготовить данные - в данных о сообщениях значительно больше данных чем в остальных сетах, необходимо проверить на дупликаты, в том числе неявные. Затем смёрджить данные (messages+purchases+target) 

In [76]:
messages.duplicated().sum()


48610

Явные дупликаты удаляем сразу

In [77]:
messages = messages.drop_duplicates()

Посмотрим какие данные могут повторяться

In [78]:
messages.nunique()

bulk_campaign_id       2709
client_id             53329
message_id          9061667
event                    11
channel                   2
date                    638
created_at          4103539
dtype: int64

In [79]:
messages['created_at'] = pd.to_datetime(messages['created_at'])
messages['date'] = pd.to_datetime(messages['date'])
messages['date'] = messages['date'].dt.strftime('%Y-%m-%d')

In [80]:
messages['created_at'].value_counts()

created_at
2023-12-29 15:20:53    608
2023-07-03 10:22:53    530
2023-12-29 14:51:33    475
2023-12-29 14:51:53    474
2023-12-29 15:20:13    468
                      ... 
2023-06-04 05:00:57      1
2023-06-04 05:03:38      1
2023-06-04 05:04:18      1
2023-06-04 05:06:38      1
2023-07-03 11:38:31      1
Name: count, Length: 4103539, dtype: int64

In [81]:
messages.duplicated(subset=['client_id', 'message_id', 'date', 'channel', 'event', 'bulk_campaign_id']).sum()

176631

In [82]:
messages = messages.drop_duplicates(subset=['client_id', 'message_id', 'date', 'channel', 'event', 'bulk_campaign_id'], keep='first')

посмотрим что с purchases

In [83]:
purchases['date'] = pd.to_datetime(purchases['date'])
purchases['date'] = purchases['date'].dt.strftime('%Y-%m-%d')

In [84]:
purchases.duplicated().sum()

73020

In [85]:
purchases = purchases.drop_duplicates()

Мёрджим

In [86]:
messages = purchases.merge(messages, on=['client_id', 'date'], how='left').dropna()

message_id нигде испоьлзоваться в дальнейшей работе не будет, мёрдж с purchase и target будет по clent_id. Мёрдж по кампаниям можно сделать по campaign_id

In [87]:
messages = messages.drop(['message_id_y', 'message_id_x'], axis=1)

теперь рассмотрим target

In [88]:
target.duplicated().sum()

0

In [89]:
df = target.merge(messages, on='client_id', how='left').dropna()

проверим данные в campaign и campaign_channel

In [90]:
campaign_columns = campaign.columns

In [91]:
campaign_columns = campaign_columns[2:]

In [92]:
campaign_columns

Index(['count_click', 'count_complain', 'count_hard_bounce', 'count_open',
       'count_purchase', 'count_send', 'count_soft_bounce', 'count_subscribe',
       'count_unsubscribe', 'nunique_click', 'nunique_complain',
       'nunique_hard_bounce', 'nunique_open', 'nunique_purchase',
       'nunique_send', 'nunique_soft_bounce', 'nunique_subscribe',
       'nunique_unsubscribe', 'count_hbq_spam', 'nunique_hbq_spam',
       'count_close', 'nunique_close'],
      dtype='object')

In [93]:
df['bulk_campaign_id'] =  df['bulk_campaign_id'].astype(int)

In [94]:
df = df.merge(campaign, on=['bulk_campaign_id', 'date'], how='left').dropna()

In [95]:
campaign_channel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131072 entries, 0 to 131071
Data columns (total 36 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   date                             131072 non-null  object
 1   bulk_campaign_id                 131072 non-null  int64 
 2   count_click_email                131072 non-null  int64 
 3   count_click_mobile_push          131072 non-null  int64 
 4   count_open_email                 131072 non-null  int64 
 5   count_open_mobile_push           131072 non-null  int64 
 6   count_purchase_email             131072 non-null  int64 
 7   count_purchase_mobile_push       131072 non-null  int64 
 8   count_soft_bounce_email          131072 non-null  int64 
 9   count_subscribe_email            131072 non-null  int64 
 10  count_unsubscribe_email          131072 non-null  int64 
 11  nunique_click_email              131072 non-null  int64 
 12  nunique_click_mo

In [96]:
campaign_channel_columns = campaign_channel.columns

In [97]:
campaign_channel_columns = campaign_channel_columns[2:]
campaign_channel_columns

Index(['count_click_email', 'count_click_mobile_push', 'count_open_email',
       'count_open_mobile_push', 'count_purchase_email',
       'count_purchase_mobile_push', 'count_soft_bounce_email',
       'count_subscribe_email', 'count_unsubscribe_email',
       'nunique_click_email', 'nunique_click_mobile_push',
       'nunique_open_email', 'nunique_open_mobile_push',
       'nunique_purchase_email', 'nunique_purchase_mobile_push',
       'nunique_soft_bounce_email', 'nunique_subscribe_email',
       'nunique_unsubscribe_email', 'count_hard_bounce_mobile_push',
       'count_send_mobile_push', 'nunique_hard_bounce_mobile_push',
       'nunique_send_mobile_push', 'count_hard_bounce_email',
       'count_hbq_spam_email', 'count_send_email', 'nunique_hard_bounce_email',
       'nunique_hbq_spam_email', 'nunique_send_email',
       'count_soft_bounce_mobile_push', 'nunique_soft_bounce_mobile_push',
       'count_complain_email', 'nunique_complain_email',
       'count_close_mobile_push', '

In [98]:
df = df.merge(campaign_channel, on=['bulk_campaign_id', 'date'], how='left').dropna()

In [99]:
df

Unnamed: 0,client_id,target,quantity,price,category_ids,date,bulk_campaign_id,event,channel,created_at,...,count_send_email,nunique_hard_bounce_email,nunique_hbq_spam_email,nunique_send_email,count_soft_bounce_mobile_push,nunique_soft_bounce_mobile_push,count_complain_email,nunique_complain_email,count_close_mobile_push,nunique_close_mobile_push
0,1515915625468060902,0,1.0,199.0,"['4', '27', '176', '458']",2022-05-27,4617,send,email,2022-05-27 05:49:50,...,1013705,46,3,1013705,0,0,0,0,0,0
1,1515915625468060902,0,1.0,199.0,"['4', '27', '176', '458']",2022-05-27,4617,open,email,2022-05-27 10:57:52,...,1013705,46,3,1013705,0,0,0,0,0,0
2,1515915625468060902,0,1.0,199.0,"['4', '27', '176', '458']",2022-05-27,4617,click,email,2022-05-27 10:59:04,...,1013705,46,3,1013705,0,0,0,0,0,0
3,1515915625468060902,0,1.0,199.0,"['4', '27', '176', '458']",2022-05-27,4617,purchase,email,2022-05-27 11:26:49,...,1013705,46,3,1013705,0,0,0,0,0,0
4,1515915625468060902,0,1.0,199.0,"['4', '27', '176', '458']",2022-05-27,4439,open,email,2022-05-27 11:43:00,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
308282,1515915626010079153,0,1.0,2999.0,"['2', '18', '217', '663']",2024-02-13,14632,click,mobile_push,2024-02-13 06:33:33,...,0,0,0,0,0,0,0,0,0,0
308283,1515915626010079153,0,1.0,2999.0,"['2', '18', '217', '663']",2024-02-13,14632,purchase,mobile_push,2024-02-13 07:01:33,...,0,0,0,0,0,0,0,0,0,0
308284,1515915626010152263,0,1.0,419.0,"['2', '18', '267', '443']",2024-02-14,14649,send,mobile_push,2024-02-14 12:36:30,...,0,0,0,0,1902,1900,0,0,0,0
308285,1515915626010152263,0,1.0,419.0,"['2', '18', '267', '443']",2024-02-14,14649,click,mobile_push,2024-02-14 19:25:16,...,0,0,0,0,1902,1900,0,0,0,0


Далее исследуем полученные данные

# Исследовательский анализ данных

Получили дата фрейм со всеми доступными данными включая target. Всего оказалось более 60 признаков, большая часть из них - ohe из данных campaign. Необходимо проверить распределения признаков и, изюавиться от лишних данных и, если необходимо, добавить дополнительные признаки

In [100]:
df[campaign_columns] = df[campaign_columns].astype(int)
df[campaign_channel_columns] = df[campaign_channel_columns].astype(int)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308287 entries, 0 to 308286
Data columns (total 66 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   client_id                        308287 non-null  int64         
 1   target                           308287 non-null  int64         
 2   quantity                         308287 non-null  float64       
 3   price                            308287 non-null  float64       
 4   category_ids                     308287 non-null  object        
 5   date                             308287 non-null  object        
 6   bulk_campaign_id                 308287 non-null  int64         
 7   event                            308287 non-null  object        
 8   channel                          308287 non-null  object        
 9   created_at                       308287 non-null  datetime64[ns]
 10  count_click                      308287 non-

In [105]:
print(df['channel'].value_counts())
print(df['category_ids'].value_counts())
print(df['event'].value_counts())

channel
email          207121
mobile_push    101166
Name: count, dtype: int64
category_ids
['4', '28', '57', '431']            15804
['4', '28', '244', '432']           11979
['4', '28', '260', '420']           10726
['4', '28', '275', '421']            8103
['2', '18', '258', '441']            7786
                                    ...  
['5562', '5633', '5567', '697']         1
['6060', '6057', '6074', '6239']        1
['5562', '5632', '5706', '1125']        1
['5562', '5597', '5696', '751']         1
['4', '28', '44', '528']                1
Name: count, Length: 931, dtype: int64
event
purchase       94301
click          76823
send           70242
open           66769
hard_bounce       62
soft_bounce       38
unsubscribe       29
complain           9
hbq_spam           7
subscribe          7
Name: count, dtype: int64


Из оставшихся 

# Инженерия признаков
Создание новых фичей.