# Final Project: 某闯关类手游用户流失预测

## 一、案例简介

手游在当下的日常娱乐中占据着主导性地位，成为人们生活中放松身心的一种有效途径。近年来，各种类型的手游，尤其是闯关类的休闲手游，由于其对碎片化时间的利用取得了非常广泛的市场。然而在此类手游中，新用户流失是一个非常严峻的问题，有相当多的新用户在短暂尝试后会选择放弃，而如果能在用户还没有完全卸载游戏的时候针对流失可能性较大的用户施以干预（例如奖励道具、暖心短信），就可能挽回用户从而提升游戏的活跃度和公司的潜在收益，因此用户的流失预测成为一个重要且挑战性的问题。在毕业项目中我们将从真实游戏中非结构化的日志数据出发，构建用户流失预测模型，综合已有知识设计适合的算法解决实际问题。

## 二、作业说明

* 根据给出的实际数据（包括用户游玩历史，关卡特征等），预测测试集中的用户是否为流失用户（二分类）；
* 方法不限，使用百度云进行评测，评价指标使用 AUC；
* 提交代码与实验报告，报告展示对数据的观察、分析、最后的解决方案以及不同尝试的对比等；
* 最终评分会参考达到的效果以及对所尝试方法的分析。

## 三、数据概览

本次使用的是一个休闲类闯关手游的数据，用户在游戏中不断闯关，每一关的基本任务是在限定步数内达到某个目标。每次闯关可能成功也可能失败，一般情况下用户只在完成一关后进入下一关，闯关过程中可以使用道具或提示等帮助。

对大多数手游来说，用户流失往往发生在早期，因此次周的留存情况是公司关注的一个重点。本次数据选取了 2020.2.1 注册的所有用户在 2.1-2.4 的交互数据，数据经过筛选保证这些注册用户在前四日至少有两日登录。流失的定义则参照次周（2.7-2.13）的登录情况，如果没有登录为流失。

本次的数据和以往结构化的形式不同，展现的是更原始的数据记录，更接近公司实际日志的形式，共包含 5 个文件：

### train.csv

训练集用户，包括用户 id（从 1 开始）以及对应是否为流失用户的 label（1：流失，0：留存）。

In [1]:
import pandas as pd
import time
import numpy as np

train_df = pd.read_csv('./data/train.csv', sep='\t')
train_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8148,8149,8150,8151,8152,8153,8154,8155,8156,8157
user_id,2774,2775,2776,2777,2778,2779,2780,2781,2782,2783,...,10922,10923,10924,10925,10926,10927,10928,10929,10930,10931
label,0,0,1,0,1,1,0,0,0,1,...,0,0,0,1,1,1,1,0,1,0


In [2]:
train_df['label'].value_counts()

0    5428
1    2730
Name: label, dtype: int64

训练集共 8158 个用户，其中流失用户大约占 1/3，需要注意的是为了匿名化，这里数据都经过一定的非均匀抽样处理，流失率并不反映实际游戏的情况，用户与关卡的 id 同样经过了重编号，但对于流失预测任务来说并没有影响。

### dev.csv

验证集格式和训练集相同，主要为了方便离线测试与模型选择。

In [3]:
dev_df = pd.read_csv('./data/dev.csv', sep='\t')
dev_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2648,2649,2650,2651,2652,2653,2654,2655,2656,2657
user_id,10932,10933,10934,10935,10936,10937,10938,10939,10940,10941,...,13580,13581,13582,13583,13584,13585,13586,13587,13588,13589
label,0,1,0,1,0,0,0,0,0,1,...,0,1,1,0,1,0,0,0,1,0


### test.csv
测试集只包含用户 id，任务就是要预测这些用户的流失概率。

In [4]:
test_df = pd.read_csv('./data/test.csv', sep='\t')
test_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2763,2764,2765,2766,2767,2768,2769,2770,2771,2772
user_id,1,2,3,4,5,6,7,8,9,10,...,2764,2765,2766,2767,2768,2769,2770,2771,2772,2773


### level_seq.csv
这个是核心的数据文件，包含用户游玩每个关卡的记录，每一条记录是对某个关卡的一次尝试，具体每列的含义如下：

* `user_id`：用户 id，和训练、验证、测试集中的可以匹配；
* `level_id`：关卡 id；
* `f_success`：是否通关（1：通关，0：失败）；
* `f_duration`：此次尝试所用的时间（单位 s）；
* `f_reststep`：剩余步数与限定步数之比（失败为 0）；
* `f_help`：是否使用了道具、提示等额外帮助（1：使用，0：未使用）；
* `time`：时间戳。

In [5]:
seq_df = pd.read_csv('./data/level_seq.csv', sep='\t')
seq_df

Unnamed: 0,user_id,level_id,f_success,f_duration,f_reststep,f_help,time
0,10932,1,1,127.0,0.500000,0,2020-02-01 00:05:51
1,10932,2,1,69.0,0.703704,0,2020-02-01 00:08:01
2,10932,3,1,67.0,0.560000,0,2020-02-01 00:09:50
3,10932,4,1,58.0,0.700000,0,2020-02-01 00:11:16
4,10932,5,1,83.0,0.666667,0,2020-02-01 00:13:12
...,...,...,...,...,...,...,...
2194346,10931,40,1,111.0,0.250000,1,2020-02-03 16:26:37
2194347,10931,41,1,76.0,0.277778,0,2020-02-03 16:28:06
2194348,10931,42,0,121.0,0.000000,1,2020-02-03 16:30:17
2194349,10931,42,0,115.0,0.000000,0,2020-02-03 16:33:40


In [6]:
seq_df['user_id'].value_counts()

4963     4122
7884     3893
12822    2938
11238    2548
5502     2369
         ... 
6807        2
862         2
397         2
4860        2
4283        2
Name: user_id, Length: 13589, dtype: int64

In [7]:
group_user = seq_df.groupby('user_id')

In [8]:
group_user['f_success'].count() - group_user['f_success'].sum()

user_id
1        291
2        115
3         90
4         50
5        138
        ... 
13585    136
13586    172
13587      6
13588      1
13589      4
Name: f_success, Length: 13589, dtype: int64

In [9]:
seq_df[seq_df['user_id'] == 1]

Unnamed: 0,user_id,level_id,f_success,f_duration,f_reststep,f_help,time
222,1,1,1,25.0,0.500000,0,2020-02-01 00:02:07
223,1,2,1,55.0,0.642857,0,2020-02-01 00:03:57
224,1,3,0,74.0,0.000000,0,2020-02-01 00:05:44
225,1,3,1,82.0,0.160000,0,2020-02-01 00:07:08
226,1,4,1,74.0,0.466667,0,2020-02-01 00:08:24
...,...,...,...,...,...,...,...
612,1,104,0,100.0,0.000000,0,2020-02-04 19:30:52
613,1,104,0,118.0,0.000000,0,2020-02-04 20:45:59
614,1,104,0,122.0,0.000000,0,2020-02-04 20:48:02
615,1,104,0,125.0,0.000000,0,2020-02-04 20:50:08


In [10]:
max_retry = group_user['level_id'].apply(lambda x : x.value_counts().idxmax())

In [11]:
import time
time.mktime(time.strptime('2020-02-04 20:52:10', "%Y-%m-%d %H:%M:%S"))

1580820730.0

In [12]:
# 最后一次记录的时间
last_time = group_user['time'].max().apply(lambda x : time.mktime(time.strptime('2020-02-04 23:59:59', "%Y-%m-%d %H:%M:%S")) - time.mktime(time.strptime(x, "%Y-%m-%d %H:%M:%S")))

In [13]:
# 最高关卡
max_level = group_user['level_id'].max()

In [14]:
# 通关次数
sum_success = group_user['f_success'].sum()
# 重试次数
sum_retry = group_user['f_success'].count() - group_user['f_success'].sum()

In [15]:
# 总时间
sum_duration = group_user['f_duration'].sum()
# 使用帮助的次数
sum_help = group_user['f_help'].sum()
# 总剩余步数
sum_step = group_user['f_reststep'].sum()

In [16]:
# 最后一次玩是否通关
last_success = group_user.last()['f_success']

In [17]:
# 最后一天的记录次数
last_day_play = group_user['time'].apply(lambda x : sum(x > '2020-02-04 00:00:00'))
# 最后一天使用帮助的次数
last_day_help = group_user.apply(lambda x : sum(x['f_help'][x['time'] > '2020-02-04 00:00:00']))
# 最后一天的时间
last_day_duration = group_user.apply(lambda x : sum(x['f_duration'][x['time'] > '2020-02-04 00:00:00']))

### level_meta.csv
每个关卡的一些统计特征，可用于表示关卡，具体每列的含义如下：

* `f_avg_duration`：平均每次尝试花费的时间（单位 s，包含成功与失败的尝试）；
* `f_avg_passrate`：平均通关率；
* `f_avg_win_duration`：平均每次通关花费的时间（单位 s，只包含通关的尝试）；
* `f_avg_retrytimes`：平均重试次数（第二次玩同一关算第 1 次重试）；
* `level_id`：关卡 id，可以和 level_seq.csv 中的关卡匹配。

In [18]:
meta_df = pd.read_csv('./data/level_meta.csv', sep='\t')
meta_df

Unnamed: 0,f_avg_duration,f_avg_passrate,f_avg_win_duration,f_avg_retrytimes,level_id
0,39.889940,0.944467,35.582757,0.017225,1
1,60.683975,0.991836,56.715706,0.004638,2
2,76.947355,0.991232,71.789943,0.004480,3
3,58.170347,0.993843,54.842882,0.004761,4
4,101.784577,0.954170,85.650547,0.027353,5
...,...,...,...,...,...
1504,594.878788,0.453730,133.625000,3.187500,1505
1505,486.562500,0.454180,115.906250,3.218750,1506
1506,325.968750,0.573525,86.250000,2.687500,1507
1507,793.096774,0.322684,164.000000,5.419355,1508


In [19]:
meta_df = meta_df.set_index(['level_id'])
max_level_pass = max_level.apply(lambda x : meta_df['f_avg_passrate'][x])

## 四、Tips

* 一个基本的思路可以是：根据游玩关卡的记录为每个用户提取特征 → 结合 label 构建表格式的数据集 → 使用不同模型训练与测试；
* 还可以借助其他模型（如循环神经网络）直接对用户历史序列建模；
* 数据量太大运行时间过长的话，可以先在一个采样的小训练集上调参；
* 集成多种模型往往能达到更优的效果；
* 可以使用各种开源工具。

In [20]:
def create_dataset(df):
    df['max_level'] = df['user_id'].apply(lambda x: max_level[x])
    df['sum_success'] = df['user_id'].apply(lambda x: sum_success[x])
    df['sum_retry'] = df['user_id'].apply(lambda x: sum_retry[x])
    df['sum_duration'] = df['user_id'].apply(lambda x: sum_duration[x])
    df['sum_help'] = df['user_id'].apply(lambda x: sum_help[x])
    df['last_time'] = df['user_id'].apply(lambda x : last_time[x])
    df['sum_step'] = df['user_id'].apply(lambda x : sum_step[x])
    df['last_success'] = df['user_id'].apply(lambda x : last_success[x])
    df['max_retry'] = df['user_id'].apply(lambda x : max_retry[x])
    df['last_day_play'] = df['user_id'].apply(lambda x : last_day_play[x])
    df['last_day_help'] = df['user_id'].apply(lambda x : last_day_help[x])
    df['last_day_duration'] = df['user_id'].apply(lambda x: last_day_duration[x])
    return df

In [21]:
train_df = create_dataset(train_df)

In [22]:
train_df

Unnamed: 0,user_id,label,max_level,sum_success,sum_retry,sum_duration,sum_help,last_time,sum_step,last_success,max_retry,last_day_play,last_day_help,last_day_duration
0,2774,0,134,136,79,25398.0,18,3256.0,40.647103,0,116,31,3,4229.0
1,2775,0,116,82,29,18839.0,14,9369.0,28.688567,0,81,35,3,7790.0
2,2776,1,123,44,25,6119.0,1,122486.0,12.871456,0,68,0,0,0.0
3,2777,0,164,145,141,40808.0,4,13053.0,35.533978,0,164,49,2,6830.0
4,2778,1,122,109,53,32045.0,9,134536.0,48.510957,0,84,0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8153,10927,1,207,177,173,34697.0,10,10823.0,58.264718,0,195,54,3,7307.0
8154,10928,1,48,50,0,4073.0,3,193348.0,23.560141,1,26,0,0,0.0
8155,10929,0,122,114,129,28858.0,14,12934.0,28.663942,0,97,39,4,4870.0
8156,10930,1,39,37,2,6120.0,6,99241.0,14.827311,1,33,0,0,0.0


### XGBoost

In [23]:
columns = train_df.columns
columns = columns.drop(['label', 'user_id'])

In [24]:
from xgboost import XGBClassifier

In [25]:
from sklearn.model_selection import cross_val_score
xgb_model = XGBClassifier(objective='binary:logistic', max_depth=2, min_child_weight=140, gamma=0.01, scale_pos_weight = 1)
np.mean(cross_val_score(xgb_model, train_df[columns], train_df['label'], cv=10, scoring='roc_auc'))

0.8044224409823479

In [27]:
xgb_model = XGBClassifier(objective='binary:logistic', max_depth=2, min_child_weight=140, gamma=0.01, scale_pos_weight = 1)
xgb_model.fit(train_df[columns], train_df['label'])

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0.01, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=2, max_leaves=0, min_child_weight=140,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

### Random Forest

In [27]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=10, max_features=0.8, max_samples=0.5, min_samples_leaf=50)
np.mean(cross_val_score(rfc, train_df[columns], train_df['label'], cv=10, scoring='roc_auc'))

0.805384446489113

In [28]:
rfc.fit(train_df[columns], train_df['label'])

RandomForestClassifier(max_depth=10, max_features=0.8, max_samples=0.5,
                       min_samples_leaf=50)

### SVM

In [37]:
from sklearn.svm import SVC

In [38]:
svc = SVC(kernel='poly', probability=True, degree=3, C=0.01)
np.mean(cross_val_score(svc, train_df[columns], train_df['label'], cv=5, scoring='roc_auc'))

0.7795503841831208

In [41]:
svc.fit(train_df[columns], train_df['label'])

SVC(C=0.01, kernel='poly', probability=True)

### KNN

In [59]:
from sklearn.neighbors import KNeighborsClassifier

In [60]:
knn = KNeighborsClassifier(n_neighbors=60, leaf_size=200)
np.mean(cross_val_score(knn, train_df[columns], train_df['label'], cv=5, scoring='roc_auc'))

0.7967077474185962

### 加权集成

In [61]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import roc_auc_score

In [62]:
dev_df = create_dataset(dev_df)
test_df = create_dataset(test_df)

In [63]:
voting_clf = VotingClassifier(estimators=[
    ('xgb', xgb_model),
    ('rfc', rfc),
    ('svc', svc),
    ('knn', knn)
], voting='soft', weights=[0.4, 0.4, 0.1, 0.1])
voting_clf.fit(train_df[columns], train_df['label'])
roc_auc_score(dev_df['label'], voting_clf.predict_proba(dev_df[columns])[:, 1])

0.8038314476358084

In [64]:
test_df['proba'] = voting_clf.predict_proba(test_df[columns])[:, 1]
test_df.to_csv("result.csv", columns = ['user_id', 'proba'], index=False)

In [65]:
test_df

Unnamed: 0,user_id,max_level,sum_success,sum_retry,sum_duration,sum_help,last_time,sum_step,last_success,max_retry,last_day_play,last_day_help,last_day_duration,proba
0,1,122,104,291,38860.0,8,11269.0,23.907676,0,91,35,0,3751.0,0.077054
1,2,170,122,115,20190.0,20,6686.0,35.679515,0,146,35,1,2840.0,0.101239
2,3,186,140,90,22291.0,14,212372.0,54.124739,0,111,0,0,0.0,0.723313
3,4,178,57,50,13234.0,8,61207.0,15.381000,0,51,13,3,1811.0,0.279685
4,5,123,100,138,29454.0,20,26822.0,28.278164,0,64,7,4,1124.0,0.152967
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2768,2769,37,34,7,3294.0,3,125208.0,13.312551,0,19,0,0,0.0,0.539809
2769,2770,311,206,205,41576.0,18,115.0,60.652032,0,190,123,2,11981.0,0.066073
2770,2771,312,179,76,24327.0,15,779.0,50.530162,0,155,74,6,7400.0,0.101854
2771,2772,57,55,32,10432.0,1,92568.0,18.386226,0,57,0,0,0.0,0.372443
