![](https://openi.nlm.nih.gov/imgs/512/257/4385593/PMC4385593_BMRI2015-530828.003.png)
* Business Understanding，个人理解就是理解你所要从事的业务，目标。
* Data Understanding，读入数据，浏览数据（head）、Features(columns)、统计摘要信息、以及热力图等等。然后针对预测目标变量，看看各个feature与之关系分布，一般要通过Visualization来呈现。
* Data preparation，主要是数据规整（将类型变量改为dummy variable）、填充缺失值，以及特征工程（连续变量切分为category、创造新的feature、组合所有要使用的变量、然后用Decision Tree看看feature importance，辅助变量选择）。
* Modeling，LR是一个好的开始、然后是一些SVM、RF、GBDT、KNN、Gaussion Bayes、MLP、LinearRegression等，做一个基本的评估。然后进行调参，直到认为这是最好的参数设置了。然后开始尝试误差分析等等，进一步提高模型的效率。
* Evaluation，以上已经处理了模型的准确性和泛化能力，现在评估模型是否满足需求，性能、效率、时间、缺陷等。
* Deploy, 进行部署。

# 1. Business understanding
TalkingData，中国最大的第三方移动平台，收集用户App安装，启动等等信息。这个挑战的目标是基于提供的App使用，地理位置以及手机属性来给用户画像，预测用户的年龄和性别。

# 2. Data understanding
## 2.1 Overview data
### Load data

In [71]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm

from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import KFold
from sklearn.metrics import log_loss

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [72]:
df_gender_age_test = pd.read_csv('../input/gender_age_test.csv', dtype={'device_id': np.str})
df_gender_age_train = pd.read_csv('../input/gender_age_train.csv', dtype={'device_id': np.str})

df_app_events = pd.read_csv('../input/app_events.csv', dtype={'app_id': np.str})
df_events = pd.read_csv('../input/events.csv', dtype={'device_id': np.str})

df_app_labels = pd.read_csv('../input/app_labels.csv', dtype={'app_id': np.str})
df_label_categories = pd.read_csv('../input/label_categories.csv')

df_phone_brands = pd.read_csv('../input/phone_brand_device_model.csv', dtype={'device_id': np.str})

### Gender Age

In [73]:
df_gender_age_test.head()

In [74]:
df_gender_age_test.device_id.nunique(), df_gender_age_test.shape[0]

每个device id对应一个用户的性别、年龄以及对应的group。

In [75]:
df_gender_age_train.head()

In [76]:
df_gender_age_train.device_id.nunique(), df_gender_age_train.shape[0]

In [77]:
df_gender_age_train.info()

In [78]:
df_gender_age_train.describe(include='all').T

没有重复的device id值，也没有null值，很好。统计信息表明平均年龄31，男性更多。其中年龄居然还有96，这是不是个问题呢？我们留个问题TODO在这里好了。

In [79]:
df_ga_full = pd.concat([df_gender_age_train, df_gender_age_test], axis=0, sort=False)

In [80]:
df_ga_full.device_id.nunique()

### Events
Events统计了每个device种event的触发时间、经纬度。

In [81]:
df_events.head()

In [82]:
df_events.event_id.nunique(), df_events.device_id.nunique(), df_events.shape[0]

整个数据集里有186716的device id，而events里只统计到了60865台设备的数据，缺失率达到了67.4%。如果是这样，就只有PhoneBrand能用了，所以我们要检查一下这些device id在test数据集里是否完整，如果是的话，意味着我们可以删掉event 里里这些无用的device id所关联的事件。

In [83]:
100 * (df_gender_age_test.device_id.isin(df_events.device_id.unique())).sum()/df_gender_age_test.device_id.nunique()

In [84]:
100 * (df_gender_age_train.device_id.isin(df_events.device_id.unique())).sum()/df_gender_age_train.device_id.nunique()

检查结果是，无论train里的device id还是test里的device id，在events里都只有大约31%的数据包含，其他都为NA。
event将对应到N多的数据，比如event里统计的触发时间，安装的App及其分类。App Events统计了各个Event种App的安装和激活状态，一个Event对应多个App。

In [85]:
df_app_events.head()

In [86]:
df_app_events.event_id.nunique(), df_app_events.shape[0]

In [87]:
# df_gender_age_train.device_id[]
in_train_events = df_events[df_events.device_id.isin(set(df_gender_age_train.device_id) & set(df_events.device_id))]
in_train_app_events = df_app_events[df_app_events.event_id.isin(in_train_events.event_id)]
in_train_app_events.event_id.nunique(), in_train_app_events.event_id.size, len(in_train_events)

可以看到在events里包含有train device id的数据集，总共有1215595条数据，也就是大约121万的event，但是app event里只采集到了约55万的event所对应的app安装状态，仍然是一半的关系。也就说整个train的数据，训练集有device id 74645条，event里有60865条device id，也在训练集里出现的device id有23309，而在这仅有的23309个设备上，采集到的event有1215595，占整个event 3252950约为1/3。

In [88]:
in_test_events = df_events[df_events.device_id.isin(set(df_gender_age_test.device_id) & set(df_events.device_id))]
in_test_app_events = df_app_events[df_app_events.event_id.isin(in_test_events.event_id)]
in_train_app_events.event_id.nunique(), in_train_app_events.event_id.size, len(in_train_events)

In [89]:
del in_train_events
del in_train_app_events
del in_test_events
del in_test_app_events

In [90]:
import gc
gc.collect()

### App label
App label标识了每个App的标签信息，表明App分类。

In [91]:
df_app_labels.head()

In [92]:
df_app_labels.app_id.nunique(), df_app_labels.label_id.nunique(), df_app_labels.shape[0]

以上数据表明，总共45.99万的数据，其中一种11万只App，对应了507个标签。表明一个App可能对应多个lable。
App标签的具体文字说明，一一对应。

In [93]:
df_label_categories.head()

In [94]:
df_label_categories.category.nunique(), df_label_categories.shape[0]

### Phone brands
device id标识独立用户，也标识了手机型号。

In [95]:
df_phone_brands.head()

In [96]:
df_phone_brands.device_id.nunique(), df_phone_brands.shape[0]

说明Phone Brands里包含了重复的device id，我们提取出这些device id，看看是否相同。

In [97]:
df_phone_brands[df_phone_brands.device_id.isin(df_phone_brands.device_id.value_counts()[df_phone_brands.device_id.value_counts() > 1]\
                                               .index.tolist())].sort_values('device_id')

基本上一眼看过去，都是相同的，表明这些其实是重复性的数据。因此直接drop掉就好了。

In [98]:
df_phone_brands.drop_duplicates(subset='device_id', inplace=True)

那么会有不同手机品牌，相同型号的手机吗？

In [99]:
a = df_phone_brands.groupby(['device_model']).phone_brand.nunique()[df_phone_brands.groupby(['device_model']).phone_brand.nunique() > 1]
a

In [100]:
df_phone_brands[df_phone_brands.device_model.isin(a.index.tolist())].sort_values(['device_model', 'phone_brand'])

In [101]:
a.shape[0]

也就是说这里有54个型号相同，品牌不同的手机型号存在。所以，这里为了区分型号，我们要将其与品牌关联起来。

In [102]:
df_phone_brands.phone_brand = df_phone_brands.phone_brand.map(str.strip).map(str.lower)
df_phone_brands.device_model = df_phone_brands.device_model.map(str.strip).map(str.lower)
df_phone_brands.device_model = df_phone_brands.phone_brand.str.cat(df_phone_brands.device_model)

In [103]:
df_phone_brands.info()

In [104]:
df_phone_brands.describe()

以上我们已经分析了所有的数据信息，包括device id，对应的age、gender、group、phone brand、device model、event_id、event timestamp、event location、event app、event app installed、event app label、event app category。

虽然event 和app event里缺失了很多数据，但仍有很多人在用。这里我们先merge phone brands吧。Event 和App的信息，我们需要分析一下要怎么处理，如何建立feature。

### Merge them all

In [105]:
df_ga_full = df_ga_full.merge(df_phone_brands, how='left', on='device_id')

In [106]:
df_train = df_ga_full.loc[df_ga_full.device_id.isin(df_gender_age_train.device_id.tolist())]
df_test = df_ga_full.loc[df_ga_full.device_id.isin(df_gender_age_test.device_id.tolist())]

## 2.2 Visualization
通过可视化，观查各个变量对于预测变量的分布的贡献。

In [107]:
# sns.kdeplot(df_gender_age_train.age)
fig = plt.figure(figsize=(9, 6))
sns.distplot(df_gender_age_train.age, ax=fig.gca())
plt.title('Age distribution')
sns.despine()

上图显示主要用户人群分布在20-50岁之间，下图显示男性用户比女性用户要多的多。

In [108]:
fig = plt.figure(figsize=(7, 4))
sns.barplot(x = df_gender_age_train.gender.value_counts().index, y=df_gender_age_train.gender.value_counts().values, ax=fig.gca())
sns.despine()
plt.title('Gender distribution')

In [109]:
df_gender_age_train.groupby('group').device_id.size().sort_index(ascending=False).plot.barh(title='Age Gender Group Distribution')
sns.despine()

那么怎么画手机品牌与Group的关系呢？品牌为X轴，年龄为Y轴，性别为hue。

In [110]:
# for brands
c = df_train.phone_brand.value_counts()
# value counts 是自动根据数量按照降序进行排序
market_share = c.cumsum()/c.sum()
# for models
c2 = df_train.device_model.value_counts()
market_share2 = c2.cumsum()/c2.sum()

In [111]:
ax = plt.subplot(1,2,1)
plt.gcf().set_figheight(4)
plt.gcf().set_figwidth(12)
plt.plot(market_share.values, 'b-')
plt.title('Brand share')
sns.despine()

ax = plt.subplot(1,2,2)
plt.plot(market_share2.values, 'g-')
plt.title('Model share')
sns.despine()

plt.subplots_adjust(top=0.8)
plt.suptitle('Brand and model share');

In [112]:
share_majority = market_share[~(market_share>0.95)].index.tolist()
share_others = market_share[market_share>0.95].index.tolist()

share_majority2 = market_share2[~(market_share2>0.60)].index.tolist()
share_others2 = market_share2[market_share2>0.60].index.tolist()

In [113]:
str(share_majority2)

In [114]:
# https://seaborn.pydata.org/tutorial/categorical.html
# sns.swarmplot(x="phone_brand", y="age", hue="gender", data=df_train);
fig = plt.figure(figsize=(20, 6))
ax = sns.boxplot(x="phone_brand", y="age", hue="gender", data=df_train[df_train.phone_brand.isin(share_majority)].sort_values('age'), ax=fig.gca());
ax.set_xticklabels(share_majority, rotation=30);
str(share_majority)

In [115]:
fig = plt.figure(figsize=(20, 6))
ax = sns.boxplot(x="device_model", y="age", hue="gender", data=df_train[df_train.device_model.isin(share_majority2)].sort_values('age'), ax=fig.gca());
ax.set_xticklabels(ax.get_xticklabels(), rotation=30);
str(share_majority2)

占市场95%的份额的手机品牌,似乎并没有明显的年龄区别,几乎都差不多。但是手机模型前60%的机型，似乎又那么一点点明显区别。

# 3. Data preparation

In [116]:
df_train.head()

## 3.1 App events and events.
试着提取App的安装数量和App label。

In [117]:
df_app_labels.head()

尝试统计每个App的label。

In [118]:
# groups可以看到每个group长的样子
# df_app_labels.groupby('app_id').label_id.groups
df_app_labels = df_app_labels.groupby('app_id').label_id.apply(lambda x: ' '.join(str(s) for s in x))
df_app_labels.head()

In [119]:
df_app_events.head()

In [120]:
df_app_events ['app_lab'] = df_app_events['app_id'].map(df_app_labels)

In [121]:
df_app_events.head()

In [122]:
df_app_events = df_app_events.groupby('event_id').app_lab.apply(lambda x: ' '.join(str(s) for s in x))

In [123]:
df_app_events.head()

In [124]:
del df_label_categories
del df_app_labels

In [125]:
df_events.head()

In [126]:
df_events['app_lab'] = df_events.event_id.map(df_app_events)

In [127]:
df_events.head()

In [128]:
df_events['timestamp'] = pd.to_datetime(df_events['timestamp'])

In [129]:
df_events['hour'] = df_events['timestamp'].dt.hour

In [160]:
time_large = df_events.groupby('device_id')['hour'].apply(lambda x: max(x))

In [161]:
time_small = df_events.groupby('device_id')['hour'].apply(lambda x: min(x))

In [154]:
from collections import Counter
time_most = df_events.groupby('device_id')['hour'].apply(lambda x: Counter(x).most_common(1)[0][0])

In [165]:
del df_app_events

In [166]:
df_events.app_lab = df_events.app_lab.fillna('Missing')
df_events = df_events.groupby('device_id').app_lab.apply(lambda x: ' '.join(str(s) for s in x))

In [167]:
df_events.head()

这下已经提取出所有的App lable了，接下来我们再组合Phone Brand。

In [183]:
df_ga_full['app_lab']= df_ga_full['device_id'].map(df_events)
df_ga_full['time_most']= df_ga_full['device_id'].map(time_most)
df_ga_full['time_large']= df_ga_full['device_id'].map(time_large)
df_ga_full['time_small']= df_ga_full['device_id'].map(time_small)

In [184]:
df_ga_full.head()

In [178]:
del df_train
del df_test
del df_events
del df_phone_brands
del time_large
del time_most
del time_small

In [186]:
fig = plt.figure(figsize=(20, 6))
ax = sns.boxplot(x="time_most", y="age", hue="gender", data=df_ga_full, ax=fig.gca());
ax.set_xticklabels(ax.get_xticklabels(), rotation=30);

可以看到，在5时使用最多人的年龄偏大，在35岁以上。

In [187]:
fig = plt.figure(figsize=(20, 6))
ax = sns.boxplot(x="time_large", y="age", hue="gender", data=df_ga_full, ax=fig.gca());
ax.set_xticklabels(ax.get_xticklabels(), rotation=30);

In [188]:
fig = plt.figure(figsize=(20, 6))
ax = sns.boxplot(x="time_small", y="age", hue="gender", data=df_ga_full, ax=fig.gca());
ax.set_xticklabels(ax.get_xticklabels(), rotation=30);

最晚到22时的女性似乎较多。

所以接下来，我们就需要把这个App label拆分，拆分成啥样呢？拆分成每个label一列，具有该类App该列就为1。刚开始的想法是建立一个空的DataFrame，列数所label的最大值。然后将该app label拆分成数组，数值类型，当作一个index selecting，然后一行一行赋值。。。。我发现这个太耗时了，在我的i7 8th上，赋值一行要1.5s，我们有74645行，因此要花31小时。

这个太吓人了，我因此还尝试了多线程（但是Python多线程其实是单核多线程），多进程也是一样有各种问题。这个问题折腾了我一星期，还好今天看到了这篇Kernel[Low RAM bag-of-apps(Python)](https://www.kaggle.com/xiaoml/low-ram-bag-of-apps-python/code)。

它把App label提取出来，直到将这写App label组到一个device id上。这时候，采用文本数据处理，将每一行的App label当作一行句子来看待，然后将其内的app label当作词汇来处理。这样他就会把所有行上的词汇提取出来成为列名。然后把每行文本上的词汇，当作列的index，在列名上找到就赋值1。这样就会形成一个由所有App label组成的列，并且每行上只有该device id拥有的app label才为1。

CSDN上有篇文章关于[文本数据预处理：sklearn中CountVectorizer、TfidfTransformer和TfidfVectorizer](https://blog.csdn.net/m0_37324740/article/details/79411651)讲解的不错。


In [189]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True)
# 将NA当作一个类别来处理。
df_app_lab_vectorized = vectorizer.fit_transform(df_ga_full['app_lab'].fillna('Missing')) 
# 可以考虑使用label category 将feature names替换掉我们更为熟悉的文字表述。
str(vectorizer.get_feature_names())

In [190]:
app_labels = pd.DataFrame(df_app_lab_vectorized.toarray(), columns=vectorizer.get_feature_names(), index=df_ga_full.device_id)
app_labels.head(3)

In [191]:
df_ga_full = df_ga_full.merge(app_labels, how='left', left_on='device_id', right_index=True)

In [192]:
df_ga_full.head(3)

In [193]:
df_ga_full = pd.get_dummies(df_ga_full.drop(columns=['gender', 'age', 'app_lab']), columns=['phone_brand', 'device_model', 'time_most', 'time_large', 'time_small'])

In [194]:
df_ga_full.head(3)

In [195]:
df_ga_full.shape

In [196]:
df_ga_full.info()

In [197]:
df_ga_full.describe()

In [198]:
train = df_ga_full[df_ga_full.device_id.isin(df_gender_age_train.device_id)]
test = df_ga_full[df_ga_full.device_id.isin(df_gender_age_test.device_id)].drop(columns=['group'])

X = train.drop(columns=['group'])
encoder = LabelEncoder()
Y = encoder.fit_transform(train['group'])

In [199]:
X.shape, Y.shape

# 4. Model

In [200]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
# scores = cross_val_score(LogisticRegression(), X, Y, scoring='neg_log_loss',cv=10, verbose=1)

In [201]:
# scores.mean(), scores

In [202]:
# from sklearn.cross_validation import cross_val_predict
# y_pred = cross_val_predict(LogisticRegression(), X, Y, cv=10, n_jobs=-1, verbose=1)
# log_loss(Y, y_pred)

In [203]:
# from sklearn.model_selection import StratifiedKFold
# kf = StratifiedKFold(n_splits=10, random_state=0)
# pred = np.zeros((Y.shape[0], Y.nunique()))
# for train_index, test_index in kf.split(X, Y):
#     X_train, X_test = X.iloc[train_index], X.iloc[test_index]
#     y_train, y_test = Y.iloc[train_index], Y.iloc[test_index]
#     lr = LogisticRegression(solver='sag').fit(X_train, y_train)
#     pred[test_index,:] = lr.predict_proba(X_test)
#     # Downsize to one fold only for kernels
#     print("{:.5f}".format(log_loss(y_test, pred[test_index, :]), end=' '))

# # log_loss(Y, pred)

In [204]:
import xgboost as xgb
from sklearn.model_selection import train_test_split

X.set_index('device_id', inplace=True)
X_train, X_val, y_train, y_val = train_test_split(X, Y, train_size=.80)

##################
#     XGBoost
##################

dtrain = xgb.DMatrix(X_train, y_train)
dvalid = xgb.DMatrix(X_val, y_val)

params = {
    "objective": "multi:softprob",
    "num_class": 12, # Y一共有12个类别
    "booster": "gbtree", # 默认为基于树的模型gbtree,还有基于线性模型的gbliner。
    "eval_metric": "mlogloss",
    "eta": 0.3, # 和GBM中的 learning rate 参数类似。
    "silent": 0, # 用于控制输出的信息，1静默模式，0默认，输出更多的，以帮助我们更好的理解。
}
watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
gbm = xgb.train(params, dtrain, 140, evals=watchlist, verbose_eval=True)

In [None]:
test.set_index('device_id', inplace=True)
y_pre = gbm.predict(xgb.DMatrix(test), ntree_limit=gbm.best_iteration)
# scores = cross_val_score(RandomForestClassifier(n_est

In [None]:
# from sklearn.ensemble import RandomForestClassifier
# scores = cross_val_score(RandomForestClassifier(n_estimators=100), X, Y, scoring='neg_log_loss',cv=10, verbose=1)

In [None]:
# scoresmean(), scores.

# 5. Evaluation

# 6. Deployment

In [None]:
pd.read_csv('../input/sample_submission.csv').head()

In [None]:
result = pd.DataFrame(y_pre, index=test.index, columns=encoder.classes_)
result.head()

In [None]:
result.to_csv('./predict_prob.csv')

In [None]:
pd.read_csv('./predict_prob.csv').head()

**Submit it!**