## 项目背景

以淘宝APP平台用户数据为数据集，通过行业的指标对淘宝用户行为进行分析，从而探索淘宝用户的行为模式，具体指标包括：日PV和日UV分析，付费率分析，复购行为分析，漏斗流失分析和用户价值RFM分析。

## 数据来源

[阿里云天池](https://tianchi.aliyun.com/dataset/dataDetail?dataId=46&userId=1)

## 问题明确

* 日PV有多少？
* rangeUV有多少？
* 付费率情况如何？
* 复购率是多少？
* 漏斗流失情况怎样？
* 用户价值情况？

## 数据说明

该数据集有淘宝APP于2014年11月18日至2014年12月18日的用户行为数据1200万条，数据已经完成脱敏，共计6个字段，分别是：
* user_id：用户身份
* item_id：商品ID
* behavior_type：用户行为类型（包含点击、收藏、加购物车、支付四种行为，分别用数字1、2、3、4表示）
* user_geohash：地理位置（已加密）
* item_category：品类ID（商品所属的品类）
* time：用户行为发生的时间

## 数据清洗

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
user_data = pd.read_csv('./tianchi_mobile_recommend_train_user.csv')

In [3]:
user_data.head()

Unnamed: 0,user_id,item_id,behavior_type,user_geohash,item_category,time
0,98047837,232431562,1,,4245,2014-12-06 02
1,97726136,383583590,1,,5894,2014-12-09 20
2,98607707,64749712,1,,2883,2014-12-18 11
3,98662432,320593836,1,96nn52n,6562,2014-12-06 10
4,98145908,290208520,1,,13926,2014-12-16 21


In [4]:
user_data

Unnamed: 0,user_id,item_id,behavior_type,user_geohash,item_category,time
0,98047837,232431562,1,,4245,2014-12-06 02
1,97726136,383583590,1,,5894,2014-12-09 20
2,98607707,64749712,1,,2883,2014-12-18 11
3,98662432,320593836,1,96nn52n,6562,2014-12-06 10
4,98145908,290208520,1,,13926,2014-12-16 21
...,...,...,...,...,...,...
12256901,93812622,378365755,1,95q6d6a,11,2014-12-13 21
12256902,93812622,177724753,1,,12311,2014-12-14 21
12256903,93812622,234391443,1,,8765,2014-12-11 16
12256904,93812622,26452000,1,95q6dqc,7951,2014-12-08 22


In [5]:
user_data.isnull().sum()

user_id                0
item_id                0
behavior_type          0
user_geohash     8334824
item_category          0
time                   0
dtype: int64

存在缺失值的是User_geohash，有83348245条，不能删除缺失值，因为地理信息在数据集收集过程中做过加密转换，因此对数据集不做处理。

In [6]:
user_data.dtypes

user_id           int64
item_id           int64
behavior_type     int64
user_geohash     object
item_category     int64
time             object
dtype: object

In [7]:
user_data['date'] = user_data['time'].apply(lambda x: x.split(' ')[0])  
user_data['hour'] = user_data['time'].apply(lambda x: x.split(' ')[1])  

### 拆分日期数据

In [8]:
user_data

Unnamed: 0,user_id,item_id,behavior_type,user_geohash,item_category,time,date,hour
0,98047837,232431562,1,,4245,2014-12-06 02,2014-12-06,02
1,97726136,383583590,1,,5894,2014-12-09 20,2014-12-09,20
2,98607707,64749712,1,,2883,2014-12-18 11,2014-12-18,11
3,98662432,320593836,1,96nn52n,6562,2014-12-06 10,2014-12-06,10
4,98145908,290208520,1,,13926,2014-12-16 21,2014-12-16,21
...,...,...,...,...,...,...,...,...
12256901,93812622,378365755,1,95q6d6a,11,2014-12-13 21,2014-12-13,21
12256902,93812622,177724753,1,,12311,2014-12-14 21,2014-12-14,21
12256903,93812622,234391443,1,,8765,2014-12-11 16,2014-12-11,16
12256904,93812622,26452000,1,95q6dqc,7951,2014-12-08 22,2014-12-08,22


In [9]:
user_data.dtypes

user_id           int64
item_id           int64
behavior_type     int64
user_geohash     object
item_category     int64
time             object
date             object
hour             object
dtype: object

In [10]:
# 数据类型转化
user_data['time'] = pd.to_datetime(user_data['time'])
user_data['date'] = pd.to_datetime(user_data['date'])
user_data['hour'] = user_data['hour'].astype('int64')

In [11]:
user_data.dtypes

user_id                   int64
item_id                   int64
behavior_type             int64
user_geohash             object
item_category             int64
time             datetime64[ns]
date             datetime64[ns]
hour                      int64
dtype: object

In [12]:
user_data = user_data.sort_values(by='time')

In [14]:
user_data.reset_index(drop=True, inplace=True)

In [15]:
user_data

Unnamed: 0,user_id,item_id,behavior_type,user_geohash,item_category,time,date,hour
0,73462715,378485233,1,,9130,2014-11-18 00:00:00,2014-11-18,0
1,36090137,236748115,1,,10523,2014-11-18 00:00:00,2014-11-18,0
2,40459733,155218177,1,,8561,2014-11-18 00:00:00,2014-11-18,0
3,814199,149808524,1,,9053,2014-11-18 00:00:00,2014-11-18,0
4,113309982,5730861,1,,3783,2014-11-18 00:00:00,2014-11-18,0
...,...,...,...,...,...,...,...,...
12256901,132653097,119946062,2,,6054,2014-12-18 23:00:00,2014-12-18,23
12256902,130082553,296196819,1,,11532,2014-12-18 23:00:00,2014-12-18,23
12256903,43592945,350594832,1,9rhhgph,9541,2014-12-18 23:00:00,2014-12-18,23
12256904,12833799,186993938,1,954g37v,3798,2014-12-18 23:00:00,2014-12-18,23


In [16]:
user_data.describe()

Unnamed: 0,user_id,item_id,behavior_type,item_category,hour
count,12256910.0,12256910.0,12256910.0,12256910.0,12256910.0
mean,71707320.0,202308400.0,1.105271,6846.162,14.81799
std,41229200.0,116739700.0,0.4572662,3809.922,6.474778
min,4913.0,64.0,1.0,2.0,0.0
25%,35849650.0,101413000.0,1.0,3721.0,10.0
50%,72928040.0,202135900.0,1.0,6209.0,16.0
75%,107377400.0,303540500.0,1.0,10290.0,20.0
max,142455900.0,404562500.0,4.0,14080.0,23.0


## 用户行为分析

PV(访问量)：即Page View, 具体是指网站的是页面浏览量或者点击量，页面被刷新一次就计算一次。

UV(独立访客)：即Unique Visitor,访问您网站的一台电脑客户端为一个访客。

In [None]:
pv_daily = user_data.groupby('date')['user_id'].count().reset_index().reset_index().reset_index().

## 消费行为分析

## 复购情况分析

## 漏斗流失分析