##  **任务要求**
### **任务 1 数据预处理**
- 任务 1.1 对照附录 1，理解各字段的含义，进行缺失值、重复值等方面的必12要处理，将处理结果保存为“task1_1_X.csv”（如果包含多张数据表，X 可从 1 开始往后编号），并在报告中描述处理过程。
- 任务 1.2 对用户信息表中 recently_logged 字段的“--”值进行必要的处理，将处理结果保存为“task1_2.csv”，并在报告中描述处理过程。
### **任务 2 平台用户活跃度分析**
- 任务 2.1 分别绘制各省份与各城市平台登录次数热力地图，并分析用户分布情况。
- 任务 2.2 分别绘制工作日与非工作日各时段的用户登录次数柱状图，并分析用户活跃的主要时间段。
- 任务 2.3 记𝑇𝑇𝑒𝑒𝑒𝑒𝑒𝑒为数据观察窗口截止时间（如：赛题数据的采集截止时间为2020 年 6 月 18 日），𝑇𝑇𝑖𝑖为用户 i 的最近访问时间，𝜎𝜎𝑖𝑖 = 𝑇𝑇𝑒𝑒𝑒𝑒𝑒𝑒 − 𝑇𝑇𝑖𝑖，若𝜎𝜎𝑖𝑖 > 90天，则称用户 i 为流失用户。根据该定义计算平台用户的流失率。
- 任务 2.4 根据任务 2.1 至任务 2.3，分析平台用户的活跃度，为该教育平台的线上管理决策提供建议。
### **任务 3 线上课程推荐**
- 任务 3.1 根据用户参与学习的记录，统计每门课程的参与人数，计算每门课程的受欢迎程度，列出最受欢迎的前 10 门课程，并绘制相应的柱状图。受欢迎程度定义如下：𝛾𝛾𝑖𝑖 = 𝑄𝑄𝑖𝑖 − 𝑄𝑄min𝑄𝑄max− 𝑄𝑄min。其中，𝛾𝛾𝑖𝑖为第 i 门课程的受欢迎程度，𝑄𝑄𝑖𝑖为参与第 i 门课程学习的人数，𝑄𝑄max和𝑄𝑄min分别为所有课程中参与人数最多和最少的课程所对应的人数。
- 任务 3.2 根据用户选择课程情况，构建用户和课程的关系表（二元矩阵），使用基于物品的协同过滤算法计算课程之间的相似度，并结合用户已选课程的记录，为总学习进度最高的 5 名用户推荐 3 门课程。
- 任务 3.3 在任务 3.1 和任务 3.2 的基础上，结合用户学习进度数据，分析付费课程和免费课程的差异，给出线上课程的综合推荐策略。

## 研究思路及分析过程
### 任务一：数据预处理
- 缺失情况分析
> 数值为0/空值的情况需要分开讨论，且关注缺失数据是否为真实缺失
- 异常情况分析
> 对出现“--”的情况进行分析，且关注该符号的实际意义以及占比情况
- 重复情况分析
对于重复数据进行删除
### 任务二：用户整体情况分析
- 用户分布分析
> 根据海内外、省份分析、乡镇分析入手,找到核心差异点所在
- 用户活跃度分析
> 细分整体情况与工作日差异
- 用户流失情况分析
细分整体情况与用户流失风险
- 线上管理决策建议
宣传、活跃度、流失为切口进行分析
### 任务三：用户课程选择分析
- 用户参与课程情况
> 现有课程选择分析与受欢迎度计算
- 用户课程推荐——基于协同过滤算法
> 基于协同过滤算法进行重点课程推荐
- 收费课程与用户学习进度相关分析
> 线上课程综合推荐策略制定

In [1]:
import pandas as pd
import numpy as np
import datetime
import jieba
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['font.sans-serif'] = ['SimHei'] 
matplotlib.rcParams['font.family']='sans-serif'
matplotlib.rcParams['axes.unicode_minus'] = False
from chinese_calendar import is_workday
from pyecharts.charts import Bar

### 1.1缺失值处理
首先判断该缺失值是否为真实缺失。针对不同的数据缺失情况，本次分析将会采用不同的处理方式：

- 1、针对数值为 0 的情况，需要进行实际的分析，回归到原始数据中去，判断该数据为 0 时是否具有实际意义。如果没有就将其作为缺失值做删除处理
- 2、针对数据为空值的情况，如果该特征数据缺失情况低于 10%，则结合该特征的重要性进行综合判断。如果字段重要性较低，则考虑直接删除，如果字段重要性较高，则进行插值法或者采用数据均值进行填补

### 1.2重复值处理
在完成缺失数据和异常数据处理之后，对数据进行重复值的删除处理。此处的重复值是指在数据表中用于分析的各个字段均一致。

### 1.3异常值处理

In [2]:
#utf-8解析错误，无缺失值
login = pd.read_csv(r'login.csv', encoding='gbk')
login.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 387144 entries, 0 to 387143
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   user_id      387144 non-null  object
 1   login_time   387144 non-null  object
 2   login_place  387144 non-null  object
dtypes: object(3)
memory usage: 8.9+ MB


In [3]:
#一天内重复登录的行为我们将视为一次登录即可，以减少数据量，因此需要对login_time进行日期转化
def time_to_date(df, column='login_time'):
    df[column] = pd.to_datetime(df[column]).apply(lambda x:x.strftime('%Y-%m-%d'))
    return df
login = time_to_date(login)
login.drop_duplicates(inplace=True)
login.head(5)                                             

Unnamed: 0,user_id,login_time,login_place
0,用户3,2018-09-06,中国广东广州
1,用户3,2018-09-07,中国广东广州
5,用户3,2018-09-10,中国广东广州
8,用户3,2018-09-10,中国北京
10,用户3,2018-09-10,中国广东


In [4]:
#根据截止日期2020-06-18计算流失时长
login['last_date_gap'] = pd.to_datetime('2020-06-18')- pd.to_datetime(login['login_time'])
login.head(5)  

Unnamed: 0,user_id,login_time,login_place,last_date_gap
0,用户3,2018-09-06,中国广东广州,651 days
1,用户3,2018-09-07,中国广东广州,650 days
5,用户3,2018-09-10,中国广东广州,647 days
8,用户3,2018-09-10,中国北京,647 days
10,用户3,2018-09-10,中国广东,647 days


In [5]:
# 重新排序
login=login.reset_index()
del login['index']
#根据login_place进行切分国家省份城市
# 改进版本
for i in range(login.shape[0]):
    if login.loc[i,'login_place'][0:2]=='中国':
        login.loc[i,'国家']='中国'
        if '黑龙江' in login.loc[i,'login_place']:
            login.loc[i,'省份']='黑龙江'
            if len(login.loc[i,'login_place'])>5:
                login.loc[i,'地区']=login.loc[i,'login_place'][5:]
            else:pass
        if '新疆维吾尔' in login.loc[i,'login_place']:
            login.loc[i,'省份']='新疆维吾尔'
            if len(login.loc[i,'login_place'])>7:
                login.loc[i,'地区']=login.loc[i,'login_place'][7:]
            else:pass
        if '内蒙古' in login.loc[i,'login_place']:
            login.loc[i,'省份']='内蒙古'
            if len(login.loc[i,'login_place'])>5:
                login.loc[i,'地区']=login.loc[i,'login_place'][5:]
            else:pass
        else:
            login.loc[i,'省份']=login.loc[i,'login_place'][2:4]
            login.loc[i,'地区']=login.loc[i,'login_place'][4:]   
    else:
        li=[word for word in jieba.cut(login.iloc[i,2])]
        if len(li)==2:
            login.loc[i,'国家']=li[0]
            login.loc[i,'省份']=li[1]
        else:
            login.loc[i,'国家']=li[0]
    if i%10000==0:
        print(f'{round(i*100/(int(login.shape[0])),2)}%')
login.head(5)


Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Lenovo\AppData\Local\Temp\jieba.cache


0.0%


Loading model cost 0.408 seconds.
Prefix dict has been built successfully.


3.74%
7.47%
11.21%
14.94%
18.68%
22.41%
26.15%
29.88%
33.62%
37.35%
41.09%
44.82%
48.56%
52.29%
56.03%
59.76%
63.5%
67.23%
70.97%
74.7%
78.44%
82.17%
85.91%
89.64%
93.38%
97.11%


Unnamed: 0,user_id,login_time,login_place,last_date_gap,国家,省份,地区
0,用户3,2018-09-06,中国广东广州,651 days,中国,广东,广州
1,用户3,2018-09-07,中国广东广州,650 days,中国,广东,广州
2,用户3,2018-09-10,中国广东广州,647 days,中国,广东,广州
3,用户3,2018-09-10,中国北京,647 days,中国,北京,
4,用户3,2018-09-10,中国广东,647 days,中国,广东,


In [6]:
login.to_csv('user_area_info.csv',index=False)

In [7]:
#处理study_information表
#price有部分缺失
study_info = pd.read_csv(r'study_information.csv', encoding='gbk')
study_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194974 entries, 0 to 194973
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_id           194974 non-null  object 
 1   course_id         194974 non-null  object 
 2   course_join_time  194974 non-null  object 
 3   learn_process     194974 non-null  object 
 4   price             190736 non-null  float64
dtypes: float64(1), object(4)
memory usage: 7.4+ MB


In [8]:
#查看price缺失部分情况
study_info[study_info.price.isnull()]
#查看缺失price的课程有几门:'课程96','课程51'
study_info[study_info.price.isnull()]['course_id'].unique()
#查看是否这些课程的所有价格都是缺失的？:是的
## way1
# study_info[study_info.course_id == '课程96']['price'].unique()
# study_info[study_info.course_id == '课程51']['price'].unique()
## way2,还能检查是否有差异化定价：没有
course_price = study_info.groupby(['course_id']).agg({'price':['max','min']})
course_price[course_price['price']['max']-  course_price['price']['min'] != 0] 
#暂时不做处理

Unnamed: 0_level_0,price,price
Unnamed: 0_level_1,max,min
course_id,Unnamed: 1_level_2,Unnamed: 2_level_2
课程51,,
课程96,,


In [9]:
#转日期
study_info = time_to_date(study_info, 'course_join_time')
#将进度转为数字
study_info['learn_process'] = study_info['learn_process'].apply(lambda x:int(x.split(':')[1].split('%')[0].strip()))
study_info.head(5)   

Unnamed: 0,user_id,course_id,course_join_time,learn_process,price
0,用户3,课程106,2020-04-21,0,0.0
1,用户3,课程136,2020-03-05,1,0.0
2,用户3,课程205,2018-09-10,63,0.0
3,用户4,课程26,2020-03-31,0,319.0
4,用户4,课程34,2020-03-31,0,299.0


In [10]:
#处理user表
users = pd.read_csv('users.csv', encoding='gbk')
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43983 entries, 0 to 43982
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   user_id                 43916 non-null  object 
 1   register_time           43983 non-null  object 
 2   recently_logged         43983 non-null  object 
 3   number_of_classes_join  43983 non-null  int64  
 4   number_of_classes_out   43983 non-null  int64  
 5   learn_time              43983 non-null  float64
 6   school                  10571 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 2.3+ MB


In [11]:
#由于用户ID是唯一标识，难以填充，因此删除缺失行
#school为非必要字段，可以暂时不处理
users = users[users.user_id.notnull()]
users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43916 entries, 0 to 43982
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   user_id                 43916 non-null  object 
 1   register_time           43916 non-null  object 
 2   recently_logged         43916 non-null  object 
 3   number_of_classes_join  43916 non-null  int64  
 4   number_of_classes_out   43916 non-null  int64  
 5   learn_time              43916 non-null  float64
 6   school                  10569 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 2.7+ MB


In [12]:
#根据school是为为空新生成一个字段：is_school
users['is_school'] = users['school']
users.is_school[users['school'].isnull()] = 0
users.is_school[~users['school'].isnull()] = 1
users.is_school.value_counts()                                          

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  users.is_school[users['school'].isnull()] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  users.is_school[~users['school'].isnull()] = 1


0    33347
1    10569
Name: is_school, dtype: int64

In [13]:
#查看recently_logger中的异常值'--'
users[users['recently_logged'] == '--']

Unnamed: 0,user_id,register_time,recently_logged,number_of_classes_join,number_of_classes_out,learn_time,school,is_school
11,用户44240,2020/6/17 17:25,--,1,0,1667.28,,0
12,用户44239,2020/6/17 17:24,--,1,0,2109.75,,0
14,用户44235,2020/6/17 16:39,--,1,0,0.00,,0
15,用户44237,2020/6/17 16:39,--,1,0,10348.62,,0
16,用户44232,2020/6/17 16:39,--,1,0,9054.72,,0
...,...,...,...,...,...,...,...,...
43772,用户214,2018/10/25 20:46,--,0,0,0.00,,0
43789,用户197,2018/10/25 19:53,--,0,0,3.10,,0
43834,用户151,2018/10/25 18:26,--,0,0,0.00,,0
43868,用户117,2018/10/25 17:47,--,0,0,0.00,,0


In [14]:
# 缺失的login_time可以考虑用最近一次login_time计算
#考虑先merge login和study_info
user_login_recently = login.groupby(['user_id'], as_index=False)['login_time'].max()
users_merge = pd.merge(users, user_login_recently,how='left', on='user_id')
users_merge.head(5)                                    

Unnamed: 0,user_id,register_time,recently_logged,number_of_classes_join,number_of_classes_out,learn_time,school,is_school,login_time
0,用户44251,2020/6/18 9:49,2020/6/18 9:49,0,0,41.25,,0,
1,用户44250,2020/6/18 9:47,2020/6/18 9:48,0,0,0.0,,0,2020-06-18
2,用户44249,2020/6/18 9:43,2020/6/18 9:43,0,0,16.22,,0,2020-06-18
3,用户44248,2020/6/18 9:09,2020/6/18 9:09,0,0,0.0,,0,2020-06-18
4,用户44247,2020/6/18 7:41,2020/6/18 8:15,0,0,1.8,,0,2020-06-18


In [15]:
#理论上recently_logged缺失有两种情况：1.注册后从未登录，2.抓取数据的时候还未退出
users_merge[users_merge['recently_logged'] == '--']

Unnamed: 0,user_id,register_time,recently_logged,number_of_classes_join,number_of_classes_out,learn_time,school,is_school,login_time
11,用户44240,2020/6/17 17:25,--,1,0,1667.28,,0,
12,用户44239,2020/6/17 17:24,--,1,0,2109.75,,0,
14,用户44235,2020/6/17 16:39,--,1,0,0.00,,0,
15,用户44237,2020/6/17 16:39,--,1,0,10348.62,,0,
16,用户44232,2020/6/17 16:39,--,1,0,9054.72,,0,
...,...,...,...,...,...,...,...,...,...
43705,用户214,2018/10/25 20:46,--,0,0,0.00,,0,
43722,用户197,2018/10/25 19:53,--,0,0,3.10,,0,2018-12-23
43767,用户151,2018/10/25 18:26,--,0,0,0.00,,0,
43801,用户117,2018/10/25 17:47,--,0,0,0.00,,0,


In [16]:
#将数据分成两部分，处理更快
users_merge = time_to_date(users_merge, 'register_time')
                                          
users_ready = users_merge[users_merge['recently_logged'] != '--']
users_process = users_merge[users_merge['recently_logged'] == '--']



In [17]:
#注册后未登录过的标记为注册时间, 有登录过的则通过登录时间+round(学习时间/(60*8)计算
columns = users_process.columns.to_list()
def get_logged_date(x):
    if type(x[columns.index('login_time')]) == float and pd.isna(x[columns.index('login_time')]):
        return x[columns.index('register_time')]
    else:
        if pd.to_datetime(x[columns.index('login_time')])+datetime.timedelta(days=int(x[columns.index('learn_time')])/480) > pd.to_datetime('2020-06-18'):
            return pd.to_datetime('2020-06-18')
        else:
            return pd.to_datetime(pd.to_datetime(x[columns.index('login_time')])+datetime.timedelta(days=int(x[columns.index('learn_time')])/480))
                                                           
                                                                                                                   
users_process['recently_logged'] = users_process.apply(lambda x:get_logged_date(x), axis=1)
users_process = time_to_date(users_process, 'recently_logged')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  users_process['recently_logged'] = users_process.apply(lambda x:get_logged_date(x), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = pd.to_datetime(df[column]).apply(lambda x:x.strftime('%Y-%m-%d'))


In [18]:
users_info = pd.concat([users_ready, users_process])
users_info.head(5)

Unnamed: 0,user_id,register_time,recently_logged,number_of_classes_join,number_of_classes_out,learn_time,school,is_school,login_time
0,用户44251,2020-06-18,2020/6/18 9:49,0,0,41.25,,0,
1,用户44250,2020-06-18,2020/6/18 9:48,0,0,0.0,,0,2020-06-18
2,用户44249,2020-06-18,2020/6/18 9:43,0,0,16.22,,0,2020-06-18
3,用户44248,2020-06-18,2020/6/18 9:09,0,0,0.0,,0,2020-06-18
4,用户44247,2020-06-18,2020/6/18 8:15,0,0,1.8,,0,2020-06-18


In [19]:
#以当前时间为基准，计算登录注册与当前的时间差
users_info = time_to_date(users_info, 'recently_logged')
users_info['register_logged_time'] = pd.to_datetime(users_info['recently_logged']) - pd.to_datetime(users_info['register_time'])
users_info['logged_now_time'] = pd.to_datetime('2020-06-18') - pd.to_datetime(users_info['recently_logged'])
users_info['regiter_now_time'] = pd.to_datetime('2020-06-18') - pd.to_datetime(users_info['register_time'])
users_info.head(5)

Unnamed: 0,user_id,register_time,recently_logged,number_of_classes_join,number_of_classes_out,learn_time,school,is_school,login_time,register_logged_time,logged_now_time,regiter_now_time
0,用户44251,2020-06-18,2020-06-18,0,0,41.25,,0,,0 days,0 days,0 days
1,用户44250,2020-06-18,2020-06-18,0,0,0.0,,0,2020-06-18,0 days,0 days,0 days
2,用户44249,2020-06-18,2020-06-18,0,0,16.22,,0,2020-06-18,0 days,0 days,0 days
3,用户44248,2020-06-18,2020-06-18,0,0,0.0,,0,2020-06-18,0 days,0 days,0 days
4,用户44247,2020-06-18,2020-06-18,0,0,1.8,,0,2020-06-18,0 days,0 days,0 days


In [20]:
#计算当前加入的课程数
users_info['number_of_class_now'] = users_info['number_of_classes_join'] - users_info['number_of_classes_out']

### 当前数据预处理已完成

In [21]:
#计算选课数量，根据每个user和course出现的次数
def nx_data(df=study_info,group_name=['course_id','user_id']):
    # 得到共现字典
    user_dic={}  
    stu_info_data=df.groupby(group_name)['course_id'].count().unstack()
    column=stu_info_data.columns.tolist()
    for i in range(stu_info_data.shape[0]):
        user_dic[column[i]]=stu_info_data[stu_info_data[column[i]]==1].index.tolist()
        
    #构造共现矩阵
    course_name=list(set(df['course_id'].values.tolist()))
    course_data=pd.DataFrame(data=np.zeros(shape=(len(course_name),len(course_name))),index=course_name,columns=course_name)
    for value in user_dic.values():
        if len(value)==1:
            pass
        else:
            for i in range(len(value)):
                for j in range(i+1,len(value)):
                    course_data.loc[value[i],value[j]]+=1
    return (user_dic,course_data)

user_dic,course_data=nx_data()

for i,key in enumerate(user_dic.keys()):
    users_info.loc[i,'选课数量']=len(user_dic[key])
users_info

Unnamed: 0,user_id,register_time,recently_logged,number_of_classes_join,number_of_classes_out,learn_time,school,is_school,login_time,register_logged_time,logged_now_time,regiter_now_time,number_of_class_now,选课数量
0,用户44251,2020-06-18,2020-06-18,0,0,41.25,,0,,0 days,0 days,0 days,0,2.0
1,用户44250,2020-06-18,2020-06-18,0,0,0.00,,0,2020-06-18,0 days,0 days,0 days,0,1.0
2,用户44249,2020-06-18,2020-06-18,0,0,16.22,,0,2020-06-18,0 days,0 days,0 days,0,1.0
3,用户44248,2020-06-18,2020-06-18,0,0,0.00,,0,2020-06-18,0 days,0 days,0 days,0,1.0
4,用户44247,2020-06-18,2020-06-18,0,0,1.80,,0,2020-06-18,0 days,0 days,0 days,0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43705,用户214,2018-10-25,2018-10-25,0,0,0.00,,0,,0 days,602 days,602 days,0,
43722,用户197,2018-10-25,2018-12-23,0,0,3.10,,0,2018-12-23,59 days,543 days,602 days,0,
43767,用户151,2018-10-25,2018-10-25,0,0,0.00,,0,,0 days,602 days,602 days,0,
43801,用户117,2018-10-25,2018-10-25,0,0,0.00,,0,,0 days,602 days,602 days,0,


In [22]:
login.columns

Index(['user_id', 'login_time', 'login_place', 'last_date_gap', '国家', '省份',
       '地区'],
      dtype='object')

### 地区合并
取最近的登录地点合并


In [24]:
#将用户按照登陆时间远近排序，将最近的一次置顶
login_recently_are = login.sort_values(by=['user_id', 'last_date_gap'])
#只取最近一次登录数据
login_del = login.user_id.drop_duplicates()
login_diff = login.iloc[login_del.index,:]

#合并数据
info_all = pd.merge(users_info, login_diff, how='left', on='user_id')
info_all.to_csv('全部信息.csv')
info_all.reset_index(inplace=True)
info_all.drop(columns=['index'], inplace=True)
info_all

Unnamed: 0,user_id,register_time,recently_logged,number_of_classes_join,number_of_classes_out,learn_time,school,is_school,login_time_x,register_logged_time,logged_now_time,regiter_now_time,number_of_class_now,选课数量,login_time_y,login_place,last_date_gap,国家,省份,地区
0,用户44251,2020-06-18,2020-06-18,0,0,41.25,,0,,0 days,0 days,0 days,0,2.0,,,NaT,,,
1,用户44250,2020-06-18,2020-06-18,0,0,0.00,,0,2020-06-18,0 days,0 days,0 days,0,1.0,2020-06-18,中国江西南昌,0 days,中国,江西,南昌
2,用户44249,2020-06-18,2020-06-18,0,0,16.22,,0,2020-06-18,0 days,0 days,0 days,0,1.0,2020-06-18,中国北京,0 days,中国,北京,
3,用户44248,2020-06-18,2020-06-18,0,0,0.00,,0,2020-06-18,0 days,0 days,0 days,0,1.0,2020-06-18,中国天津,0 days,中国,天津,
4,用户44247,2020-06-18,2020-06-18,0,0,1.80,,0,2020-06-18,0 days,0 days,0 days,0,1.0,2020-06-18,中国湖北武汉,0 days,中国,湖北,武汉
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43911,用户214,2018-10-25,2018-10-25,0,0,0.00,,0,,0 days,602 days,602 days,0,,,,NaT,,,
43912,用户197,2018-10-25,2018-12-23,0,0,3.10,,0,2018-12-23,59 days,543 days,602 days,0,,2018-12-23,中国山东,543 days,中国,山东,
43913,用户151,2018-10-25,2018-10-25,0,0,0.00,,0,,0 days,602 days,602 days,0,,,,NaT,,,
43914,用户117,2018-10-25,2018-10-25,0,0,0.00,,0,,0 days,602 days,602 days,0,,,,NaT,,,


### 分析阶段
#### 用户区域分析

In [25]:
#先看看总体国家分析
login.国家.value_counts()

中国    267582
英国        84
德国        22
越南        11
荷兰         8
波兰         7
南非         3
捷克         2
泰国         2
挪威         1
瑞典         1
瑞士         1
希腊         1
Name: 国家, dtype: int64

In [86]:
from pyecharts.globals import CurrentConfig, NotebookType
CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB
from pyecharts.charts import Line, Pie, Grid
import pyecharts.options as opts
#非中国部分分析
country_index = login[login['国家'] != '中国'].国家.value_counts().index.tolist()
country_value = login[login['国家'] != '中国'].国家.value_counts().values.tolist()

line = (
    Line()
    .add_xaxis(country_index)
    .add_yaxis('count', country_value,
               markpoint_opts=opts.MarkPointOpts(data=[opts.MarkPointItem(type_=["max"])]),
               # markpoint_opts=opts.MarkPointOpts(data=[opts.MarkPointItem(type_=["min"])]),
               markline_opts=opts.MarkLineOpts(data=[opts.MarkLineItem(type_="average")])
              )
    .set_global_opts(title_opts=opts.TitleOpts(title="country count"))    
)
# line.load_javascript()
line.render_notebook()

In [113]:
pie = (
    Pie()
    .add(series_name='county', data_pair=[list(z) for z in zip(country_index, country_value)],
        radius=[105, 165],
        center=["50%", "50%"],
        rosetype="radius"
        )
    .set_global_opts(legend_opts=opts.LegendOpts(is_show=False))
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)"))
)
pie.render_notebook()

In [120]:
#再看国内省份的分布
login=pd.read_csv('user_area_info.csv',index_col=0)
province_index = login[login['国家'] == '中国'].省份.value_counts().index.tolist()
province_value = login[login['国家'] == '中国'].省份.value_counts().values.tolist()

bar = (
    Bar()
    .add_xaxis(province_index)
    .add_yaxis('count', province_value,
              markpoint_opts=opts.MarkPointOpts(data=[opts.MarkPointItem(type_=["max"])]),
              markline_opts=opts.MarkLineOpts(data=[opts.MarkLineItem(type_="average")])
              )
)
bar.render_notebook()

In [127]:
#地图展示
from pyecharts.charts import Map
from pyecharts.commons.utils import JsCode
map = (
    Map()
    .add('', [list(z) for z in zip(province_index, province_value)], 'china',
        itemstyle_opts={
                "normal": {"areaColor": "#323c48", "borderColor": "#404a59"},
                "emphasis": {
                    # "label": {"show": Timeline},
                    "areaColor": "rgba(255,255,255, 0.5)",
                },
        },)
    .set_global_opts(
            title_opts=opts.TitleOpts(
                subtitle="",
                pos_left="center",
                pos_top="top",
                title_textstyle_opts=opts.TextStyleOpts(
                    font_size=25, color="rgba(255,255,255, 0.9)"
                ),
            ),
            tooltip_opts=opts.TooltipOpts(
                is_show=True,
                formatter=JsCode(
                    """function(params) {
                    if ('value' in params.data) {
                        return params.data.value[2] + ': ' + params.data.value[0];
                    }
                }"""
                ),
            ),)
)
map.render('province_count_map.html')

'D:\\书籍笔记\\数据分析\\项目\\自身项目\\教育平台线上课程用户行为\\province_count_map.html'

In [138]:
from pyecharts.charts import Pie
pie = (
    Pie()
    .add(
            '',
            data_pair=[list(z) for z in zip(province_index, province_value)],              
            radius = [120,50],           #环形内外圆的半径
            rosetype="radius", #玫瑰饼图          
        )
    .set_global_opts(legend_opts=opts.LegendOpts(is_show=False)) #''：图例名（不使用图例）
)
pie.render_notebook()

In [141]:
info_all.columns

Index(['user_id', 'register_time', 'recently_logged', 'number_of_classes_join',
       'number_of_classes_out', 'learn_time', 'school', 'is_school',
       'login_time_x', 'register_logged_time', 'logged_now_time',
       'regiter_now_time', 'number_of_class_now', '选课数量', 'login_time_y',
       'login_place', 'last_date_gap', '国家', '省份', '地区'],
      dtype='object')

In [143]:
info_all.groupby(['省份']).agg({'learn_time':['sum','mean','count'],'number_of_class_now':['sum','mean'],'选课数量':['sum','mean']})

Unnamed: 0_level_0,learn_time,learn_time,learn_time,number_of_class_now,number_of_class_now,选课数量,选课数量
Unnamed: 0_level_1,sum,mean,count,sum,mean,sum,mean
省份,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
,512257.45,862.386279,594,568,0.956229,1.0,1.0
上海,452444.64,319.974993,1414,211,0.149222,5.0,1.25
云南,237193.66,513.40619,462,260,0.562771,15.0,2.142857
内蒙古,147719.3,645.062445,229,173,0.755459,0.0,
北京,352498.33,188.602638,1869,179,0.095773,5.0,1.0
台湾,12910.27,280.658043,46,11,0.23913,0.0,
吉林,130434.7,430.477558,303,95,0.313531,2.0,1.0
四川,957668.17,600.79559,1594,808,0.506901,8.0,1.142857
天津,130559.82,324.775672,402,102,0.253731,8.0,2.0
宁夏,125475.36,836.5024,150,104,0.693333,5.0,1.25


### 用户活跃度分析

In [147]:
time = info_all[info_all['recently_logged']>'2020-01-01'].groupby(by='recently_logged').user_id.count().index.tolist()
count = info_all[info_all['recently_logged']>'2020-01-01'].groupby(by='recently_logged').user_id.count().values.tolist()


line = (
    Line()
    .add_xaxis(time)
    .add_yaxis('count', count,
               markpoint_opts=opts.MarkPointOpts(data=[opts.MarkPointItem(type_=["max"])]),
               # markpoint_opts=opts.MarkPointOpts(data=[opts.MarkPointItem(type_=["min"])]),
               markline_opts=opts.MarkLineOpts(data=[opts.MarkLineItem(type_="average")])
              )
    .set_global_opts(title_opts=opts.TitleOpts(title="country count"))    
)
# line.load_javascript()
line.render_notebook()

In [150]:
#可以看到，上述活跃度有异常点，接着对该点进行分析，该天为'2020-06-11'
info_all[info_all['recently_logged'] == '2020-06-11'].describe()
#看结果应该是做了推广

Unnamed: 0,number_of_classes_join,number_of_classes_out,learn_time,register_logged_time,logged_now_time,regiter_now_time,number_of_class_now,选课数量,last_date_gap
count,2489.0,2489.0,2489.0,2489,2489,2489,2489.0,12.0,258
mean,1.030936,0.004821,257.961937,7 days 08:44:44.451586982,7 days 00:00:00,14 days 08:44:44.451586982,1.026115,1.5,69 days 06:41:51.627906977
std,0.368918,0.074858,1692.174817,38 days 11:39:57.646745670,0 days 00:00:00,38 days 11:39:57.646745670,0.351445,1.732051,87 days 22:01:11.613348387
min,0.0,0.0,0.0,0 days 00:00:00,7 days 00:00:00,7 days 00:00:00,0.0,1.0,7 days 00:00:00
25%,1.0,0.0,0.0,0 days 00:00:00,7 days 00:00:00,7 days 00:00:00,1.0,1.0,18 days 00:00:00
50%,1.0,0.0,0.0,0 days 00:00:00,7 days 00:00:00,7 days 00:00:00,1.0,1.0,35 days 12:00:00
75%,1.0,0.0,0.0,0 days 00:00:00,7 days 00:00:00,7 days 00:00:00,1.0,1.0,107 days 18:00:00
max,7.0,2.0,58530.88,623 days 00:00:00,7 days 00:00:00,630 days 00:00:00,7.0,7.0,630 days 00:00:00


In [153]:
#6月11日有一个异常点
#非省份因素
info_all[info_all['recently_logged']=='2020-06-11'].groupby(['省份']).user_id.count()
# 筛选出填写信息差异较大，因此考虑是当时进行学校注册优惠活动？
info_all[info_all['recently_logged']=='2020-06-11'].is_school.value_counts()

1    2293
0     196
Name: is_school, dtype: int64

### 用户流失分析

In [154]:
info_all.groupby(['省份','logged_now_time']).user_id.count().unstack()

logged_now_time,0 days,1 days,2 days,3 days,4 days,5 days,6 days,7 days,8 days,9 days,...,611 days,612 days,614 days,619 days,623 days,624 days,625 days,627 days,628 days,646 days
省份,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
,,13.0,12.0,6.0,6.0,5.0,4.0,7.0,1.0,5.0,...,,,,,,,,,,
上海,,4.0,4.0,3.0,7.0,,4.0,9.0,3.0,2.0,...,,,,,,,,,,
云南,,14.0,16.0,14.0,2.0,2.0,1.0,16.0,8.0,7.0,...,,,,,,,,,,
内蒙古,,7.0,2.0,2.0,,1.0,1.0,,,2.0,...,,,,,,,,,,
北京,4.0,6.0,,3.0,1.0,,2.0,7.0,1.0,1.0,...,,,,,,,,,1.0,
台湾,,,,,,,1.0,,,1.0,...,,,,,,,,,,
吉林,,3.0,2.0,1.0,,,3.0,,,,...,,,,,,,,,,
四川,3.0,8.0,13.0,7.0,5.0,11.0,6.0,8.0,3.0,8.0,...,,,,,,,,,,
天津,1.0,1.0,3.0,1.0,,1.0,,1.0,1.0,,...,,,,,,,,1.0,,
宁夏,1.0,7.0,8.0,4.0,2.0,1.0,5.0,7.0,1.0,,...,,,,,,,,,,


In [158]:
for i in range(info_all.shape[0]):
    if int(str(info_all.loc[i,'logged_now_time'])[:-14]) > 150:        
        info_all.loc[i,'流失时间划分']='大于150天'
    elif 90 <= int(str(info_all.loc[i,'logged_now_time'])[:-14]) < 150:        
        info_all.loc[i,'流失时间划分']='大于90天'
    elif 30 <= int(str(info_all.loc[i,'logged_now_time'])[:-14]) < 90:        
        info_all.loc[i,'流失时间划分']='大于30天'
    elif 15 <= int(str(info_all.loc[i,'logged_now_time'])[:-14]) < 30:        
        info_all.loc[i,'流失时间划分']='大于15天'
    elif 7 <= int(str(info_all.loc[i,'logged_now_time'])[:-14]) < 15:        
        info_all.loc[i,'流失时间划分']='大于7天'
    elif 0 <= int(str(info_all.loc[i,'logged_now_time'])[:-14]) < 7:        
        info_all.loc[i,'流失时间划分']='7天内'
info_all.groupby(['省份','流失时间划分']).user_id.count().unstack()

流失时间划分,7天内,大于150天,大于15天,大于30天,大于7天,大于90天
省份,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,46.0,79.0,33.0,257.0,30.0,149.0
上海,22.0,1140.0,24.0,130.0,28.0,70.0
云南,49.0,172.0,32.0,117.0,43.0,49.0
内蒙古,13.0,62.0,10.0,90.0,4.0,50.0
北京,16.0,1542.0,18.0,157.0,16.0,119.0
台湾,1.0,33.0,,8.0,2.0,2.0
吉林,9.0,160.0,5.0,65.0,2.0,61.0
四川,53.0,811.0,48.0,417.0,40.0,224.0
天津,7.0,278.0,7.0,73.0,4.0,33.0
宁夏,28.0,31.0,10.0,32.0,11.0,38.0


In [167]:

users_province = info_all.groupby(['省份'])['user_id'].nunique()
leave_rate_province = pd.merge(leave_num, users_province, how='left', on='省份')
leave_rate_province

Unnamed: 0_level_0,user_id_x,user_id_y
省份,Unnamed: 1_level_1,Unnamed: 2_level_1
,46,594
,79,594
,33,594
,257,594
,30,594
...,...,...
黑龙,213,446
黑龙,24,446
黑龙,117,446
黑龙,6,446
