# 0. 介绍
如果有人问学数据分析有没有一条捷径的话? 答案可能就是kaggle.  
        
kaggle上除了很有很多比赛之后,还有很多大大小小的数据集. 这些数据集是极好的练手的素材.   
这些数据集从几K到十几个G的都有, 大家可以按需下载.下载地址 : https://www.kaggle.com/datasets   
推荐大家下载"Only Datasets with Tasks"的数据集, 这些数据集自带任务,各任务下还有大家提交的答案, 这样每个数据集就相当于一个教程,非常有用. 
![图示](img/1.png)
      
       
本次数据集为 Google 应用商店 App 的下载情况,包含七万多条数据.   
下载地址为: https://www.kaggle.com/lava18/google-play-store-apps/tasks?taskId=276


# 1. 数据集预处理

In [52]:
import pandas as pd
import numpy as np
import matplotlib 
import matplotlib.pyplot as plt
import datetime

## 1.1 读取数据

In [2]:
location1 = r'dateset_kaggle/googleplaystore.csv'
df1 = pd.read_csv(location1)
df1.tail()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10840,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


In [3]:
location2 = r'dateset_kaggle/googleplaystore_user_reviews.csv'
df2 = pd.read_csv(location2)
df2.tail()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
64290,Houzz Interior Design Ideas,,,,
64291,Houzz Interior Design Ideas,,,,
64292,Houzz Interior Design Ideas,,,,
64293,Houzz Interior Design Ideas,,,,
64294,Houzz Interior Design Ideas,,,,


In [4]:
df1.info()
#df1.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


读取数据后, 第一步就需要查看一下数据的大体情况. 

这里可以看出原数据的很多格式问题.        
Reviews列,Installs列 都是字符串格式.         
Price列带有单位$,也包含Everyone, 并且是字符串格式.       
Size列带单位M和K,也包含Varies with device.        
Last Updated列日期格式不规范.      
Current Ver列包含Varies with device,并且是字符串格式.    
Android Ver列包含Varies with device,并且是字符串格式.
   
所以需要先对数据进行一些基础的处理,后面遇到具体任务的时候, 可能还要再做一些特定的处理.

## 1.2 数据格式转换

### 1.2.1 处理Price列 : 删除异常值,去掉$,转化为数值型

用 df1['Price'].unique() 查看Price列中有哪些不重复值      
发现问题有3个:           
1)有异常值"Everyone"    
2)值有带$单位       
3)值都是字符串型,应该是数值型    

In [5]:
# 1) 删除异常值

# 先查看下值为"Everyone"数据量有多少, 评估下删除对整体数据的影响.
df1.groupby(['Price']).size()['Everyone']  #返回1,只有1条'Everyone'数据,所以直接删掉.

#删除'Price'列中"Everyone"所在行, 用取反排除的方法.
df1 = df1[~df1['Price'].isin(['Everyone'])]

df1['Price'].unique()   # 检查是否成功删除

# 2) 将'Price'列中的$去掉
df1['Price'] = df1['Price'].str.strip('$')

# 3) 将'Price'列转化为数字型
df1['Price'] = df1['Price'].astype('float')

### 1.2.2 处理Size列 : 替换异常值,统一计量单位,去掉单位,改为数值型

用df1['Size'].unique() 查看Size列中有哪些不重复值  
         
发现问题有4个:     
1) 有异常值'Varies with device'??为什么用f1['Size'].unique()显示不出来???   
2) 单位有M和K, 需要统一 ,需要去掉单位,并且值都是字符串型,应该是数值型

In [6]:
# 1) 处理异常值

#df1.groupby(['Size']).size()['Varies with device']   
# 返回1695,数据量较大, 最好不要直接删除,此处用0代替.

df1['Size'] = df1['Size'].replace('Varies with device','0') # 将'Varies with device'转化为'0'
'0'  in  df1['Size'].values.tolist()
df1.groupby(['Size']).size()['0']  

1695

In [7]:
# 2) 将单位统一(k=M/1024), 然后去掉单位,并转化为浮点型.
def convert_K2M(item): 
    if 'k' in item :
        item = round(float(item.replace('k',''))/1024 ,2)
    else:
        item = round(float(item.replace('M','')),2)     
    return item 

df1['Size']= df1['Size'].apply(convert_K2M)
df1['Size'].unique()

array([1.9e+01, 1.4e+01, 8.7e+00, 2.5e+01, 2.8e+00, 5.6e+00, 2.9e+01,
       3.3e+01, 3.1e+00, 2.8e+01, 1.2e+01, 2.0e+01, 2.1e+01, 3.7e+01,
       2.7e+00, 5.5e+00, 1.7e+01, 3.9e+01, 3.1e+01, 4.2e+00, 7.0e+00,
       2.3e+01, 6.0e+00, 6.1e+00, 4.6e+00, 9.2e+00, 5.2e+00, 1.1e+01,
       2.4e+01, 0.0e+00, 9.4e+00, 1.5e+01, 1.0e+01, 1.2e+00, 2.6e+01,
       8.0e+00, 7.9e+00, 5.6e+01, 5.7e+01, 3.5e+01, 5.4e+01, 2.0e-01,
       3.6e+00, 5.7e+00, 8.6e+00, 2.4e+00, 2.7e+01, 2.5e+00, 1.6e+01,
       3.4e+00, 8.9e+00, 3.9e+00, 2.9e+00, 3.8e+01, 3.2e+01, 5.4e+00,
       1.8e+01, 1.1e+00, 2.2e+00, 4.5e+00, 9.8e+00, 5.2e+01, 9.0e+00,
       6.7e+00, 3.0e+01, 2.6e+00, 7.1e+00, 3.7e+00, 2.2e+01, 7.4e+00,
       6.4e+00, 3.2e+00, 8.2e+00, 9.9e+00, 4.9e+00, 9.5e+00, 5.0e+00,
       5.9e+00, 1.3e+01, 7.3e+01, 6.8e+00, 3.5e+00, 4.0e+00, 2.3e+00,
       7.2e+00, 2.1e+00, 4.2e+01, 7.3e+00, 9.1e+00, 5.5e+01, 2.0e-02,
       6.5e+00, 1.5e+00, 7.5e+00, 5.1e+01, 4.1e+01, 4.8e+01, 8.5e+00,
       4.6e+01, 8.3e

### 1.2.3 处理Reviews列 : 改变数据类型

In [19]:
# 转为为float型
df1['Reviews'] = df1['Reviews'].astype('float')

### 1.2.4 处理Installs列 : 去掉加号和逗号,  改变数据类型

In [30]:
# 1) 去掉值中的+号和,号
df1['Installs'] = df1['Installs'].replace('+','').str.replace(',','')

In [31]:
# 2) 转为为float型
df1['Installs'] = df1['Installs'].astype('float')

### 1.2.5 处理Last Updated列 : 改变时间类型

In [None]:
# 原数据日期格式比较怪,用7-JAN-18来表示2018-1-15. January 7, 2018
# ('January 7, 2018').split(' ') 返回['January', '7,', '2018']

def time_update (item):  
    t1 = item.replace(',','').replace(' ','') #得到类似January72018
    t2 = datetime.datetime.strptime(t1,'%B%d%Y') #%B表示英文月份.
    return t2

df1['Last Updated'] = df1['Last Updated'].apply(time_update )
df1['Last Updated']