# Las Vegas Business 数据集处理

### 该 Notebook 对应的博客[链接](http://xiehongfeng100.github.io/2018/07/31/yelper-dpps-las-vegas-data-preprocessing/)

In [1]:
import pandas as pd

## 1. 加载数据

### 1.1 加载 Business 数据

In [2]:
yelp_lv_bizes = pd.read_csv('../../dataset/las_vegas/business/las_vegas_business_with_db_id.csv')

In [3]:
len(yelp_lv_bizes)

26777

In [4]:
yelp_lv_bizes[:5]

Unnamed: 0,db_id,business_id,stars,review_count,latitude,longitude,city
0,4,--9e1ONYQuAa-CB_Rrw7Tw,4.0,1451,36.1232,-115.169,Las Vegas
1,11,--DdmeR16TRb3LsjG0ejrQ,3.0,5,36.1143,-115.171,Las Vegas
2,12,--e8PjCNhEz32pprnPhCwQ,3.5,19,36.1589,-115.133,Las Vegas
3,29,--o5BoU7qYMALeVDK6mwVg,3.5,6,36.1016,-115.132,Las Vegas
4,33,--q7kSBRb0vWC8lSkXFByA,4.0,7,36.0167,-115.173,Las Vegas


### 1.2 加载 Checkin 数据及重命名列名

In [5]:
yelp_lv_cks = pd.read_csv('../../dataset/las_vegas/business/las_vegas_checkin_with_db_id.csv')

In [6]:
len(yelp_lv_cks)

23242

In [7]:
yelp_lv_cks[:5]

Unnamed: 0,business_db_id,count
0,4,2568
1,11,30
2,12,1
3,33,107
4,48,2


In [8]:
# 重命名列名，以便于跟 business 数据集根据 db_id 字段进行合并
yelp_lv_cks = yelp_lv_cks.rename(index=str, columns={'business_db_id': 'db_id', 'count': 'checkin_count'})

In [9]:
yelp_lv_cks[:5]

Unnamed: 0,db_id,checkin_count
0,4,2568
1,11,30
2,12,1
3,33,107
4,48,2


## 2. Business 数据集左连接（Left Join）Checkin 数据集

In [10]:
yelp_lv_bizes = pd.merge(yelp_lv_bizes, yelp_lv_cks, how='left', on=['db_id'])

In [11]:
yelp_lv_bizes[:5]

Unnamed: 0,db_id,business_id,stars,review_count,latitude,longitude,city,checkin_count
0,4,--9e1ONYQuAa-CB_Rrw7Tw,4.0,1451,36.1232,-115.169,Las Vegas,2568.0
1,11,--DdmeR16TRb3LsjG0ejrQ,3.0,5,36.1143,-115.171,Las Vegas,30.0
2,12,--e8PjCNhEz32pprnPhCwQ,3.5,19,36.1589,-115.133,Las Vegas,1.0
3,29,--o5BoU7qYMALeVDK6mwVg,3.5,6,36.1016,-115.132,Las Vegas,
4,33,--q7kSBRb0vWC8lSkXFByA,4.0,7,36.0167,-115.173,Las Vegas,107.0


In [12]:
# 左连接后，将 checkin_count 列中为值为 NA/NaN 的均填充为 0.0 （其他列已经检查过不存在值为 NA/NAN 的地方）
yelp_lv_bizes = yelp_lv_bizes.fillna(0)

In [13]:
yelp_lv_bizes[:5]

Unnamed: 0,db_id,business_id,stars,review_count,latitude,longitude,city,checkin_count
0,4,--9e1ONYQuAa-CB_Rrw7Tw,4.0,1451,36.1232,-115.169,Las Vegas,2568.0
1,11,--DdmeR16TRb3LsjG0ejrQ,3.0,5,36.1143,-115.171,Las Vegas,30.0
2,12,--e8PjCNhEz32pprnPhCwQ,3.5,19,36.1589,-115.133,Las Vegas,1.0
3,29,--o5BoU7qYMALeVDK6mwVg,3.5,6,36.1016,-115.132,Las Vegas,0.0
4,33,--q7kSBRb0vWC8lSkXFByA,4.0,7,36.0167,-115.173,Las Vegas,107.0


## 3. 归一化 review_count 和 checkin_count

这里`要对 business 的 review_count 和 checkin_count 进行归一化的原因是后续计算流行度的时候要结合这两者，但这两者统计的角度不一样，所以我们需要将他们各自归一化之后再进行。`具体我们使用的是 sklearn 库中 preprocessing 模块的 [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) 进行处理，其归一化公式如下：

![](images/MinMaxScalerEq.png)

这里我们的 `lowLimit=0，upLimit=1`

In [14]:
from sklearn.preprocessing import MinMaxScaler

### 3.1 归一化 review_count 到 (0, 1) 范围中

In [15]:
rc_scaler = MinMaxScaler()

In [16]:
rc_scaled = rc_scaler.fit_transform(yelp_lv_bizes.review_count.values.reshape(-1, 1))



In [17]:
rc_scaled[:5]

array([[0.19679261],
       [0.00027181],
       [0.0021745 ],
       [0.00040772],
       [0.00054363]])

In [18]:
yelp_lv_bizes = yelp_lv_bizes.assign(review_count_scaled=rc_scaled)

### 3.2 归一化 checkin_count 到 (0, 1) 范围中

In [19]:
ck_scaler = MinMaxScaler()

In [20]:
ck_scaled = ck_scaler.fit_transform(yelp_lv_bizes.checkin_count.values.reshape(-1, 1))

In [21]:
ck_scaled[:5]

array([[1.94607375e-02],
       [2.27345064e-04],
       [7.57816881e-06],
       [0.00000000e+00],
       [8.10864063e-04]])

In [22]:
yelp_lv_bizes = yelp_lv_bizes.assign(checkin_count_scaled=ck_scaled)

## 4. 计算流行度（Popularity）

对于流行度，我们在 review_count 及 checkin_count 基础上构造一个新的指标 popularity。那 popularity 怎么计算呢？这里考虑到 `review（评论）有好有坏，不能认为其数量越高，客户对某一个 business（商店）的满意度就越高；但 checkin（签到）就不一样了，可以认为签到数越多，客户对某一个 business 的满意度越高`，所以，我们在计算流行度的时候简单认为 checkin 的重要性是 review_count 的 2 倍，也即：
$$
popularity = review\_count\_scaled + 2 * checkin\_count\_scaled
$$

In [23]:
yelp_lv_bizes[:5]

Unnamed: 0,db_id,business_id,stars,review_count,latitude,longitude,city,checkin_count,review_count_scaled,checkin_count_scaled
0,4,--9e1ONYQuAa-CB_Rrw7Tw,4.0,1451,36.1232,-115.169,Las Vegas,2568.0,0.196793,0.019461
1,11,--DdmeR16TRb3LsjG0ejrQ,3.0,5,36.1143,-115.171,Las Vegas,30.0,0.000272,0.000227
2,12,--e8PjCNhEz32pprnPhCwQ,3.5,19,36.1589,-115.133,Las Vegas,1.0,0.002175,8e-06
3,29,--o5BoU7qYMALeVDK6mwVg,3.5,6,36.1016,-115.132,Las Vegas,0.0,0.000408,0.0
4,33,--q7kSBRb0vWC8lSkXFByA,4.0,7,36.0167,-115.173,Las Vegas,107.0,0.000544,0.000811


In [24]:
popularity=yelp_lv_bizes.review_count_scaled + 2 * yelp_lv_bizes.checkin_count_scaled

In [25]:
popularity[:5]

0    0.235714
1    0.000727
2    0.002190
3    0.000408
4    0.002165
dtype: float64

In [26]:
pp_scaler = MinMaxScaler()
popularity_scaled = pp_scaler.fit_transform(popularity.values.reshape(-1, 1))

In [27]:
popularity_scaled[:5]

array([[0.09637074],
       [0.00029703],
       [0.00089523],
       [0.00016669],
       [0.0008853 ]])

In [28]:
yelp_lv_bizes = yelp_lv_bizes.assign(popularity=popularity_scaled)

In [29]:
yelp_lv_bizes[:5]

Unnamed: 0,db_id,business_id,stars,review_count,latitude,longitude,city,checkin_count,review_count_scaled,checkin_count_scaled,popularity
0,4,--9e1ONYQuAa-CB_Rrw7Tw,4.0,1451,36.1232,-115.169,Las Vegas,2568.0,0.196793,0.019461,0.096371
1,11,--DdmeR16TRb3LsjG0ejrQ,3.0,5,36.1143,-115.171,Las Vegas,30.0,0.000272,0.000227,0.000297
2,12,--e8PjCNhEz32pprnPhCwQ,3.5,19,36.1589,-115.133,Las Vegas,1.0,0.002175,8e-06,0.000895
3,29,--o5BoU7qYMALeVDK6mwVg,3.5,6,36.1016,-115.132,Las Vegas,0.0,0.000408,0.0,0.000167
4,33,--q7kSBRb0vWC8lSkXFByA,4.0,7,36.0167,-115.173,Las Vegas,107.0,0.000544,0.000811,0.000885


## 4. 保存处理结果

In [30]:
yelp_lv_bizes.to_csv('../../dataset/las_vegas/business/las_vegas_business_preprocessed_with_db_id.csv', index=False)