# 資料前處理 & 情緒分析


套件設定

In [49]:
import jieba
import jieba.analyse
import pandas as pd
%matplotlib inline

## MetaData
- 資料來源 : PTT 汽車版、汽車買賣版
- 資料區間：2020/12/01 ~ 2023/01/31

| 廠牌 | 關鍵字 | 資料總筆數 | 清理後筆數 |
| --- | --- | --- | --- |
| Nissan | Nissan、裕隆、裕日,日產、Sentra、Kicks、仙草 | 2,464 | 2,079 |
| Toyota | Toyota、Altis、Cross、豐田、和泰、阿提斯、卡羅拉 | 8,755 | 7,420 |
| Ford | 福特、六和、九和、上正、Ford、Focus | 4,684 | 4,011 |
| Honda | Honda、HRV、本田 | 2,488 | 2,171 |
| Mazda | Mazda、CX-3、CX-30、馬三、Mazda 3 | 3,321 | 2,787 |

## 資料載入
根據自己的data路徑修改 <br>
`data不上github`，要注意ignore

In [175]:
ptt = pd.read_csv("../data/rawData/mazda_ptt_data.csv") 
ptt.head(3)

Unnamed: 0,system_id,artUrl,artTitle,artDate,artPoster,artCatagory,artContent,artComment,e_ip,insertedDate,dataSource
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01 00:09:42,city0504,car,原文連結：\nhttps://ctee.com.tw/lohas/car/378518.ht...,"[{""cmtStatus"": ""→"", ""cmtPoster"": ""mingchaoliu""...",111.243.121.95,2020-12-01 00:04:07,ptt
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01 11:03:24,yamatobar,car,原文連結：\nhttps://auto.ltn.com.tw/news/16610/3\n原...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""XXXXBANG"", ""...",1.171.168.195,2020-12-02 00:04:03,ptt
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01 11:04:57,oppoR20,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！\nhttps://w...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""wang960615"",...",140.125.222.17,2020-12-02 00:04:03,ptt


In [176]:
# 看看有幾篇文章
print(f"number of posts: {ptt.shape[0]}")
print(f"date range: {(ptt['artDate'].min(), ptt['artDate'].max())}")
print(f"category: \n{ptt['artCatagory'].value_counts()}")


number of posts: 3321
date range: ('2020-12-01 00:09:42', '2023-01-30 21:30:53')
category: 
artCatagory
CarShop    1777
car        1544
Name: count, dtype: int64


### 留言萃取
取出`artComment`的`cmtContent`

In [177]:
ptt = ptt[ptt.artComment != '[]'] # 刪除沒有comment的文章

# 取出 commentContent
def getComtInfo(com):
    cmtContent = ""
    com = eval(com)
    # print(com)
    for i in com:
        # print(i)
        cmtContent += i['cmtContent'] + "。"
    return pd.Series([cmtContent])


ptt[['cmtContent']] = ptt['artComment'].apply(lambda r: getComtInfo(r))
ptt.head(3)

Unnamed: 0,system_id,artUrl,artTitle,artDate,artPoster,artCatagory,artContent,artComment,e_ip,insertedDate,dataSource,cmtContent
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01 00:09:42,city0504,car,原文連結：\nhttps://ctee.com.tw/lohas/car/378518.ht...,"[{""cmtStatus"": ""→"", ""cmtPoster"": ""mingchaoliu""...",111.243.121.95,2020-12-01 00:04:07,ptt,:要看樓主住哪或開車活動範圍在哪？西部是還好啦。:一年是可以去保養廠幾次。:3008快小改款...
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01 11:03:24,yamatobar,car,原文連結：\nhttps://auto.ltn.com.tw/news/16610/3\n原...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""XXXXBANG"", ""...",1.171.168.195,2020-12-02 00:04:03,ptt,:電動浴缸？。:防水電動車。:哈哈。:原廠有打算出敞篷RAV4嗎?。:RAV250h？？？。...
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01 11:04:57,oppoR20,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！\nhttps://w...,"[{""cmtStatus"": ""推"", ""cmtPoster"": ""wang960615"",...",140.125.222.17,2020-12-02 00:04:03,ptt,:北美的YARIS停產了。:但是這篇內文的東西總感覺可能全球發售。:畢竟馬二的銷量除了日本本...


將 `artTitle`, `artContent`, `cmtContent` 合併成新欄位 `whole_content`

In [178]:
ptt['whole_content'] = ptt['artTitle'] + ptt['artContent'] + ptt['cmtContent']
ptt = ptt[['system_id', 'artUrl', 'artTitle', 'artDate', 'artCatagory', 'artContent', 'whole_content']]
ptt

Unnamed: 0,system_id,artUrl,artTitle,artDate,artCatagory,artContent,whole_content
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01 00:09:42,car,原文連結：\nhttps://ctee.com.tw/lohas/car/378518.ht...,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH原文連結：\nhttps://...
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01 11:03:24,car,原文連結：\nhttps://auto.ltn.com.tw/news/16610/3\n原...,[新聞]預計明年現身，ToyotaRAV4將推全新動力！原文連結：\nhttps://aut...
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01 11:04:57,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！\nhttps://w...,[情報]新世代Mazda2/CX-3有望沿用Yaris平台新一代Mazda 2有望直接沿用Y...
3,4,https://www.ptt.cc/bbs/car/M.1606792933.A.D92....,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01 11:22:11,car,這是依照之前 Mazda Q3 財報會上所公佈的資訊\nhttps://www.carstu...,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台這是依照之前 Mazda Q...
4,5,https://www.ptt.cc/bbs/car/M.1606794935.A.AA7....,[情報]2020年11月份臺灣汽車市場銷售報告,2020-12-01 11:55:33,car,新增小七車\nhttps://www.7car.tw/articles/read/70876...,[情報]2020年11月份臺灣汽車市場銷售報告新增小七車\nhttps://www.7car...
...,...,...,...,...,...,...,...
3316,3317,https://www.ptt.cc/bbs/CarShop/M.1674922265.A....,[購車]全新MazdaCX3020SCarbonEdition,2023-01-29 00:11:03,CarShop,車輛狀況：2023 全新\n\n車輛品牌：Mazda\n\n車款型式：CX30 20S Ca...,[購車]全新MazdaCX3020SCarbonEdition車輛狀況：2023 全新\n\...
3317,3318,https://www.ptt.cc/bbs/CarShop/M.1675004691.A....,[購車]MazdaCX-520SPremiumSE,2023-01-29 23:04:49,CarShop,車輛狀況：全新\n\n車輛品牌：Mazda\n\n車款型式：CX-5 20S Premium...,[購車]MazdaCX-520SPremiumSE車輛狀況：全新\n\n車輛品牌：Mazda...
3318,3319,https://www.ptt.cc/bbs/CarShop/M.1675041873.A....,[購車]2023MazdaCX-520SPremiumSE,2023-01-30 09:24:31,CarShop,車輛狀況：全新\n\n車輛品牌：Mazda\n\n車款型式：2023 Mazda CX-5 ...,[購車]2023MazdaCX-520SPremiumSE車輛狀況：全新\n\n車輛品牌：M...
3319,3320,https://www.ptt.cc/bbs/CarShop/M.1675050698.A....,[購車]Mazda35D2023年式20SSignature/Prem,2023-01-30 11:51:36,CarShop,車輛狀況：全新\n \n車輛品牌：Mazda3\n \n車款型式：Mazda3 5D 202...,[購車]Mazda35D2023年式20SSignature/Prem車輛狀況：全新\n \...


`artDate` 日期格式轉換

In [179]:
ptt["artDate"] = pd.to_datetime(ptt["artDate"])
ptt["artDate"] = ptt["artDate"].dt.date
ptt

Unnamed: 0,system_id,artUrl,artTitle,artDate,artCatagory,artContent,whole_content
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01,car,原文連結：\nhttps://ctee.com.tw/lohas/car/378518.ht...,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH原文連結：\nhttps://...
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01,car,原文連結：\nhttps://auto.ltn.com.tw/news/16610/3\n原...,[新聞]預計明年現身，ToyotaRAV4將推全新動力！原文連結：\nhttps://aut...
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！\nhttps://w...,[情報]新世代Mazda2/CX-3有望沿用Yaris平台新一代Mazda 2有望直接沿用Y...
3,4,https://www.ptt.cc/bbs/car/M.1606792933.A.D92....,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,這是依照之前 Mazda Q3 財報會上所公佈的資訊\nhttps://www.carstu...,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台這是依照之前 Mazda Q...
4,5,https://www.ptt.cc/bbs/car/M.1606794935.A.AA7....,[情報]2020年11月份臺灣汽車市場銷售報告,2020-12-01,car,新增小七車\nhttps://www.7car.tw/articles/read/70876...,[情報]2020年11月份臺灣汽車市場銷售報告新增小七車\nhttps://www.7car...
...,...,...,...,...,...,...,...
3316,3317,https://www.ptt.cc/bbs/CarShop/M.1674922265.A....,[購車]全新MazdaCX3020SCarbonEdition,2023-01-29,CarShop,車輛狀況：2023 全新\n\n車輛品牌：Mazda\n\n車款型式：CX30 20S Ca...,[購車]全新MazdaCX3020SCarbonEdition車輛狀況：2023 全新\n\...
3317,3318,https://www.ptt.cc/bbs/CarShop/M.1675004691.A....,[購車]MazdaCX-520SPremiumSE,2023-01-29,CarShop,車輛狀況：全新\n\n車輛品牌：Mazda\n\n車款型式：CX-5 20S Premium...,[購車]MazdaCX-520SPremiumSE車輛狀況：全新\n\n車輛品牌：Mazda...
3318,3319,https://www.ptt.cc/bbs/CarShop/M.1675041873.A....,[購車]2023MazdaCX-520SPremiumSE,2023-01-30,CarShop,車輛狀況：全新\n\n車輛品牌：Mazda\n\n車款型式：2023 Mazda CX-5 ...,[購車]2023MazdaCX-520SPremiumSE車輛狀況：全新\n\n車輛品牌：M...
3319,3320,https://www.ptt.cc/bbs/CarShop/M.1675050698.A....,[購車]Mazda35D2023年式20SSignature/Prem,2023-01-30,CarShop,車輛狀況：全新\n \n車輛品牌：Mazda3\n \n車款型式：Mazda3 5D 202...,[購車]Mazda35D2023年式20SSignature/Prem車輛狀況：全新\n \...


### 資料清理

In [180]:
# 清除空值
ptt.dropna(subset=['whole_content'], axis=0, how='any', inplace=True)

# 用'。'取代'\n\n'，並移除'\n'
ptt = ptt.replace(r'[\n]+', '', regex=True)

# 移除內文中的網址
ptt['whole_content'] = ptt['whole_content'].str.replace('(http|https)://.*', '', regex=True).replace(r'www\S+', '', regex=True)

ptt

Unnamed: 0,system_id,artUrl,artTitle,artDate,artCatagory,artContent,whole_content
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01,car,原文連結：https://ctee.com.tw/lohas/car/378518.html...,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH原文連結：
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01,car,原文連結：https://auto.ltn.com.tw/news/16610/3原文內容：...,[新聞]預計明年現身，ToyotaRAV4將推全新動力！原文連結：
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！https://www...,[情報]新世代Mazda2/CX-3有望沿用Yaris平台新一代Mazda 2有望直接沿用Y...
3,4,https://www.ptt.cc/bbs/car/M.1606792933.A.D92....,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,這是依照之前 Mazda Q3 財報會上所公佈的資訊https://www.carstuff...,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台這是依照之前 Mazda Q...
4,5,https://www.ptt.cc/bbs/car/M.1606794935.A.AA7....,[情報]2020年11月份臺灣汽車市場銷售報告,2020-12-01,car,新增小七車https://www.7car.tw/articles/read/70876?馬...,[情報]2020年11月份臺灣汽車市場銷售報告新增小七車
...,...,...,...,...,...,...,...
3316,3317,https://www.ptt.cc/bbs/CarShop/M.1674922265.A....,[購車]全新MazdaCX3020SCarbonEdition,2023-01-29,CarShop,車輛狀況：2023 全新車輛品牌：Mazda車款型式：CX30 20S Carbon車輛顏色...,[購車]全新MazdaCX3020SCarbonEdition車輛狀況：2023 全新車輛品...
3317,3318,https://www.ptt.cc/bbs/CarShop/M.1675004691.A....,[購車]MazdaCX-520SPremiumSE,2023-01-29,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：CX-5 20S Premium SE車輛顏色：...,[購車]MazdaCX-520SPremiumSE車輛狀況：全新車輛品牌：Mazda車款型式...
3318,3319,https://www.ptt.cc/bbs/CarShop/M.1675041873.A....,[購車]2023MazdaCX-520SPremiumSE,2023-01-30,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：2023 Mazda CX-5 20S Prem...,[購車]2023MazdaCX-520SPremiumSE車輛狀況：全新車輛品牌：Mazda...
3319,3320,https://www.ptt.cc/bbs/CarShop/M.1675050698.A....,[購車]Mazda35D2023年式20SSignature/Prem,2023-01-30,CarShop,車輛狀況：全新 車輛品牌：Mazda3 車款型式：Mazda3 5D 2023年式 20S ...,[購車]Mazda35D2023年式20SSignature/Prem車輛狀況：全新 車輛品...


### 替代字串

In [181]:
replace = pd.read_csv('../dict/replace.csv')
replace_dict = {key: '' for key in replace['alias']}

In [182]:
def replace_str(data):
    for old, new in replace_dict.items():
        data = data.replace(old, new)
    return data

In [183]:
replace_df = ptt.copy()
replace_df['whole_content'] = replace_df['whole_content'].apply(lambda x : replace_str(x))
replace_df

Unnamed: 0,system_id,artUrl,artTitle,artDate,artCatagory,artContent,whole_content
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01,car,原文連結：https://ctee.com.tw/lohas/car/378518.html...,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH：
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01,car,原文連結：https://auto.ltn.com.tw/news/16610/3原文內容：...,[新聞]預計明年現身，ToyotaRAV4將推全新動力！：
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！https://www...,[情報]新世代Mazda2/CX-3有望沿用Yaris平台新一代Mazda 2有望直接沿用Y...
3,4,https://www.ptt.cc/bbs/car/M.1606792933.A.D92....,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,這是依照之前 Mazda Q3 財報會上所公佈的資訊https://www.carstuff...,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台這是依照之前 Mazda Q...
4,5,https://www.ptt.cc/bbs/car/M.1606794935.A.AA7....,[情報]2020年11月份臺灣汽車市場銷售報告,2020-12-01,car,新增小七車https://www.7car.tw/articles/read/70876?馬...,[情報]2020年11月份臺灣汽車市場銷售報告新增小七車
...,...,...,...,...,...,...,...
3316,3317,https://www.ptt.cc/bbs/CarShop/M.1674922265.A....,[購車]全新MazdaCX3020SCarbonEdition,2023-01-29,CarShop,車輛狀況：2023 全新車輛品牌：Mazda車款型式：CX30 20S Carbon車輛顏色...,[購車]全新MazdaCX3020SCarbonEdition：2023 全新車輛品牌：Ma...
3317,3318,https://www.ptt.cc/bbs/CarShop/M.1675004691.A....,[購車]MazdaCX-520SPremiumSE,2023-01-29,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：CX-5 20S Premium SE車輛顏色：...,[購車]MazdaCX-520SPremiumSE：全新車輛品牌：Mazda：CX-5 20...
3318,3319,https://www.ptt.cc/bbs/CarShop/M.1675041873.A....,[購車]2023MazdaCX-520SPremiumSE,2023-01-30,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：2023 Mazda CX-5 20S Prem...,[購車]2023MazdaCX-520SPremiumSE：全新車輛品牌：Mazda：202...
3319,3320,https://www.ptt.cc/bbs/CarShop/M.1675050698.A....,[購車]Mazda35D2023年式20SSignature/Prem,2023-01-30,CarShop,車輛狀況：全新 車輛品牌：Mazda3 車款型式：Mazda3 5D 2023年式 20S ...,[購車]Mazda35D2023年式20SSignature/Prem：全新 車輛品牌：Ma...


### 斷詞
初始化斷詞引擎

In [184]:
jieba.set_dictionary('../dict/dict.txt')
jieba.load_userdict('../dict/user_dict.txt')

Building prefix dict from d:\Projects\NSYSU\2023_BigDataAnalysis\dict\dict.txt ...
Loading model from cache C:\Users\s2568\AppData\Local\Temp\jieba.uaa528441c6063f69433245c0db13322d.cache
Loading model cost 0.622 seconds.
Prefix dict has been built successfully.


先清除標點符號及空字串

In [185]:
clear_df = replace_df.copy()

clear_df['whole_content'] = clear_df['whole_content'].str.replace(r'[^\w\s\d]+', '', regex=True).astype(str)

clear_df

Unnamed: 0,system_id,artUrl,artTitle,artDate,artCatagory,artContent,whole_content
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01,car,原文連結：https://ctee.com.tw/lohas/car/378518.html...,新聞小休旅熱鬧好玩PEUGEOT300815LBlueH
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01,car,原文連結：https://auto.ltn.com.tw/news/16610/3原文內容：...,新聞預計明年現身ToyotaRAV4將推全新動力
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！https://www...,情報新世代Mazda2CX3有望沿用Yaris平台新一代Mazda 2有望直接沿用Yaris...
3,4,https://www.ptt.cc/bbs/car/M.1606792933.A.D92....,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,這是依照之前 Mazda Q3 財報會上所公佈的資訊https://www.carstuff...,Re情報新世代Mazda2CX3有望沿用Yaris平台這是依照之前 Mazda Q3 財報會...
4,5,https://www.ptt.cc/bbs/car/M.1606794935.A.AA7....,[情報]2020年11月份臺灣汽車市場銷售報告,2020-12-01,car,新增小七車https://www.7car.tw/articles/read/70876?馬...,情報2020年11月份臺灣汽車市場銷售報告新增小七車
...,...,...,...,...,...,...,...
3316,3317,https://www.ptt.cc/bbs/CarShop/M.1674922265.A....,[購車]全新MazdaCX3020SCarbonEdition,2023-01-29,CarShop,車輛狀況：2023 全新車輛品牌：Mazda車款型式：CX30 20S Carbon車輛顏色...,購車全新MazdaCX3020SCarbonEdition2023 全新車輛品牌MazdaC...
3317,3318,https://www.ptt.cc/bbs/CarShop/M.1675004691.A....,[購車]MazdaCX-520SPremiumSE,2023-01-29,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：CX-5 20S Premium SE車輛顏色：...,購車MazdaCX520SPremiumSE全新車輛品牌MazdaCX5 20S Premi...
3318,3319,https://www.ptt.cc/bbs/CarShop/M.1675041873.A....,[購車]2023MazdaCX-520SPremiumSE,2023-01-30,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：2023 Mazda CX-5 20S Prem...,購車2023MazdaCX520SPremiumSE全新車輛品牌Mazda2023 Mazd...
3319,3320,https://www.ptt.cc/bbs/CarShop/M.1675050698.A....,[購車]Mazda35D2023年式20SSignature/Prem,2023-01-30,CarShop,車輛狀況：全新 車輛品牌：Mazda3 車款型式：Mazda3 5D 2023年式 20S ...,購車Mazda35D2023年式20SSignaturePrem全新 車輛品牌Mazda3 ...


進行jieba斷詞

In [186]:
# 設定繁體中文詞庫
jieba.set_dictionary("../dict/dict.txt.big")

stopwords_manual = ['恭喜', '有無', '有人', '是不是', '本來', '遇到', '機車', '時間', '討論', '10', 'XD', '20', '未來', '現在', '今年']

# 新增stopwords
with open("../dict/stopwords.txt", encoding="utf-8") as f:
    stopWords = [line.strip() for line in f.readlines()]
stopWords.extend(stopwords_manual)

# 設定斷詞 function
def getToken(row):
    if not isinstance(row, str):  # 檢查類型是否為字串
        row = str(row)  # 將非字串類型轉換為字串
    seg_list = jieba.cut(row, cut_all=False)
    seg_list = [
        w for w in seg_list if w not in stopWords and len(w) > 1
    ]  # 篩選掉停用字與字元數大於1的詞彙
    return seg_list

clear_df["words"] = clear_df["whole_content"].apply(getToken)
clear_df

Building prefix dict from d:\Projects\NSYSU\2023_BigDataAnalysis\dict\dict.txt.big ...
Loading model from cache C:\Users\s2568\AppData\Local\Temp\jieba.u87526c01a2c6093fa84ac3f5467b7506.cache


Loading model cost 1.293 seconds.
Prefix dict has been built successfully.


Unnamed: 0,system_id,artUrl,artTitle,artDate,artCatagory,artContent,whole_content,words
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01,car,原文連結：https://ctee.com.tw/lohas/car/378518.html...,新聞小休旅熱鬧好玩PEUGEOT300815LBlueH,"[新聞, 小休, 熱鬧, 好玩, PEUGEOT300815LBlueH]"
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01,car,原文連結：https://auto.ltn.com.tw/news/16610/3原文內容：...,新聞預計明年現身ToyotaRAV4將推全新動力,"[新聞, 預計, 明年, 現身, ToyotaRAV4, 將推, 動力]"
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！https://www...,情報新世代Mazda2CX3有望沿用Yaris平台新一代Mazda 2有望直接沿用Yaris...,"[情報, 世代, Mazda2CX3, 有望, 沿用, Yaris, 平台, 新一代, Ma..."
3,4,https://www.ptt.cc/bbs/car/M.1606792933.A.D92....,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,這是依照之前 Mazda Q3 財報會上所公佈的資訊https://www.carstuff...,Re情報新世代Mazda2CX3有望沿用Yaris平台這是依照之前 Mazda Q3 財報會...,"[Re, 情報, 世代, Mazda2CX3, 有望, 沿用, Yaris, 平台, Maz..."
4,5,https://www.ptt.cc/bbs/car/M.1606794935.A.AA7....,[情報]2020年11月份臺灣汽車市場銷售報告,2020-12-01,car,新增小七車https://www.7car.tw/articles/read/70876?馬...,情報2020年11月份臺灣汽車市場銷售報告新增小七車,"[情報, 2020, 11, 月份, 臺灣汽車, 市場, 銷售, 報告, 新增, 小七車]"
...,...,...,...,...,...,...,...,...
3316,3317,https://www.ptt.cc/bbs/CarShop/M.1674922265.A....,[購車]全新MazdaCX3020SCarbonEdition,2023-01-29,CarShop,車輛狀況：2023 全新車輛品牌：Mazda車款型式：CX30 20S Carbon車輛顏色...,購車全新MazdaCX3020SCarbonEdition2023 全新車輛品牌MazdaC...,"[購車, MazdaCX3020SCarbonEdition2023, MazdaCX30,..."
3317,3318,https://www.ptt.cc/bbs/CarShop/M.1675004691.A....,[購車]MazdaCX-520SPremiumSE,2023-01-29,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：CX-5 20S Premium SE車輛顏色：...,購車MazdaCX520SPremiumSE全新車輛品牌MazdaCX5 20S Premi...,"[購車, MazdaCX520SPremiumSE, MazdaCX5, 20S, Prem..."
3318,3319,https://www.ptt.cc/bbs/CarShop/M.1675041873.A....,[購車]2023MazdaCX-520SPremiumSE,2023-01-30,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：2023 Mazda CX-5 20S Prem...,購車2023MazdaCX520SPremiumSE全新車輛品牌Mazda2023 Mazd...,"[購車, 2023MazdaCX520SPremiumSE, Mazda2023, Mazd..."
3319,3320,https://www.ptt.cc/bbs/CarShop/M.1675050698.A....,[購車]Mazda35D2023年式20SSignature/Prem,2023-01-30,CarShop,車輛狀況：全新 車輛品牌：Mazda3 車款型式：Mazda3 5D 2023年式 20S ...,購車Mazda35D2023年式20SSignaturePrem全新 車輛品牌Mazda3 ...,"[購車, Mazda35D2023, 年式, 20SSignaturePrem, Mazda..."


## 情緒分析
利用`LIWC`進行情緒分析
+ sentiment 計算方式: positive - anger - anx - negative - sad

In [187]:
senti_df = clear_df.copy()

In [188]:
# 讀取情緒字典
liwc_dict = pd.read_csv("../dict/liwc/LIWC_CH.csv")
liwc_dict = liwc_dict.rename(columns={'name': 'word', "class": 'sentiments'})
liwc_dict = liwc_dict.set_index('word')['sentiments'].to_dict()
# liwc_dict

In [189]:
def get_sentiment(words, liwc_dict):
    sentiment_ratio = 0
    pos = 0
    neg = 0
    for word in words:
        if word in liwc_dict:
            if (liwc_dict[word] == "positive"):
                pos += 1
            elif (liwc_dict[word] == "negative"):
                neg += 1
        else:
            continue
    
    if (pos+neg == 0):
        sentiment_ratio = 0.5
    else :
        sentiment_ratio = round(pos / (pos + neg), 3)
        
    return sentiment_ratio

In [190]:
# 幫每句話加上情緒分數
senti_df['sentimentRatio'] = senti_df.apply(lambda row : get_sentiment(row['words'],liwc_dict), axis = 1)
senti_df

Unnamed: 0,system_id,artUrl,artTitle,artDate,artCatagory,artContent,whole_content,words,sentimentRatio
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01,car,原文連結：https://ctee.com.tw/lohas/car/378518.html...,新聞小休旅熱鬧好玩PEUGEOT300815LBlueH,"[新聞, 小休, 熱鬧, 好玩, PEUGEOT300815LBlueH]",1.0
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01,car,原文連結：https://auto.ltn.com.tw/news/16610/3原文內容：...,新聞預計明年現身ToyotaRAV4將推全新動力,"[新聞, 預計, 明年, 現身, ToyotaRAV4, 將推, 動力]",0.5
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！https://www...,情報新世代Mazda2CX3有望沿用Yaris平台新一代Mazda 2有望直接沿用Yaris...,"[情報, 世代, Mazda2CX3, 有望, 沿用, Yaris, 平台, 新一代, Ma...",0.5
3,4,https://www.ptt.cc/bbs/car/M.1606792933.A.D92....,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,這是依照之前 Mazda Q3 財報會上所公佈的資訊https://www.carstuff...,Re情報新世代Mazda2CX3有望沿用Yaris平台這是依照之前 Mazda Q3 財報會...,"[Re, 情報, 世代, Mazda2CX3, 有望, 沿用, Yaris, 平台, Maz...",0.5
4,5,https://www.ptt.cc/bbs/car/M.1606794935.A.AA7....,[情報]2020年11月份臺灣汽車市場銷售報告,2020-12-01,car,新增小七車https://www.7car.tw/articles/read/70876?馬...,情報2020年11月份臺灣汽車市場銷售報告新增小七車,"[情報, 2020, 11, 月份, 臺灣汽車, 市場, 銷售, 報告, 新增, 小七車]",0.5
...,...,...,...,...,...,...,...,...,...
3316,3317,https://www.ptt.cc/bbs/CarShop/M.1674922265.A....,[購車]全新MazdaCX3020SCarbonEdition,2023-01-29,CarShop,車輛狀況：2023 全新車輛品牌：Mazda車款型式：CX30 20S Carbon車輛顏色...,購車全新MazdaCX3020SCarbonEdition2023 全新車輛品牌MazdaC...,"[購車, MazdaCX3020SCarbonEdition2023, MazdaCX30,...",0.5
3317,3318,https://www.ptt.cc/bbs/CarShop/M.1675004691.A....,[購車]MazdaCX-520SPremiumSE,2023-01-29,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：CX-5 20S Premium SE車輛顏色：...,購車MazdaCX520SPremiumSE全新車輛品牌MazdaCX5 20S Premi...,"[購車, MazdaCX520SPremiumSE, MazdaCX5, 20S, Prem...",1.0
3318,3319,https://www.ptt.cc/bbs/CarShop/M.1675041873.A....,[購車]2023MazdaCX-520SPremiumSE,2023-01-30,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：2023 Mazda CX-5 20S Prem...,購車2023MazdaCX520SPremiumSE全新車輛品牌Mazda2023 Mazd...,"[購車, 2023MazdaCX520SPremiumSE, Mazda2023, Mazd...",1.0
3319,3320,https://www.ptt.cc/bbs/CarShop/M.1675050698.A....,[購車]Mazda35D2023年式20SSignature/Prem,2023-01-30,CarShop,車輛狀況：全新 車輛品牌：Mazda3 車款型式：Mazda3 5D 2023年式 20S ...,購車Mazda35D2023年式20SSignaturePrem全新 車輛品牌Mazda3 ...,"[購車, Mazda35D2023, 年式, 20SSignaturePrem, Mazda...",1.0


In [191]:
# 看一下情緒的分布
senti_df['sentimentRatio'].describe()

count    2787.000000
mean        0.630432
std         0.259330
min         0.000000
25%         0.500000
50%         0.500000
75%         0.857000
max         1.000000
Name: sentimentRatio, dtype: float64

In [192]:
final_df = senti_df.drop('whole_content', axis=1)
final_df

Unnamed: 0,system_id,artUrl,artTitle,artDate,artCatagory,artContent,words,sentimentRatio
0,1,https://www.ptt.cc/bbs/car/M.1606752584.A.175....,[新聞]小休旅熱鬧好玩PEUGEOT30081.5LBlueH,2020-12-01,car,原文連結：https://ctee.com.tw/lohas/car/378518.html...,"[新聞, 小休, 熱鬧, 好玩, PEUGEOT300815LBlueH]",1.0
1,2,https://www.ptt.cc/bbs/car/M.1606791807.A.CA7....,[新聞]預計明年現身，ToyotaRAV4將推全新動力！,2020-12-01,car,原文連結：https://auto.ltn.com.tw/news/16610/3原文內容：...,"[新聞, 預計, 明年, 現身, ToyotaRAV4, 將推, 動力]",0.5
2,3,https://www.ptt.cc/bbs/car/M.1606791901.A.39C....,[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,新一代Mazda 2有望直接沿用Yaris平台，CX3也可能直接辦理！https://www...,"[情報, 世代, Mazda2CX3, 有望, 沿用, Yaris, 平台, 新一代, Ma...",0.5
3,4,https://www.ptt.cc/bbs/car/M.1606792933.A.D92....,Re:[情報]新世代Mazda2/CX-3有望沿用Yaris平台,2020-12-01,car,這是依照之前 Mazda Q3 財報會上所公佈的資訊https://www.carstuff...,"[Re, 情報, 世代, Mazda2CX3, 有望, 沿用, Yaris, 平台, Maz...",0.5
4,5,https://www.ptt.cc/bbs/car/M.1606794935.A.AA7....,[情報]2020年11月份臺灣汽車市場銷售報告,2020-12-01,car,新增小七車https://www.7car.tw/articles/read/70876?馬...,"[情報, 2020, 11, 月份, 臺灣汽車, 市場, 銷售, 報告, 新增, 小七車]",0.5
...,...,...,...,...,...,...,...,...
3316,3317,https://www.ptt.cc/bbs/CarShop/M.1674922265.A....,[購車]全新MazdaCX3020SCarbonEdition,2023-01-29,CarShop,車輛狀況：2023 全新車輛品牌：Mazda車款型式：CX30 20S Carbon車輛顏色...,"[購車, MazdaCX3020SCarbonEdition2023, MazdaCX30,...",0.5
3317,3318,https://www.ptt.cc/bbs/CarShop/M.1675004691.A....,[購車]MazdaCX-520SPremiumSE,2023-01-29,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：CX-5 20S Premium SE車輛顏色：...,"[購車, MazdaCX520SPremiumSE, MazdaCX5, 20S, Prem...",1.0
3318,3319,https://www.ptt.cc/bbs/CarShop/M.1675041873.A....,[購車]2023MazdaCX-520SPremiumSE,2023-01-30,CarShop,車輛狀況：全新車輛品牌：Mazda車款型式：2023 Mazda CX-5 20S Prem...,"[購車, 2023MazdaCX520SPremiumSE, Mazda2023, Mazd...",1.0
3319,3320,https://www.ptt.cc/bbs/CarShop/M.1675050698.A....,[購車]Mazda35D2023年式20SSignature/Prem,2023-01-30,CarShop,車輛狀況：全新 車輛品牌：Mazda3 車款型式：Mazda3 5D 2023年式 20S ...,"[購車, Mazda35D2023, 年式, 20SSignaturePrem, Mazda...",1.0


## 儲存結果

In [193]:
final_df.to_csv("../data/sentiment/mazda_clean_data.csv", encoding = 'utf-8',index = False)