## Lesson 20 - PTT Graph





### Table of Contents

* [PTT.cc](#ptt)
* [Import data to MongoDB](#import_mongo)
* [Load data from MongoDB](#load_data_from_mongodb)
* [Create Keyword Node](#create_keyword)
* [Create Post Node](#create_post_node)
* [Create User Node](#create_user_node)
* [Relationship of PTTPost keywords and post](#relationship_ptt_keyword_POST)
* [Relationship of PTTPost and User](#relationship_post_user)
* [Relationship of PTTPost keywords and comment](#)
* [Create comment Node](#create_comment_node)
* [Relationship of Comment and Post](#relationship_comment_post)
* [Relationship of Comment and User](#relationship_comment_user)
* [Import to Neo4j](#import_neo4j)
* [Cypher query](#cypher)




<a id="ptt"></a>
## PTT.cc

Ptt，中文名批踢踢實業坊。但知道這個站名`由來`的人不太多，還曾有人覺得Ptt是「怕太太」的意思。

Ptt名稱來自創站者杜奕瑾（ID:Ptt）本人帳號的名稱，根據他在2010年的一場演講中的說法，該名稱的由來是「Panda Tu」一稱呼，Panda即熊貓，他自述因為在大學時常睡眠不足有黑眼圈，很像熊貓，Tu是他的姓「杜」的拼音（護照上的名字），因此取P和T兩開頭字母，然後覺得有兩個T感覺唸起來比較好聽，於是就成為Ptt。

同樣據杜奕瑾所說，後來Ptt作為站名，可以進一步衍伸的意義則包括「P」是「批判」，「T」是「踢爆」，代表ptt具有批評懷疑社會既有觀念的精神。另一種衍伸的意義是「Professional　Techonolgy　Temple」（專業科技殿堂）（因此並非一開始就是這三個字的縮寫）

<img src="images/ptt_incoming.png">

<a id="import_mongo"></a>
## Import data to MongoDB

- ./data/mongo/ptt/

```
mongorestore -h 127.0.0.1 --port 27017 --db ptt ptt --drop
```

<img src="images/restore_ptt_mongodb_done.png">

<img src="images/robomongo_ptt.png">

In [1]:
from __future__ import absolute_import

import json
import requests
import datetime
import time
import pandas as pd
import numpy as np
import csv
import ast

# from tqdm import tqdm
# from bs4 import BeautifulSoup
from urllib.parse import urlencode
from datetime import (datetime as dt, timedelta as td)
from abc import ABCMeta, abstractmethod
from urllib.parse import quote, unquote
from pymongo import MongoClient
from pymongo import UpdateOne

In [2]:
def get_keyword_id(keyword):
    return hashtag_keyword_map[str(keyword).strip()]

def add_quote(keyword):
    return '"'+str(keyword).replace('\n', ' ').replace('\r', ' ').strip()+'"'

def replace_double_quote(s):
    return str(s).strip().replace('"',"'")

def if_in_list(item, target_list):
    if item not in target_list:
        return 0
    return 1

def extract_ptt_post_id(url):
    return url.split('/')[-1:][0].replace('.html','')

def trim_locale(locale):
    return locale.replace('(','').replace(')','')

def extract_hashtag(content):
    hashtags = []
    if type(content)==float:
        return ""
    if '#' in content:
        for h in content.split('#')[1:]:
            hashtags.append('#'+h.split(' ')[0].replace('\n','').replace('、','').strip())
    return "/".join(hashtags) if hashtags else ""

def extract_mention(content):
    mentions = []
    if type(content)==float:
        return ""
    if '@' in content:
        for h in content.split('@')[1:]:
            mentions.append('@'+h.split(' ')[0].replace('\n','').replace('、','').strip())
    return "/".join(mentions) if mentions else ""

In [3]:
import jieba
import jieba.analyse

jieba.initialize()
jieba.set_dictionary("./dict/dict.big.txt")
jieba.load_userdict('./dict/mydic.txt')

wantWordList = set()
with open('./dict/mydic.txt', 'r', encoding="utf8") as file:
    wantWordList = file.readlines()
    wantWordList = [wantword.strip('\n').replace('\ufeff','').strip().split(' ')[0] for wantword in wantWordList]

stopwords = set()
with open('./dict/stopwords.txt', 'r', encoding="utf8") as file:
    stopwords = file.readlines()
    stopwords = [stopword.strip('\n').strip() for stopword in stopwords]
except_file = open("./dict/hippo_exception_word.txt", encoding='utf-8')
exception = except_file.read().split(',')
exception.append(" ")

punct = set(u''':!),.:;?]}$¢'"、。〉》」』】〕〗〞︰︱︳﹐､﹒﹔﹕﹖﹗﹚﹜﹞！），．：；？｜｝︴︶︸︺︼︾﹀﹂﹄﹏､～￠々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖（［｛￡￥〝︵︷︹︻︽︿﹁﹃﹙﹛﹝（｛“‘-—_…''')
punct |= set(exception)
punct = list(punct)
stopwords = set(stopwords+punct)

names = set()
with open('./dict/name.txt', 'r', encoding="utf-8") as file:
    names= file.readlines()
    names = [name.strip('\n').strip() for name in names]

def segmentWord(text):
    words = [word for word in jieba.cut(text, cut_all=False) if len(word.strip())>1 and (word not in stopwords)  ]
    return "/".join(set(words))

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\princ\AppData\Local\Temp\jieba.cache
Loading model cost 0.924 seconds.
Prefix dict has been built succesfully.
Building prefix dict from D:\Programming\Python\Neo4j\課程\dict\dict.big.txt ...
Loading model from cache C:\Users\princ\AppData\Local\Temp\jieba.u2119b7327cb8dafa65534ff0e37256b6.cache
Loading model cost 1.424 seconds.
Prefix dict has been built succesfully.


In [4]:
MONGO_URI = 'mongodb://127.0.0.1:27017'

client = MongoClient(MONGO_URI)
database_name = 'ptt'
collection_name = 'Gossiping'
db = client[database_name]

<a id="load_data_from_mongodb"></a>
## Load data from MongoDB

In [5]:
start = datetime.datetime.strptime("2020-11-12 00:00:00", "%Y-%m-%d %H:%M:%S")
end = datetime.datetime.strptime("2020-11-13 23:59:59", "%Y-%m-%d %H:%M:%S")
result = list(db[collection_name].find({ "date":{'$gt': start, '$lt': end} }))

In [6]:
df = pd.DataFrame(result)
if '_id' in df.columns:
    del df['_id']
df.fillna('', inplace=True)
df['postId'] = df['url'].apply(lambda x:extract_ptt_post_id(x))
df['locale'] = df['locale'].apply(lambda x:trim_locale(x))
df['author'] = df['author'].apply(lambda x:str(x).strip())
df = df.replace(np.nan, '', regex=True)
df.sample()

Unnamed: 0,board,title,author,content,date,ip,locale,comments,score,url,updatetime,postId
2037,Gossiping,[問卦] 我朋友當儀隊被川普老婆挽著有反映怎辦?,shoga,\n\n我朋友美國大鵰當美軍儀隊啦\n碰到總統來主持退伍軍人紀念儀式\n現場下雨要我朋友幫第...,2020-11-12 21:58:55,123.194.160.91,臺灣,"[{'user': 'A80211ab', 'content': ': 要是我也會有反應 超...",2,https://www.ptt.cc/bbs/Gossiping/M.1605189538....,2020-11-16 10:47:13.158,M.1605189538.A.8A8


<a id="create_post_node"></a>
## Create Post Node

In [7]:
ptt_post_nodes = df[['postId','title','author','content','date','ip','locale','score','url','board']].copy()
ptt_post_nodes[":LABEL"] = "PTTPost"
ptt_post_nodes = ptt_post_nodes.rename(columns={'postId': 'PTTPost:ID'})
ptt_post_nodes = ptt_post_nodes.drop_duplicates(subset='PTTPost:ID', keep="first")
ptt_post_nodes['keywords'] = list(map(segmentWord, ptt_post_nodes['content']+ptt_post_nodes['title']))
ptt_post_nodes["author"] = ptt_post_nodes["author"].apply(lambda x : replace_double_quote(x))
ptt_post_nodes["author"] = ptt_post_nodes["author"].apply(lambda x : add_quote(x))
ptt_post_nodes["content"] = ptt_post_nodes["content"].apply(lambda x : replace_double_quote(x))
ptt_post_nodes["content"] = ptt_post_nodes["content"].apply(lambda x : add_quote(x))
ptt_post_nodes["title"] = ptt_post_nodes["title"].apply(lambda x : replace_double_quote(x))
ptt_post_nodes["title"] = ptt_post_nodes["title"].apply(lambda x : add_quote(x))
ptt_post_nodes["content"] = ptt_post_nodes["content"].apply(lambda x:str(x).replace('\n',' ').strip())

if ptt_post_nodes.shape[0]>0:
    ptt_post_nodes_opt = ptt_post_nodes[["PTTPost:ID","title","author","content","date","ip","locale","score","url","board",":LABEL"]]
else:
    ptt_post_nodes_opt = pd.DataFrame(columns=["PTTPost:ID","title","author","content","date","ip","locale","score","url","board",":LABEL"])
ptt_post_nodes_opt.to_csv('csv/ptt/ptt_post_nodes.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
print(ptt_post_nodes.shape[0])
ptt_post_nodes_opt.sample()

4023


Unnamed: 0,PTTPost:ID,title,author,content,date,ip,locale,score,url,board,:LABEL
3497,M.1605148480.A.B57,"""Re: [新聞] 高市議會通過萊劑零檢出 陳其邁：盼中央""","""weakerman""","""""",2020-11-12 10:34:37,114.47.85.248,臺灣,2,https://www.ptt.cc/bbs/Gossiping/M.1605148480....,Gossiping,PTTPost


<a id="create_user_node"></a>
## Create Comment Node

In [8]:
cnt_comment = 10000000
commentCountList = []
commentList = []
for index, row in df[df['comments'].str.len() > 0].iterrows():
    try:
        for x in row['comments']:
            if len(x['content'].replace(':','').strip())>0:
                year = str(row['date']).split('-')[0]
                date_time = x['time'].replace('/','-')
                comment_date = year+'-'+date_time+':00'
                comment_content = x['content'].replace(':','').strip()
                commentList.append({"postId":row['postId'],"user":x['user'].strip(),"comment":comment_content, "date":comment_date})
    except Exception as e:
        pass

df_ptt_comment = pd.DataFrame(commentList)
df_ptt_comment['commmentId'] = df_ptt_comment.index +cnt_comment
df_ptt_comment['commmentId'] = df_ptt_comment['commmentId'].apply(lambda x:"CMT"+str(x))
df_ptt_comment['keywords'] = list(map(segmentWord, df_ptt_comment['comment']))
print(df_ptt_comment.shape[0])
df_ptt_comment.sample(1)

138875


Unnamed: 0,postId,user,comment,date,commmentId,keywords
117169,M.1605150059.A.B42,susanna026,覺得 口罩管成這樣 要進黑心瘦肉精豬肉？,2020-11-12 13:17:00,CMT10117169,口罩/要進/瘦肉精/豬肉/管成/黑心


In [9]:
list_user = set(list(df['author'])+list(df_ptt_comment['user']))
df_user = pd.DataFrame(list_user, columns={"user"})
df_user = df_user.drop_duplicates(subset='user', keep="first")
bag_username_list = []
user_id_map = {}
cnt_user = 100000000
for index, row in df_user.iterrows():
    for user in row['user'].strip().split(','):
        if user not in bag_username_list:
            cnt_user = cnt_user+1
            cnt_user_code = "CU"+str(cnt_user)
            user_id_map[user] = cnt_user_code
            bag_username_list.append(user)

In [10]:
ptt_user_nodes = df_user[['user']].copy()
ptt_user_nodes = ptt_user_nodes.rename(columns={'user': 'username'})
ptt_user_nodes["PTTUser:ID"] = ptt_user_nodes["username"].map(user_id_map)
ptt_user_nodes[":LABEL"] = "PTTUser"
ptt_user_nodes = ptt_user_nodes[["PTTUser:ID","username",":LABEL"]]
ptt_user_nodes = ptt_user_nodes.drop_duplicates(subset='PTTUser:ID', keep="first")
ptt_user_nodes["username"] = ptt_user_nodes["username"].apply(lambda x : replace_double_quote(x))
ptt_user_nodes["username"] = ptt_user_nodes["username"].apply(lambda x : add_quote(x))
ptt_user_nodes.to_csv('csv/ptt/ptt_user_nodes.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
print(ptt_user_nodes.shape[0])
ptt_user_nodes.head(1)

23230


Unnamed: 0,PTTUser:ID,username,:LABEL
0,CU100000001,"""nolander""",PTTUser


<a id="create_keyword"></a>
## Create Keyword Node

In [11]:
df['keywords'] = list(map(segmentWord, df['title']+df['content']))
df_ptt_comment.fillna('', inplace=True)
df_ptt_comment['keywords'] = list(map(segmentWord, df_ptt_comment['comment']))
df_keyword = pd.DataFrame(list(df['keywords'])+list(df_ptt_comment['keywords']), columns=['keywords'])
df_keyword.head()

Unnamed: 0,keywords
0,但謝/死人型/橋頭/里長/權利/討論/姊弟/周遭/褫奪公權/署名/致電/友人/得易科/今年/...
1,霍萱側/甜美/粉色/一轉/Iris/人字奶/賣萌/車頭燈/至今/誠意/IG/朝聖/內褲/豆子...
2,心慧/蓓今/許可/分別/根本無法/30%/政策/表示/制度/至今/巨資/強調/市長/哲在/二...
3,ASUS/放在/冰會/Asus/on/X00QD/問卦/去年/家樂福/my/夏天/常溫/Se...
4,可惜/實力派/on/Sent/SM/唱個/JPTT/唱歌/一堆/相關/my/興趣/好像/接班...


In [12]:
bag_keyword_list = []
global_bag_keyword_list = []
keyword_id_map = {}
cnt_keyword = 100_000_000
for index, row in df_keyword.iterrows():
    for keyword in row['keywords'].split('/'):
        if keyword.strip() not in global_bag_keyword_list:
            cnt_keyword = cnt_keyword+1
            cnt_keyword_code = "K"+str(cnt_keyword)
            keyword_id_map[keyword] = cnt_keyword_code
            bag_keyword_list.append({"id":cnt_keyword_code, "keyword":keyword})
            global_bag_keyword_list.append(keyword)

if global_bag_keyword_list:
    # Create keyword nodes
    all_keyword_nodes = pd.DataFrame(bag_keyword_list)
    all_keyword_nodes = all_keyword_nodes.rename(columns={'id':'Keyword:ID'})
    all_keyword_nodes[":LABEL"] = "Keywords"
    all_keyword_nodes["keyword"] = all_keyword_nodes["keyword"].apply(lambda x : replace_double_quote(x))
    all_keyword_nodes["keyword"] = all_keyword_nodes["keyword"].apply(lambda x: add_quote(x))
    all_keyword_nodes.to_csv(f'csv/ptt/post_keyword_nodes.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
else:
    # Create empty keyword nodes
    keyword_id_map = {}
    all_keyword_nodes = pd.DataFrame(columns=['Keyword:ID', 'keyword', ':LABEL'])
    all_keyword_nodes.to_csv(f'csv/ptt/post_keyword_nodes.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
all_keyword_nodes.sample()

Unnamed: 0,Keyword:ID,keyword,:LABEL
39940,K100039941,"""山線""",Keywords


<a id="relationship_ptt_keyword_POST"></a>
## Relationship of PTTPost keywords and post

In [13]:
to_id_list = list(ptt_post_nodes['PTTPost:ID'])
Ptt_keyword_list = []
for index, row in ptt_post_nodes[ptt_post_nodes['keywords']!=''].iterrows():
    for keyword in row['keywords'].replace('\n',' ').strip().split('/'):
        Ptt_keyword_list.append({"keyword":keyword, "postId":row['PTTPost:ID']})
Ptt_keyword_nodes = pd.DataFrame(Ptt_keyword_list)
Ptt_keyword_relation = Ptt_keyword_nodes[['keyword','postId']]
Ptt_keyword_relation = Ptt_keyword_relation.rename(columns={'keyword': 'keyword',
                                                            'postId': ':END_ID'})
Ptt_keyword_relation[":START_ID"] =  Ptt_keyword_relation["keyword"].map(keyword_id_map)
Ptt_keyword_relation[":END_ID"] = Ptt_keyword_relation[":END_ID"].apply(lambda x:str(x))
Ptt_keyword_relation['in_list'] = Ptt_keyword_relation[':END_ID'].apply(lambda x:if_in_list(x, to_id_list))
Ptt_keyword_relation = Ptt_keyword_relation[Ptt_keyword_relation['in_list']==1]
del Ptt_keyword_relation['in_list']
Ptt_keyword_relation[':TYPE'] = 'RELATED_TO'
Ptt_keyword_relation["keyword"] = Ptt_keyword_relation["keyword"].apply(lambda x : replace_double_quote(x))
Ptt_keyword_relation["keyword"] = Ptt_keyword_relation["keyword"].apply(lambda x : add_quote(x))
Ptt_keyword_relation = Ptt_keyword_relation[[":START_ID","keyword",":END_ID",":TYPE"]]
Ptt_keyword_relation.to_csv('csv/ptt/ptt_post_keyword_rel.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
print(Ptt_keyword_relation.shape[0])
Ptt_keyword_relation.sample()

130381


Unnamed: 0,:START_ID,keyword,:END_ID,:TYPE
5768,K100000359,"""my""",M.1605275683.A.A58,RELATED_TO


<a id="relationship_ptt_keyword_COMMENT"></a>
## Relationship of PTTPost keywords and comment

In [14]:
to_id_list = list(df_ptt_comment['commmentId'])
Ptt_keyword_list = []
for index, row in df_ptt_comment[df_ptt_comment['keywords']!=''].iterrows():
    for keyword in row['keywords'].replace('\n',' ').strip().split('/'):
        Ptt_keyword_list.append({"keyword":keyword, "commmentId":row['commmentId']})

In [15]:
Ptt_keyword_nodes = pd.DataFrame(Ptt_keyword_list)
Ptt_keyword_relation = Ptt_keyword_nodes[['keyword','commmentId']]
Ptt_keyword_relation = Ptt_keyword_relation.rename(columns={'keyword': 'keyword',
                                                            'commmentId': ':END_ID'})
Ptt_keyword_relation[":START_ID"] =  Ptt_keyword_relation["keyword"].map(keyword_id_map)
Ptt_keyword_relation[":END_ID"] = Ptt_keyword_relation[":END_ID"].apply(lambda x:str(x))
Ptt_keyword_relation['in_list'] = Ptt_keyword_relation[':END_ID'].apply(lambda x:if_in_list(x, to_id_list))
Ptt_keyword_relation = Ptt_keyword_relation[Ptt_keyword_relation['in_list']==1]
del Ptt_keyword_relation['in_list']
Ptt_keyword_relation[':TYPE'] = 'RELATED_TO'
Ptt_keyword_relation["keyword"] = Ptt_keyword_relation["keyword"].apply(lambda x : replace_double_quote(x))
Ptt_keyword_relation["keyword"] = Ptt_keyword_relation["keyword"].apply(lambda x : add_quote(x))
Ptt_keyword_relation = Ptt_keyword_relation[[":START_ID","keyword",":END_ID",":TYPE"]]
Ptt_keyword_relation.to_csv('csv/ptt/ptt_comment_keyword_rel.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
print(Ptt_keyword_relation.shape[0])
Ptt_keyword_relation.sample()

448206


Unnamed: 0,:START_ID,keyword,:END_ID,:TYPE
357737,K100011364,"""自律""",CMT10112949,RELATED_TO


<a id="relationship_post_user"></a>
## Relationship of PTTPost and User

In [16]:
to_id_list = list(ptt_post_nodes['PTTPost:ID'])
ptt_user_post_relation = df[['author','title','postId']]
ptt_user_post_relation[":START_ID"] = ptt_user_post_relation["author"].map(user_id_map)
ptt_user_post_relation = ptt_user_post_relation.rename(columns={'author': 'username',
                                                                'postId': ':END_ID'})
ptt_user_post_relation['in_list'] = ptt_user_post_relation[':END_ID'].apply(lambda x:if_in_list(x, to_id_list))
ptt_user_post_relation = ptt_user_post_relation[ptt_user_post_relation['in_list']==1]
del ptt_user_post_relation['in_list']
ptt_user_post_relation[':TYPE'] = 'POSTS'
# ptt_user_post_relation[':START_ID'] = ptt_user_post_relation[':START_ID'].apply(np.int64)
ptt_user_post_relation[':START_ID'] = ptt_user_post_relation[':START_ID'].apply(lambda x:str(x))
ptt_user_post_relation = ptt_user_post_relation[[":START_ID","username","title",":END_ID",":TYPE"]]
ptt_user_post_relation["username"] = ptt_user_post_relation["username"].apply(lambda x : replace_double_quote(x))
ptt_user_post_relation["username"] = ptt_user_post_relation["username"].apply(lambda x : add_quote(x))
ptt_user_post_relation["title"] = ptt_user_post_relation["title"].apply(lambda x : replace_double_quote(x))
ptt_user_post_relation["title"] = ptt_user_post_relation["title"].apply(lambda x : add_quote(x))
ptt_user_post_relation.to_csv('csv/ptt/ptt_user_post_rel.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
print(ptt_user_post_relation.shape[0])
ptt_user_post_relation.sample(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


4023


Unnamed: 0,:START_ID,username,title,:END_ID,:TYPE
184,CU100018808,"""lelingzi""","""[問卦] 唸文組有什麼壓力嗎？""",M.1605275511.A.0DE,POSTS


<a id="create_comment_node"></a>
## Create comment Node

In [17]:
df_ptt_comment.sample(1)

Unnamed: 0,postId,user,comment,date,commmentId,keywords
38111,M.1605237144.A.C6E,scottandk,認真說，很會善用自己優勢，但機體(腦袋)受限,2020-11-13 11:40:00,CMT10038111,很會/機體/善用/受限/認真/優勢/腦袋


In [18]:
ptt_comment_nodes = df_ptt_comment[['commmentId','comment','user','date']].copy()

ptt_comment_nodes[":LABEL"] = "PTTComment"
ptt_comment_nodes = ptt_comment_nodes.rename(columns={'commmentId': 'PTTComment:ID'})
ptt_comment_nodes = ptt_comment_nodes.drop_duplicates(subset='PTTComment:ID', keep="first")

ptt_comment_nodes["user"] = ptt_comment_nodes["user"].apply(lambda x : replace_double_quote(x))
ptt_comment_nodes["user"] = ptt_comment_nodes["user"].apply(lambda x : add_quote(x))
ptt_comment_nodes["comment"] = ptt_comment_nodes["comment"].apply(lambda x : replace_double_quote(x))
ptt_comment_nodes["comment"] = ptt_comment_nodes["comment"].apply(lambda x : add_quote(x))
ptt_comment_nodes.to_csv('csv/ptt/ptt_comment_nodes.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
print(ptt_comment_nodes.shape[0])
ptt_comment_nodes.sample(1)

138875


Unnamed: 0,PTTComment:ID,comment,user,date,:LABEL
44939,CMT10044939,"""4歲年輕人？滿口幹話""","""ShadeSea""",2020-11-13 09:52:00,PTTComment


<a id="relationship_comment_post"></a>
## Relationship of Comment and Post

In [19]:
print(df_ptt_comment.shape[0])
df_ptt_comment2 = df_ptt_comment[df_ptt_comment['user'].notna()]
print(df_ptt_comment2.shape[0])
df_ptt_comment2.sample(1)

138875
138875


Unnamed: 0,postId,user,comment,date,commmentId,keywords
23812,M.1605249729.A.876,daruq,種九層塔,2020-11-13 16:25:00,CMT10023812,九層


In [20]:
ptt_post_nodes.sample(1)

Unnamed: 0,PTTPost:ID,title,author,content,date,ip,locale,score,url,board,:LABEL,keywords
3502,M.1605148417.A.88C,"""[問卦] 有沒有 Overwatch 還死掉的八卦""","""mossdevin""","""今天早上 youtube 大家都知道掛掉了 炒得沸沸揚揚 結果有人貼出一個偵測網站""",2020-11-12 10:33:34,180.217.87.229,臺灣,0,https://www.ptt.cc/bbs/Gossiping/M.1605148417....,Gossiping,PTTPost,貼出/掛掉/youtube/炒得/有人/Overwatch/死掉/問卦/八卦/偵測/早上/網...


In [21]:
to_id_list = list(ptt_post_nodes['PTTPost:ID'])
ptt_post_comment_relation = df_ptt_comment2[['commmentId','user','comment','postId']]
ptt_post_comment_relation = ptt_post_comment_relation.rename(columns={'commmentId':':START_ID', 'user': 'username',
                                                                      'postId': ':END_ID'})
ptt_post_comment_relation['in_list'] = ptt_post_comment_relation[':END_ID'].apply(lambda x:if_in_list(x, to_id_list))
ptt_post_comment_relation = ptt_post_comment_relation[ptt_post_comment_relation['in_list']==1]
del ptt_post_comment_relation['in_list']
ptt_post_comment_relation[':TYPE'] = 'COMMENT_OF'
# ptt_post_comment_relation[':START_ID'] = ptt_post_comment_relation[':START_ID'].apply(np.int64)
ptt_post_comment_relation["username"] = ptt_post_comment_relation["username"].apply(lambda x : replace_double_quote(x))
ptt_post_comment_relation["username"] = ptt_post_comment_relation["username"].apply(lambda x : add_quote(x))
ptt_post_comment_relation["comment"] = ptt_post_comment_relation["comment"].apply(lambda x : replace_double_quote(x))
ptt_post_comment_relation["comment"] = ptt_post_comment_relation["comment"].apply(lambda x : add_quote(x))
ptt_post_comment_relation.to_csv('csv/ptt/ptt_post_comment_rel.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
print(ptt_post_comment_relation.shape[0])
ptt_post_comment_relation.sample(3)

138875


Unnamed: 0,:START_ID,username,comment,:END_ID,:TYPE
12772,CMT10012772,"""linhsiuwei""","""那憑什麼罷免韓國瑜？""",M.1605261835.A.AA8,COMMENT_OF
76237,CMT10076237,"""jack529""","""沒被抓包不就報公帳，這麼簡單還要護航，真的可憐""",M.1605181924.A.4E3,COMMENT_OF
12376,CMT10012376,"""odaaaaa""","""真的很有錢，我是說民進黨""",M.1605262183.A.5E9,COMMENT_OF


<a id="relationship_comment_user"></a>
## Relationship of Comment and User

In [22]:
to_id_list = list(ptt_comment_nodes['PTTComment:ID'])
ptt_user_comment_relation = df_ptt_comment[['user','comment','commmentId']]
ptt_user_comment_relation[":START_ID"] = ptt_user_comment_relation["user"].map(user_id_map)
ptt_user_comment_relation = ptt_user_comment_relation.rename(columns={'user': 'username',
                                                                      'commmentId': ':END_ID'})
ptt_user_comment_relation['in_list'] = ptt_user_comment_relation[':END_ID'].apply(lambda x:if_in_list(x, to_id_list))
ptt_user_comment_relation = ptt_user_comment_relation[ptt_user_comment_relation['in_list']==1]
del ptt_user_comment_relation['in_list']
ptt_user_comment_relation[':TYPE'] = 'COMMENTS'
# ptt_user_comment_relation[':START_ID'] = ptt_user_comment_relation[':START_ID'].apply(np.int64)
ptt_user_comment_relation = ptt_user_comment_relation[[":START_ID","username","comment",":END_ID",":TYPE"]]
ptt_user_comment_relation["username"] = ptt_user_comment_relation["username"].apply(lambda x : replace_double_quote(x))
ptt_user_comment_relation["username"] = ptt_user_comment_relation["username"].apply(lambda x : add_quote(x))
ptt_user_comment_relation["comment"] = ptt_user_comment_relation["comment"].apply(lambda x : replace_double_quote(x))
ptt_user_comment_relation["comment"] = ptt_user_comment_relation["comment"].apply(lambda x : add_quote(x))
ptt_user_comment_relation.to_csv('csv/ptt/ptt_user_comment_rel.csv', index=False, encoding='utf-8', quoting=csv.QUOTE_NONE, escapechar='\\')
print(ptt_user_comment_relation.shape[0])
ptt_user_comment_relation.sample(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


138875


Unnamed: 0,:START_ID,username,comment,:END_ID,:TYPE
85924,CU100017837,"""ynd""","""乾～～～～""",CMT10085924,COMMENTS


<a id="import_neo4j"></a>
## Import to Neo4j

```python
.\bin\neo4j-admin import --nodes=import/ptt_comment_nodes.csv --nodes=import/ptt_post_nodes.csv --nodes=import/post_keyword_nodes.csv --relationships=import/ptt_user_comment_rel.csv --nodes=import/ptt_user_nodes.csv --relationships=import/ptt_user_post_rel.csv --relationships=import/ptt_post_comment_rel.csv --relationships=import/ptt_post_keyword_rel.csv --relationships=import/ptt_comment_keyword_rel.csv
```

<a id="cypher"></a>
## Cypher query

```python
MATCH p=( (n:PTTUser)-[r1]-(n1)-[*0..2]->(nd) ) 
RETURN * LIMIT 25
```

```python
MATCH p=( (n:PTTUser)-[*1..2]-(m) ) 
RETURN * LIMIT 25
```

```python
MATCH (n:Keywords{keyword:"道歉"})-[*1..2]-(m:PTTUser) RETURN * LIMIT 25
```

```python
MATCH (n:PTTPost)-[r:POSTS]-(m:PTTUser) RETURN * limit 25
```

```python
MATCH (n:PTTUser)-[r:POSTS]-(m:PTTUser),(n:PTTUser)-[r2:COMMENTS]-(o:PTTComment) RETURN * LIMIT 25
```

```python
MATCH (n:PTTUser)-[r1:COMMENTS]-(o:PTTComment)-[r2:COMMENT_OF]-(p:PTTPost)-[r3:COMMENT_OF]-(c:PTTComment) RETURN * LIMIT 25
```

```python
MATCH (n:PTTUser)-[r1:COMMENTS]-(o:PTTComment)-[r2:COMMENT_OF]-(p:PTTPost)-[r3:COMMENT_OF]-(c:PTTComment)-[r4:COMMENTS]-(q:PTTUser) RETURN * LIMIT 25
```

```python
MATCH (n:PTTUser)-[r1:COMMENTS]-(o:PTTComment)-[r2:COMMENT_OF]-(p:PTTPost)-[r3:COMMENT_OF]-(c:PTTComment)-[r4:COMMENTS]-(n) RETURN * LIMIT 25
```

```python
MATCH (n:PTTUser)-[r1:POSTS]-(m),(n:PTTUser)-[r2:COMMENTS]-(o:PTTComment)-[r3:COMMENT_OF]-(p:PTTPost) RETURN * LIMIT 25
```

```python
MATCH (n:PTTUser)-[r1:POSTS]-(m),(n:PTTUser)-[r2:COMMENTS]-(o:PTTComment)-[r3:COMMENT_OF]-(p:PTTPost)-[r3:COMMENT_OF]-(c:PTTComment) RETURN * LIMIT 25
```

```python
MATCH (n:PTTUser)-[r1:POSTS]-(m),(n:PTTUser)-[r2:COMMENTS]-(o:PTTComment)-[r3:COMMENT_OF]-(p:PTTPost)-[r4:COMMENT_OF]-(c:PTTComment)-[r5:COMMENTS]-(u:PTTUser) RETURN * LIMIT 25
```

```python
MATCH (n:PTTUser)-[r1:POSTS]-(m),(n:PTTUser)-[r2:COMMENTS]-(o:PTTComment)-[r3:COMMENT_OF]-(p:PTTPost)-[r4:COMMENT_OF]-(c:PTTComment)-[r5:COMMENTS]-(n) RETURN * LIMIT 25
```

## Summary

In this practice, we don't extract keyword, hashtag, maybe you could add more nodes, relationship by reference last lesson.

## Homework

- Please change collection from your MongoDB.
- Create more nodes and relationships.