# 爬取Facebook粉絲專頁資料 第二篇 - 獲取粉絲專頁的留言者及評論

### 透過Facebook Graph API去抓取粉絲專頁的資料，但是使用Facebook Graph API需要token(權杖)
### 我們透過建立Facebook App的應用程式，並使用該應用程式的帳號(APP ID)和密碼(APP Secret)當作權限

[說明]Graph API是什麼東東？http://www.tonylin.idv.tw/dokuwiki/doku.php/facebook:basic:graphapi

![Imgur](https://i.imgur.com/PkpTIlC.png)
![Imgur](https://i.imgur.com/6Uhz2ds.png)
![Imgur](https://i.imgur.com/d02QeHz.png)

In [None]:
#載入python函式庫
import json
import datetime
import csv
import time
import pymongo
from pymongo import MongoClient
try:
    from urllib.request import urlopen, Request
except ImportError:
    from urllib2 import urlopen, Request

## 首先連接至資料庫，並且打開建立好的資料表來存取資料

In [None]:
client = MongoClient("140.120.13.242",27017)
print(client)

db= client["FB_DB2"]
col=db["fansPage"]

## 使用上面取得的APP ID和App Secret作為權杖（token）


In [None]:
app_id = ""
app_secret = ""
access_token = app_id + "|" + app_secret

## 獲取粉絲專頁的ID

![Imgur](https://i.imgur.com/3C9Gu9R.png)

## 因為要分析的粉絲專頁不止一個，所以把要分析的粉絲專頁的ID和相對應的粉絲專頁名稱放進list
## 建立字典，key為粉絲專頁的ID，value為粉絲專頁的名稱

In [None]:
file_id = ["appledaily.tw","tsaiingwen","MaYingjeou","starbuckstaiwan","duncanlindesign","jay","ashin555","YahooTWNews","ETtoday","news.ebc"]
file_name=["台灣蘋果日報","蔡英文 Tsai Ing-wen","馬英九","統一星巴克咖啡同好會","Duncan","周杰倫 Jay Chou","五月天 阿信","Yahoo!奇摩新聞","ETNEWS新聞雲","東森新聞"]

dic=dict()
for i in range(10):
    dic[file_id[i]]=file_name[i]

## 程式主要由5個function來完成

request_until_succeed(url):
來確保爬取成功

getFacebookCommentFeedUrl(base_url):


getReactionsForComments(base_url):
取得留言所得到的反應（Like,Love....)

processFacebookComment(comment, status_id, parent_id=''):


scrapeFacebookPageFeedComments(page_id, access_token):
主要程式，在裡面會呼叫以上function來進行爬取的動作

-------------------------------------------------------------------------

## 確保能連線

In [1]:
def request_until_succeed(url):
    req = Request(url)
    success = False
    count=0 # 不要讓他一直嘗試連線，會鬼撞牆，所以給他個參數
    while success is False:
        try:
            response = urlopen(req)
            if response.getcode() == 200:
                success = True
        except Exception as e:
            print(e)
            time.sleep(5)

            print("Error for URL {}: {}".format(url, datetime.datetime.now()))
            print("Retrying.")
            if count==3: # 看你要讓她嘗試連線多少次
                return None
            else:
                count+=1
    return response.read().decode('utf8')

--------

## 調整編碼方式

In [2]:
def unicode_decode(text):
    try:
        return text.encode('utf-8').decode()
    except UnicodeDecodeError:
        return text.encode('utf-8')

---------

## 產生url string

url = base_url + fields 

fields:你要取得資料欄位

In [3]:
def getFacebookCommentFeedUrl(base_url):

    # Construct the URL string
    fields = "&fields=id,message,reactions.limit(0).summary(true)" + \
        ",created_time,comments,from,attachment"
    url = base_url + fields

    return url

---------

## 取得留言所得到的反應（Like,Love,Comment....)

In [4]:
def getReactionsForComments(base_url):

    reaction_types = ['like', 'love', 'wow', 'haha', 'sad', 'angry']
    reactions_dict = {}   # dict of {status_id: tuple<6>}

    for reaction_type in reaction_types:
        fields = "&fields=reactions.type({}).limit(0).summary(total_count)".format(
            reaction_type.upper())

        url = base_url + fields

        data = json.loads(request_until_succeed(url))['data']

        data_processed = set()  # set() removes rare duplicates in statuses
        for status in data:
            id = status['id']
            count = status['reactions']['summary']['total_count']
            data_processed.add((id, count))

        for id, count in data_processed:
            if id in reactions_dict:
                reactions_dict[id] = reactions_dict[id] + (count,)
            else:
                reactions_dict[id] = (count,)

    return reactions_dict

-----------

## 將所有留言結構化成Tuple型態，並回傳資料

In [5]:
def processFacebookComment(comment, status_id, parent_id=''):

    # The status is now a Python dictionary, so for top-level items,
    # we can simply call the key.

    # Additionally, some items may not always exist,
    # so must check for existence first

     # 確認資料欄位是否有值,並做處理
    comment_id = comment['id']
    comment_message = '' if 'message' not in comment or comment['message'] \
        is '' else unicode_decode(comment['message'])
    comment_author = unicode_decode(comment['from']['name'])
    num_reactions = 0 if 'reactions' not in comment else \
        comment['reactions']['summary']['total_count']

    if 'attachment' in comment:
        attachment_type = comment['attachment']['type']
        attachment_type = 'gif' if attachment_type == 'animated_image_share' \
            else attachment_type
        attach_tag = "[[{}]]".format(attachment_type.upper())
        comment_message = attach_tag if comment_message is '' else \
            comment_message + " " + attach_tag

    # Time needs special care since a) it's in UTC and
    # b) it's not easy to use in statistical programs.

    comment_published = datetime.datetime.strptime(
        comment['created_time'], '%Y-%m-%dT%H:%M:%S+0000')
    # 設定台灣時區
    comment_published = comment_published + datetime.timedelta(hours=8)   
    comment_published = comment_published.strftime(
        '%Y-%m-%d %H:%M:%S')  # best time format for spreadsheet programs

    # Return a tuple of all processed data

    return (comment_id, status_id, parent_id, comment_message, comment_author,
            comment_published, num_reactions)

-----------------------------------------------------------------

## 主要程式，在裡面會呼叫以上function來進行爬取的動作
## 並透過2層while迴圈來抓取所有留言的資料

In [6]:
def scrapeFacebookPageFeedComments(page_id, access_token):
    # 讀取之前POST下來的CSV檔案
    with open('{}_facebook_comments.csv'.format(page_id), 'w',encoding='utf-8-sig') as file:
        w = csv.writer(file)
        
        # 寫入CSV檔案
        w.writerow(["comment_id", "status_id", "parent_id", "comment_message",
                    "comment_author", "comment_published", "num_reactions",
                    "num_likes", "num_loves", "num_wows", "num_hahas",
                    "num_sads", "num_angrys", "num_special"])

        num_processed = 0 # 計算處理多少post
        scrape_starttime = datetime.datetime.now()
        after = ''
        base = "https://graph.facebook.com/v2.9"
        parameters = "/?limit={}&access_token={}".format(
            100, access_token)

        print("Scraping {} Comments From Posts: {}\n".format(
            dic[page_id], scrape_starttime))

        with open('{}_facebook_statuses.csv'.format(page_id), 'r',encoding='utf-8-sig') as csvfile:
            reader = csv.DictReader(csvfile)

            # Uncomment below line to scrape comments for a specific status_id
            # reader = [dict(status_id='5550296508_10154352768246509')]

            for status in reader:
                has_next_page = True

                while has_next_page:

                    node = "/{}/comments".format(status['status_id'])
                    after = '' if after is '' else "&after={}".format(after)
                    base_url = base + node + parameters + after

                    url = getFacebookCommentFeedUrl(base_url)
                    # 如果URL Page無法回應，就讓他Drop，才不會讓程式整個卡住
                    try:
                        comments = json.loads(request_until_succeed(url))
                        reactions = getReactionsForComments(base_url)
                    except:
                        has_next_page=False
                        print("server 無法回應")

                    for comment in comments['data']:
                        comment_data = processFacebookComment(
                            comment, status['status_id'])
                        # print(comment_data[0])
                        reactions_data = reactions[comment_data[0]]
                        # 留言的人 = comment_data[4]
                        # 留言的人說了甚麼話 = comment_data[3]
                        # 找有無此留言人，並存入資料庫
                        cursor=col.find({"name":comment_data[4]})
                        if cursor.count()>0:
                            user_comment=[]
                            for arr in cursor:
                                try:
                                    user_comment=arr[dic[page_id]]
                                except:
                                    user_comment = []
                                break
                            user_comment.append(comment_data[3])
                            col.update({'name':comment_data[4]}, {"$set": {dic[page_id]: user_comment} }, upsert=True)
                        else:
                            user_comment=[]
                            user_comment.append(comment_data[3])
                            col.insert({"name":comment_data[4],dic[page_id]:user_comment})


                        # calculate thankful/pride through algebra
                        num_special = comment_data[6] - sum(reactions_data)
                        w.writerow(comment_data + reactions_data +
                                   (num_special, ))

                        if 'comments' in comment:
                            has_next_subpage = True
                            sub_after = ''

                            while has_next_subpage:
                                sub_node = "/{}/comments".format(comment['id'])
                                sub_after = '' if sub_after is '' else "&after={}".format(
                                    sub_after)
                                sub_base_url = base + sub_node + parameters + sub_after

                                sub_url = getFacebookCommentFeedUrl(sub_base_url)
                                try:
                                    sub_comments = json.loads(request_until_succeed(sub_url))
                                    sub_reactions = getReactionsForComments(sub_base_url)
                                except:
                                    has_next_subpage = True
                                    print("server 無法回應(sub)")

                                for sub_comment in sub_comments['data']:
                                    sub_comment_data = processFacebookComment(sub_comment, status['status_id'], comment['id'])
                                    sub_reactions_data = sub_reactions[sub_comment_data[0]]
                                    # 留言人底下的留言人 = sub_comment_data[4]
                                    # 留言人底下的留言人之留言 = sub_comment_data[3]

                                    num_sub_special = sub_comment_data[6] - sum(sub_reactions_data)
                                    w.writerow(sub_comment_data +sub_reactions_data + (num_sub_special,))

                                    num_processed += 1
                                    if num_processed % 100 == 0:
                                        print("{} Comments Processed: {}".format(num_processed,datetime.datetime.now()))

                                if 'paging' in sub_comments:
                                    if 'next' in sub_comments['paging']:
                                        sub_after = sub_comments['paging']['cursors']['after']
                                    else:
                                        has_next_subpage = False
                                else:
                                    has_next_subpage = False

                        # output progress occasionally to make sure code is not
                        # stalling
                        num_processed += 1
                        if num_processed % 100 == 0:
                            print("{} Comments Processed: {}".format(num_processed, datetime.datetime.now()))
                    if 'paging' in comments:
                        if 'next' in comments['paging']:
                            after = comments['paging']['cursors']['after']
                        else:
                            has_next_page = False
                    else:
                        has_next_page = False

        print("\nDone!\n{} Comments Processed in {}".format(num_processed, datetime.datetime.now() - scrape_starttime))

--------------------------

## 進入主程式

In [None]:
if __name__ == '__main__':
    for file_name in file_id:
        scrapeFacebookPageFeedComments(file_name, access_token)

100 Comments Processed: 2017-09-21 16:48:19.520377 

200 Comments Processed: 2017-09-21 16:48:32.247251 

300 Comments Processed: 2017-09-21 16:48:49.736166 

400 Comments Processed: 2017-09-21 16:49:07.349318 

500 Comments Processed: 2017-09-21 16:49:14.678843 

600 Comments Processed: 2017-09-21 16:49:21.920978 

700 Comments Processed: 2017-09-21 16:49:30.839750 

800 Comments Processed: 2017-09-21 16:49:36.122160 

900 Comments Processed: 2017-09-21 16:49:42.785571 

Done!

## 最後會再產生一個所有留言的CSV檔案