## Facebook Page Crawling 

In this notebook, we will crawling data from Facebook Page using the library facebook-scraper
https://github.com/kevinzg/facebook-scraper

### Import libraries

In [1]:
import lib.facebook_scraper as fs
import lib.utils as utils
import pandas as pd
import numpy as np
import os

### Define variables
First we have to define some variables that we will be using throughout the notebook
- FANPAGE_LINK: The link to the fanpage that we want to crawl data from. This can be found by going to the fanpage and copying the link from the address bar. In this project , the link to the fanpage of [Vẽ bậy](https://www.facebook.com/vebay69) is https://www.facebook.com/vebay69. And the value of FANPAGE_LINK is "vebay69"
- DATA_FOLDER_PATH: The path to the folder save data
- COOKIE_PATH: The path to the cookie file that we will be using to authenticate with Facebook. This cookie file can be obtained by logging into Facebook and copying the cookie from the browser. For example, in Chromium, use extension [Get cookies.txt LOCALLY](https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid) to get the cookie file. Then save the cookie to a file and use the path to this file as the value for COOKIE_PATH. <span style="color:red; font-weight:bold">USE COOKIE FROM A FAKE ACCOUNT, OTHERWISE YOUR REAL ACCOUNT MIGHT GET BANNED.</span>.

In [2]:
FANPAGE_LINK = "vebay69"
DATA_FOLDER_PATH = "data"
COOKIE_PATH = os.path.join('cookies', 'cookie_1.txt')

### Set cookies
First, set cookies to authenticate with Facebook

In [3]:
fs.set_cookies('./cookies/cookie_1.txt')

### Get profile of the FANPAGE we will crawl

Use function get_page_info, timeout is the number seconds to wait before requesting

In [4]:
page_info = fs.get_page_info(
    account=FANPAGE_LINK,
    timeout=60,
)
page_info



{'top_post': {'post_id': '691288359773038',
  'text': 'Ám ảnh mấy hôm nay…\n#Muonggg',
  'post_text': 'Ám ảnh mấy hôm nay…\n#Muonggg',
  'shared_text': '',
  'original_text': None,
  'time': datetime.datetime(2023, 12, 1, 14, 47, 31),
  'timestamp': 1701416851,
  'image': 'https://scontent.fhan14-4.fna.fbcdn.net/v/t39.30808-6/406876813_691288056439735_2430868875471301490_n.jpg?stp=cp0_dst-jpg_e15_fr_q65&_nc_cat=102&ccb=1-7&_nc_sid=a0818e&efg=eyJpIjoidCJ9&_nc_eui2=AeGLs92NOyhd_dsSDI1AecPdzEUdJOAh57jMRR0k4CHnuK58PksI0UnLkKNzFtk6CsLV8AR_Hm3qcvPnvGeDtpLB&_nc_ohc=pt6SMiOK5-8AX_PH6Jd&tn=RTPKl52GhDXvlbye&_nc_ht=scontent.fhan14-4.fna&oh=00_AfAUCARSY3QtgMVN6WPAy1ifLARaUv9Gh6Q-4tTzLWgChQ&oe=656F0A5B&manual_redirect=1',
  'image_lowquality': 'https://scontent.fhan14-4.fna.fbcdn.net/v/t39.30808-6/406876813_691288056439735_2430868875471301490_n.jpg?stp=cp0_dst-jpg_e15_p320x320_q65&_nc_cat=102&ccb=1-7&_nc_sid=a0818e&efg=eyJpIjoidCJ9&_nc_eui2=AeGLs92NOyhd_dsSDI1AecPdzEUdJOAh57jMRR0k4CHnuK58PksI0UnLkK

### Start crawling
Firstly, we only crawl the basic information of posts like numbers of reacte, comments, sharers, text, post_id...
After, We will crawl all the data of posts as comments_full, info of sharers, reactors in the file [Crawl comment](crawl_comment.ipynb), [Crawl sharer](crawl_sharer.ipynb), [Crawl reactor](crawl_reactor.ipynb). These notebooks use post_id to extract sequence posts
- Define number of post we want to crawl
- Create a variable resum_post_url to store the URL of the current post for scraping data post later. This variable is saved in the file .txt with path is resume_post_url_file_path

In [5]:
NUMBER_POST = 100
resume_post_url_file_path = os.path.join(DATA_FOLDER_PATH, FANPAGE_LINK, "url", "resume_post_url.txt")

Define a callback function to save next post url and save in a log file

In [6]:
def handle_pagination_url(url):
    global resume_post_url
    resume_post_url = url
    print(f"Resume url: {url}")
    utils.append_url_to_history(
        url=resume_post_url,
        file_path=os.path.join(DATA_FOLDER_PATH, FANPAGE_LINK, "url", "post_url_history.txt"),
    )

Read resume_post_url from path. If the file does not exist, create an empty file

In [7]:
resume_post_url = utils.read_url_file(file_path=resume_post_url_file_path)
print(f"Resume post url: {resume_post_url}")
post_list = []

Resume post url: None


- The function get_posts return an Iterator of dictionaries, and we need to iterate through it to retrieve data for each post. 
- This function takes page_limit as a parameter, difining the number of posts to be returned. Since each request of this function fetches 10 posts, the total number of posts returned will be 10 times page_limit. 
- The options parameter defines special parameters; when specified, it will make a separate request to each individual post to retrieve specific data.

In [8]:
try:
    for post in fs.get_posts(
        account=FANPAGE_LINK,
        page_limit=NUMBER_POST//10,
        start_url=resume_post_url,
        request_url_callback=handle_pagination_url,
        options={
            "allow_extra_requests": True,
            "reactions": True,
        },
        timeout=120,
    ):
        print(post)
        post_list.append(post)
        utils.sleep(np.random.randint(5, 10))
except fs.exceptions.TemporarilyBanned:
    print("Error: Temporarily Banned")

except fs.exceptions.AccountDisabled:
    print("Error: Account Disabled")

except Exception as e:
    print(e)

Resume url: https://m.facebook.com/vebay69/
Append resume url to history: ./data/vebay69/url/post_url_history.txt


{'post_id': '688805450021329', 'text': 'Ăn xin công nghệ cao:\n#Hoho', 'post_text': 'Ăn xin công nghệ cao:\n#Hoho', 'shared_text': '', 'original_text': None, 'time': datetime.datetime(2023, 11, 28, 12, 0, 2), 'timestamp': 1701172802, 'image': None, 'image_lowquality': 'https://scontent-sin6-3.xx.fbcdn.net/v/t39.30808-6/404762694_688805226688018_6037405769867400504_n.jpg?stp=cp0_dst-jpg_e15_p320x320_q65&_nc_cat=104&ccb=1-7&_nc_sid=5f2048&efg=eyJpIjoidCJ9&_nc_ohc=rlCbxNzd3MIAX-TBk5e&_nc_oc=AQnC2ZEMMBA6nomiZcLTdvKTZorhSG2zDvccBhNKyAP4DIqdM25YZe0Yannw9IaErcw&_nc_ht=scontent-sin6-3.xx&oh=00_AfDHAGYfp1D0wjGzLnY-aL-uceMGtJdH5fn6ziLnBWi9gg&oe=656C3062', 'images': [], 'images_description': [], 'images_lowquality': ['https://scontent-sin6-3.xx.fbcdn.net/v/t39.30808-6/404762694_688805226688018_6037405769867400504_n.jpg?stp=cp0_dst-jpg_e15_p320x320_q65&_nc_cat=104&ccb=1-7&_nc_sid=5f2048&efg=eyJpIjoidCJ9&_nc_ohc=rlCbxNzd3MIAX-TBk5e&_nc_oc=AQnC2ZEMMBA6nomiZcLTdvKTZorhSG2zDvccBhNKyAP4DIqdM25YZe0Yannw

Compare the number of crawled posts with the initially defined quantity

In [9]:
len(post_list), NUMBER_POST

(79, 100)

Save the resume post url and posts data with the current datetime 

In [10]:
utils.write_url_file(
    file_path=resume_post_url_file_path,
    url=resume_post_url,
)
if post_list:
    posts_df = utils.save_data(
        data_list=post_list,
        type="posts",
        folder_path=os.path.join(DATA_FOLDER_PATH, FANPAGE_LINK, "raw"),
    )

Save resume url: ./data/vebay69/url/resume_post_url.txt
Save posts data: ./data/vebay69/raw/posts_2023-11-29_00-29-15.csv
