We use Facebook Scraper Library of python language
to scrape data from popular news channel's pages
on facebook

In [22]:
from facebook_scraper import get_posts as get
import pandas as pd
import numpy as np
from datetime import datetime as dt

In [25]:
news = ['bbcnews','cnn','nytimes',
        'FoxNews','ABCNews',
        'HuffPostIndia','HuffPostUK',
        'NBCNews','time','DailyMail',
        'businessinsider','NowThisNews','aljazeera']

We will scrape out all the posts present in first 250 feeds of a News page, Time Taken to scrape is mentioned

In [28]:
now = dt.now()
print('Start Date : '+str(now))
posts = []
for channel in news :     
    print(channel)
    for post in get(channel,pages = 250) :
        post['channel'] = channel
        posts.append(post)

print('Time taken to scrape data : '+str((dt.now() - now).seconds/60)+' minutes')


Start Date : 2020-04-28 17:22:02.209437
bbcnews
cnn
nytimes
FoxNews
ABCNews
HuffPostIndia
HuffPostUK
NBCNews
time
DailyMail
businessinsider
NowThisNews
aljazeera
Time taken to scrape data : 45.05 minutes


In [29]:
original = posts.copy()
print('Source Copy Created.')

Source Copy Created.


<b>Filtering Out Posts</b> : The pages upload few posts that are irrelavant and hence need to be ignored by us so that it doesnt effect efficiency of our model.Posts having absent <b>time_posted,shares,likes,shared_text,urls</b> are dropped from our data

In [30]:
posts = original.copy()
refined_post = []
for post in posts : 
    
    check = post['post_url'] is None or post['post_id'] is None
    check = check or post['time'] is None or post['shares'] == 0
    check = check or post['likes'] == 0 or len(post['shared_text']) == 0
    
    if not check : 
        refined_post.append(post)
        
posts = refined_post.copy()

print(str(len(posts))+' posts refined out of '+str(len(original))+' posts')

6803 posts refined out of 11100 posts


In [38]:
print('Example Post')
for i in posts[0] : 
    string = str(posts[0][i])
    if i != 'time' : 
        string = string[:20]
        
    print(str(i)+' : '+string)

Example Post
post_id : 10157728122492217
text : A pregnant nurse who
post_text : A pregnant nurse who
shared_text : BBC.COM
Remembering 
time : 2020-04-28 16:23:16
image : https://external.fbo
likes : 2327
comments : 333
shares : 1015
post_url : https://m.facebook.c
link : https://bbc.in/2W3Q7
channel : bbcnews


From Above dataset we select <b> post_id,shared_text,time (duration between time posted and time scraped),image(present or not - 0/1),likes,comments,shares,post_url,channel </b> name to create a table database

In [39]:
col = ['channel','id','url','timespan','text','image','likes','comments','shares']
table = pd.DataFrame(columns = col)
for post in posts : 
    index = post['shared_text'].find('\n')
    text = post['shared_text'][index+1 : ]
    if len(text) > 0 : 
        data = [{col[0] : post['channel'], col[1] : post['post_id'],
                 col[2] : post['post_url'],col[3] : (datetime.datetime.now() - post['time']).seconds/3600,
                 col[4] : text,
                 col[5] : np.float64(post['image'] is not None),col[6] : np.float64(post['likes']),
                 col[7] : np.float64(post['comments']),col[8] : np.float64(post['shares'])}]
        
        table = table.append(data,ignore_index=True,sort=False)

print('Table Data')
print(table.head(5))

Table Data
   channel                 id  \
0  bbcnews  10157728122492217   
1  bbcnews  10157728074267217   
2  bbcnews  10157727901822217   
3  bbcnews  10157727795172217   
4  bbcnews  10157727671812217   

                                                 url  timespan  \
0  https://m.facebook.com/story.php?story_fbid=10...  1.891389   
1  https://m.facebook.com/story.php?story_fbid=10...  2.320278   
2  https://m.facebook.com/story.php?story_fbid=10...  3.813333   
3  https://m.facebook.com/story.php?story_fbid=10...  4.589444   
4  https://m.facebook.com/story.php?story_fbid=10...  5.345833   

                                                text  image   likes  comments  \
0          Remembering 100 NHS workers who have died    1.0  2327.0     333.0   
1  Rising virus care home toll leads to record de...    1.0   950.0     197.0   
2      Top NYC coronavirus doctor takes her own life    1.0  5854.0    1533.0   
3      How NZ beat the virus and got its coffee back    1.0  3488.0  

Finally we save the table as csv file for <b> Later use </b>

In [41]:
table_copy = table.copy()
table.to_csv('News.csv',index = False)
print('Table saved as .csv format')

Table saved as .csv format
