# Fake News websites data analysis

Will use data downloaded from CrowdTangle's "historical data" feature rather than making multiple requests to the API. The latter option would end up taking longer due to API limitations.

The data was downloaded on several .csv files, saved on `./data/in`.


Time period for the analysis:
* Start - 2019-01-01
* End - 2021-03-27

In [48]:
import requests
import json
import pandas as pd
import numpy as np
import timeit
import time
import glob

## Get list of pages on each category

Given that the .csv files generated by CrowdTangle do not specify from which list they come from, it will be necessary to make API calls go get the IDs of pages related to each list.

The lists are:

* 'least-biased' : '1525935'
* 'conspiracy-pseudoscience' : '1525936'
* 'pro-science' : '1525937'

In [2]:
lists = {
    'least-biased' : '1525935',
    'conspiracy-pseudoscience' : '1525936',
    'pro-science' : '1525937'
}

In [3]:
token = open('./ctoken').read()

In [4]:
def generate_account_list_url(listid, token=token):
    '''
    Generates the API URL for the get request with the lists of accounts.
    
    ARGS:
    ListId = The id of the list for which to retrieve accounts. This is provided as a path variable in the URL
    Token = API Token
    
    Returns:
    STR - CrowdTangle API URL, for getting IDs of accounts in a list
    '''
    return 'https://api.crowdtangle.com/lists/{}/accounts?token={}&count=100'.format(listid, token)

In [11]:
platformid_to_list = dict()

In [12]:
for listname, listid in lists.items():
    print(listname)
    page = 0
    nextpage = True
    url =  generate_account_list_url(listid)
    while nextpage:
        page += 1
        print('DOWNLOADING PAGE', page)
        
        re = requests.get(url)

        for account in re.json()['result']['accounts']:
            platformid_to_list[account['platformId']] = listname
            
        if 'nextPage' in re.json()['result']['pagination']:
            url = re.json()['result']['pagination']['nextPage']
            time.sleep(10)
        else:
            nextpage = False

least-biased
DOWNLOADING PAGE 1
DOWNLOADING PAGE 2
DOWNLOADING PAGE 3
DOWNLOADING PAGE 4
conspiracy-pseudoscience
DOWNLOADING PAGE 1
DOWNLOADING PAGE 2
pro-science
DOWNLOADING PAGE 1
DOWNLOADING PAGE 2


## Creates and cleans DF

Data was downloaded on several .csv files. Merge them into one single DF.

*Note: yes, this is will probably use up a lot of RAM. I have recently bought 32gb, though, so I am going to use it ;)*

In [92]:
path = './data/in'
files = glob.glob(path + '/*.csv')

df_list = []

for filename in files:
    df = pd.read_csv(filename, index_col=None, low_memory=False, dtype={'Facebook Id' : str})
    df_list.append(df)

df = pd.concat(df_list, axis=0, ignore_index=True)

### Cleaning

Remove unnecessary columns and pages with under 100 average followers, the same threshold used by NYU researchers for [this article](https://medium.com/cybersecurity-for-democracy/far-right-news-sources-on-facebook-more-engaging-e04a01efae90).

In [93]:
df.columns

Index(['Page Name', 'User Name', 'Facebook Id', 'Page Category',
       'Page Admin Top Country', 'Page Description', 'Page Created',
       'Likes at Posting', 'Followers at Posting', 'Post Created',
       'Post Created Date', 'Post Created Time', 'Type', 'Total Interactions',
       'Likes', 'Comments', 'Shares', 'Love', 'Wow', 'Haha', 'Sad', 'Angry',
       'Care', 'Video Share Status', 'Is Video Owner?', 'Post Views',
       'Total Views', 'Total Views For All Crossposts', 'Video Length', 'URL',
       'Message', 'Link', 'Final Link', 'Image Text', 'Link Text',
       'Description', 'Sponsor Id', 'Sponsor Name', 'Sponsor Category',
       'Overperforming Score (weighted  —  Likes 1x Shares 1x Comments 1x Love 1x Wow 1x Haha 1x Sad 1x Angry 1x Care 1x )'],
      dtype='object')

In [95]:
columns_to_drop = ['User Name', 'Page Category', 'Page Admin Top Country', 'Page Description', 'Sponsor Id',
                   'Page Created','Likes at Posting', 'Post Created Date', 'Post Created Time', 'Video Length',
                   'Total Interactions', 'Video Share Status', 'Is Video Owner?', 'Post Views', 'Total Views For All Crossposts',
                   'Overperforming Score (weighted  —  Likes 1x Shares 1x Comments 1x Love 1x Wow 1x Haha 1x Sad 1x Angry 1x Care 1x )']

df.drop(columns_to_drop, axis = 1, inplace=True)

# TURN ALL REACTIONS INTO ONE COLUMN
df['Reactions'] = df[['Likes', 'Love', 'Wow', 'Haha', 'Sad', 'Angry','Care']].sum(axis=1)

columns_to_drop = ['Likes', 'Love', 'Wow', 'Haha', 'Sad', 'Angry','Care']
df.drop(columns_to_drop, axis = 1, inplace=True)

'''
This part will recreate the Total Interactions column.
My computer is in PT-BR and CrowdTangle uses commas in their decimal separator.
The workaround for this is so dramatic that it is easier to just recreate the column.
'''
df['Total Interactions'] = df[['Reactions', 'Comments', 'Shares']].sum(axis=1)

#### Lists pages below the 100 avg. followers threshold

In [96]:
grouped_by_followers = df.groupby('Facebook Id').agg({'Followers at Posting' : 'mean'})
grouped_by_followers['Followers at Posting'].min()

175.22222222222223

None of the pages fall under the threshold, so no action is necessary.

### Renames columns

To avoid mistakes later, all column names will be turned to lower case and will have no spaces.

In [97]:
column_names = list()
for c in df.columns:
    column_names.append(c.lower().replace(' ', '_'))

df.columns = column_names

### Removes pages not available after 2021

Since this will be a time series, it makes no sense to include in the analysis pages that have no data before 2021.

### Add category column

In [98]:
def check_category(facebookid,
                   platformid_to_list=platformid_to_list):
    '''
    Checks the Facebook ID and finds it in the dictionary with
    category names. Returns category.
    
    ARGS:
    facebookid - STR - id to be found
    platformid_to_list - List of IDs and their categories
    
    RETURN:
    'least-biased'|'conspiracy-pseudoscience'|'pro-science'
    '''
    return platformid_to_list[facebookid]

In [99]:
df['category'] = df['facebook_id'].apply(lambda x: check_category(x))

# CHECKS FOR ERRORS
df[df['category'].isna()]

Unnamed: 0,page_name,facebook_id,followers_at_posting,post_created,type,comments,shares,total_views,url,message,link,final_link,image_text,link_text,description,sponsor_name,sponsor_category,reactions,total_interactions,category


### Converts post creation date to datetime

In [111]:
df['post_created'] = pd.to_datetime(df['post_created'])



## Analysis

In [112]:
df.columns

Index(['page_name', 'facebook_id', 'followers_at_posting', 'post_created',
       'type', 'comments', 'shares', 'total_views', 'url', 'message', 'link',
       'final_link', 'image_text', 'link_text', 'description', 'sponsor_name',
       'sponsor_category', 'reactions', 'total_interactions', 'category',
       'intactions_to_ratio'],
      dtype='object')

In [113]:
df.dtypes

page_name                       object
facebook_id                     object
followers_at_posting           float64
post_created            datetime64[ns]
type                            object
comments                         int64
shares                           int64
total_views                      int64
url                             object
message                         object
link                            object
final_link                      object
image_text                      object
link_text                       object
description                     object
sponsor_name                    object
sponsor_category                object
reactions                        int64
total_interactions               int64
category                        object
intactions_to_ratio            float64
dtype: object

In [114]:
df.describe()

Unnamed: 0,followers_at_posting,comments,shares,total_views,reactions,total_interactions,intactions_to_ratio
count,1844121.0,1877594.0,1877594.0,1877594.0,1877594.0,1877594.0,1844121.0
mean,4378894.0,89.71327,238.9516,36424.15,777.9799,1106.645,0.08859303
std,8194872.0,665.0261,2494.048,1500883.0,8212.704,9866.539,0.9569294
min,159.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,213432.0,1.0,2.0,0.0,14.0,20.0,0.002074421
50%,966764.0,5.0,10.0,0.0,57.0,84.0,0.01036909
75%,4192955.0,35.0,51.0,0.0,241.0,372.0,0.04829052
max,36721500.0,305338.0,726056.0,519348000.0,2390346.0,2654643.0,410.1335


In [115]:
df.head()

Unnamed: 0,page_name,facebook_id,followers_at_posting,post_created,type,comments,shares,total_views,url,message,...,final_link,image_text,link_text,description,sponsor_name,sponsor_category,reactions,total_interactions,category,intactions_to_ratio
0,Collective Evolution,131929868907,5136410.0,2021-01-01 23:57:32,Link,185,1286,0,https://www.facebook.com/CollectiveEvolutionPa...,Don't we have incredible innovation everywhere...,...,,,8 Year-Old Mexican Girl Invents A Solar Water ...,"Innovation comes from all ages, and this is fu...",,,6445,7916,conspiracy-pseudoscience,0.154115
1,21st Century Wire,182032255155419,34584.0,2021-01-01 23:33:02,Link,0,9,0,https://www.facebook.com/21WIRE.TV/posts/51392...,"Some of the good, the bad, and mostly ugly for...",...,,,INTO THE FIRE: 2021 Trends and Predictions fro...,"NEW YEARS DAY SPECIAL | Once again, we innocen...",,,16,25,conspiracy-pseudoscience,0.072288
2,Ancient Origins,530869733620642,865151.0,2021-01-01 23:30:09,Photo,62,302,0,https://www.facebook.com/ancientoriginsweb/pos...,Pompeii 1980 www.ancient-origins.net:=:https:/...,...,https://www.facebook.com/login/?next=https%3A%...,,Timeline Photos,,,,2937,3301,conspiracy-pseudoscience,0.381552
3,Jesus Daily,70630972354,33588034.0,2021-01-01 23:00:22,Link,4055,726,0,https://www.facebook.com/JesusDaily/posts/1016...,Does America today really need Jesus?,...,,,Does America really need Jesus anymore?,Watch and listen to Billy Grahams last message...,,,12895,17676,conspiracy-pseudoscience,0.052626
4,IFLScience,367116489976035,23885759.0,2021-01-01 23:00:11,Link,5958,4856,0,https://www.facebook.com/IFLScience/posts/4316...,"When same-sex marriage is legalized, it leads ...",...,,,Teenage Suicide Attempts Fall After Same-Sex M...,"When a state legalizes same-sex marriage, it l...",,,77462,88276,conspiracy-pseudoscience,0.369576


#### Interaction to followers ratio

In [118]:
df['intactions_to_follow_ratio'] = (df['total_interactions'] / df['followers_at_posting'])*100

### Time comparison

In [143]:
df.groupby('category').agg({'facebook_id' : 'nunique',
                            'followers_at_posting' : 'mean',
                            'comments' : 'mean',
                            'shares' : 'mean',
                            'reactions' : 'mean',
                            'intactions_to_follow_ratio' : 'mean'})

Unnamed: 0_level_0,facebook_id,followers_at_posting,comments,shares,reactions,intactions_to_follow_ratio
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
conspiracy-pseudoscience,128,8227927.0,117.182794,555.083088,1409.772961,0.139939
least-biased,327,3170033.0,90.926867,80.316563,397.991307,0.06842
pro-science,112,1715200.0,47.869156,142.669755,726.085223,0.062359


In [144]:
df[df['post_created'] > '2021'].groupby('category').agg({'facebook_id' : 'nunique',
                            'followers_at_posting' : 'mean',
                            'comments' : 'mean',
                            'shares' : 'mean',
                            'reactions' : 'mean',
                            'intactions_to_follow_ratio' : 'mean'})

Unnamed: 0_level_0,facebook_id,followers_at_posting,comments,shares,reactions,intactions_to_follow_ratio
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
conspiracy-pseudoscience,115,6742823.0,114.430443,225.999428,1271.614858,0.126171
least-biased,316,3259983.0,85.768814,50.477554,400.509356,0.056092
pro-science,110,1803654.0,70.435572,134.058721,1459.082828,0.081664


In [145]:
df[df['post_created'] > '2020'].groupby('category').agg({'facebook_id' : 'nunique',
                            'followers_at_posting' : 'mean',
                            'comments' : 'mean',
                            'shares' : 'mean',
                            'reactions' : 'mean',
                            'intactions_to_follow_ratio' : 'mean'})

Unnamed: 0_level_0,facebook_id,followers_at_posting,comments,shares,reactions,intactions_to_follow_ratio
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
conspiracy-pseudoscience,128,8198056.0,116.957082,496.791906,1641.279099,0.154032
least-biased,322,3241167.0,92.415681,51.47525,400.483017,0.060571
pro-science,111,1762823.0,64.808835,173.4266,1131.931398,0.068396
