# Scraping Media Bias websites list and their Facebook pages

This script will go through the lists of websites tagged as "least biased", "conspiracy-pseudoscience" and "pro-science" on [Media Bias/Fact Check](https://mediabiasfactcheck.com/):

* https://mediabiasfactcheck.com/center/
* https://mediabiasfactcheck.com/conspiracy/
* https://mediabiasfactcheck.com/pro-science/

Later, it will go through each of these sites to gather their Facebook pages URLs.

These Facebook pages will later be listed in a format compatible with [CrowdTangle](https://www.crowdtangle.com/)'s import function. The tool will later be used for analysing engagement for these pages.

CrowdTangle requests data in a .csv file, using the `Page or Account URL,List` template, one page per line. List being the name of the list the referred page will be a part of.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from IPython.display import clear_output
import timeit
import time

## Generating the full websites lists

Gather information from Media Bias/Fact check.

These pages already contain the URL for almost all of the websites under each category in a table.

The scraped content will be added to a list of dictionaries (`contents`) one for each entry. In the end, each will contain:

```
{
    'category' : 'least-biased|conspiracy-pseudoscience|pro-science', 
    'website_name' : 'name',
    'url' : 'website's url',
    'report' : 'media bias report url',
    'facebook_pages' : 'the list of all facebook.com links found in the websites homepages',
    'number_facebook_urls' : int # no. of Facebook URLs found on the page
}
```

Each of these entries will need to be inspected one by one later. This is necessary because the script will gather all the Facebook pages in each of the entries' homepage, in case there are multiple ones. 

The data will be saved to a .xlsx file for making life easier when updating information during this inspection, `./data/interim/extracted_data.xlsx`.

In [2]:
urls = {'least-biased' : 'https://mediabiasfactcheck.com/center/',
       'conspiracy-pseudoscience' : 'https://mediabiasfactcheck.com/conspiracy/',
       'pro-science' : 'https://mediabiasfactcheck.com/pro-science/'}

contents = list()

Each line with the URLs follow one of two formats: either it contains just the URL, or it contains the website name and the URL in parentheses. They will be extracted using the `extract_url` function below.

A second part of the script will go through the ones that do not have the URLs in the lists already and gather the information from their reports.

In [3]:
def extract_url(text):
    '''
    Extracts URL information from each row's website name.
    
    Args:
    text: STR - the text information for each row (row.text)
    
    Output:
    STR
    '''
    if '(' in text:
        # THE URL IS ALWAYS THE LAST PART WHEN IN PARENTHESES
        url = text.split('(')[-1].strip(')')
    elif '.' in text:
        url = text
    else:
        return '-'
    if url.startswith('http'):
        return url
    else:
        return ('https://' + url)

### Scraping loops

*Note: using nested for loops is not the most performatic way of handling this task, but it should do the trick just fine given that it is not really that much data.*

In [4]:
categories_ran = 0

for category, url in urls.items():
    categories_ran += 1
    print('WORKING ON CATEGORY #', categories_ran, '/', len(urls), '-',  category)
    
    # COLLECT THE PAGE
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # GETS THE TABLE WITH THE WEBSITE LIST
    table = soup.find( 'table', {'id':'mbfc-table'} )
    
    for row in table.findAll("tr"):
        # SKIPS ROWS WITHOUT ANCHOR <a> TAG, FOR THEY ARE EMPTY LINES
        if not row.find('a'):
            continue
        else:
            data = {'category' : category,
                    'website_name' : row.text,
                    'url' : extract_url(row.text),
                    'report' : 'https://mediabiasfactcheck.com' + row.find('a')['href']}
            contents.append(data)
    
print('')
print('COMPLETE')
print('*' * len('COMPLETE'))
print('')

WORKING ON CATEGORY # 1 / 3 - least-biased
WORKING ON CATEGORY # 2 / 3 - conspiracy-pseudoscience
WORKING ON CATEGORY # 3 / 3 - pro-science

COMPLETE
********



#### Pages lacking URL

Goes through the report pages for the lines which did not contain the website URL in order to extract this info.

In [5]:
def extract_url_report(report_url):
    '''
    Extracts URL information from a Media Bias/Fact Check report page.
    
    Args:
    url: STR - the web address for the report page
    
    Output:
    STR - the URL for the website in the report
    '''    
    report_page = requests.get(report_url)
    report_soup = BeautifulSoup(report_page.text, 'html.parser')
    report_content = report_soup.find('div', {'class':'entry-content'} )
    for p in report_content.find_all('p'):
        if p.text.startswith('Source: '):
            return p.find('a')['href']

In [6]:
for entry in contents:
    if entry['url'] == '-':
        print('WORKING ON', entry['report'])
        entry['url'] = extract_url_report(entry['report'])
        time.sleep(1)
        
print('')
print('COMPLETE')
print('*' * len('COMPLETE'))
print('')

WORKING ON https://mediabiasfactcheck.com/adweek/
WORKING ON https://mediabiasfactcheck.com/air-force-times/
WORKING ON https://mediabiasfactcheck.com/allafrica/
WORKING ON https://mediabiasfactcheck.com/aptn-news/
WORKING ON https://mediabiasfactcheck.com/army-times/
WORKING ON https://mediabiasfactcheck.com/biloxi-sun-herald/
WORKING ON https://mediabiasfactcheck.com/bozeman-daily-chronicle/
WORKING ON https://mediabiasfactcheck.com/burnett-county-sentinel/
WORKING ON https://mediabiasfactcheck.com/denton-record-chronicle/
WORKING ON https://mediabiasfactcheck.com/eagle-tribune/
WORKING ON https://mediabiasfactcheck.com/elko-daily-free-press/
WORKING ON https://mediabiasfactcheck.com/hastings-tribune/
WORKING ON https://mediabiasfactcheck.com/how-to-geek/
WORKING ON https://mediabiasfactcheck.com/longview-news-journal/
WORKING ON https://mediabiasfactcheck.com/norfolk-daily-news/
WORKING ON https://mediabiasfactcheck.com/south-bend-tribune/
WORKING ON https://mediabiasfactcheck.com/s

## Enters each webpage and extracts any Facebook URL

First, try to find all the Facebook links on a page using `requests`/`BeautifulSoup`. All the unique values will be put on a list. The funcion will return both the number of unique Facebook URLs found and the complete list. Having the number of URLs found should make it easier when going through the records one by one to check the Facebook links.

Some of the pages in the analysis are dynamic, so we will use `Selenium` to rescan the ones that failed to find any Facebook URLs.

In [7]:
def generate_requests(extract_url):
    '''
    Uses requests for extracting page data.
    
    Args:
    extract_url - STR - url for the extraction
    
    Returns:
    BeutifulSoup object with the parsed page.
    '''
    page = requests.get(extract_url, timeout=60)
    return BeautifulSoup(page.text, 'html.parser')

def generate_selenium(extract_url):
    '''
    Sets the Selenium parameters for extracting the page data.
    
    Args:
    extract_url - STR - url for the extraction
    
    Returns:
    BeutifulSoup object with the parsed page.
    '''
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    driver = webdriver.Chrome(options=chrome_options)
    
    driver.set_page_load_timeout(60)
    driver.get(extract_url)
    
    extract_soup = BeautifulSoup(driver.page_source)    
    driver.quit()

    return extract_soup
    
def extract_facebook(page_object):
    '''
    Uses BeautifulSoup to extract all links pointing to Facebook from a requests or Selenium source code object.
    
    Args:
    page_object: requests or Selenium source code object
    
    Output:
    List of Facebook pages.
    '''
    facebook_urls = list()
    
    for a in page_object.find_all('a'):
        if a.has_attr('href') and 'facebook.com' in a['href']:
            if a['href'] not in facebook_urls:
                facebook_urls.append(a['href'])
    
    if len(facebook_urls) > 0:
        return (facebook_urls, len(facebook_urls))
    else:
        return (['None found'], 0)

In [8]:
total_runs = 0
start = timeit.default_timer()
content_error = list()

for entry in contents:
    clear_output(wait=True)
    total_runs += 1
    
    try:
        if entry['url'] != None and entry['url'].startswith('http'):
            soup = generate_requests(entry['url'])
            extracted_data = extract_facebook(soup)
            entry['facebook_page'] = extracted_data[0]
            entry['number_facebook_urls'] = extracted_data[1]

        stop = timeit.default_timer()
        expected_time = np.round((stop-start) / (total_runs / len(contents)) / 60, 2)

    except Exception as e:
        content_error.append((entry, e))
        continue

    print('WORKING ON URL #', total_runs, 'out of', len(contents), '-', entry['url'])
    print('Progress: ', np.round((total_runs/len(contents) * 100), 2), '%')
    print('Current run time:', np.round((stop - start)/60, 2), 'minutes')
    print('Expected run time:', expected_time, 'minutes') 

WORKING ON URL # 949 out of 949 - https://www.zmescience.com
Progress:  100.0 %
Current run time: 28.82 minutes
Expected run time: 28.82 minutes


### Checks errors

In [9]:
len(content_error)

52

In [10]:
content_error

[({'category': 'least-biased',
   'website_name': 'Asia Times (www.atimes.com)',
   'url': 'https://www.atimes.com',
   'report': 'https://mediabiasfactcheck.com/asia-times/'},
  requests.exceptions.SSLError(urllib3.exceptions.MaxRetryError('HTTPSConnectionPool(host=\'www.atimes.com\', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError("hostname \'www.atimes.com\' doesn\'t match either of \'*.parkingcrew.net\', \'parkingcrew.net\'")))'))),
 ({'category': 'least-biased',
   'website_name': 'Belleville News-Democrat (www.bnd.com)',
   'url': 'https://www.bnd.com',
   'report': 'https://mediabiasfactcheck.com/belleville-news-democrat/'},
  requests.exceptions.ReadTimeout(urllib3.exceptions.ReadTimeoutError("HTTPSConnectionPool(host='www.bnd.com', port=443): Read timed out. (read timeout=60)"))),
 ({'category': 'least-biased',
   'website_name': 'Biloxi Sun Herald',
   'url': 'http://sunherald.com',
   'report': 'https://mediabiasfactcheck.com/biloxi-s

Most of these are actually offline, as seen on manual inspection. The other ones will be discarded, as they are not numerous enough to impact any outcome in this analysis.

#### Runs with Selenium

*Note: A copy of the contents that have already been extracted will be created just in case...*

In [11]:
contents_selenium = contents.copy()
content_error_selenium = list()

In [12]:
total_runs = 0
start = timeit.default_timer()
content_error = list()

for entry in contents_selenium:
    clear_output(wait=True)
    total_runs += 1
    
    if 'number_facebook_urls' in entry:
        if entry['number_facebook_urls'] > 0:
            continue
    else:
        try:
            if entry['url'] != None and entry['url'].startswith('http'):
                soup = generate_selenium(entry['url'])
                extracted_data = extract_facebook(soup)
                entry['facebook_page'] = extracted_data[0]
                entry['number_facebook_urls'] = extracted_data[1]

            stop = timeit.default_timer()
            expected_time = np.round((stop-start) / (total_runs / len(contents_selenium)) / 60, 2)

        except Exception as e:
            content_error.append((entry, e))
            continue

    print('WORKING ON URL #', total_runs, 'out of', len(contents_selenium), '-', entry['url'])
    print('Progress: ', np.round((total_runs/len(contents_selenium) * 100), 2), '%')
    print('Current run time:', np.round((stop - start)/60, 2), 'minutes')
    print('Expected run time:', expected_time, 'minutes') 

WORKING ON URL # 947 out of 949 - https://www.who.int
Progress:  99.79 %
Current run time: 5.24 minutes
Expected run time: 5.31 minutes


### Creates DF and saves to Excel for inspection

In [13]:
df = pd.DataFrame(contents_selenium)
df = df[df['number_facebook_urls'] > 0].reset_index(drop=True)

In [14]:
df

Unnamed: 0,category,website_name,url,report,facebook_page,number_facebook_urls
0,least-biased,24ur.com,https://24ur.com,https://mediabiasfactcheck.com/24ur-com/,[https://facebook.com/24urcom],1.0
1,least-biased,38 North (www.38north.org),https://www.38north.org,https://mediabiasfactcheck.com/38-north/,[https://www.facebook.com/38NorthNK],1.0
2,least-biased,680 News (www.680news.com),https://www.680news.com,https://mediabiasfactcheck.com/680-news/,[https://www.facebook.com/680News],1.0
3,least-biased,1010 WINS AM (1010wins.radio.com),https://1010wins.radio.com,https://mediabiasfactcheck.com/1010-wins-am/,"[https://www.facebook.com/1010wins/, https://w...",3.0
4,least-biased,ABC7Chicago.com,https://ABC7Chicago.com,https://mediabiasfactcheck.com/abc7chicago-com/,[https://www.facebook.com/pages/ABC-7-Chicago/...,1.0
...,...,...,...,...,...,...
542,pro-science,Understanding Reality Through Science (underst...,https://understandrealitythroughscience.blogsp...,https://mediabiasfactcheck.com/understand-real...,[https://www.facebook.com/thom.raff],1.0
543,pro-science,VeryWell (www.verywell.com),https://www.verywell.com,https://mediabiasfactcheck.com/verywell/,[https://www.facebook.com/verywell],1.0
544,pro-science,VeryWell Family (verywellfamily.com),https://verywellfamily.com,https://mediabiasfactcheck.com/verywell-family/,[https://www.facebook.com/verywell],1.0
545,pro-science,World Meteorological Organization (public.wmo....,https://public.wmo.int,https://mediabiasfactcheck.com/world-meteorolo...,[https://www.facebook.com/World-Meteorological...,1.0


In [19]:
interim_file = './data/interim/extracted_data.xlsx'
df.to_excel(interim_file, index=False)

## Imports DF after cleaning and generates CrowdTangle's CSV file

Generate a .csv using the `Page or Account URL,List` template.

It will be saved to `./data/interim/crowdtangle_batch_upload.csv`. The `category` column will be the list name.

In [20]:
df = pd.read_excel(interim_file)

## Export

In [21]:
df['facebook_page'] = df['facebook_page'].apply(lambda x: x.replace("'","").replace('[','').
                                                replace(']','').strip())

In [22]:
df_to_export = df[['facebook_page','category']]
df_to_export.columns = ['Page or Account URL','List']
df_to_export.to_csv('./data/interim/crowdtangle_batch_upload.csv',
                    sep=',', index=False)