# Left-wing vs Right-wing Media on US President Donald Trump

This notebook primarily focuses on data collection and formatting.

## News Data Acquisition

There are no readily available datasets which give right-wing and left-wing media data. So, we have to extract it from sources on the web. Now, from popular sources, it is widely accepted that CNN is left-leaning and Fox News is right-leaning. Therefore, we will use their content as proxies for left-leaning and right-leaning discussion on Donald Trump and the incumbent government.

### Data from CNN and Fox

Both CNN and Fox News have a search feature, where we will search for Donald Trump, and extract URLs of the news article from there. The idea is to get around 3000-4000 articles per set. So, we will extract 3000-4000 URLs from the search options and then extract title and text from each article's URL. This is done using Selenium and BeautifulSoup. Selenium has been used to open the search browser and extract the URLs. BeautifulSoup can extract the text from each news article URL.

In [2]:
import requests, json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
# from bs4 import BeautifulSoup
import time
from selenium import webdriver

## CNN URLs Collection

In [1]:
a = time.time()

chrome_path = r"C:\Users\Yash Kumar\Downloads\chromedriver_win32\chromedriver.exe"
cnn_masterlist = []

try:
    for counter in range(1, 600): # Focusing on ~600 pages of search results
        driver =  webdriver.Chrome(chrome_path)

        if counter == 1:
            driver.get("https://www.cnn.com/search/?q=donald%20trump&size=10&page=1&sort=newest&type=article")
            time.sleep(5) # inducing delay counter of 5 seconds to prevent DOS attack

        else:
            num = (counter - 1) * 10
            driver.get(f"""https://www.cnn.com/search/?q=donald%20trump&size=10&page={counter}&from={num}&sort=newest&type=article""")
            time.sleep(5)
            # Every search page increments num by 10 and counter by 1

        for i in range(1, 11):
            # Below given is the URL xpath
            post = driver.find_element_by_xpath(f"/html/body/div[5]/div[2]/div/div[2]/div[2]/div/div[3]/div[{i}]/div[2]/h3/a")
            link = post.get_attribute("href")
            cnn_masterlist.append(link)
            
        driver.close()
        
        if counter % 10 == 0:
            print(counter, time.time()-a)

except:
    print("worked until page", str(counter))

print(time.time() - a)

## Fox URLs Collection

In [None]:
a = time.time()

chrome_path = r"C:\Users\Yash Kumar\Downloads\chromedriver_win32\chromedriver.exe"
fox_masterlist = []

try:
    for counter in range(1, 600):
        driver =  webdriver.Chrome(chrome_path)

        if counter == 1:
            driver.get("""https://www.foxnews.com/search-results/search?q=donald%20trump&ss=fn&type=story&start=0""")
            time.sleep(3) # inducing delay counter of 5 seconds to prevent DOS attack

        else:
            num = (counter - 1) * 10
            driver.get(f"""https://www.foxnews.com/search-results/search?q=donald%20trump&ss=fn&type=story&start={num}""")

        for i in range(3, 13):
            post = driver.find_element_by_xpath(f"""//*[@id="search-container"]/div[{i}]/div/div/h3/a""")
            # xpath of Fox URLs
            link = post.get_attribute("href")

            if fox_masterlist == []:
                fox_masterlist.append(link)

            elif fox_masterlist[-1] != link:
                # Lots of repititive links in Fox news in searches, mostly one after the other
                fox_masterlist.append(link)

            else:
                pass

        driver.close()

        if counter % 10 == 0:
            print(counter, time.time() - a)

except:
    print("worked until page", str(counter))

print(time.time() - a)

In [58]:
# Saving URLs as data frames
CNN_trump = pd.DataFrame({'CNN_URL':cnn_masterlist})
CNN_trump.to_csv("CNN URLs.csv")

Fox_trump = pd.DataFrame({'Fox_URL':fox_masterlist})
Fox_trump.to_csv("Fox URLs.csv")

In [90]:
# Title and text scraper from URL
def cnn_text_title(url):
    try:
        time.sleep(0.2)
        page = requests.get(url).text
        soup = BeautifulSoup(page, "html.parser")
        # HTML tag associated with paragraphs
        name_box = soup.find_all("div", attrs={"class": lambda L: L and L.startswith("zn-body__paragraph")})
        text = " ".join(chk.get_text(' ', strip=True) for chk in name_box) # Concatenating all paragraphs toogether
        return [text, soup.find("h1").text] # Title
    
    except:
        return ["", ""]
    
CNN_trump["text_title"] = CNN_trump["CNN_URL"].apply(lambda x: cnn_text_title(x))

In [91]:
# Title and text scraper from URL
def fox_text_title(url):
    try:
        time.sleep(0.2)
        page = requests.get(url).text
        soup = BeautifulSoup(page, "html.parser")
        # HTML tag associated with paragraphs
        name_box = soup.find_all("p", attrs={"class":''})
        text = " ".join(chk.get_text(' ', strip=True) for chk in name_box if len(list(chk.descendants)) == 1).replace("\xa0", " ").replace("&apos;", "")
        return [text, soup.find("h1").text] # Title
    except:
        return ["", ""]
    
Fox_trump["text_title"] = Fox_trump["Fox_URL"].apply(lambda x: fox_text_title(x))

In [94]:
# Splitting title and text from list of text and title
CNN_trump["title"] = CNN_trump["text_title"].apply(lambda x: x[1])
CNN_trump["text"] = CNN_trump["text_title"].apply(lambda x: x[0])

Fox_trump["title"] = Fox_trump["text_title"].apply(lambda x: "" if x == "" else x[1])
Fox_trump["text"] = Fox_trump["text_title"].apply(lambda x: "" if x == "" else x[0])

# CNN_trump.drop("")

The data has been converted to a pandas dataframe with URL, Title and Text features. These dataframes have been saved as pickled files as they are easier to handle than CSV files, which are easy to get corrupted with text data due to presence of extra commas, spaces, tabs etc.

In [98]:
CNN_trump.drop("text_title", axis=1).to_pickle("CNN Trump Articles.pkl")
Fox_trump.drop("text_title", axis=1).to_pickle("Fox Trump Articles.pkl")