# Evaluating How Liberal vs. Conservative Opinion News Shows Influenced the Narrative about the COVID-19 Pandemic

**Group Members: Jennifer Andre, Tobi Jegede, Callie Lambert, & Lori Zakalik**

**Date**: May 2, 2022

**Disclaimer:** In order to run our team's notebook, please make sure you have the following python libraries and packages downloaded on your device:
* glob
* os
* matplotlib.pyplot
* numpy
* string
* regex
* sklearn.feature_extraction
* operator
* collections
* spacy
* sklearn.decomposition
* wordcloud
* ntlk.sentiment
    *  Note: If you run into issues after downloading the above package (ntlk.sentiment), please use the code chunk below: 
    
        ```python
        import nltk import ssl
        try: 
            _create_unverified_https_context = ssl._create_unverified_context
        except AttributeError:
            pass
        else:
            ssl._create_default_https_context = _create_unverified_https_context 
            
        nltk.download('vader_lexicon')

## Explanation of the Data

As mentioned in our Final Project Proposal, we wanted to specifically analyze how the different **opinion** news arms of popular cable news channels talked about the COVID-19 pandemic. 

Using some research from the Pew Research Center, we found that CNN and MSNBC were the most popular news channels watched by individuals who consistently voted for liberal political candidates, while Fox news was the most watched cable news channel for individuals who consistently voted for conservaive political candidates. 

We then found additional articles that provided information on the most watched tv shows on CNN, MSNBC, and Fox News and found that Anderson Cooper 360 was the most watched show on CNN, Rachel Maddow was the most watched show on MSNBC, and Tucker Carlson and the Five were the most watched shows on Fox News. 

We then used webscraping techniques to pull the text files for the transcripts for each of the shows mentioned above from March 2020 to March 2022. The code to run this webscraping can be found in the code folder on our project's GitHub page, located here: https://github.com/tobijegede/opinion-news-nlp 



# Setup & Data Pre-Processing 

In [1]:
#import packages & libraries
import glob 
import os
import matplotlib.pyplot as plt
import numpy as np
import string
import regex as re
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from operator import itemgetter
from collections import Counter
import spacy
from sklearn.decomposition import LatentDirichletAllocation
from wordcloud import WordCloud
from nltk.sentiment import SentimentIntensityAnalyzer

## Load Data

**Disclaimer:** You may need to adjust the code chunk below to properly load in the data. Our current file structure is:
- Code Repository Folder (opinion-news-nlp)
    - data
        - 01-raw
            - name of opinion news show
                - a list of text files

In [5]:

#get the correct file path starting file path
repo_path = os.path.dirname(os.getcwd()) 

#read in the liberal corpus
rm_paths = glob.glob(repo_path + "/data/01-raw/rachel_maddow/*.txt") #the paths for the rachel maddow transcript files
ac_paths = glob.glob(repo_path + "/data/01-raw/anderson_cooper/*.txt") #the paths for the anderson cooper transcript files
all_liberal_files = rm_paths + ac_paths # all liberal trasncripts

#read in the conservative corpus 
tc_paths = glob.glob(repo_path + "/data/01-raw/tucker_carlson/*.txt") #the paths for the tucker carlson transcript files
tf_paths = glob.glob(repo_path + '/data/01-raw/the_five/*.txt') #the paths for the five transcript files
all_conservative_files = tc_paths + tf_paths #all conservative transcripts

#read in the CDC transcripts
cdc_paths = glob.glob(repo_path + "/data/01-raw/cdc_press_releases/*.txt") #the paths for cdc transcript files


The basic summary statistics for all of the data that we used in our analysis is as follows:
1. Conservative Corpus, N = 458
    - Tucker Carlson, 208
    - The Five, 250
2. Liberal Corpus, N = 1,008
    - Anderson Cooper, 530
    - Rachel Maddow, 478
3. CDC, N = 47


Although we have fewer trasncripts in our conservative corpus, the transcripts are **{insert average transcript length here}** <<EDIT THIS!>>

## Define COVID Terms

In the below, we created lists of words to use to help with co-occurrence analysis later on in our notebook. We specifically wanted to catalog most of the ways that COVID, mask, and vaccine words could potentially show up in the transcripts in the dataset.

In [None]:
# covid terms
#covid_terms = ['coronavirus', 'covid', 'covid-19', 'covid-', 'covid19', 'virus']

covid_terms = ['coronavirus', 'covid', 'covid-19', 'covid-', 
                'covid19', 'virus', 'sars', 'sars-', 'sars-cov-2']

In [None]:
# vaccine terms
#vaccine_terms = ['vaccine', 'vaccination', 'vaccinated', 'vaccinated', 'mrna', 'booster', 'vax', 'vaxx', 'vaxxed']

vaccine_terms = ['vaccine', 'vaccination', 'vaccinated', 'mrna', 'booster', 'vax', 'vaxx', 
                'vaxxed', 'pfizer', 'moderna', 'johnson', 'j&j']



In [None]:
# mask terms
#mask_terms = ['mask', 'masking']

mask_terms = ['mask', 'masking', 'n95', 'kn95']

In [None]:
# other COVID-related terms (can choose to use or not)
other_terms = ['china', 'wuhan', 'mandate', 'pandemic', 'epidemic', 'virus',
                'distancing', 'spread', 'immunity', 'incubation', 'quarantine']

all_covid_terms = covid_terms + other_terms

## Update Stop Words

In the code chunk below, we add specific stop words for the conservative, liberal, and CDC news sources in order to remove named entities, like the names of the talk show hosts, as well as the name of the CDC, from the list of relevant words to count.

In [None]:
# add network, host names for conservative news corpus
add_stop_words_conservative = ['tucker', 'carlson', 'fox', 'news', 'five', 
                'greg', 'gutfeld', 'dana', 'perino', 'jesse', 'watters', 
                'jeanine', 'pirro', 'geraldo', 'rivera', 'jessica', 'tarlov',
                'harold', 'ford', 'jr', 'ok', 'williams',  'pavlich', 
                'mcdowell', 'juan', 'thanks', 'crosstalk', 'unidentified',
                 'video', 'clip', 'voiceover', 'videotape']

# add host names, important figures for the liberal news corpus
add_stop_words_liberal = ['anderson', 'cooper', 'rachel', 'maddow', 
                  'chris', 'hayes', 'ari', 'berman', 'michael', 'osterholm',
                  'cnn', 'msnbc', 'cnns', 'msnbcs',
                  'vivek', 'murthy', 'rochelle', 'walensky', 'jerome', 'adams', 'alex', 'azar',
                  'anthony', 'fauci', 'faucis',
                  'cuomo', 'erin', 'david',
                  'leana', 'wen', 'deborah', 'birx',
                  'robert', 'redfield', 'gavin', 'newsom',
                  'ashish', 'jha', 'tom', 'frieden',
                  'video', 'clip', 'voiceover', 'videotape']

# add common words for the cdc news corpus
add_stop_words_cdc = ['question', 'cdc', "fauci", "dr", "thanks", "thank", "people"]


In [None]:
#create the full list of stop words for each of the corpuses
full_stop_words_conservative = text.ENGLISH_STOP_WORDS.union(add_stop_words_conservative)
full_stop_words_liberal = text.ENGLISH_STOP_WORDS.union(add_stop_words_liberal)
full_stop_words_cdc = text.ENGLISH_STOP_WORDS.union(add_stop_words_cdc)

## Read in, Clean, & Store Cleaned Transcripts

## Store Transcripts by Year

# Analysis #1: Word Frequency Analysis

In order to get a high level picture of the datasets that we had available, and to do a basic sense check, we first conducted word frequency analysis using sklearn's Count Vectorizer and

# Analysis #2: Co-Occurrence Analysis

# Analysis #3: Topic Modeling using LDA

# Analysis #4: Sentiment Analysis

# Policy Implications

**INSERT SOMETHING HERE ABOUT POLICY IMPLICATIONS**
1. LALALA
2. OOHH LOOK AT MEEEEE, I'M POLICY RELATED

# Future Work & Analysis Limitations

**Future Work**
1. Look at transcripts of **local news channels** instead of national news channels to get a better approximation of localized opinions about COVID-19 and its associated mitigation strategies
    - This matches the way that COVID-19 mitigation is being addressed now – on a case by case, local level
2. Expand to look at more **change over time** in the coverage of COVID-19 under different presidential administrations instead of lumping them together to see how the discussion of the pandemic has changed over time

**Limitation**
1. By just doing text analysis, the **context** and **tone** surrounding the words is absent
    - We found that the words used across the liberal and conservative news channels is the same but the context in which its being used or the tone with which the words are used could be different (and meaningfully so) but this is unable to be captured through only text analysis