# Analyzing the Discourse on 'Climate Change' in U.S. Media: A Basic Tutorial of How to Do Data Science With Python

Version: 1.0 (8 August 2022)

For a more detailed version, please see the corresponding tutorial on [Medium.com]().

--------

# Introduction

## What is the aim of this tutorial?
In this contribution, I will collect and analyze a small dataset consisting of news articles collected via an API to illustrate the typical pipeline for data-driven research:

**Collecting Data** ➔ **Cleaning/Transforming Data** ➔ **Analyzing/Visualizing Data**

The audience of this tutorial are beginners who just started doing data science with Python. My goal is to illustrate the overall approach mentioned above. Therefore, I will restrict the analysis to a few easy-to-understand methods. Yet, I hope to show that basic methods can lead to promising first results, even when applied on a rather small dataset.

## How are we going to do this?
In this tutorial, I will mainly be using [Jupyter](https://jupyter.org/) notebooks and the programming language [Python](https://www.python.org/).

Jupyter Notebooks are interactive documents that can be displayed in the browser. Among other things, they allow the step-by-step execution of code in code cells as well as a detailled documentation of the code in text cells via [Markdown](https://www.markdownguide.org/). Jupyter notebooks are particularly suitable for data-driven research since they make each of the individual steps of the analysis tansparent while also enabling the presentation of the results in such a way that they can be understood by everyone, including people without any programming knowledge.

The Jupyter notebook of this analysis is available on [GitHub](https://github.com/thomjur/climate_change_in_us_media_tutorial).

# Data

The data we'll be working with consists of English articles from well-known U.S. media websites that mention the term "climate change," which I have collected using [News API's](https://newsapi.org/) *free tier*.

In this case, an API (*application programming interface*) is an interface that enables programs or users to access and retrieve data from an external web server (usually in a JSON format). Regarding our example, querying the News API allows us to retrieve large amounts of article data in a semi-structured form using a simple HTTP query string. For the details of the queries, please see the

A typical article that we collect via the News API looks like this:

```Python
{
    "source": {
        "id": "reuters",
        "name": "Reuters"
    },
    "author": null,
    "title": "Wary shoppers muddy outlook for tech, auto firms in Asia - Reuters",
    "description": "Asian tech firms from chipmaker Samsung to display panel maker [...]",
    "urlToImage": "https://www.reuters.com/resizer/43w65Nb0zXMVr68fW8Al2pM83M8=/1200x628",
    "publishedAt": "2022-07-28T08:02:00Z",
    "content": "July 28 (Reuters) - Asian tech firms from chipmaker Samsung to display … [+5170 chars]"
}

```

As we can see, it includes a lot of information. In this tutorial, we will focus on the title of the retrieved articles. The data basis of our analyis includes articles published between 22 June 2022 and 22 July 2022 from the following news websites:

1. Fox News
2. Breitbart
3. The Washington Post
4. CNN

If you want to include other websites as well, you can easily retrieve additional data from other news websites via the News API. The collection of articles from the four websites was grouped into two corpora according to the general political orientation of the websites (right-wing/conservative vs. liberal): Fox News and Breitbart (**Corpus Conservative**, 195 articles) and The Washington Post and CNN (**Corpus Liberal**, 184 articles).


# Research question and methods
In this tutorial, the titles of the retrived article data will be examined. The following methods are used during the analysis and their results are visualized:

1. **Named-Entity Recognition**.
2. **Bag-of-Words**
3. **Sentiment Analysis**

Since the focus of this article lies on the interplay of the individual steps in the pipeline, the analysis methods were restricted to a selection of easy-to-understand and easy-to-implement techniques. For a mor in-depth analysis, please feel free to add additional methods, for example from the field of corpus linguistics or a complementary qualitative analysis in the sense of a *mixed-methods* approach. Also, it might make sense to increase the size and variets of the corpus data.  

---------------

# ANALYSIS


In [None]:
from newsapi import NewsApiClient
import pickle

# 1. Collecting and Loading Data (News API)

In [None]:
# IDs news websites for News API

NEWSPAGES_A = ["fox-news", "breitbart-news"]
NEWSPAGES_B = ["cnn", "the-washington-post"]

## 1.1 Collecting data via the News API

In [None]:
# Initialisation News API wrapper 
#newsapi = NewsApiClient(api_key='YOUR_KEY_GOES_HERE')

In [None]:
# collecting data for Corpus Conservative

for portal in NEWSPAGES_A:
    all_articles = newsapi.get_everything(q='\"climate change\"',
                                          sources=portal,
                                         )
    with open(portal+'.pkl', 'wb') as f:
        pickle.dump(all_articles, f)

In [None]:
# ollecting data for Corpus Liberal

for portal in NEWSPAGES_B:
    all_articles = newsapi.get_everything(q='\"climate change\"',
                                          sources=portal,
                                         )
    with open(portal+'.pkl', 'wb') as f:
        pickle.dump(all_articles, f)

## 1.2 Loading stored data

In [None]:
# loading Corpus Conservative

titles_corpus_A = list()
for portal in NEWSPAGES_A:
    with open(portal+'.pkl', "rb") as f:
        data = pickle.load(f)
        print(f'Loading data from {portal}.')
        for article in data['articles']:
            titles_corpus_A.append(article['title'])
print(f'\nEs liegen Daten zu {len(titles_corpus_A)} Artikel vor.')

In [None]:
# loading Corpus Liberal

titles_corpus_B = list()
for portal in NEWSPAGES_B:
    with open(portal+'.pkl', "rb") as f:
        data = pickle.load(f)
        print(f'Loading data from {portal}.')
        for article in data['articles']:
            titles_corpus_B.append(article['title'])
print(f'\nEs liegen Daten zu {len(titles_corpus_B)} Artikel vor.')

In [None]:
titles_corpus_B[:5]

# 2. Analysis & Visualization

In [None]:
# import necessary libraries

import spacy
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns
sns.set_theme(style="whitegrid")

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
docs_A = nlp(" ".join(titles_corpus_A))

In [None]:
docs_B = nlp(" ".join(titles_corpus_B))

## 2.1 Named-Entity Recognition (NER)

### NER: Corpus Conservative

In [None]:
corpus_A_NER_counter_text = Counter() 
corpus_A_NER_counter_label = Counter()

for ent in docs_A.ents:
    corpus_A_NER_counter_text.update([ent.text])
    corpus_A_NER_counter_label.update([ent.label_])

#### Named-Entities (Individuals)

In [None]:
corpus_A_NER_counter_text.most_common(10)

#### Named-Entities (Kategorien)

In [None]:
corpus_A_NER_counter_label.most_common(5)

### NER: Corpus Liberal

In [None]:
corpus_B_NER_counter_text = Counter() 
corpus_B_NER_counter_label = Counter()

for ent in docs_B.ents:
    corpus_B_NER_counter_text.update([ent.text])
    corpus_B_NER_counter_label.update([ent.label_])

#### Named-Entities (Individuals)

In [None]:
corpus_B_NER_counter_text.most_common(10)

#### Named-Entities (Kategorien)

In [None]:
corpus_B_NER_counter_label.most_common(5)

### Visualization / Comparison

In [None]:
df_con = pd.DataFrame(corpus_A_NER_counter_label.most_common(10), columns=['entity', 'amount'])
df_lib = pd.DataFrame(corpus_B_NER_counter_label.most_common(10), columns=['entity', 'amount'])
df_con['corpus'] = 'conservative'
df_lib['corpus'] = 'liberal'

df = pd.concat([df_con, df_lib])

In [None]:
f, ax = plt.subplots(figsize=(10, 7))

sns.barplot(
    data=df,
    x='amount', y='entity', hue="corpus",
    palette={'conservative': 'r', 'liberal': 'b'},
     alpha=.8
)
ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(xlabel="Abs. Freq.", ylabel='NER Category')

## 2.2 Bag of Words (BoW)

In [None]:
corpus_A_BoW = Counter() 
corpus_B_BoW = Counter()

### Word Cloud and BoW  Corpus Conservative

In [None]:
wc = WordCloud(width=800, height=400, background_color="white", min_word_length=3).generate(docs_A.text.lower())
plt.figure(figsize=(40,80))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

### Word Cloud and BoW Corpus Liberal

In [None]:
wc = WordCloud(width=800, height=400, background_color="white", min_word_length=3).generate(docs_B.text.lower())
plt.figure(figsize=(40,80))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

## 2.3 Sentiment Analysis

In [None]:
from textblob import TextBlob

### Sentiment Analyse: Corpus Conservative

In [None]:
negatives = 0
neg_sentences_A = ""
neg_sentences_A_list = list()

for title in titles_corpus_A:
    polarity = TextBlob(title).sentiment.polarity
    if polarity < 0:
        negatives += 1
        neg_sentences_A += title + " "
        neg_sentences_A_list.append(title)

print(f'{negatives/len(titles_corpus_A)*100:.2f}% der Titel sind negativ.')
        

### Sentiment Analyse: Corpus Liberal

In [None]:
negatives = 0
neg_sentences_B = ""
neg_sentences_B_list = list()

for title in titles_corpus_B:
    polarity = TextBlob(title).sentiment.polarity
    if polarity < 0:
        negatives += 1
        neg_sentences_B += title + " "
        neg_sentences_B_list.append(title)

print(f'{negatives/len(titles_corpus_B)*100:.2f}% der Titel sind negativ.')
        

### Word Cloud: Negative Sentences Corpus Conservative

In [None]:
wc = WordCloud(width=800, height=400, background_color="white", min_word_length=3).generate(neg_sentences_A.lower())
plt.figure(figsize=(40,80))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

### Word Cloud: Negative Sentences Corpus Liberal

In [None]:
wc = WordCloud(width=800, height=400, background_color="white", min_word_length=3).generate(neg_sentences_B.lower())
plt.figure(figsize=(40,80))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

# Qualitative Analysis

In [None]:
import random

In [None]:
# Corpus Conservative
random.shuffle(neg_sentences_A_list)
neg_sentences_A_list[:5]

In [None]:
# Corpus Liberal
random.shuffle(neg_sentences_B_list)
neg_sentences_B_list[:5]