# Dataset statistics

This notebook does not include any code specific to the study. It merely generates some statistics about the news article corpus published in the respective paper.

Paper reference: _Spliethöver, Keiff, Wachsmuth (2022): "No Word Embedding Model Is Perfect: Evaluating the Representation Accuracy for Social Bias in the Media", EMNLP 2022, Abu Dhabi._

Code & Data reference: https://github.com/webis-de/EMNLP-22

## Data preparation and loading

Please run the following two cells for any of the embedding models. They load the most common packages and set commonly used variables. They are necessary to run the training cells below.

In [1]:
import json
import pandas as pd
import sqlite3
import sys

from os import path
from tqdm.notebook import tqdm

PARENT_DIR = path.abspath("../src")
sys.path.append(PARENT_DIR)
from embedding_bias.config import NEWS_ARTICLE_DB_NAME
from embedding_bias.util import *

tqdm.pandas()

In [2]:
DATA_DIR = path.join(PARENT_DIR.parent, "data")
DB_PATH = path.join(DATA_DIR, "raw", NEWS_ARTICLE_DB_NAME)
ALLSIDES_RANKING_PATH = path.join(DATA_DIR, "raw", "allsides-ranking.csv")
OUTLET_CONFIG_PATH = path.join(DATA_DIR, "raw", "outlet-config.json")

# Target sqlite database
target_db_connection = sqlite3.connect(DB_PATH)

# Outlet config file
outlet_config = pd.read_json(OUTLET_CONFIG_PATH)
outlet_selection = outlet_config

# Allsides ranking
allsides_ranking = pd.read_csv(ALLSIDES_RANKING_PATH)

# Map from outlet name to political orientation
outlet_orientation_map = {
    o["name"].lower(): o["allsides_rating"]
    for i,o in outlet_selection.iterrows()}

# Groups of political orientations
orientation_groups = {
    "left": ["Lean Left", "Left"],
    "center": ["Center"],
    "right": ["Lean Right", "Right"]}

In [3]:
articles = get_articles_as_df(
    allsides_ranking=allsides_ranking,
    db_connection=target_db_connection,
    outlet_selection=outlet_selection,
    preprocessed=True)

Collecting articles for HuffPost
Collecting articles for Daily Beast
Collecting articles for Vox
Collecting articles for Rolling Stone
Collecting articles for Newsweek
Collecting articles for MSNBC
Collecting articles for Vice
Collecting articles for Slate
Collecting articles for New York Daily News
Collecting articles for The New Yorker
Collecting articles for CNN
Collecting articles for The New York Times
Collecting articles for The Guardian
Collecting articles for The Washington Post
Collecting articles for Bloomberg
Collecting articles for ABC News
Collecting articles for NBC News
Collecting articles for Politico
Collecting articles for U.S. News & World Report
Collecting articles for CBS News
Collecting articles for BBC News
Collecting articles for Heavy
Collecting articles for Business Insider
Collecting articles for Patch
Collecting articles for CNBC
Collecting articles for Forbes
Collecting articles for Reuters
Collecting articles for FiveThirtyEight
Collecting articles for USA

In [4]:
# Ignore articles with unrealistic dates
articles = articles[~((articles.date.str.len() < 10) & (articles.date.str.len() > 0))]
articles["date_dt"] = pd.to_datetime(articles.date, format="%Y-%m-%d")

# Add orientation groupings
articles["orientation_group"] = articles.orientation.apply(
    lambda x: [k for k, g in orientation_groups.items() if x in g][0])

In [5]:
articles

Unnamed: 0,text,date,outlet,orientation,date_dt,orientation_group
0,The term “integrative therapies” describes the...,2017-03-13,HuffPost,Left,2017-03-13,left
1,"If his politics is rooted in communal hatred, ...",,HuffPost,Left,NaT,left
2,"“I had people ― wealthy, billionaires ― callin...",2018-02-28,HuffPost,Left,2018-02-28,left
4,HuffPost Canada closed in 2021 and this site i...,,HuffPost,Left,NaT,left
5,I was more than a little disappointed when I s...,,HuffPost,Left,NaT,left
...,...,...,...,...,...,...
496624,There is more bad news out of US college campu...,2016-04-07,Townhall,Right,2016-04-07,right
496625,Million Insights - World's Fastest Growing Mar...,2021-11-18,Townhall,Right,2021-11-18,right
496626,The opinions expressed by columnists are their...,2020-10-09,Townhall,Right,2020-10-09,right
496627,Data & News supplied by www.cloudquote.io Stoc...,2021-01-08,Townhall,Right,2021-01-08,right


## Computing statistics

In [6]:
# Total number of news articles
total_articles = len(articles)

# Total number of media outlets
total_outlets = len(articles.outlet.unique())

# Number of news articles per media outlet
articles_per_outlet = articles.outlet.value_counts()

# Number of news articles per political orientation
articles_per_orientation = articles.orientation.value_counts()

# Number of news articles per political orientation group
# (more coarse grained orientation division)
articles_per_orientation_group = articles.orientation_group.value_counts()

# Number of news articles for which an automatically extracted date of publication exists
articles_with_date = len(articles[~(articles.date.str.len() < 10)])

# Number of news articles for which _no_ date of publication exists
articles_without_date = len(articles[(articles.date.str.len() < 10)])

# All news articles that have an associated publication date within the timeframe of interest
articles_filtered_by_date = articles[
    (articles.date_dt.dt.year >= 2010) & (articles.date_dt.dt.year <= 2021)].date_dt.dropna()

# Number of news articles published in each year of interest
articles_per_year = articles_filtered_by_date.dt.year.value_counts().sort_index()

# Number of news articles per political orientation in each year of interest
articles_per_year_per_orientation = {
    group: articles_filtered_by_date[
        articles.orientation == group].dt.year.value_counts().sort_index()
    for group in articles.orientation.unique()
}

# Number of media outlets available for each political orientation
outlets_per_orientation = {
    orientation: len(articles[articles.orientation == orientation].outlet.unique())
    for orientation in articles.orientation.unique()
}

In [7]:
# Pretty printing calculated statistics

with pd.option_context('display.max_rows', None):
    separator_length = 60
    print("#" * (separator_length + 20))
    print(f"Total number of articles:\t{total_articles}")
    print(f"Total number of outlets:\t{total_outlets}")

    print("-" * separator_length)
    print("Articles per outlet:")
    print(articles_per_outlet)

    print("-" * separator_length)
    print("Articles per bias:")
    print(articles_per_orientation)

    print("-" * separator_length)
    print("Articles per bias group:")
    print(articles_per_orientation_group)

    print("-" * separator_length)
    print("Outlets per orientation")
    [print(f"{orientation}:\t{count}") for orientation, count in outlets_per_orientation.items()]

    print("#" * (separator_length + 20))
    print(f"With date\t\t{articles_with_date}")
    print(f"Without date\t{articles_without_date}")

    print("-" * separator_length)
    print("Articles per year (filtered)")
    print(articles_per_year)

    print("-" * separator_length)
    print("Articles per year (filtered) per orientation")
    [print(f"{group.capitalize()}\n{articles_per_year_per_orientation[group]}") for group in articles.orientation.unique()]

################################################################################
Total number of articles:	520798
Total number of outlets:	47
------------------------------------------------------------
Articles per outlet:
Newsweek                    36602
The Washington Post         35457
Vox                         33974
HuffPost                    33498
Breitbart News              28230
Daily Beast                 27991
Wall Street Journal         23975
Washington Times            23964
New York Post               22287
The New Yorker              19789
Rolling Stone               19602
BBC News                    19468
CNBC                        18814
U.S. News & World Report    18704
PJ Media                    16344
Newsmax                     15156
Reuters                     14453
Vice                        14015
CNN                         11647
Slate                       10607
MSNBC                        7392
ABC News                     6640
Red State                   