### Aggregate articles to outlet-level

Take all the articles extracted in the previous step and aggrgate the data so we get one row per news outlet, with relevant features (ex. theme1_AvgTone, theme1_PosTone.... themeX_AvgTone).

In [None]:
!pip3 install scikit-learn

In [2]:
from tqdm import tqdm
import pandas as pd
import numpy as np
import time
import ast
import os

In [3]:
def split_on_theme(s):
    if isinstance(s,float) or str(s) == "nan" or not s:
        return []
    # make cell into list of themes
    return list(ast.literal_eval(s))

Get the file and subset the columns to the relevant ones for this analysis

In [4]:
# get extras (headers and mdfc)
articles = pd.read_csv("GDELT_GKG/gkg_csvs/one_year_outlets_eMFD.csv")
#articles = articles.head(1000)

In [None]:
articles.columns

In [6]:
subset = ['V2SOURCECOMMONNAME', 
          'IMGorEMBED', # combo of 3 below
          #'V21RELATEDIMAGES', 
          #'V21SOCIALIMAGEEMBEDS',
          #'V21SOCIALVIDEOEMBEDS', 
          'THEMES_SUBSET', 'PosScore','NegScore', 
          #'AvgTone',  #not needed since its just sum of pos and neg tone
          'Polarity', 'ActRefDens', 'SelfRefDens', 'WordCount',
          # get the eMFD features
          'care-harm','fairness-cheating', 'loyalty-betrayal', 'authority-subversion',
          'sanctity-degradation']

articles = articles[subset]

In [None]:
articles.shape # (14523991,14)

In [None]:
# how many outlets do we have
articles["V2SOURCECOMMONNAME"].unique().shape # 1512 if we do intersect with MBFC, otherwise 21544

In [9]:
# get total number of articles per outlet for normalising later (?)
total_article_counts = articles.groupby(by="V2SOURCECOMMONNAME").size()

In [None]:
# see what kind of tail we're dealing with - seems there's quite a few outlets that haven't published a lot
total_article_counts.sort_values(ascending=True).head(1000)

In [None]:
# compared to top publishers...
total_article_counts.sort_values(ascending=True).tail(10)

Due to the large amount of data, but some outlets being less active and not publishing often, we may want to exclude outlets that have published less than X articles in the timeframe of the data. This way we avoid mostly uninformative rows.

In the 6 months are 181 days, so let's say we'd want outlets that have pusted at least 5 articles per day.

In [12]:
# Decide whether to exclude outlets with less than X articles in the timeframe
filter_out_small_outlets = False

min_art_per_day = 2
min_articles = min_art_per_day*365

# filter out news outlets with fewer than X articles 
if filter_out_small_outlets:
    subset_outlets = total_article_counts[total_article_counts >= min_articles].index
    articles = articles[articles["V2SOURCECOMMONNAME"].isin(subset_outlets)]

In [None]:
# how many outlets do we have now?
articles["V2SOURCECOMMONNAME"].unique().shape # 511 with 6 months, 4349 with all outlets for 1 year

For each row in articles where there are multiple themes, we want to split so that there's one theme per row (with one cell keeping track of all related themes in case).

In [14]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.options.mode.chained_assignment = None

In [None]:
# get robertson ground truths and separate outlets into either Has-GroundTruth or Not-GroundTruth
robertson = pd.read_csv("GDELT_GKG/extras/Bias Ratings/robertson.csv",usecols=["domain","score"])
robertson_outlets = robertson.domain.values

# start processing
start = time.time()
# clean column into list format
articles["THEME"] = articles["THEMES_SUBSET"].apply(split_on_theme)
# add column with list of all themes present in case
#articles["THEMES_SUBSET"] = articles["THEME"]
# can drop theme_subset for now unless find nice way to utilise later
articles.drop(columns=["THEMES_SUBSET"], inplace=True)
# expand a row with multiple themes into many rows with unique themes
e = articles.explode("THEME").reset_index(drop=True)

grouped = e.groupby(by=["V2SOURCECOMMONNAME", "THEME"])

# for each outler & theme, make a row of sentiment per theme
GT_rows = []
noGT_rows = [] # outlets which don't have ground truth
#i=1
for x in tqdm(grouped.groups,desc="Splitting into rows per outlet and theme..."):
    outlet, theme = x
    t = grouped.get_group(x).aggregate(np.mean)
    colnames = [theme + "_" + col for col in list(t.index)]
    row = pd.DataFrame([t.to_list()], columns=colnames)
    # this gets the number of articles for this outlet with this theme
    art_num = grouped.get_group(x).aggregate(len)["THEME"]
    row[theme + "_article_count"] = art_num
    row.insert(0, "outlet", outlet)
    if outlet in robertson_outlets:
        #i+=1
        GT_rows.append(row)
        #if i>50:
        #    print(i)
    else:
        noGT_rows.append(row)

print("Done splitting!")


In [None]:
# split the rows into outlets in MBFC and those outside of MBFC
"""
df_mbfc = pd.read_csv("/home/insert_user/GDELT_GKG/extras/Bias Ratings/MBFC_features.csv")
df_mbfc = df_mbfc.rename(columns={'URL': 'outlet', 'Bias Rating': 'lean', "Latutude":"Latitude"})
mbfc_outlets = df_mbfc.outlet.values

# split rows into outlets which have or don't have Robertson Ground Truths
robertson = pd.read_csv("GDELT_GKG/extras/Bias Ratings/robertson.csv",usecols=["domain","score"])

# gather two separate smaller lists of rows for mbfc outlets and all others
mbfc_rows = []
gdelt_rows = []
for row in tqdm(rows):
    if row.outlet.values[0] in robertson_outlets:
        continue #mbfc_rows.append(row)
    elif row.outlet.values[0] in mbfc_outlets:
        continue #mbfc_rows.append(row)
    else:
        gdelt_rows.append(row)

print(len(mbfc_rows), len(gdelt_rows))
"""

In [None]:
del rows, articles

## GDELT Dataset

Let's now do the same, but without the MBFC extra information for the rest of the GDELT outlets.

In [17]:
def batch_concat(rows,stepsize,save_name):
    max_batches = len(rows)
    
    for batch_start in tqdm(np.arange(0,max_batches,step=stepsize)):
    # if file already exists, move on
        if os.path.exists("/home/insert_user/GDELT_GKG/extras/{}_{}.csv".format(save_name,int(batch_start))):
            print("Skipping, file already exists for this batch!")
            continue

        batch_end = batch_start + stepsize

        if batch_end >= max_batches:
            batch_end = max_batches - 1

        # concatenate batch
        sub_rows = rows[batch_start:batch_end]
        sub_df = pd.concat(sub_rows)

        # do some processing while we're at it
        sub_df.fillna(0, inplace=True) # replace nan's
        sub_df.set_index("outlet", inplace=True)
        # get aggregate functions per column - mean for all except for article counts
        agg_dict = {col:('mean' if not col.endswith("_article_count") else 'sum') 
                for col in sub_df.columns}
        # group by so we get the final rows, apply aggregate function for all cols
        res = sub_df.groupby("outlet").aggregate(agg_dict)

        # normalise article counts per theme by total article counts per outlet
        if "outlet" not in res.columns.to_list(): # 
            res["outlet"] = res.index
        res["tot_art"] = res["outlet"].map(total_article_counts)
        tot_art_columns = res.columns.str.endswith("_article_count")
        res.loc[:,tot_art_columns] = res.loc[:,tot_art_columns].div(res["tot_art"], axis=0)

        # save to file
        #sub_df.to_csv("gdelt_articles_part_{}.csv".format(batch))
        res.to_csv("/home/insert_user/GDELT_GKG/extras/{}_{}.csv".format(save_name,int(batch_start)))


In [None]:
max_batches = len(GT_rows)
stepsize = 100000

batch_concat(GT_rows,stepsize=stepsize,save_name="robertson_outlets_part")

In [None]:
max_batches = len(noGT_rows)
stepsize = 100000

batch_concat(noGT_rows,stepsize=stepsize,save_name="gdelt_outlets_2_part")

Let's now combine these concatenated dfs, si

## MBFC Dataset Rows

In [None]:
# concat sentiment rows so that we get a line with outlet and theme-sentiment 
# in one row, populated with 0s when new sentiments added
print("Concatenating...")
df = pd.concat(mbfc_rows)

df.fillna(0, inplace=True) # replace nan's
df.set_index("outlet", inplace=True)

del colnames,e,grouped,outlet,row,rows,t,theme,x

"""
now group by each outlet and aggregate by the average! -> we get a row of 
each outlet's tones per theme! we want to do the average for all except
the counts of how many articles the outlet published featuring a theme
"""

print("Aggregating...")

agg_dict = {col:('mean' if not col.endswith("_article_count") else 'sum') 
            for col in df.columns}
# group by so we get the final rows
res = df.groupby("outlet").aggregate(agg_dict)

end = time.time()

print("This took {} seconds".format(end-start))

In [None]:
res.head()

In [None]:
res.shape

Now we have a row per outlet, with each outlet's average tone per theme (and number of articles mentioning that theme).

Let's add the features from MBFC to the data.

### Fixing MBFC Features data

We first need to rename some things...

In [None]:
# if we need to change something without rerunning all code above
#res = pd.read_csv("/home/insert_user/GDELT_GKG/outlet_sentiments.csv")
# skip faulty MBFC columns
#res = res.iloc[:,:-8]

In [None]:
# get full dataframe of measures extracted from MBFC
df_mbfc = pd.read_csv("/home/insert_user/GDELT_GKG/extras/MBFC_features.csv")
# rename columns
df_mbfc = df_mbfc.rename(columns={'URL': 'outlet', 'Bias Rating': 'lean', "Latutude":"Latitude"})

In [None]:
df_mbfc.head()

In [None]:
res = res.merge(df_mbfc[["outlet","lean","Factuality","PressFreedom","MediaType","Traffic","Credibility","Longitude","Latitude"]], 
                on="outlet", how = 'left') # add other categs

In [None]:
res.head()

Since different outlets will greatly vary in how many articles they publish in a day, we want to make sure to normalise the article counts per theme by the total number of articles that the outlet publishes in general - this way we get a ratio of articles per theme.

In [None]:
# normalise article counts per theme by total article counts per outlet
res["tot_art"] = res["outlet"].map(total_article_counts)
tot_art_columns = res.columns.str.endswith("_article_count")
res.loc[:,tot_art_columns] = res.loc[:,tot_art_columns].div(res["tot_art"], axis=0)
#res.drop(columns=["tot_art"], inplace=True) let's keep this if we want to remove outlets with few articles later

In [None]:
res.head()

In [None]:
# save file
res.to_csv("/home/insert_user/GDELT_GKG/extras/mbfc_outlet_sentiments.csv", index=False)

In [None]:
res = pd.read_csv("/home/insert_user/GDELT_GKG/extras/mbfc_outlet_sentiments.csv")