# Notebook for Named Entity Recognition

Using spaCy for named entity recognition, we want to create relative frequency tables for the entities by year. At this point, we are only interested in the entities that appear most frequently.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [None]:
!pip install "spacy~=3.0.6"

In [None]:
!python -m spacy download en_core_web_md

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
!pip install spacy-entity-linker==1.0.3

In [None]:
!python -m spacy_entity_linker "download_knowledge_base"

In [1]:
import os
from typing import Iterable
from itertools import chain
import spacy
from spacy.tokens.doc import Doc
from spacy_entity_linker.EntityElement import EntityElement
import pandas as pd
from dataset_config import BASE_FAKESPEAK_CONFIG, BASE_MISINFOTEXT_CONFIG

In [None]:
# Only run this code if you're loading from Google Drive
from google.colab import drive
drive.mount('/content/drive')

## Loading the articles

In [3]:
fakespeak_config = BASE_FAKESPEAK_CONFIG | {
    "output_path": "./data/Fakespeak-ENG/Analysis_output/Fakespeak_named_entities_frequency.xlsx",
    "output_headlines_path": "./data/Fakespeak-ENG/Analysis_output/Fakespeak_named_entities_frequency_headlines.xlsx",
    "usecols": BASE_FAKESPEAK_CONFIG["usecols"] + ["originalHeadline"]
}

misinfotext_config = BASE_MISINFOTEXT_CONFIG | {
    "output_path": "./data/MisInfoText/PolitiFact_original_modified.xlsx",
    "output_headlines_path": "./data/MisInfoText/Analysis_output/MisInfoText_named_entities_frequency.xlsx",
}

In [6]:
using_dataset = misinfotext_config

In [None]:
dataset_df = pd.read_excel(
    using_dataset["input_path"], 
    sheet_name=using_dataset["sheet_name"], 
    usecols=using_dataset["usecols"]
)

Unnamed: 0,factcheckURL,originalURL,originalBodyText,originalHeadline,originalTextType,originalDate,originalDateYear
0,http://www.politifact.com/arizona/statements/2...,https://associatedmediacoverage.com/three-stat...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016-05-06,2016
1,http://www.politifact.com/california/statement...,https://users.focalbeam.com/fs/distribution:wl...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016-07-12,2016
2,http://www.politifact.com/california/statement...,http://www.sacbee.com/opinion/op-ed/soapbox/ar...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017-08-04,2017
3,http://www.politifact.com/california/statement...,https://nocagastax.com/california-gas-tax-hike...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017-06-15,2017
4,http://www.politifact.com/california/statement...,https://chu.house.gov/media-center/press-relea...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017-05-04,2017
...,...,...,...,...,...,...,...
650,http://www.politifact.com/wisconsin/statements...,https://x.com/ScottWalker/status/9428776407421...,Road projects across the state are staying on ...,,Social media,2017-12-18,2017
651,http://www.politifact.com/wisconsin/statements...,https://x.com/ScottWalker/status/9511017961011...,The last thing we need is more Madison in our ...,,Social media,2018-01-10,2018
652,http://www.politifact.com/wisconsin/statements...,https://x.com/MahlonMitchell/status/9538161542...,When \n@ScottWalker\n told firefighters we did...,,Social media,2018-01-18,2018
653,http://www.politifact.com/wisconsin/statements...,http://dailycaller.com/2018/01/25/hey-look-sen...,"Now that its 2018, an election year, I would l...",HEY LOOK! Senator Tammy Baldwin Is Back In Wis...,News and blog,2018-01-25,2018


In [None]:
# Set this to True if you want to filter by only
# "News and blog" or "Social media" article types.
# It will save to a separate subdirectory without overwriting
# the existing files.
only_use_news_blog_and_social_media = True

if only_use_news_blog_and_social_media:
    dataset_df = dataset_df[(dataset_df["originalTextType"] == "News and blog") | (dataset_df["originalTextType"] == "Social media")]
    
    output_path = using_dataset.output_path
    output_path_split = output_path.split("/")
    output_path_split.insert(len(output_path_split) - 1, "news_blog_and_social_media")
    using_dataset.output_path = "/".join(output_path_split)

    output_headlines_path = using_dataset.output_headlines_path
    output_headlines_path_split = output_headlines_path.split("/")
    output_headlines_path_split.insert(len(output_headlines_path_split) - 1, "news_blog_and_social_media")
    using_dataset.output_headlines_path = "/".join(output_headlines_path_split)

    os.makedirs("/".join(output_path_split[:-1]), exist_ok=True)

In [7]:
dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalHeadline,originalDateYear
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,,2019
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",,2019
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,,2019
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,,2019
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,,2019


## Tagging named entities using spaCy

To make up for the difficulties of consolidating similar named entities, we use spaCy's large web model to ensure higher tagging accuracy in the initial NER step.

Documentation for entityLinker: https://github.com/egerber/spaCy-entity-linker

In [8]:
# load spacy model
nlp = spacy.load("en_core_web_md")

# add custom entityLinker pipeline
entity_linker = nlp.add_pipe("entityLinker", last=True)

  import pkg_resources


In [None]:
def get_entities_from_doc(doc: Doc) -> Iterable[EntityElement]:
    return doc._.linkedEntities

def get_entity_data(row: pd.Series, entities_col: str):
    entities: Iterable[EntityElement] = row[entities_col]
    return [{
        "Entity": entity.get_label(),
        "Wikidata_id": entity.get_id(),
        "Wikidata_url": entity.get_url(),
        "Year": row[using_dataset["year_col"]],
        "Span_text": entity.get_span().text,
    } for entity in entities]

In [10]:
# For some reason, any spans of just "President" (or similar)
# get tagged as Zhong Chenle, maybe because he has an alias "President".
# The following code fixes that to point to the correct Wikidata entry
# for the generic term "president".

zhong_chenle_president_aliases = {'PRESIDENT', 'President', 'Presidents'}
zhong_chenle_wikidata_id = 30945670
president_wikidata_id = 30461

def clean_incorrect_president_entity(df: pd.DataFrame):
    zhong_chenle_as_president_filter = (df["Wikidata_id"] == zhong_chenle_wikidata_id) & (df["Span_text"].isin(zhong_chenle_president_aliases))
    df.loc[zhong_chenle_as_president_filter, "Entity"] = "president"
    df.loc[zhong_chenle_as_president_filter, "Wikidata_id"] = president_wikidata_id
    df.loc[zhong_chenle_as_president_filter, "Wikidata_url"] = f"https://www.wikidata.org/wiki/Q{president_wikidata_id}"

In [11]:
# A similar thing is happening where the state of Texas
# is sometimes confused for a musical play named "Texas". 

texas_musical_wikidata_id = 7707415
texas_state_wikidata_id = 1439

def clean_incorrect_texas_entity(df: pd.DataFrame):
    texas_musical_filter = df["Wikidata_id"] == texas_musical_wikidata_id
    df.loc[texas_musical_filter, "Wikidata_id"] = texas_state_wikidata_id
    df.loc[texas_musical_filter, "Wikidata_url"] = f"https://www.wikidata.org/wiki/Q{texas_state_wikidata_id}"

In [None]:
dataset_df["doc"] = list(nlp.pipe(dataset_df[using_dataset["text_col"]]))
dataset_df["entities"] = dataset_df["doc"].apply(get_entities_from_doc)

all_entities_data = list(chain.from_iterable(dataset_df.apply(get_entity_data, args=("entities",), axis=1)))
entities_df = pd.DataFrame(all_entities_data)

clean_incorrect_president_entity(entities_df)
clean_incorrect_texas_entity(entities_df)

entities_df

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text
0,Mexico,96,https://www.wikidata.org/wiki/Q96,2019,Mexico
1,The Wall,27964590,https://www.wikidata.org/wiki/Q27964590,2019,Wall
2,United States–Mexico–Canada Agreement,56839716,https://www.wikidata.org/wiki/Q56839716,2019,USMCA
3,The Wall,27964590,https://www.wikidata.org/wiki/Q27964590,2019,Wall
4,parking lot,6501349,https://www.wikidata.org/wiki/Q6501349,2019,lot
...,...,...,...,...,...
106236,UPDATE,1076005,https://www.wikidata.org/wiki/Q1076005,2023,UPDATES
106237,INSANE,3153089,https://www.wikidata.org/wiki/Q3153089,2023,INSANE
106238,tax,8161,https://www.wikidata.org/wiki/Q8161,2023,TAXES
106239,Batouri Airport,2265760,https://www.wikidata.org/wiki/Q2265760,2023,OUR


In [13]:
dataset_df['originalHeadline'] = dataset_df['originalHeadline'].fillna("")
dataset_df["doc_headline"] = list(nlp.pipe(dataset_df['originalHeadline']))
dataset_df["entities_headline"] = dataset_df["doc_headline"].apply(get_entities_from_doc)

all_entities_headlines_data = list(chain.from_iterable(dataset_df.apply(get_entity_data, args=("entities_headline",), axis=1)))
entities_headlines_df = pd.DataFrame(all_entities_headlines_data)

clean_incorrect_president_entity(entities_headlines_df)
clean_incorrect_texas_entity(entities_headlines_df)

entities_headlines_df

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text
0,Joe Biden,6279,https://www.wikidata.org/wiki/Q6279,2019,Joe Biden
1,agency,3951828,https://www.wikidata.org/wiki/Q3951828,2019,Thoughts
2,Straight,7620981,https://www.wikidata.org/wiki/Q7620981,2019,Straight
3,Tom Selleck,213706,https://www.wikidata.org/wiki/Q213706,2019,Tom Selleck
4,You,39082126,https://www.wikidata.org/wiki/Q39082126,2019,You
...,...,...,...,...,...
2247,Place,29468697,https://www.wikidata.org/wiki/Q29468697,2021,Place
2248,The Watch,29313,https://www.wikidata.org/wiki/Q29313,2020,WATCH
2249,Bill Gates,5284,https://www.wikidata.org/wiki/Q5284,2020,Bill Gates
2250,Vaccine,7907941,https://www.wikidata.org/wiki/Q7907941,2020,Vaccine


## Filter dataframes by year and named entities
Currently, entityLinker catches all entities, not just proper nouns. To get around this, we first create dataframes filtering by year, then get the POS tags using spacy. This will then allow us to filter the dataframes further by excluding any counted nouns.

In [14]:
grouped_by_year = entities_df.groupby(by="Year")
entity_years_dfs = [grouped_by_year.get_group(group).copy() for group in grouped_by_year.groups]

In [15]:
grouped_by_year_headlines = entities_headlines_df.groupby(by="Year")
entity_years_headlines_dfs = [grouped_by_year_headlines.get_group(group).copy() for group in grouped_by_year_headlines.groups]

In [16]:
# helper function for counting entities in each year
def get_count(df: pd.DataFrame):
  df['Count'] = df.groupby(['Entity'])['Wikidata_id'].transform('count')
  sorted_df = df.sort_values(by=['Count', 'Entity', 'Wikidata_id'], ascending=False)
  unique_df = sorted_df.drop_duplicates(subset=["Wikidata_id"])

  return unique_df

In [17]:
# from each dataframe, obtain the counts of entities, sort by count, then keep unique values
# dropping N/A values to account for error in entityLinker tagging
entity_counts_dfs = [get_count(df).dropna() for df in entity_years_dfs]

entity_counts_dfs[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count
178,Donald Trump,22686,https://www.wikidata.org/wiki/Q22686,2019,Donald J. Trump,91.0
43,human,5,https://www.wikidata.org/wiki/Q5,2019,person,68.0
129,United States of America,30,https://www.wikidata.org/wiki/Q30,2019,USA,67.0
67,year,577,https://www.wikidata.org/wiki/Q577,2019,year,53.0
60,Democratic Party,29552,https://www.wikidata.org/wiki/Q29552,2019,Democrats,50.0


In [18]:
entity_counts_headlines_dfs = [get_count(df).dropna() for df in entity_years_headlines_dfs]

entity_counts_headlines_dfs[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count
5,Donald Trump,22686,https://www.wikidata.org/wiki/Q22686,2019,Donald Trump,3
0,Joe Biden,6279,https://www.wikidata.org/wiki/Q6279,2019,Joe Biden,2
1323,Alexandria Ocasio-Cortez,55223040,https://www.wikidata.org/wiki/Q55223040,2019,AOC,2
1724,training,918385,https://www.wikidata.org/wiki/Q918385,2019,training,1
1415,taxpayer,1938414,https://www.wikidata.org/wiki/Q1938414,2019,Taxpayers,1


In [19]:
tagger = spacy.load("en_core_web_md")

Keeping only proper nouns to get rid of common regular words

In [20]:
for df in entity_counts_dfs:
    df['POS'] = [doc[0].pos_ for doc in tagger.pipe(df['Entity'])]

proper_noun_entity_counts_df = [df[df["POS"] == "PROPN"].copy() for df in entity_counts_dfs]

proper_noun_entity_counts_df[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count,POS
178,Donald Trump,22686,https://www.wikidata.org/wiki/Q22686,2019,Donald J. Trump,91.0,PROPN
129,United States of America,30,https://www.wikidata.org/wiki/Q30,2019,USA,67.0,PROPN
60,Democratic Party,29552,https://www.wikidata.org/wiki/Q29552,2019,Democrats,50.0,PROPN
427,United States Congress,11268,https://www.wikidata.org/wiki/Q11268,2019,US Congress,34.0,PROPN
61769,Virginia,1370,https://www.wikidata.org/wiki/Q1370,2019,Virginia,22.0,PROPN


In [21]:
for df in entity_counts_headlines_dfs:
    df['POS'] = [doc[0].pos_ for doc in tagger.pipe(df['Entity'])]

proper_noun_entity_counts_headlines_df = [df[df["POS"] == "PROPN"].copy() for df in entity_counts_headlines_dfs]

proper_noun_entity_counts_headlines_df[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count,POS
5,Donald Trump,22686,https://www.wikidata.org/wiki/Q22686,2019,Donald Trump,3,PROPN
0,Joe Biden,6279,https://www.wikidata.org/wiki/Q6279,2019,Joe Biden,2,PROPN
1323,Alexandria Ocasio-Cortez,55223040,https://www.wikidata.org/wiki/Q55223040,2019,AOC,2,PROPN
1731,social,345367,https://www.wikidata.org/wiki/Q345367,2019,Social,1,PROPN
30,millimetre,174789,https://www.wikidata.org/wiki/Q174789,2019,mm,1,PROPN


In [22]:
for df in proper_noun_entity_counts_df:
    df["Proportion"] = df['Count'] / df['Count'].sum()

proper_noun_entity_counts_df[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count,POS,Proportion
178,Donald Trump,22686,https://www.wikidata.org/wiki/Q22686,2019,Donald J. Trump,91.0,PROPN,0.056034
129,United States of America,30,https://www.wikidata.org/wiki/Q30,2019,USA,67.0,PROPN,0.041256
60,Democratic Party,29552,https://www.wikidata.org/wiki/Q29552,2019,Democrats,50.0,PROPN,0.030788
427,United States Congress,11268,https://www.wikidata.org/wiki/Q11268,2019,US Congress,34.0,PROPN,0.020936
61769,Virginia,1370,https://www.wikidata.org/wiki/Q1370,2019,Virginia,22.0,PROPN,0.013547


In [23]:
for df in proper_noun_entity_counts_headlines_df:
    df["Proportion"] = df['Count'] / df['Count'].sum()

proper_noun_entity_counts_headlines_df[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count,POS,Proportion
5,Donald Trump,22686,https://www.wikidata.org/wiki/Q22686,2019,Donald Trump,3,PROPN,0.078947
0,Joe Biden,6279,https://www.wikidata.org/wiki/Q6279,2019,Joe Biden,2,PROPN,0.052632
1323,Alexandria Ocasio-Cortez,55223040,https://www.wikidata.org/wiki/Q55223040,2019,AOC,2,PROPN,0.052632
1731,social,345367,https://www.wikidata.org/wiki/Q345367,2019,Social,1,PROPN,0.026316
30,millimetre,174789,https://www.wikidata.org/wiki/Q174789,2019,mm,1,PROPN,0.026316


## Write results to Excel spreadsheet

In [None]:
!pip install xlsxwriter

In [24]:
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(using_dataset.output_path, engine="xlsxwriter")

for df in proper_noun_entity_counts_df:
    year = str(df["Year"].iloc[0])
    df.to_excel(writer, sheet_name=year, columns=['Entity', 'Wikidata_id', 'Wikidata_url', 'Count', 'Proportion'], index=False)

# close the excel writer and output file
writer.close()

In [25]:
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(using_dataset.output_headlines_path, engine="xlsxwriter")

for df in proper_noun_entity_counts_headlines_df:
    year = str(df["Year"].iloc[0])
    df.to_excel(writer, sheet_name=year, columns=['Entity', 'Wikidata_id', 'Wikidata_url', 'Count', 'Proportion'], index=False)

# close the excel writer and output file
writer.close()