# Notebook for Named Entity Recognition

Using spaCy for named entity recognition, we want to create relative frequency tables for the entities by year. At this point, we are only interested in the entities that appear most frequently.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [None]:
!pip install "spacy~=3.0.6"

In [None]:
!python -m spacy download en_core_web_md

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
!pip install spacy-entity-linker==1.0.3

In [None]:
!python -m spacy_entity_linker "download_knowledge_base"

In [1]:
import os
from typing import Iterable
from itertools import chain
import spacy
from spacy.tokens.doc import Doc
from spacy_entity_linker.EntityElement import EntityElement
import pandas as pd

In [None]:
# Only run this code if you're loading from Google Drive
from google.colab import drive
drive.mount('/content/drive')

## Loading the articles

In [2]:
class DatasetConfig():
    input_path: str
    output_path: str
    sheet_name: str
    usecols: list[str]

    def __init__(self, input_path: str, output_path: str, sheet_name: str, usecols: list[str]):
        self.input_path = input_path
        self.output_path = output_path
        self.sheet_name = sheet_name
        self.usecols = usecols

In [3]:
fakespeak_config = DatasetConfig(
    # file_path="/content/drive/My Drive/fake_news_over_time/Fakespeak_ENG_modified.xlsx",
    input_path="./data/Fakespeak-ENG/Fakespeak-ENG modified.xlsx",
    output_path="./data/Fakespeak-ENG/Analysis_output/Fakespeak_named_entities_frequency.xlsx",
    sheet_name="Working",
    usecols=['ID', 'combinedLabel', 'originalTextType', 'originalBodyText', 'originalDateYear']
)

misinfotext_config = DatasetConfig(
    input_path="./data/MisInfoText/PolitiFact_original_modified.xlsx",
    output_path="./data/MisInfoText/Analysis_output/MisInfoText_named_entities_frequency.xlsx",
    sheet_name="Working",
    usecols=None
)

In [None]:
using_dataset = misinfotext_config

In [39]:
dataset_df = pd.read_excel(
    using_dataset.input_path, 
    sheet_name=using_dataset.sheet_name, 
    usecols=using_dataset.usecols
)

In [40]:
dataset_df.head()

Unnamed: 0,ID,combinedLabel,originalTextType,originalBodyText,originalDateYear
0,Politifact_FALSE_Social media_687276,False,Social media,Mexico is paying for the Wall through the new ...,2019
1,Politifact_FALSE_Social media_25111,False,Social media,"Chuck Schumer: ""why should American citizens b...",2019
2,Politifact_FALSE_Social media_735424,False,Social media,Billions of dollars are sent to the State of C...,2019
3,Politifact_FALSE_Social media_594307,False,Social media,If 50 Billion $$ were set aside to go towards ...,2019
4,Politifact_FALSE_Social media_839325,False,Social media,Huge@#CD 9 news. \n@ncsbe\n sent letter to eve...,2019


## Tagging named entities using spaCy

To make up for the difficulties of consolidating similar named entities, we use spaCy's large web model to ensure higher tagging accuracy in the initial NER step.

Documentation for entityLinker: https://github.com/egerber/spaCy-entity-linker

In [7]:
# load spacy model
nlp = spacy.load("en_core_web_md")

# add custom entityLinker pipeline
entity_linker = nlp.add_pipe("entityLinker", last=True)

In [8]:
def get_entities_from_doc(doc: Doc) -> Iterable[EntityElement]:
    return doc._.linkedEntities

def get_entity_data(row: pd.Series):
    entities: Iterable[EntityElement] = row["entities"]
    return [{
        "Entity": entity.get_label(),
        "Wikidata_id": entity.get_id(),
        "Wikidata_url": entity.get_url(),
        "Year": row["originalDateYear"],
        "Span_text": entity.get_span().text,
    } for entity in entities]

In [41]:
dataset_df["doc"] = list(nlp.pipe(dataset_df['originalBodyText']))
dataset_df["entities"] = dataset_df["doc"].apply(get_entities_from_doc)

all_entities_data = list(chain.from_iterable(dataset_df.apply(get_entity_data, axis=1)))
entities_df = pd.DataFrame(all_entities_data)
entities_df

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text
0,Mexico,96,https://www.wikidata.org/wiki/Q96,2019,Mexico
1,The Wall,27964590,https://www.wikidata.org/wiki/Q27964590,2019,Wall
2,United States–Mexico–Canada Agreement,56839716,https://www.wikidata.org/wiki/Q56839716,2019,USMCA
3,The Wall,27964590,https://www.wikidata.org/wiki/Q27964590,2019,Wall
4,parking lot,6501349,https://www.wikidata.org/wiki/Q6501349,2019,lot
...,...,...,...,...,...
109935,UPDATE,1076005,https://www.wikidata.org/wiki/Q1076005,2023,UPDATES
109936,INSANE,3153089,https://www.wikidata.org/wiki/Q3153089,2023,INSANE
109937,tax,8161,https://www.wikidata.org/wiki/Q8161,2023,TAXES
109938,Ontario,1904,https://www.wikidata.org/wiki/Q1904,2023,ON


In [42]:
# For some reason, any spans of just "President" (or similar)
# get tagged as Zhong Chenle, maybe because he has an alias "President".
# The following code fixes that to point to the correct Wikidata entry
# for the generic term "president".

zhong_chenle_president_aliases = {'PRESIDENT', 'President', 'Presidents'}
zhong_chenle_wikidata_id = 30945670
zhong_chenle_as_president_filter = (entities_df["Wikidata_id"] == zhong_chenle_wikidata_id) & (entities_df["Span_text"].isin(zhong_chenle_president_aliases))
president_wikidata_id = 30461

entities_df.loc[zhong_chenle_as_president_filter, "Entity"] = "president"
entities_df.loc[zhong_chenle_as_president_filter, "Wikidata_id"] = president_wikidata_id
entities_df.loc[zhong_chenle_as_president_filter, "Wikidata_url"] = f"https://www.wikidata.org/wiki/Q{president_wikidata_id}"

In [43]:
entities_df.head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text
0,Mexico,96,https://www.wikidata.org/wiki/Q96,2019,Mexico
1,The Wall,27964590,https://www.wikidata.org/wiki/Q27964590,2019,Wall
2,United States–Mexico–Canada Agreement,56839716,https://www.wikidata.org/wiki/Q56839716,2019,USMCA
3,The Wall,27964590,https://www.wikidata.org/wiki/Q27964590,2019,Wall
4,parking lot,6501349,https://www.wikidata.org/wiki/Q6501349,2019,lot


## Filter dataframes by year and named entities
Currently, entityLinker catches all entities, not just proper nouns. To get around this, we first create dataframes filtering by year, then get the POS tags using spacy. This will then allow us to filter the dataframes further by excluding any counted nouns.

In [44]:
grouped_by_year = entities_df.groupby(by="Year")
entity_years_dfs = [grouped_by_year.get_group(group).copy() for group in grouped_by_year.groups]

In [27]:
# helper function for counting entities in each year
def get_count(df: pd.DataFrame):
  df['Count'] = df.groupby(['Entity'])['Wikidata_id'].transform('count')
  sorted_df = df.sort_values(by=['Count', 'Entity', 'Wikidata_id'], ascending=False)
  unique_df = sorted_df.drop_duplicates(subset=["Wikidata_id"])

  return unique_df

In [45]:
# from each dataframe, obtain the counts of entities, sort by count, then keep unique values
# dropping N/A values to account for error in entityLinker tagging
entity_counts_dfs = [get_count(df).dropna() for df in entity_years_dfs]

In [46]:
entity_counts_dfs[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count
125,United States of America,30,https://www.wikidata.org/wiki/Q30,2019,USA,86.0
176,Donald Trump,22686,https://www.wikidata.org/wiki/Q22686,2019,Donald J. Trump,85.0
43,human,5,https://www.wikidata.org/wiki/Q5,2019,person,80.0
67,year,577,https://www.wikidata.org/wiki/Q577,2019,year,68.0
60,Democratic Party,29552,https://www.wikidata.org/wiki/Q29552,2019,Democrats,55.0


In [47]:
tagger = spacy.load("en_core_web_md")

In [48]:
for df in entity_counts_dfs:
    df['POS'] = [doc[0].pos_ for doc in tagger.pipe(df['Entity'])]

In [49]:
proper_noun_entity_counts_df = [df[df["POS"] == "PROPN"].copy() for df in entity_counts_dfs]

In [50]:
proper_noun_entity_counts_df[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count,POS
125,United States of America,30,https://www.wikidata.org/wiki/Q30,2019,USA,86.0,PROPN
176,Donald Trump,22686,https://www.wikidata.org/wiki/Q22686,2019,Donald J. Trump,85.0,PROPN
60,Democratic Party,29552,https://www.wikidata.org/wiki/Q29552,2019,Democrats,55.0,PROPN
426,United States Congress,11268,https://www.wikidata.org/wiki/Q11268,2019,US Congress,38.0,PROPN
190,president,1255921,https://www.wikidata.org/wiki/Q1255921,2019,president,27.0,PROPN


In [51]:
for df in proper_noun_entity_counts_df:
    df["Proportion"] = df['Count'] / df['Count'].sum()

In [52]:
proper_noun_entity_counts_df[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count,POS,Proportion
125,United States of America,30,https://www.wikidata.org/wiki/Q30,2019,USA,86.0,PROPN,0.049397
176,Donald Trump,22686,https://www.wikidata.org/wiki/Q22686,2019,Donald J. Trump,85.0,PROPN,0.048823
60,Democratic Party,29552,https://www.wikidata.org/wiki/Q29552,2019,Democrats,55.0,PROPN,0.031591
426,United States Congress,11268,https://www.wikidata.org/wiki/Q11268,2019,US Congress,38.0,PROPN,0.021827
190,president,1255921,https://www.wikidata.org/wiki/Q1255921,2019,president,27.0,PROPN,0.015508


## Write results to Excel spreadsheet

In [None]:
!pip install xlsxwriter

In [53]:
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(using_dataset.output_path, engine="xlsxwriter")

for df in proper_noun_entity_counts_df:
    year = str(df["Year"].iloc[0])
    df.to_excel(writer, sheet_name=year, columns=['Entity', 'Wikidata_id', 'Wikidata_url', 'Count', 'Proportion'], index=False)

# close the excel writer and output file
writer.close()