# Notebook for Named Entity Recognition

Using spaCy for named entity recognition, we want to create relative frequency tables for the entities by year. At this point, we are only interested in the entities that appear most frequently.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [None]:
!pip install "spacy~=3.0.6"

In [None]:
!python -m spacy download en_core_web_md

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
!pip install spacy-entity-linker==1.0.3

In [None]:
!python -m spacy_entity_linker "download_knowledge_base"

In [17]:
import os
from typing import Iterable
from itertools import chain
import spacy
from spacy.tokens.doc import Doc
from spacy_entity_linker.EntityElement import EntityElement
import pandas as pd

In [None]:
# Only run this code if you're loading from Google Drive
from google.colab import drive
drive.mount('/content/drive')

## Loading the articles

In [101]:
class DatasetInfo():
    input_path: str
    output_path: str
    sheet_name: str
    usecols: list[str]

    def __init__(self, input_path: str, output_path: str, sheet_name: str, usecols: list[str]):
        self.input_path = input_path
        self.output_path = output_path
        self.sheet_name = sheet_name
        self.usecols = usecols

In [102]:
fakespeak_info = DatasetInfo(
    # file_path="/content/drive/My Drive/fake_news_over_time/Fakespeak_ENG_modified.xlsx",
    input_path="./data/Fakespeak-ENG/Fakespeak-ENG modified.xlsx",
    output_path="./data/Fakespeak-ENG/Analysis_output/Fakespeak_named_entities_frequency.xlsx",
    sheet_name="Working",
    usecols=['ID', 'combinedLabel', 'originalTextType', 'originalBodyText', 'originalDateYear']
)

misinfotext_info = DatasetInfo(
    input_path="./data/MisInfoText/PolitiFact_original_modified.xlsx",
    output_path="./data/MisInfoText/Analysis_output/MisInfoText_named_entities_frequency.xlsx",
    sheet_name="Working",
    usecols=None
)

In [103]:
using_dataset = misinfotext_info

In [None]:
dataset_df = pd.read_excel(
    using_dataset.input_path, 
    sheet_name=using_dataset.sheet_name, 
    usecols=using_dataset.usecols
)

In [32]:
dataset_df.head()

Unnamed: 0,factcheckURL,originalURL,originalBodyText,originalHeadline,originalTextType,originalDate,originalDateYear
0,http://www.politifact.com/arizona/statements/2...,https://associatedmediacoverage.com/three-stat...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016-05-06,2016
1,http://www.politifact.com/california/statement...,https://users.focalbeam.com/fs/distribution:wl...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016-07-12,2016
2,http://www.politifact.com/california/statement...,http://www.sacbee.com/opinion/op-ed/soapbox/ar...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017-08-04,2017
3,http://www.politifact.com/california/statement...,https://nocagastax.com/california-gas-tax-hike...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017-06-15,2017
4,http://www.politifact.com/california/statement...,https://chu.house.gov/media-center/press-relea...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017-05-04,2017


## Tagging named entities using spaCy

To make up for the difficulties of consolidating similar named entities, we use spaCy's large web model to ensure higher tagging accuracy in the initial NER step.

Documentation for entityLinker: https://github.com/egerber/spaCy-entity-linker

In [33]:
# load spacy model
nlp = spacy.load("en_core_web_md")

# add custom entityLinker pipeline
entity_linker = nlp.add_pipe("entityLinker", last=True)

In [39]:
def get_entities_from_doc(doc: Doc) -> Iterable[EntityElement]:
    return doc._.linkedEntities

def get_entity_data(row: pd.Series):
    entities: Iterable[EntityElement] = row["entities"]
    return [{
        "Entity": entity.get_label(),
        "Wikidata_id": entity.get_id(),
        "Wikidata_url": entity.get_url(),
        "Year": row["originalDateYear"],
        "Span_text": entity.get_span().text,
    } for entity in entities]

In [40]:
dataset_df["doc"] = list(nlp.pipe(dataset_df['originalBodyText']))
dataset_df["entities"] = dataset_df["doc"].apply(get_entities_from_doc)

all_entities_data = list(chain.from_iterable(dataset_df.apply(get_entity_data, axis=1)))
entities_df = pd.DataFrame(all_entities_data)
entities_df

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text
0,The Residents,947955,https://www.wikidata.org/wiki/Q947955,2016,Residents
1,state,7275,https://www.wikidata.org/wiki/Q7275,2016,states
2,pet,39201,https://www.wikidata.org/wiki/Q39201,2016,pet
3,humane society,1636604,https://www.wikidata.org/wiki/Q1636604,2016,Humane Society
4,compliance,633140,https://www.wikidata.org/wiki/Q633140,2016,compliance
...,...,...,...,...,...
61897,spouse,1196129,https://www.wikidata.org/wiki/Q1196129,2018,spouses
61898,firefighter,107711,https://www.wikidata.org/wiki/Q107711,2018,fire fighters
61899,Line of Duty,6553279,https://www.wikidata.org/wiki/Q6553279,2018,line of duty
61900,duty,878070,https://www.wikidata.org/wiki/Q878070,2018,duty


In [41]:
# For some reason, any spans of just "President" (or similar)
# get tagged as Zhong Chenle, maybe because he has an alias "President".
# The following code fixes that to point to the correct Wikidata entry
# for the generic term "president".

zhong_chenle_president_aliases = {'PRESIDENT', 'President', 'Presidents'}
zhong_chenle_wikidata_id = 30945670
zhong_chenle_as_president_filter = (entities_df["Wikidata_id"] == zhong_chenle_wikidata_id) & (entities_df["Span_text"].isin(zhong_chenle_president_aliases))
president_wikidata_id = 30461

entities_df.loc[zhong_chenle_as_president_filter, "Entity"] = "president"
entities_df.loc[zhong_chenle_as_president_filter, "Wikidata_id"] = president_wikidata_id
entities_df.loc[zhong_chenle_as_president_filter, "Wikidata_url"] = f"https://www.wikidata.org/wiki/Q{president_wikidata_id}"

In [42]:
entities_df.head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text
0,The Residents,947955,https://www.wikidata.org/wiki/Q947955,2016,Residents
1,state,7275,https://www.wikidata.org/wiki/Q7275,2016,states
2,pet,39201,https://www.wikidata.org/wiki/Q39201,2016,pet
3,humane society,1636604,https://www.wikidata.org/wiki/Q1636604,2016,Humane Society
4,compliance,633140,https://www.wikidata.org/wiki/Q633140,2016,compliance


## Filter dataframes by year and named entities
Currently, entityLinker catches all entities, not just proper nouns. To get around this, we first create dataframes filtering by year, then get the POS tags using spacy. This will then allow us to filter the dataframes further by excluding any counted nouns.

In [51]:
grouped_by_year = entities_df.groupby(by="Year")
entity_years_dfs = [grouped_by_year.get_group(group).copy() for group in grouped_by_year.groups]

In [44]:
# helper function for counting entities in each year
def get_count(df: pd.DataFrame):
  df['Count'] = df.groupby(['Entity'])['Wikidata_id'].transform('count')
  sorted_df = df.sort_values(by=['Count', 'Entity', 'Wikidata_id'], ascending=False)
  unique_df = sorted_df.drop_duplicates()

  return unique_df

In [None]:
# from each dataframe, obtain the counts of entities, sort by count, then keep unique values
# dropping N/A values to account for error in entityLinker tagging
entity_counts_dfs = [get_count(df).dropna() for df in entity_years_dfs]

In [None]:
entity_counts_dfs[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count,POS
42334,withdrawal,1760704,https://www.wikidata.org/wiki/Q1760704,2007,withdrawal,2,NOUN
42327,troop,1080137,https://www.wikidata.org/wiki/Q1080137,2007,troops,2,NOUN
42326,jerk,497332,https://www.wikidata.org/wiki/Q497332,2007,surge,2,NOUN
42328,Iraq,796,https://www.wikidata.org/wiki/Q796,2007,Iraq,2,PROPN
42322,Bill Clinton,1124,https://www.wikidata.org/wiki/Q1124,2007,Clinton,2,PROPN


In [55]:
tagger = spacy.load("en_core_web_md")

In [None]:
for df in entity_counts_dfs:
    df['POS'] = [doc[0].pos_ for doc in tagger.pipe(df['Entity'])]

In [90]:
proper_noun_entity_counts_df = [df[df["POS"] == "PROPN"].copy() for df in entity_counts_dfs]

In [91]:
proper_noun_entity_counts_df[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count,POS
42328,Iraq,796,https://www.wikidata.org/wiki/Q796,2007,Iraq,2,PROPN
42322,Bill Clinton,1124,https://www.wikidata.org/wiki/Q1124,2007,Clinton,2,PROPN
42335,United States of America,30,https://www.wikidata.org/wiki/Q30,2007,U.S.,1,PROPN
42321,Monday,105,https://www.wikidata.org/wiki/Q105,2007,Monday,1,PROPN
42352,Grave Mistake,5597792,https://www.wikidata.org/wiki/Q5597792,2007,grave mistake,1,PROPN


In [92]:
# helper function to calculate frequency in percentage
def get_prop(df):
  df['Proportion'] = df['Count'] / df['Count'].sum()

  return df

In [None]:
for df in proper_noun_entity_counts_df:
    df["Proportion"] = df['Count'] / df['Count'].sum()

In [94]:
proper_noun_entity_counts_df[0].head()

Unnamed: 0,Entity,Wikidata_id,Wikidata_url,Year,Span_text,Count,POS,Proportion
42328,Iraq,796,https://www.wikidata.org/wiki/Q796,2007,Iraq,2,PROPN,0.285714
42322,Bill Clinton,1124,https://www.wikidata.org/wiki/Q1124,2007,Clinton,2,PROPN,0.285714
42335,United States of America,30,https://www.wikidata.org/wiki/Q30,2007,U.S.,1,PROPN,0.142857
42321,Monday,105,https://www.wikidata.org/wiki/Q105,2007,Monday,1,PROPN,0.142857
42352,Grave Mistake,5597792,https://www.wikidata.org/wiki/Q5597792,2007,grave mistake,1,PROPN,0.142857


## Write results to Excel spreadsheet

In [None]:
!pip install xlsxwriter

In [None]:
# create excel writer object to initialize new workbook
writer = pd.ExcelWriter(using_dataset.output_path, engine="xlsxwriter")

for df in proper_noun_entity_counts_df:
    year = str(df["Year"].iloc[0])
    df.to_excel(writer, sheet_name=year, columns=['Entity', 'Wikidata_id', 'Wikidata_url', 'Count', 'Proportion'], index=False)

# close the excel writer and output file
writer.close()

In [None]:
tnlp = nlp("John Doe would be disastrous as President!")
tnlp