# Notebook for Named Entity Recognition

Using spaCy for named entity recognition, we want to create relative frequency tables for the entities by year. At this point, we are only interested in the entities that appear most frequently.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [None]:
!pip install "spacy~=3.0.6"

In [None]:
!python -m spacy download en_core_web_md

In [None]:
!python -m spacy download en_core_web_lg

In [None]:
!pip install spacy-entity-linker==1.0.3

In [None]:
!python -m spacy_entity_linker "download_knowledge_base"

In [None]:
# Only run this code if you're loading from Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [1]:
import spacy
import pandas as pd
from dataset_config import BASE_FAKESPEAK_CONFIG, BASE_MISINFOTEXT_CONFIG
from helpers import get_groups, make_output_path, make_output_path_for_type

## Loading the articles

In [2]:
fakespeak_config = BASE_FAKESPEAK_CONFIG | {
    "headline_col": "originalHeadline",
    "usecols": BASE_FAKESPEAK_CONFIG["usecols"] + ["originalHeadline"]
}

misinfotext_config = BASE_MISINFOTEXT_CONFIG | {
    "headline_col": "originalHeadline",
}

In [3]:
using_dataset = misinfotext_config

In [4]:
dataset_df = pd.read_excel(
    using_dataset["input_path"], 
    sheet_name=using_dataset["sheet_name"], 
    usecols=using_dataset["usecols"]
)
dataset_df.head()

Unnamed: 0,factcheckURL,originalURL,originalBodyText,originalHeadline,originalTextType,originalDate,originalDateYear
0,http://www.politifact.com/arizona/statements/2...,https://associatedmediacoverage.com/three-stat...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016-05-06,2016
1,http://www.politifact.com/california/statement...,https://users.focalbeam.com/fs/distribution:wl...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016-07-12,2016
2,http://www.politifact.com/california/statement...,http://www.sacbee.com/opinion/op-ed/soapbox/ar...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017-08-04,2017
3,http://www.politifact.com/california/statement...,https://nocagastax.com/california-gas-tax-hike...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017-06-15,2017
4,http://www.politifact.com/california/statement...,https://chu.house.gov/media-center/press-relea...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017-05-04,2017


## Tagging named entities using spaCy

To make up for the difficulties of consolidating similar named entities, we use spaCy's large web model to ensure higher tagging accuracy in the initial NER step.

Documentation for entityLinker: https://github.com/egerber/spaCy-entity-linker

In [5]:
# load spacy model
nlp = spacy.load("en_core_web_md")

# add custom entityLinker pipeline
entity_linker = nlp.add_pipe("entityLinker", last=True)

  import pkg_resources


In [6]:
dataset_df["text_doc"] = list(nlp.pipe(dataset_df[using_dataset["text_col"]]))
dataset_df["headline_doc"] = list(nlp.pipe(dataset_df[using_dataset["headline_col"]].fillna("")))
dataset_df.head()

Unnamed: 0,factcheckURL,originalURL,originalBodyText,originalHeadline,originalTextType,originalDate,originalDateYear,text_doc,headline_doc
0,http://www.politifact.com/arizona/statements/2...,https://associatedmediacoverage.com/three-stat...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016-05-06,2016,"(Residents, of, multiple, states, will, be, as...","(Multiple, States, Have, Agreed, To, Implement..."
1,http://www.politifact.com/california/statement...,https://users.focalbeam.com/fs/distribution:wl...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016-07-12,2016,"(Sacramento, ,, CA, -, United, States, Senator...","(U.S., Senator, Dianne, Feinstein, Opposes, Pr..."
2,http://www.politifact.com/california/statement...,http://www.sacbee.com/opinion/op-ed/soapbox/ar...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017-08-04,2017,"(We, should, anticipate, black, and, gray, mar...","(Why, you, should, buy, a, locking, gasoline, ..."
3,http://www.politifact.com/california/statement...,https://nocagastax.com/california-gas-tax-hike...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017-06-15,2017,"(As, a, ballot, initiative, calling, for, repe...","(California, Gas, -, Tax, -, Hike, Repeal, Cam..."
4,http://www.politifact.com/california/statement...,https://chu.house.gov/media-center/press-relea...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017-05-04,2017,"(WASHINGTON, ,, DC, , The, House, of, Represe...","(Rep., Chu, Decries, "", Heartless, "", ACA, Rep..."


In [7]:
# For some reason, any spans of just "President" (or similar)
# get tagged as Zhong Chenle, maybe because he has an alias "President".
# The following code fixes that to point to the correct Wikidata entry
# for the generic term "president".

zhong_chenle_president_aliases = {'PRESIDENT', 'President', 'Presidents'}
zhong_chenle_wikidata_id = 30945670
president_wikidata_id = 30461

def clean_incorrect_president_entity(df: pd.DataFrame):
    zhong_chenle_as_president_filter = (df["Wikidata_id"] == zhong_chenle_wikidata_id) & (df["Span_text"].isin(zhong_chenle_president_aliases))
    df.loc[zhong_chenle_as_president_filter, "Entity"] = "president"
    df.loc[zhong_chenle_as_president_filter, "Wikidata_id"] = president_wikidata_id
    df.loc[zhong_chenle_as_president_filter, "Wikidata_url"] = f"https://www.wikidata.org/wiki/Q{president_wikidata_id}"

In [8]:
# A similar thing is happening where the state of Texas
# is sometimes confused for a musical play named "Texas". 

texas_musical_wikidata_id = 7707415
texas_state_wikidata_id = 1439

def clean_incorrect_texas_entity(df: pd.DataFrame):
    texas_musical_filter = df["Wikidata_id"] == texas_musical_wikidata_id
    df.loc[texas_musical_filter, "Wikidata_id"] = texas_state_wikidata_id
    df.loc[texas_musical_filter, "Wikidata_url"] = f"https://www.wikidata.org/wiki/Q{texas_state_wikidata_id}"

In [9]:
def get_entity_details_df(df: pd.DataFrame, doc_col: str):
    copied_df = df.copy()
    copied_df["entity"] = copied_df[doc_col].apply(lambda doc: doc._.linkedEntities.entities)

    entity_df = copied_df.explode("entity").dropna()

    # TODO: extract entity type
    entity_details_df = pd.DataFrame(
        data={
            "year": entity_df[using_dataset["year_col"]],
            "type": entity_df[using_dataset["type_col"]],
            "Entity": entity_df["entity"].apply(lambda ent: ent.get_label()),
            "Wikidata_id": entity_df["entity"].apply(lambda ent: ent.get_id()),
            "Wikidata_url": entity_df["entity"].apply(lambda ent: ent.get_url()),
            "Span": entity_df["entity"].apply(lambda ent: ent.get_span()),
            "Span_text": entity_df["entity"].apply(lambda ent: ent.get_span().text)
        }
    )

    clean_incorrect_president_entity(entity_details_df)
    clean_incorrect_texas_entity(entity_details_df)

    # If the entity label is missing, fill it in with the span text.
    # This is rare, but sometimes happens
    entity_details_df["Entity"] = entity_details_df["Entity"].fillna(entity_details_df["Span_text"])

    return entity_details_df

In [10]:
text_entity_details_df = get_entity_details_df(dataset_df, "text_doc")
text_entity_details_df.head()

Unnamed: 0,year,type,Entity,Wikidata_id,Wikidata_url,Span,Span_text
0,2016,News and blog,The Residents,947955,https://www.wikidata.org/wiki/Q947955,(Residents),Residents
0,2016,News and blog,state,7275,https://www.wikidata.org/wiki/Q7275,(states),states
0,2016,News and blog,pet,39201,https://www.wikidata.org/wiki/Q39201,(pet),pet
0,2016,News and blog,humane society,1636604,https://www.wikidata.org/wiki/Q1636604,"(Humane, Society)",Humane Society
0,2016,News and blog,compliance,633140,https://www.wikidata.org/wiki/Q633140,(compliance),compliance


In [11]:
headline_entity_details_df = get_entity_details_df(dataset_df, "headline_doc")
headline_entity_details_df.head()

Unnamed: 0,year,type,Entity,Wikidata_id,Wikidata_url,Span,Span_text
0,2016,News and blog,State,16928008,https://www.wikidata.org/wiki/Q16928008,(States),States
0,2016,News and blog,tool,39546,https://www.wikidata.org/wiki/Q39546,(Implement),Implement
0,2016,News and blog,Ordinance,25339629,https://www.wikidata.org/wiki/Q25339629,(Ordinance),Ordinance
0,2016,News and blog,Pet,22905746,https://www.wikidata.org/wiki/Q22905746,(Pet),Pet
1,2016,Press release,theatrical property,942297,https://www.wikidata.org/wiki/Q942297,(Prop),Prop


## Group dataframes by year and count named entities
Currently, entityLinker catches all entities, not just proper nouns. To get around this, we first create dataframes filtering by year, then get the POS tags using spacy. This will then allow us to filter the dataframes further by excluding any counted nouns.

In [12]:
def get_count(df: pd.DataFrame):
  copied_df = df.copy()
  copied_df['Count'] = copied_df.groupby(['Wikidata_id'])['Wikidata_id'].transform('count')
  sorted_df = copied_df.sort_values(by=['Count'], ascending=False)
  unique_df = sorted_df.drop_duplicates(subset=["Wikidata_id"])

  return unique_df

In [13]:
tagger = spacy.load("en_core_web_md")

Getting entity counts and keeping only proper nouns to get rid of common regular words

In [14]:
def get_years_dfs(df: pd.DataFrame):
    years, years_dfs = get_groups(df, "year")

    year_counts_dfs = [get_count(df) for df in years_dfs]

    for df in year_counts_dfs:
        df['POS'] = [doc[0].pos_ for doc in tagger.pipe(df['Entity'])]
    
    propn_year_counts_dfs = [df[df["POS"] == "PROPN"] for df in year_counts_dfs]

    return years, propn_year_counts_dfs

In [15]:
years_text, years_text_dfs = get_years_dfs(text_entity_details_df)
years_text_dfs[0].head()

Unnamed: 0,year,type,Entity,Wikidata_id,Wikidata_url,Span,Span_text,Count,POS
428,2007,Press release,withdrawal,1760704,https://www.wikidata.org/wiki/Q1760704,(withdrawal),withdrawal,2,PROPN
428,2007,Press release,Bill Clinton,1124,https://www.wikidata.org/wiki/Q1124,(Clinton),Clinton,2,PROPN
428,2007,Press release,Iraq,796,https://www.wikidata.org/wiki/Q796,(Iraq),Iraq,2,PROPN
428,2007,Press release,Monday,105,https://www.wikidata.org/wiki/Q105,(Monday),Monday,1,PROPN
428,2007,Press release,Chance the Rapper,12470060,https://www.wikidata.org/wiki/Q12470060,(chance),chance,1,PROPN


In [16]:
years_headline, years_headline_dfs = get_years_dfs(headline_entity_details_df)
years_headline_dfs[0].head()

Unnamed: 0,year,type,Entity,Wikidata_id,Wikidata_url,Span,Span_text,Count,POS
428,2007,Press release,John McCain,10390,https://www.wikidata.org/wiki/Q10390,"(John, McCain)",John McCain,1,PROPN
428,2007,Press release,Hillary Clinton,6294,https://www.wikidata.org/wiki/Q6294,"(Hillary, Clinton)",Hillary Clinton,1,PROPN


## Write results to Excel spreadsheet

In [17]:
def save_entity_counts_for_years(years: list[int], dfs: list[pd.DataFrame], output_path: str):
    writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
    
    for year, df in zip(years, dfs):
        df.to_excel(
            writer,
            sheet_name=str(year),
            index=False,
            columns=["Entity", "Wikidata_id", "Wikidata_url", "Span_text", "Count"]
        )
    
    writer.close()

In [18]:
save_entity_counts_for_years(
    years=years_text, 
    dfs=years_text_dfs, 
    output_path=make_output_path(using_dataset, "named_entities_frequency")
)

save_entity_counts_for_years(
    years=years_headline, 
    dfs=years_headline_dfs, 
    output_path=make_output_path(using_dataset, "named_entities_frequency_headlines")
)

In [19]:
def save_entity_counts_for_types(entity_details_df: pd.DataFrame, suffix = ""):
    types, types_dfs = get_groups(entity_details_df, "type")

    for type, df in zip(types, types_dfs):
        years_text, years_text_dfs = get_years_dfs(df)

        save_entity_counts_for_years(
            years=years_text, 
            dfs=years_text_dfs, 
            output_path=make_output_path_for_type(using_dataset, type, f"named_entities_frequency{suffix}")
        )

In [20]:
save_entity_counts_for_types(text_entity_details_df)
save_entity_counts_for_types(headline_entity_details_df, "_headlines")