# Notebook for Named entity Recognition

Using spaCy for named entity recognition, we want to create relative frequency tables for the entities by year. At this point, we are only interested in the entities that appear most frequently.

Currently processes the "Fakespeak-ENG modified.xlsx" file (I've renamed my copy to "Fakespeak_ENG_modified.xlsx" to create a more consistent path), but will eventually be run on data from MisInfoText as well.

From the original data file, we use the following columns: ID, combinedLabel, originalTextType, originalBodyText, originalDateYear

We are processing text from the "originalBodyText" column.

In [1]:
import spacy
import pandas as pd
from spacy.tokens.span import Span
from spacy.tokens.doc import Doc
from spacy_entity_linker.EntityElement import EntityElement
from helpers import load_data, get_groups, load_stop_word_list, is_all_stop_words

## Loading the articles

In [2]:
dataset_df = load_data()
dataset_df.head()

Unnamed: 0,id,text,headline,text_type,year
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016
1,http://www.politifact.com/california/statement...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016
2,http://www.politifact.com/california/statement...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017
3,http://www.politifact.com/california/statement...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017
4,http://www.politifact.com/california/statement...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017


## Tagging named entities using spaCy

To make up for the difficulties of consolidating similar named entities, we use spaCy's large web model to ensure higher tagging accuracy in the initial NER step.

Documentation for entityLinker: https://github.com/egerber/spaCy-entity-linker

In [3]:
# load spacy model
nlp = spacy.load("en_core_web_lg")

# add custom entityLinker pipeline
entity_linker = nlp.add_pipe("entityLinker", last=True)

  import pkg_resources


In [4]:
dataset_df["text_doc"] = list(nlp.pipe(dataset_df["text"]))
dataset_df["headline_doc"] = list(nlp.pipe(dataset_df["headline"].fillna("")))
dataset_df.head()

Unnamed: 0,id,text,headline,text_type,year,text_doc,headline_doc
0,http://www.politifact.com/arizona/statements/2...,Residents of multiple states will be asked to ...,Multiple States Have Agreed To Implement A ‘Tw...,News and blog,2016,"(Residents, of, multiple, states, will, be, as...","(Multiple, States, Have, Agreed, To, Implement..."
1,http://www.politifact.com/california/statement...,"Sacramento, CA - United States Senator Dianne ...",U.S. Senator Dianne Feinstein Opposes Prop. 64...,Press release,2016,"(Sacramento, ,, CA, -, United, States, Senator...","(U.S., Senator, Dianne, Feinstein, Opposes, Pr..."
2,http://www.politifact.com/california/statement...,We should anticipate black and gray markets in...,Why you should buy a locking gasoline cap,News and blog,2017,"(We, should, anticipate, black, and, gray, mar...","(Why, you, should, buy, a, locking, gasoline, ..."
3,http://www.politifact.com/california/statement...,As a ballot initiative calling for repeal of a...,California Gas-Tax-Hike Repeal Campaign Heats Up,News and blog,2017,"(As, a, ballot, initiative, calling, for, repe...","(California, Gas, -, Tax, -, Hike, Repeal, Cam..."
4,http://www.politifact.com/california/statement...,"WASHINGTON, DC The House of Representatives t...","Rep. Chu Decries ""Heartless"" ACA Repeal Vote",Press release,2017,"(WASHINGTON, ,, DC, , The, House, of, Represe...","(Rep., Chu, Decries, "", Heartless, "", ACA, Rep..."


In [5]:
# For some reason, any spans of just "President" (or similar)
# get tagged as Zhong Chenle, maybe because he has an alias "President".
# The following code fixes that to point to the correct Wikidata entry
# for the generic term "president".

zhong_chenle_president_aliases = {"PRESIDENT", "President", "Presidents"}
zhong_chenle_wikidata_id = 30945670
president_wikidata_id = 30461

def clean_incorrect_president_entity(df: pd.DataFrame):
    zhong_chenle_as_president_filter = (df["wikidata_id"] == zhong_chenle_wikidata_id) & (df["span_text"].isin(zhong_chenle_president_aliases))
    df.loc[zhong_chenle_as_president_filter, "entity"] = "president"
    df.loc[zhong_chenle_as_president_filter, "wikidata_id"] = president_wikidata_id
    df.loc[zhong_chenle_as_president_filter, "wikidata_url"] = f"https://www.wikidata.org/wiki/Q{president_wikidata_id}"

In [6]:
# A similar thing is happening where the state of Texas
# is sometimes confused for a musical play named "Texas". 

texas_musical_wikidata_id = 7707415
texas_state_wikidata_id = 1439

def clean_incorrect_texas_entity(df: pd.DataFrame):
    texas_musical_filter = df["wikidata_id"] == texas_musical_wikidata_id
    df.loc[texas_musical_filter, "wikidata_id"] = texas_state_wikidata_id
    df.loc[texas_musical_filter, "wikidata_url"] = f"https://www.wikidata.org/wiki/Q{texas_state_wikidata_id}"

The spacy_entity_linker package doesn't include NER tags like PERSON, ORG, GPE, etc. So to extract them, we have to try to match the linked entities to the original spacy entities, and grab the NER tag from those. This doesn't always work because the entities don't always line up, but it's the best we can do.

In [7]:
def get_entity_tag(row: pd.Series, doc_col: str):
    linked_entity: EntityElement = row["entity"]
    linked_entity_span: Span = linked_entity.get_span()

    doc: Doc = row[doc_col]

    for entity in doc.ents:
        if linked_entity_span.start >= entity.start and linked_entity_span.end <= entity.end:
            return entity.label_

    return None

In [8]:
def get_entity_details_df(df: pd.DataFrame, doc_col: str):
    copied_df = df.copy()
    copied_df["entity"] = copied_df[doc_col].apply(lambda doc: doc._.linkedEntities.entities)

    entity_df = copied_df.explode("entity").dropna()
    entity_df["tag"] = entity_df.apply(get_entity_tag, args=(doc_col,), axis=1)

    entity_details_df = pd.DataFrame(
        data={
            "year": entity_df["year"],
            "type": entity_df["text_type"],
            "entity": entity_df["entity"].apply(lambda ent: ent.get_label()),
            "tag": entity_df["tag"],
            "wikidata_id": entity_df["entity"].apply(lambda ent: ent.get_id()),
            "wikidata_url": entity_df["entity"].apply(lambda ent: ent.get_url()),
            "span": entity_df["entity"].apply(lambda ent: ent.get_span()),
            "span_text": entity_df["entity"].apply(lambda ent: ent.get_span().text)
        }
    )

    clean_incorrect_president_entity(entity_details_df)
    clean_incorrect_texas_entity(entity_details_df)

    # If the entity label is missing, fill it in with the span text.
    # This is rare, but sometimes happens
    entity_details_df["entity"] = entity_details_df["entity"].fillna(entity_details_df["span_text"])

    return entity_details_df

In [9]:
text_entity_details_df = get_entity_details_df(dataset_df, "text_doc")
text_entity_details_df.head()

Unnamed: 0,year,type,entity,tag,wikidata_id,wikidata_url,span,span_text
0,2016,News and blog,The Residents,,947955,https://www.wikidata.org/wiki/Q947955,(Residents),Residents
0,2016,News and blog,state,,7275,https://www.wikidata.org/wiki/Q7275,(states),states
0,2016,News and blog,pet,,39201,https://www.wikidata.org/wiki/Q39201,(pet),pet
0,2016,News and blog,humane society,ORG,1636604,https://www.wikidata.org/wiki/Q1636604,"(Humane, Society)",Humane Society
0,2016,News and blog,compliance,,633140,https://www.wikidata.org/wiki/Q633140,(compliance),compliance


In [10]:
headline_entity_details_df = get_entity_details_df(dataset_df, "headline_doc")
headline_entity_details_df.head()

Unnamed: 0,year,type,entity,tag,wikidata_id,wikidata_url,span,span_text
0,2016,News and blog,Jamie Madrox,ORG,2456058,https://www.wikidata.org/wiki/Q2456058,(Multiple),Multiple
0,2016,News and blog,Ordinance,,25339629,https://www.wikidata.org/wiki/Q25339629,(Ordinance),Ordinance
0,2016,News and blog,Α,ORG,9887,https://www.wikidata.org/wiki/Q9887,(A),A
0,2016,News and blog,Pet,,22905746,https://www.wikidata.org/wiki/Q22905746,(Pet),Pet
0,2016,News and blog,Two,,2665675,https://www.wikidata.org/wiki/Q2665675,(Two),Two


In [11]:
entity_types_to_keep = [
    "EVENT",
    "FAC",
    "GPE",
    "LANGUAGE",
    "LAW",
    "LOC",
    "NORP",
    "ORG",
    "PERSON",
    "PRODUCT",
    "WORK_OF_ART",
]

In [12]:
stopword_list = load_stop_word_list()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Adam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
pos_filter = text_entity_details_df["tag"].isin(entity_types_to_keep)
filtered_text_entity_details_df = text_entity_details_df[pos_filter]

stopwords_in_df = filtered_text_entity_details_df["entity"].apply(is_all_stop_words, args=(stopword_list,))
filtered_text_entity_details_df = filtered_text_entity_details_df[~stopwords_in_df]

filtered_text_entity_details_df.head()

Unnamed: 0,year,type,entity,tag,wikidata_id,wikidata_url,span,span_text
0,2016,News and blog,humane society,ORG,1636604,https://www.wikidata.org/wiki/Q1636604,"(Humane, Society)",Humane Society
0,2016,News and blog,Texas,GPE,1439,https://www.wikidata.org/wiki/Q1439,(Texas),Texas
0,2016,News and blog,Arizona,GPE,816,https://www.wikidata.org/wiki/Q816,(Arizona),Arizona
0,2016,News and blog,Missouri,GPE,1581,https://www.wikidata.org/wiki/Q1581,(Missouri),Missouri
0,2016,News and blog,Society of the United States,ORG,5963598,https://www.wikidata.org/wiki/Q5963598,"(Society, of, the, United, States)",Society of the United States


In [24]:
pos_filter = headline_entity_details_df["tag"].isin(entity_types_to_keep)
filtered_headline_entity_details_df = headline_entity_details_df[pos_filter]

stopwords_in_df = filtered_headline_entity_details_df["entity"].apply(is_all_stop_words, args=(stopword_list,))
filtered_headline_entity_details_df = filtered_headline_entity_details_df[~stopwords_in_df]

filtered_headline_entity_details_df.head()

Unnamed: 0,year,type,entity,tag,wikidata_id,wikidata_url,span,span_text
0,2016,News and blog,Jamie Madrox,ORG,2456058,https://www.wikidata.org/wiki/Q2456058,(Multiple),Multiple
0,2016,News and blog,Α,ORG,9887,https://www.wikidata.org/wiki/Q9887,(A),A
1,2016,Press release,Dianne Feinstein,PERSON,230733,https://www.wikidata.org/wiki/Q230733,"(Dianne, Feinstein)",Dianne Feinstein
1,2016,Press release,theatrical property,LAW,942297,https://www.wikidata.org/wiki/Q942297,(Prop),Prop
1,2016,Press release,full stop,LAW,172008,https://www.wikidata.org/wiki/Q172008,(.),.


## Group dataframes by year and count named entities

In [25]:
def get_count(df: pd.DataFrame):
  copied_df = df.copy()
  copied_df["count"] = copied_df.groupby(["wikidata_id"])["wikidata_id"].transform("count")
  sorted_df = copied_df.sort_values(by=["count"], ascending=False)
  unique_df = sorted_df.drop_duplicates(subset=["wikidata_id"])

  return unique_df

In [26]:
def get_count_dfs_for_years(df: pd.DataFrame):
    years, years_dfs = get_groups(df, "year")

    year_counts_dfs = [get_count(df) for df in years_dfs]

    return years, year_counts_dfs

In [27]:
years_text, years_text_dfs = get_count_dfs_for_years(filtered_text_entity_details_df)
years_text_dfs[0].head()

Unnamed: 0,year,type,entity,tag,wikidata_id,wikidata_url,span,span_text,count
438,2009,News and blog,United States of America,GPE,30,https://www.wikidata.org/wiki/Q30,(U.S.),U.S.,21
445,2009,News and blog,Barack Obama,PERSON,76,https://www.wikidata.org/wiki/Q76,(Obama),Obama,18
434,2009,News and blog,Americans,NORP,846570,https://www.wikidata.org/wiki/Q846570,(Americans),Americans,17
450,2009,Press release,AARP,ORG,463410,https://www.wikidata.org/wiki/Q463410,(AARP),AARP,15
441,2009,News and blog,House,ORG,23558,https://www.wikidata.org/wiki/Q23558,(House),House,14


In [28]:
years_headline, years_headline_dfs = get_count_dfs_for_years(filtered_headline_entity_details_df)
years_headline_dfs[0].head()

Unnamed: 0,year,type,entity,tag,wikidata_id,wikidata_url,span,span_text,count
450,2009,Press release,AARP,ORG,463410,https://www.wikidata.org/wiki/Q463410,(AARP),AARP,2
445,2009,News and blog,Barack Obama,ORG,76,https://www.wikidata.org/wiki/Q76,(Obama),Obama,2
437,2009,News and blog,USS Constitution,LAW,944436,https://www.wikidata.org/wiki/Q944436,(Constitution),Constitution,2
450,2009,Press release,Medicare,ORG,559392,https://www.wikidata.org/wiki/Q559392,(Medicare),Medicare,1
449,2009,News and blog,Shift,ORG,18712525,https://www.wikidata.org/wiki/Q18712525,(Shifts),Shifts,1


## Write results to Excel spreadsheet

In [29]:
def save_entity_counts_for_years(years: list[int], dfs: list[pd.DataFrame], output_path: str):
    writer = pd.ExcelWriter(output_path, engine="xlsxwriter")
    
    for year, df in zip(years, dfs):
        df.to_excel(
            writer,
            sheet_name=str(year),
            index=False,
            columns=["entity", "tag", "wikidata_id", "wikidata_url", "span_text", "count", "year"]
        )
    
    writer.close()

In [30]:
save_entity_counts_for_years(
    years=years_text, 
    dfs=years_text_dfs, 
    output_path="./output/ner.xlsx"
)

save_entity_counts_for_years(
    years=years_headline, 
    dfs=years_headline_dfs, 
    output_path="./output/ner_headlines.xlsx"
)

In [31]:
types, types_dfs = get_groups(text_entity_details_df, "type")

for type, df in zip(types, types_dfs):
    years_text, years_text_dfs = get_count_dfs_for_years(df)

    type_str = str(type).lower().replace(" ", "_")

    save_entity_counts_for_years(
        years=years_text, 
        dfs=years_text_dfs, 
        output_path=f"./output/{type_str}/ner_{type_str}.xlsx"
    )

In [32]:
types, types_dfs = get_groups(headline_entity_details_df, "type")

for type, df in zip(types, types_dfs):
    years_text, years_text_dfs = get_count_dfs_for_years(df)

    type_str = str(type).lower().replace(" ", "_")

    save_entity_counts_for_years(
        years=years_text, 
        dfs=years_text_dfs, 
        output_path=f"./output/{type_str}/ner_{type_str}_headlines.xlsx"
    )