## Entity Disambiguation

Here we do simple rule-based entity disambiguation. Specifically, we look at word overlaps and use a threshold to cluster different textual forms for named entities, and cluster them. Each cluster represents a single entity with the different textual forms as their synonyms. Synonyms may be overriden manually for correctness.

In [1]:
import os
import pandas as pd

from itertools import combinations
from nltk.metrics.distance import edit_distance

In [2]:
DATA_DIR = "../../data/entity-graph"

In [3]:
entities_df = pd.read_csv(os.path.join(DATA_DIR, "entities.tsv"), sep="\t", 
                          names=["sid", "sent", "ent_text", "ent_label", "ent_start", "end_end"])
entities_df.head()

Unnamed: 0,sid,sent,ent_text,ent_label,ent_start,end_end
0,0,President Donald J. Trump claims he has the “ ...,Donald J. Trump,PERSON,10,25
1,0,President Donald J. Trump claims he has the “ ...,the Justice Department,ORG,90,112
2,2,For clues to what may surface when investigato...,Deutsche Bank,ORG,112,125
3,2,For clues to what may surface when investigato...,Russia,GPE,169,175
4,3,Recent media reports confirm that Deutsche Ban...,Deutsche Bank,ORG,34,47


## Different types of entities detected

In [4]:
entities_df["ent_label"].value_counts()

ORG            593
PERSON         314
GPE            286
DATE           175
NORP           109
MONEY           84
CARDINAL        76
LOC             51
PRODUCT         31
FAC             21
ORDINAL         17
WORK_OF_ART      7
EVENT            6
LAW              4
PERCENT          3
TIME             2
LANGUAGE         1
Name: ent_label, dtype: int64

## Common Functions

In [5]:
def entity_similarity(e1, e2):
    if e1 == e2:
        return 1
    elif e1.find(e2) > -1 or e2.find(e1) > -1:
        return 1
    else:
        t1, t2 = set(e1.lower().split()), set(e2.lower().split())
        score = len(t1.intersection(t2)) / len(t1.union(t2))
        return score
    

def construct_all_pairs(ent_df, sim_threshold):
    ent_pairs_df = pd.DataFrame(list(combinations(ent_df.ent_text, 2)), 
        columns=['ent_text_x', 'ent_text_y'])
    # compute similarity
    ent_pairs_df["sim"] = ent_pairs_df.apply(
        lambda row: entity_similarity(row.ent_text_x, row.ent_text_y), axis=1)
    # remove low similarity pairs
    ent_pairs_df = ent_pairs_df[ent_pairs_df["sim"] > sim_threshold]
    # remove textually identical pairs
    ent_pairs_df = ent_pairs_df[ent_pairs_df["ent_text_x"] != ent_pairs_df["ent_text_y"]]
    # LHS should be lexicographically smaller than RHS (upper triangle entries)
    ent_pairs_df = ent_pairs_df[ent_pairs_df["ent_text_x"] <= ent_pairs_df["ent_text_y"]]
    return ent_pairs_df


def accumulate_synonyms(ent_pairs_df):
    ent_pairs_grouped_df = (ent_pairs_df.groupby(["ent_text_x"])["ent_text_y"]
        .apply(list)
        .reset_index(name="syns_list"))
    ent_pairs_grouped_df["synonyms"] = ent_pairs_grouped_df["syns_list"].str.join("|")
    ent_pairs_grouped_df = ent_pairs_grouped_df.drop(columns=["syns_list"])
    return ent_pairs_grouped_df


## ORG

In [6]:
s1 = entity_similarity("the Justice Department", "US Department of Justice")
s2 = entity_similarity("the Southern District", "Office for the Southern District of NY")
s1, s2

(0.4, 1)

In [7]:
org_df = entities_df[entities_df["ent_label"] == "ORG"]
org_df.head()

Unnamed: 0,sid,sent,ent_text,ent_label,ent_start,end_end
1,0,President Donald J. Trump claims he has the “ ...,the Justice Department,ORG,90,112
2,2,For clues to what may surface when investigato...,Deutsche Bank,ORG,112,125
4,3,Recent media reports confirm that Deutsche Ban...,Deutsche Bank,ORG,34,47
9,5,"Indeed , Deutsche Bank is a common thread in n...",Deutsche Bank,ORG,9,22
12,7,In all the flurry of media attention to the Ru...,the Department of Justice,ORG,103,128


In [8]:
org_pairs_df = construct_all_pairs(org_df, 0.35)
org_pairs_df.head(20)

Unnamed: 0,ent_text_x,ent_text_y,sim
7,the Justice Department,the US Department of Justice,0.6
9,the Justice Department,the US Department of Justice,0.6
13,the Justice Department,the US Department of Justice,0.6
36,the Justice Department,the US Department of Justice,0.6
39,the Justice Department,the US Department of Justice,0.6
55,the Justice Department,the US Department of Justice,0.6
222,the Justice Department,the US Department of Justice,0.6
228,the Justice Department,the US Department of Justice,0.6
232,the Justice Department,the US Department of Justice,0.6
264,the Justice Department,the US Department of Justice,0.6


In [9]:
org_pairs_grouped_df = accumulate_synonyms(org_pairs_df)
org_pairs_grouped_df.head(10)

Unnamed: 0,ent_text_x,synonyms
0,Alfa,Alfa Group
1,Bank,Deutsche Bank|Deutsche Bank|Deutsche Bank
2,Bank of Cyprus,the Cyprus Popular Bank|the Central Bank of Cy...
3,Bank of New York,New York Times|New York State Department of Fi...
4,Bannon at Trump Tower,Trump|Trump
5,Bloomberg,Bloomberg News|Bloomberg News
6,Commercial Bank of SF,Russian Commercial Bank
7,Committee,the Committee on Financial Services
8,Crime and Corruption Reporting Project,Organized Crime and Corruption Reporting Project
9,Department of Financial Services,The Department of Financial Services|The Depar...


In [10]:
org_pairs_grouped_df.to_csv(os.path.join(DATA_DIR, "org_syns.csv"), index=False)

## PERSON

In [11]:
s1 = entity_similarity("Jared Kushner", "Jared Kushner")
s2 = entity_similarity("Carl Levin", "Levin")
s3 = entity_similarity("Donald J. Trump", "Donald Trump")
s4 = entity_similarity("Donald J. Trump", "Donald Trump Jr.")
s1, s2, s3, s4

(1, 1, 0.6666666666666666, 0.5)

In [12]:
person_df = (entities_df[entities_df["ent_label"] == "PERSON"][["ent_text"]]
    .drop_duplicates())
person_df.head()

Unnamed: 0,ent_text
0,Donald J. Trump
17,Donald Trump
18,Jared Kushner
46,Robert Mueller
70,Breuer


In [13]:
person_pairs_df = construct_all_pairs(person_df, 0.6)
person_pairs_df.head(20)

Unnamed: 0,ent_text_x,ent_text_y,sim
0,Donald J. Trump,Donald Trump,0.666667
25,Donald J. Trump,Trump,1.0
183,Donald Trump,Trump,1.0
207,Donald Trump,Donald Trump ’s,1.0
293,Donald Trump,Elect Donald Trump,1.0
298,Donald Trump,Donald Trump Jr.,1.0
337,Jared Kushner,Kushner,1.0
674,Breuer,Rolf - Ernst Breuer,1.0
785,Diane Glossman,Glossman,1.0
2281,Carl Levin,Levin,1.0


In [14]:
person_pairs_grouped_df = accumulate_synonyms(person_pairs_df)
person_pairs_grouped_df.head(10)

Unnamed: 0,ent_text_x,synonyms
0,Ackerman,Josef Ackermann
1,Alex Sapir,Sapir
2,Boris Rotenberg,Rotenberg
3,Breuer,Rolf - Ernst Breuer
4,Carl Levin,Levin
5,Crown Prince Mohammed bin Zayed al Nahyan,Mohammed bin Zayed
6,David Kautter,Kautter
7,Diane Glossman,Glossman
8,Donald,Donald Jr.|Elect Donald Trump|Donald Trump Jr.
9,Donald J. Trump,Donald Trump|Trump


In [15]:
person_pairs_grouped_df.to_csv(os.path.join(DATA_DIR, "person_syns.csv"), index=False)

## GPE + LOC + NORP

In [16]:
s1 = entity_similarity("Russia", "the Russian Federation")
s2 = entity_similarity("Virgin Island", "the British Virgin Islands")
s1, s2

(1, 1)

In [17]:
gpe_df = (entities_df[(entities_df["ent_label"] == "GPE") | 
                      (entities_df["ent_label"] == "LOC") |
                      (entities_df["ent_label"] == "NORP")][["ent_text"]]
    .drop_duplicates())
gpe_df.head()

Unnamed: 0,ent_text
3,Russia
6,American
7,Russian
10,US
15,Manhattan


In [18]:
gpe_pairs_df = construct_all_pairs(gpe_df, 0.5)
gpe_pairs_df.head(50)

Unnamed: 0,ent_text_x,ent_text_y,sim
1,Russia,Russian,1.0
25,Russia,the Russian Federation,1.0
33,Russia,Russians,1.0
248,Russian,the Russian Federation,1.0
256,Russian,Russians,1.0
556,Southern District,the Southern District,1.0
569,Southern District,The Southern District,1.0
693,New York,New Yorker,1.0
756,New York,New York Preet Bharara,1.0
759,New York,New York of Deutsche Bank,1.0


In [19]:
gpe_pairs_grouped_df = accumulate_synonyms(gpe_pairs_df)
gpe_pairs_grouped_df.head(10)

Unnamed: 0,ent_text_x,synonyms
0,Africa,Africa - Israel|East Africa
1,Constantinos Koudellaris,Koudellaris
2,Cyprus,the Cyprus Mail
3,Deutsche,New York of Deutsche Bank
4,Eastern District,the Eastern District|this Eastern District
5,Israel,Israeli
6,Latvia,Latvian
7,New York,New Yorker|New York Preet Bharara|New York of ...
8,Panama,Panama City
9,Russia,Russian|the Russian Federation|Russians


In [20]:
gpe_pairs_grouped_df.to_csv(os.path.join(DATA_DIR, "gpe_syns.csv"), index=False)

## DATE

Not sure if disambiguation will help, but being able to group narrative by time might. Will consider later when that comes up.

In [21]:
date_df = entities_df[entities_df["ent_label"] == "DATE"]["ent_text"]
date_df.head(20)

8                  the 1990s
34               December 22
45                      2016
52                      1870
60             December 1998
75                      2001
85      just the past decade
86             December 2016
90     between 2005 and 2007
92                      2008
93                      2009
94                      2015
98                 June 2015
100            November 2015
115               April 2016
120          April 20 , 2017
129          March 30 , 2016
137             January 2017
154        1996 through 2002
157       December 21 , 2010
Name: ent_text, dtype: object