# Entity and Relationship Extraction

This notebook mainly focuses on two parts.

1. The first part is extraction of entities (subject and object) from the scrapped news article. This is done with the help of spacy.
2. The second part is store the extracted information in a directed graph. I have used **networkx** library for this task. The graph is then saved for future inference.

#### Install Requirements

In [78]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.6.0/en_core_web_md-3.6.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


#### Import Libraries

In [79]:
import spacy
import os
import glob
import re
import string
from tqdm import tqdm
import networkx as nx
import matplotlib.pyplot as plt
nlp = spacy.load('en_core_web_md')

#### Data Preparation

In [80]:
def clean(text):
  '''
  A function to clean the text.

  Input:
  text: string
  '''
  # lowercase the text
  text = text.lower()
  # remove unicode characters
  text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)

  return text


In [81]:
PATH = '/content/drive/MyDrive/Intern Task/Data/'
files = []

sentences = []

for file in glob.glob(PATH + '*.txt'):
  with open(file, 'r') as f:
    text = f.read()

  sent = re.split('[.?]', text)

  sentences.extend(list(map(lambda x: clean(x), re.split('[.?]', text))))


In [82]:
len(sentences)

5311

#### Global Variables

In [83]:
# To store the subject and object as a tuple for crating graph later
EDGES = []

# To store the relationship between subject and object
RELATIONS = []

#### Entity Extraction

In [84]:
def getAdj(text,index):

  '''
  This is a fuctionn that checks if there are any adjectives associated with subjects
  or objects extrated.

  Input:
  text: (string)
  index: (int)
  '''

  doc = nlp(text)

  phrase = ''
  pos = None

  for token in doc:

    if token.i == index:

      for subtoken in token.children:
        if (subtoken.pos_ == 'ADJ'):
          phrase += ' ' + subtoken.text
      break

  return phrase

In [93]:
def getSVO(text):
  '''
  This is the function that extracts the subject and object from a text and
  returns subject, object and verb (reflects relation between sub and obj).

  Input:
  text: (string)
  '''

  doc = nlp(text)
  sub = None
  rel = None
  obj = None

  for token in doc:
    # root word
    if (token.pos_=='VERB'):

        # only extract noun or pronoun subjects
      for sub_tok in token.lefts:

        if (sub_tok.dep_ in ['nsubj','nsubjpass']) and (sub_tok.pos_ in ['NOUN','PROPN','PRON']):

          # look for subject modifier
          adj = getAdj(text,sub_tok.i)


          sub = (adj + ' ' + sub_tok.text).strip()

          rel = token.text

          # check for noun or pronoun direct objects
          for sub_tok in token.rights:

            if (sub_tok.dep_ in ['dobj']) and (sub_tok.pos_ in ['NOUN','PROPN']):

              # look for object modifier
              adj = getAdj(text,sub_tok.i)

              obj = (adj + ' ' + sub_tok.text).strip()


  return sub, obj, rel

In [94]:
for text in tqdm(sentences):
  sub, obj, rel = getSVO(text)
  EDGES.append((sub, obj))
  RELATIONS.append(rel)

100%|██████████| 5311/5311 [01:38<00:00, 54.03it/s]


In [95]:
## Removing any subject object relationship where any one of the subject or object
## is not detected.

NEW_EDGES = []
NEW_RELATIONS = []
for i, tup in enumerate(EDGES):
  is_None = not all(tup)

  if not is_None and RELATIONS[i] is not None:
    NEW_EDGES.append(tup)
    NEW_RELATIONS.append(RELATIONS[i])

#### Create and store information in Graph

In [97]:
G = nx.DiGraph()
for i, entity in enumerate(NEW_EDGES):
  sub, obj = entity
  G.add_edge(sub, obj, value=NEW_RELATIONS[i])

#### Visualize the Graph

In [98]:
edge_labels = nx.get_edge_attributes(G, 'value')
pos = nx.spring_layout(G)
plt.figure(figsize=(20,20))
nx.draw(
    G, pos, edge_color='black', width=1, linewidths=1,
    node_size=500, node_color='pink', alpha=0.9,
    labels={node: node for node in G.nodes()}
)
nx.draw_networkx_edge_labels(
    G, pos,
    edge_labels=edge_labels,
    font_color='red'
)

Output hidden; open in https://colab.research.google.com to view.

#### Save the Graph

In [99]:
import pickle

pickle.dump(G, open('/content/drive/MyDrive/Intern Task/Graph_DB/graph_db.pickle', 'wb'))