<a href="https://colab.research.google.com/github/zaneprice5/knowledge-graph-with-spaCy/blob/main/NLP_knowledge_graph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# This notebook will explore the NYT headlines dataset from https://www.kaggle.com/datasets/tmishinev/nyt-headlines-20102021?select=nyt_articles_2021.csv

We will mostly use sPacy https://spacy.io/ which is an industrial strength text-processing library to create a knowledge graph of the information contained in the NYT dataset.

Sources:
1. https://www.kaggle.com/code/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk
2. https://spacy.io/

last updated: 9/7/2022

In [1]:
#import necessary libraries

import re
import pandas as pd
import bs4
import requests
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher 
from spacy.tokens import Span 

import networkx as nx

import matplotlib.pyplot as plt
from tqdm import tqdm

pd.set_option('display.max_colwidth', 200)
%matplotlib inline


In [2]:
# read in data into pandas df
candidate_sentences = pd.read_csv("/content/nyt_articles_2021.csv")
candidate_sentences.shape

(37337, 7)

In [3]:
#inspect the data to understand its structure

candidate_sentences.head()

Unnamed: 0,pub_date,abstract,headline,lead_paragraph,news_desk,section_name,word_count
0,2021-01-01 00:16:53+00:00,The video shows a man raising something to his car window before a bang is heard. An officer ducks for cover and then fires several rounds at the man.,Minneapolis Police Release Body Camera Video of Its First Killing Since George Floyd,"The Minneapolis Police Department released body camera footage on Thursday that shed new light on a fatal police shooting the night before, the first killing by a city police officer since George ...",National,U.S.,861
1,2021-01-01 00:58:19+00:00,"Every December since 2017, Ada Rojas has guided women through the process of memorializing their New Year’s resolutions on a vision board, a collage that reflects their goals and helps keep them o...",Resolving to live a lot better than in 2020.,"Every December since 2017, Ada Rojas has guided women through the process of memorializing their New Year’s resolutions on a vision board, a collage that reflects their goals and helps keep them o...",Express,U.S.,263
2,2021-01-01 01:24:55+00:00,"The suit, led by Representative Louie Gohmert of Texas, seeks to give the vice president the power to reject electoral votes that were cast for Joseph R. Biden Jr.",Justice Dept. Asks Judge to Toss Election Lawsuit Against Pence,[Here’s what you need to know about President-elect Joseph R. Biden Jr.’s Inauguration Day.],Washington,U.S.,695
3,2021-01-01 01:28:22+00:00,"The United States recorded its 20 millionth case since the start of the coronavirus pandemic on Thursday, surpassing a grim milestone just as the prospects for getting the virus under control quic...",The U.S. reaches 20 million cases.,"The United States recorded its 20 millionth case since the start of the coronavirus pandemic on Thursday, surpassing a grim milestone just as the prospects for getting the virus under control quic...",Foreign,World,438
4,2021-01-01 03:00:05+00:00,Milo Beckman hides some pleasant surprises in his New Year’s puzzle — let’s hope for more of that!,Party Hearty,"FRIDAY PUZZLE — I hope people have a lot of fun with this “themeless” Friday, which revealed its center (to this solver, at least) like a secret box with a series of sliding parts. If Milo Beckman...",Games,Crosswords & Games,670


#### After looking at the dataframe, we see that we have 7 features. pub_date, abstract, headline, lead_paragraph, news_desk, section_name, and word_count. We can do quite a lot with this information!

### Let's explore creating a knowledge graph from the sentences contained in the dataset!

### Knowledge graphs are based upon nodes and edges. An edge connects two nodes and shows the relationship between them. The smallest possible knowledge graph we can build is called a 'triple.' A triple contains two nodes and an edge. 

An example of a triple would be (dog, to eat, bone). This triple may be encoded in a sentence like "The dog ate the bone yesterday."

#### The most basic way to construct a knowledge graph is to break a sentence down into its most basic constituents.

In [5]:
doc = nlp("the dog ate the bone yesterday.")

for tok in doc:
  print(tok.text, "...", tok.dep_)

the ... det
dog ... nsubj
ate ... ROOT
the ... det
bone ... dobj
yesterday ... npadvmod
. ... punct


Here we can see that spaCy is able to determine the syntactic constituency of all of the elements in the sentence. This will be important later on as we use a function to create the knowledge graph of sentences.