<a href="https://colab.research.google.com/github/srushtibhavsar/Gambling/blob/main/Week_1_2_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### <font color='red'>NOTE: Please do not edit this file. </font> Go to <font color='blue'>*File > Save a copy in Drive*</font>.

# **openHPI Course: Knowledge Graphs 2023**
## **Week 1: Knowledge Representation with Graphs**
### **Hands-On 1.2: NLP**

---


This is the first Python notebook for week 1 (Knowledge Representation with Graphs) in the openHPI Course **Knowledge Graphs 2023**.

In this notebook you will learn the basics of Natural Language Processing (NLP).

- For the rest of this hands-on session, we will be using data uploaded in this [Google Drive](https://drive.google.com/drive/folders/15HNd46z9G2tuN35LzYox8gf_bJbyjNzb?usp=sharing) folder.
- Make a copy of this folder into your own machine and/or to your Google Drive.

# <a name="TOC"></a> Table of Contents:
1. Google Colab Basics
2. The Art of Understanding: Natural Language Processing ([NLP](#2-nlp))
  - [Tokenization](#nlp-tokenization)
  - Morphological and Syntactic Analysis
    - [Lemmatization](#nlp-lemma)
    - [Part-of-Speech Tagging](#nlp-pos)
    - [Dependency Parsing](#nlp-dep)
  - [Named Entity Recognition](#nlp-ner)


# <a name="1-google collab"></a> 1. Google Colab Basics


### 1.1 How to upload files from your local machine to this notebook?
  First things first, let's load *star-trek.json* file from your local machine into this notebook.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving star-trek.json to star-trek (1).json


### 1.2 How to upload files from Google Drive to this notebook?
  - Using [PyDrive](https://pythonhosted.org/PyDrive/)

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

- After executing the code above, Colab will prompt you to **allow Google to have access to your Drive**. Allow permission then go to your Drive, right-click *star-trek.json* and select "Get link."

In [None]:
link = "https://drive.google.com/file/d/1BYbGVY2tr4nSMb2XGG3tXZlVC2R1YjH1/view?usp=drive_link"
file_id = link.split('/')[-2]
# We are only interested in the file id
print(f' File ID: {file_id}')

downloaded = drive.CreateFile({'id':file_id})
downloaded.GetContentFile('star-trek.json')

 File ID: 1BYbGVY2tr4nSMb2XGG3tXZlVC2R1YjH1


---

# <a name="2-nlp"></a> 2. Natural Language Processing

As discussed in the lecture in section **1.3 The Art of Understanding**, the process of understanding relies on the correct interpretation of various aspects of natural language.



![nlu](https://drive.google.com/uc?id=1kMwWjG5gYvRvw7YBYt_ujBfA90hE3kE3)

In this notebook, we will take a look at some of the NLP techniques implemented with [SpaCy](https://spacy.io) python library using the `en_core_web_sm` [model](https://spacy.io/models/en), which is an English language pipeline optimized for CPU and was trained on blog entries, news articles and online comments.

## <a name="nlp-tokenization"></a> 2.1 Tokenization
Tokenization is the **process of separating character sequences** into
smaller pieces, called **tokens**. In this process, certain characters
might be omitted, such as punctuation, depending on the rules of the individual language model.

For this example, let's use an excerpt from Leonard Nimoy's [Wikipedia page](https://en.wikipedia.org/wiki/Leonard_Nimoy):

> *Leonard Simon Nimoy was born on March 26, 1931, in an Irish section of West End of Boston, Massachusetts, to Jewish immigrants from Iziaslav, Ukraine. His mother, Dora (née Spinner; 1904–1987), was a homemaker, and his father, Max Nimoy (1901–1987), owned a barbershop in the Mattapan section of Boston. Leonard Simon Nimoy was an American actor, famed for playing Spock in the Star Trek franchise for almost 50 years.*

In [None]:
#First we have to import the spaCy library and download the language models
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp('Leonard Simon Nimoy was born on March 26, 1931, in an Irish section \
of West End of Boston, Massachusetts, to Jewish immigrants from Iziaslav, Ukraine. \
His mother, Dora (née Spinner; 1904–1987), was a homemaker, and his father, \
Max Nimoy (1901–1987), owned a barbershop in the Mattapan section of Boston. \
Leonard Simon Nimoy was an American actor, famed for playing Spock in the Star Trek \
franchise for almost 50 years.')

# Let's view the first 10 tokens
for token in doc[:10]:
  print(token.text)



Leonard
Simon
Nimoy
was
born
on
March
26
,
1931


Traditionally, tokenization involves initially splitting a text into sentences. SpaCy makes it possible to retrieve sentences from the [Doc](https://spacy.io/api/doc) container object.

In [None]:
# doc.sents - an iterator over the sentences in the Doc object
for id, sent in enumerate(doc.sents):
  print(f'Sentence {id+1}: {sent}')

Sentence 1: Leonard Simon Nimoy was born on March 26, 1931, in an Irish section of West End of Boston, Massachusetts, to Jewish immigrants from Iziaslav, Ukraine.
Sentence 2: His mother, Dora (née Spinner; 1904–1987), was a homemaker, and his father, Max Nimoy (1901–1987), owned a barbershop in the Mattapan section of Boston.
Sentence 3: Leonard Simon Nimoy was an American actor, famed for playing Spock in the Star Trek franchise for almost 50 years.


Under the hood, SpaCy's [Doc](https://spacy.io/api/doc) object is made up of a sequence of [Token](https://spacy.io/api/token) objects.

## 2.2. Morphological and Syntactic Analysis
The statement `doc = nlp('...')` in SpaCy is very powerful. It takes care of several NLP techniques at once: tokenization, lemmatization, Part-of-speech (POS) tagging and dependency parsing.

<a name="nlp-lemma"></a>**Lemmatization** falls under the umbrella of **morphological analysis**, the study of words. It requires morphological rules and a lexicon to reduce a word to its base form. In this way, words that have different inflections can be treated as the same item. For example, the auxiliary verbs *is*, *are*, *was*, and *were* are grouped together under the lemma *be*.

The latter two processes are part of **syntactic analysis**. <a name="nlp-pos"></a> With POS tagging, each word is classified according to its syntactic **part-of-speech** and labelled according to a specified tagset, while **dependency parsing** identifies the syntactic relationship that exists between tokens in a sentence.

Let's now inspect SpaCy's [Token](https://spacy.io/api/token) object. Other than `token.text`, there are several linguistic annotations that we can access through this object:

1. `token.lemma_` - the base form of the word. Example: the lemma of *was* is *be*.
2. `token.pos_` - simple part-of-speech tag according to the [UPOS](https://universaldependencies.org/u/pos/).
3. `token.tag_` - detailed part-of-speech tag according to the [Penn](https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html).
4. `token.dep_` - syntactic dependency to describe the relationship between phrases in that sentence.

In [None]:
token_details = []
for idx, token in enumerate(doc):
  token_details.append((idx, token.text, token.lemma_, token.pos_, token.tag_, token.dep_))

Let's print out the first 25 tokens and their attributes in tabular form.

In [None]:
from tabulate import tabulate

print(tabulate(token_details[:25], headers=['ID', 'TEXT', 'LEMMA', 'POS', 'TAG', 'DEP']))

  ID  TEXT           LEMMA          POS    TAG    DEP
----  -------------  -------------  -----  -----  ---------
   0  Leonard        Leonard        PROPN  NNP    compound
   1  Simon          Simon          PROPN  NNP    compound
   2  Nimoy          Nimoy          PROPN  NNP    nsubjpass
   3  was            be             AUX    VBD    auxpass
   4  born           bear           VERB   VBN    ROOT
   5  on             on             ADP    IN     prep
   6  March          March          PROPN  NNP    pobj
   7  26             26             NUM    CD     nummod
   8  ,              ,              PUNCT  ,      punct
   9  1931           1931           NUM    CD     nummod
  10  ,              ,              PUNCT  ,      punct
  11  in             in             ADP    IN     prep
  12  an             an             DET    DT     det
  13  Irish          irish          ADJ    JJ     amod
  14  section        section        NOUN   NN     pobj
  15  of             of             ADP 

## 2.3 <a name="nlp-dept"></a>Dependency Parsing

As part of syntactic analysis, dependency parsing is a process that determines and examines the grammatical relationships between phrases and words in a sentence.

Each relationship is indicated by a **head** (tail of the arrow) and its modifiers, the **dependent**/s (tip of the arrow). In the diagram below, *born* is considered the head, *was* as the tail, and [auxpass](https://universaldependencies.org/docs/en/dep/auxpass.html) for passive auxiliary

![was-born](https://drive.google.com/uc?id=1OtoC1Ww1I4HIKRFyW4Rm6eSDCWabkjHW)



In [None]:
from spacy import displacy

for sent in doc.sents:
  displacy.render(sent, style="dep", jupyter=True, options={'distance': 100})

## 2.4 <a name="nlp-ner"></a>Named Entity Recognition (NER)
Recall from [1.4 Graphs and Triples](https://docs.google.com/presentation/d/147-hjulZqnsuSfK-66NGq8GfTTNmeUNjew5XjQ8IrzI/edit?usp=sharing) that graphs are constructed from *triples*. Each triple is composed of two vertices connected with a directed edge. The *vertices* are the entities, while the *edges* represent the relationships that exist between an entity pair.



![triple](https://drive.google.com/uc?id=1BMERaNGHJADpL7FYW8qfTBoLpK00IVm5)
![graph](https://drive.google.com/uc?id=1JNBej83x1wJSe340lECAUQwBzFNhQIBs)

NER involves locating and classifying real-world objects mentioned in the sentence into categories such as **names, persons, organizations, locations, expressions of time, quantities, monetary values**, etc.

SpaCy's [Doc](https://spacy.io/api/doc) object predicts these named entities and stores them in the `ents` property with the following attributes:
1. `ent.text`
2. `ent.start_char` - the position of the first character of the name mentioned in the sentence, with `0` being the first character in the sentence.
3. `ent.end_char` - the position of the last character of the name mentioned in the sentence.
4. `ent.label_` - the category, such as, ORG, GPE (Geopolitical Entity), MONEY, etc.

In [None]:
ner_details = []

for ent in doc.ents:
  ner_details.append((ent.text, ent.start_char, ent.end_char, ent.label_))

In [None]:
import pandas as pd

# for now, just for printing tabular data nicely ;)
pd.DataFrame(ner_details, columns=['TEXT', 'START', 'END', 'LABEL'])

Unnamed: 0,TEXT,START,END,LABEL
0,Leonard Simon Nimoy,0,19,PERSON
1,"March 26, 1931",32,46,DATE
2,Irish,54,59,NORP
3,West End,71,79,GPE
4,Boston,83,89,GPE
5,Massachusetts,91,104,GPE
6,Jewish,109,115,NORP
7,Iziaslav,132,140,GPE
8,Ukraine,142,149,GPE
9,Dora,163,167,PERSON


In [None]:
from spacy import displacy

# Let's use displacy to display the entities.
displacy.render(doc, style='ent', jupyter=True)

To determine the description of the entity labels, simply use `spacy.explain(label)`.

In [None]:
spacy.explain('NORP')

'Nationalities or religious or political groups'

Inspecting the results of the NER and the dependency graph, we can observe the following patterns: nouns and noun phrases are identified as entities, while verbal, prepositional, and adjectival phrases signify relations.

In summary, syntactical analysis of texts is often straightforward. Unfortunately, the same cannot be said when it comes to semantics. The succeeding section illustrates the complexity in understanding the meaning of words.