<a href="https://colab.research.google.com/github/ua-datalab/NLP-Speech/blob/main/Introduction_to_Information_Extraction/Introduction_to_Information_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><h1> Introduction to Information Extraction </h1> </center>



![](https://www.tex-ai.com/wp-content/uploads/2020/10/NLP-techniques-for-information-extraction.jpg)

## Housekeeping
1. Check that the recording is on
2. Check audio and screenshare
3. Share link to notebook in chat
4. Light mode and readable font size

## What is information extraction?

Information extraction (IE) involves converting *unstructured or semi-structured information* from *machine readable documents* into *structured knowledge*, which can be queried to *automatically access* specific information *at scale*.

Broadly, it is the culmintion of NLP, dataset creation, and  search and retrieval.

It is a move from data processing, to incorporating knowledge into tasks.

### Terminology
- *Unstructured or semi-structured information*
- *Machine readable documents*
- *Structured knowledge*
- *Automatically access*
- *At scale*

## Processing the World Wide Web- an infinite knowledge source
- Millions of contributors, and agregate agreement on correctness of information
- Easy to train algorithms to access human-created information
- Large group of people who can agree on how entities are related to each other
- Semantic Web- formally representing web metadata, entities and locations to utilize information in scalable ways

## Shallow vs Advanced Information extraction

- Reguar Expressions
  - Matching patterns (such as capitalized letters) or digits (money)
- Matching keywords against registries
  - Names
  - Geographical entities
  - Currency

With the advent of NLP, we are able to add a lot of information to text, and with LLMs, we can add embeddings that better model relationships between words in a document. So, we can move from extracting expected information to processing documents better to search for information better.

## What can we extract?

### Named entity recognition (NER) or identification
- Aims at finding real-world objects in texts.
- Classifies them into predefined categories such as names of persons, organizations, locations, temporal expressions, products, etc.


Question: is this an example best solved by shallow or deep IE?

### Quantities and monetary values
- Currency
- Stocks
- Number

### Entity Disambiguation and Term Evolution
- Same name may point to two different entities
- An entity may have a new name

Question: How does one differenciate information about the Bush administrations of two different presidents?

Question: How do we connect news about Twitter from 2010-2024?

### Entity classes
- categories assigned to an entity
- Determines its relationship with other entities

A given entity can belong to more than one class.

Question: what are some classes Mark Zuckerburg belongs to?

### Entity Relations
- Meaning building involves connecting concepts
- IE includes extracting those connections


# Creating the components of a knowledge base

Information extraction is thus motivated to enrich a knowledge base that can be imporved and addded to. This happens by:

 - Selecting data sources (newspaper articles, reddit posts, question-answer datasets)
 - Extracting entities, classes, and relationships in the dataset
 - Consistntly integrating and linking new information in the right place in an existing knowledge base

# How do we carry out information extraction?

- Data processing:
  - Tokenization
  - Tagging
  - Stop word removal
  - Dependency parsing

- Using a language model to:
  - Carry out the data processing
  - Utilize tags to assess entities and relations
  - Make predictions and using them to extract information

- Processing the output
  - Human-readability and organization
  - Natural Language Understanding
  - Utilizing the extracted information for downstream tasks (search, text classification, summarization, returning related content)

## How do we measure the success of IE ?
### TF/IDF
- Term Frequency-Inverse Document Frequency
- Evaluates the importance of a word in one given document, relative to a collection of documents.
- Helps find exact documents relevant to an entity,when it is frequent in one specific document, but rare across the entire dataset
- Also provides a measure for "common" entities

### Precision
- The proportion of true positives among all positive (true and false) predictions made by a model.
- Measure of how well a model identifies relevant instances.

### Recall
- The proportion of true positives among all actual positive instances (true positives and false negatives).
- It indicates a model's ability to capture relevant cases.
- Focus on minimizing false negatives.


# Why Information Extraction?

Essential step for downstream tasks, such as text summarization, text classification, sentiment analysis, and finding causality.

# Working examples of information extraction

1. Information Collection
2. Process Data
3. Choosing the Right Model
4. Evaluation of the Model
5. Deploying Model in Production

# Entity extraction with SpaCy

Source: https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/

**Note for Cyverse users:**

In case you have issues running the cells with 'pip install...', do the following:
1.  Open a terminal window.
2. Run all the installation code lines.
3. Click on Kernal > Restart Kernal.

In [None]:
! pip install spacy
! python -m spacy download en_core_web_sm

In [None]:
import spacy

# import and load the English language model for vocabluary, syntax & entities
import en_core_web_sm
nlp = en_core_web_sm.load()

#for visualization of Entity detection importing displacy from spacy:
from spacy import displacy

# For querying tags
from spacy import explain

In [None]:
nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities=[(i, i.label_, i.label) for i in nytimes.ents]
entities

In [None]:
displacy.render(nytimes, style = "ent",jupyter = True)

In [None]:
# Query tags:
import spacy
spacy.explain("GPE")

## Discussion

This small and light language model is able to parse, tag and extract entities with relative speed. I used the small English model for this task. Changing the size of the language model may give different results.

# Simple LLM-based relation extractor

For this example, we will focus on extracting relations as well as entities. The model used is Rebel, a simple but powerful relation extractor in the form of a triplet (Entity1, Relation, Entity2). This can run on CPU, as we are accessing a pretrained model.

Note: the output of the model is not human readable in this example.

Model source: https://huggingface.co/Babelscape/rebel-large

**Note for Cyverse users:**

In case you have issues running the cells with 'pip install...', do the following:
1.  Open a terminal window.
2. Run all the installation code lines.
3. Click on Kernal > Restart Kernal.

In [None]:
!pip install transformers



In [None]:
#Load model and format for creating the triplets:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
gen_kwargs = {
    "max_length": 256,
    "length_penalty": 0,
    "num_beams": 3,
    "num_return_sequences": 3,
}

# Function to extract triplets from text:
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/344 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [None]:
# function for processing a text corpus:
def print_triplets(text):
  # Tokenizer text
  model_inputs = tokenizer(text, max_length=1000, padding=True, truncation=True, return_tensors = 'pt')

  # Generate
  generated_tokens = model.generate(
      model_inputs["input_ids"].to(model.device),
      attention_mask=model_inputs["attention_mask"].to(model.device),
      **gen_kwargs,
  )

  # Extract text
  decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)

  # Print triplets:
  for idx, sentence in enumerate(decoded_preds):
      print(f'Prediction triplets {idx}')
      [print(f"Entity1: {item['head']}\n Relation:{item['type']}\n Entity2:{item['tail']}") for item in extract_triplets(sentence)]
      break #we only print 1 prediction from the model, comment this for more predictions



In [None]:
# Example 1: relation extraction for a given sentence:
text1 = "Batman was created by the artist Bob Kane and writer Bill Finger,\
        and debuted in the 27th issue of the comic book Detective Comics on March 30, 1939."
print_triplets(text1)

Prediction triplets 0
Entity1: Batman
 Relation:creator
 Entity2:Bob Kane
Entity1: Batman
 Relation:creator
 Entity2:Bill Finger


In [None]:
# function for processing a text corpus:
def extract_triplets_form_corpus(corpus):
  sents = corpus.split('.')
  while("" in sents):
    sents.remove("")
  print(f"No. of sentences: {len(sents)}")
  for n, sent in enumerate(sents):
      print(f"Sentence no. {n} to be processed: {sent}")
      # Tokenizer text
      model_inputs = tokenizer(sent, max_length=1000, padding=True, truncation=True, return_tensors = 'pt')
      print(f"sent: {n} tokenized.")

      # Generate
      generated_tokens = model.generate(
          model_inputs["input_ids"].to(model.device),
          attention_mask=model_inputs["attention_mask"].to(model.device),
          **gen_kwargs,
      )
      print(f"model output for sent: {n} generated.")

      # Extract text
      decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)
      print(f"text from sent: {n} extracted.")

      # Print triplets:
      for idx, sentence in enumerate(decoded_preds):
          print(f'Prediction triplets {idx}')
          [print(f"Entity1: {item['head']}\n Relation:{item['type']}\n Entity2:{item['tail']}")
          for item in extract_triplets(sentence)]
          break #we only print 1 prediction

In [None]:
# Text to extract triplets from
# Source: https://en.wikipedia.org/wiki/Batman
text = 'Batman, created by the artist Bob Kane and writer Bill Finger, is a superhero who appears in American comic books published by DC Comics \
        and debuted in the 27th issue of the comic book Detective Comics on March 30, 1939.\
        In the DC Universe, Batman is the alias of Bruce Wayne, a wealthy American playboy, \
        philanthropist, and industrialist who resides in Gotham City.\
        His origin story features him swearing vengeance against criminals after witnessing the murder of his parents,\
        Thomas and Martha.'


In [None]:
extract_triplets_form_corpus(text)

No. of sentences: 3
Sentence no. 0 to be processed: Batman, created by the artist Bob Kane and writer Bill Finger, is a superhero who appears in American comic books published by DC Comics         and debuted in the 27th issue of the comic book Detective Comics on March 30, 1939
sent: 0 tokenized.
model output for sent: 0 generated.
text from sent: 0 extracted.
Prediction triplets 0
Entity1: Batman
 Relation:creator
 Entity2:Bob Kane
Entity1: Batman
 Relation:instance of
 Entity2:superhero
Entity1: Batman
 Relation:inception
 Entity2:March 30, 1939
Entity1: Detective Comics
 Relation:publisher
 Entity2:DC Comics
Sentence no. 1 to be processed:         In the DC Universe, Batman is the alias of Bruce Wayne, a wealthy American playboy,         philanthropist, and industrialist who resides in Gotham City
sent: 1 tokenized.
model output for sent: 1 generated.
text from sent: 1 extracted.
Prediction triplets 0
Entity1: Batman
 Relation:residence
 Entity2:Gotham City
Sentence no. 2 to be pro

## Discussion

When we provide a text based on both real and fictitious events, our output connects entities with a variety of relations, including "subclass of", "spouse", and "residence".

Since "Batman" refers to both a fictitious person as well as a work of art, our relations extractor may offer all kinds of relations. We can use this pipeline to extract information based on our needs.

In [None]:
# Try this example on your own:
# Source: https://en.wikipedia.org/wiki/Leonardo_da_Vinci
text = 'Leonardo da Vinci was born on 15 April 1452 in, or close to, \
        the Tuscan hill town of Vinci, 20 miles from Florence.\
        He was born to Piero da Vinci a Florentine legal notary, and Caterina di Meo Lippi (c. 1434–1494),\
        from the lower class.\
        He was an Italian polymath of the High Renaissance who was active as a painter, draughtsman, \
        engineer, scientist, theorist, sculptor, and architect.'
extract_triplets_form_corpus(text)

# Automatic resume scraping with LLMs

How are automatic tracking systems used for collecting information from resumes?

Source: https://huggingface.co/foduucom/resume-extractor

This simple example uses an LLM to extract information from a resume PDF. It uses Ollama to access Llama3, a large language model.

Note: This will take a long while, and requires a GPU to run in a reasonable timeframe.



In [None]:
## install and download ollama with dependencies:
## Not needed for Cyverse
# ! sudo apt-get install -y pciutils
# !curl https://ollama.ai/install.sh | sh
# !pip install ollama


In [None]:
# Install necessary libraries
!pip install ollama langchain_community pdfminer.six

In [None]:
# import necessary python libraries
import os
import threading
import subprocess
import requests
import json

def ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])


In [None]:
# start Ollama
ollama_thread = threading.Thread(target=ollama)
ollama_thread.start()

In [None]:
!ollama pull llama3.1

In [None]:
# # run embedding model
# !ollama run llama3

In [None]:
from langchain_community.llms import Ollama

# Format for the extracted output:
json_content = """{{
    "name": "",
    "email" : "",
    "phone_1": "",
    "phone_2": "",
    "address": "",
    "city": "",
    "linkedin": "",
    "professional_experience_in_years": "",
    "highest_education": "",
    "is_fresher": "yes/no",
    "is_student": "yes/no",
    "skills": ["",""],
    "applied_for_profile": "",
    "education": [
        {{
            "institute_name": "",
            "year_of_passing": "",
            "score": ""
        }},
        {{
            "institute_name": "",
            "year_of_passing": "",
            "score": ""
        }}
    ],
    "professional_experience": [
        {{
            "organisation_name": "",
            "duration": "",
            "profile": ""
        }},
        {{
            "organisation_name": "",
            "duration": "",
            "profile": ""
        }}
    ]
}}"""

class InputData:
    # LLM Prompt
    def input_data(text):

        input = f"""Extract relevant information from the following resume text and fill the provided JSON template.
                    Ensure all keys in the template are present in the output,
                    even if the value is empty or unknown.
                    If a specific piece of information is not found in the text, use 'Not provided' as the value.

        Resume text:
        {text}

        JSON template:
        {json_content}

        Instructions:
        1. Carefully analyse the resume text.
        2. Extract relevant information for each field in the JSON template.
        3. If a piece of information is not explicitly stated, make a reasonable inference based on the context.
        4. Ensure all keys from the template are present in the output JSON.
        5. Format the output as a valid JSON string.

        Output the filled JSON template only, without any additional text or explanations."""

        return input
    # run LLM:
    def llm():
        llm = Ollama(model="llama3.1")
        return llm

# Process resume and print results:
from pdfminer.high_level import extract_text
import sys
sys.path.append("/content/resume-extractor/")

def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

In [None]:
# Extraction:
text = extract_text_from_pdf(r'/content/anti-cv.pdf')

llm = input.llm()
data = llm.invoke(input.input_data(text))

print(data)

## Discussion

With more computing power and a larger LLMs, we can automate the entire pipeline for our information extraction, and work using simple prompts.

Our pipeline is able to take an opaque text, extract a document out of it, process it, and conduct an information extraction task, all in one shot, with very good results. However, running a model like this will require a GPU and more overhead.

# References

- https://nanonets.com/blog/information-extraction/
- https://www.geeksforgeeks.org/information-extraction-in-nlp/
- "Introduction to Information Extraction: Basic Notions and Current Trends"
