<a href="https://colab.research.google.com/github/ua-datalab/NLP-Speech/blob/main/Introduction_to_Information_Extraction/Introduction_to_Information_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><h1> Introduction to Information Extraction </h1> </center>



![](https://www.tex-ai.com/wp-content/uploads/2020/10/NLP-techniques-for-information-extraction.jpg)

## Housekeeping
1. Check that the recording is on
2. Check audio and screenshare
3. Share link to notebook in chat
4. Light mode and readable font size

## What is information extraction?

Information extraction (IE) involves converting *unstructured or semi-structured information* from *machine readable documents* into *structured knowledge*, which can be queried to *automatically access* specific information *at scale*.

Broadly, it is the culmintion of NLP, dataset creation, and  search and retrieval.

It is a move from data processing, to incorporating knowledge into tasks.

### Terminology
- *Unstructured or semi-structured information*
- *Machine readable documents*
- *Structured knowledge*
- *Automatically access*
- *At scale*

## Processing the World Wide Web- an infinite knowledge source
- Millions of contributors, and agregate agreement on correctness of information
- Easy to train algorithms to access human-created information
- Large group of people who can agree on how entities are related to each other

### Shallow vs Advanced Information extraction

- Reguar Expressions
  - Matching patterns (such as capitalized letters) or digits (money)
- Matching keywords against registries
  - Names
  - Geographical entities
  - Currency

With the advent of NLP, we are able to add a lot of information to text, and with LLMs, we can add embeddings that better model relationships between words in a document. So, we can move from extracting expected information to processing documents better to search for information better.

## What can we extract?

### Named entity recognition (NER) or identification
- Aims at finding real-world objects in texts.
- Classifies them into predefined categories such as names of persons, organizations, locations, temporal expressions, products, etc.


Question: is this an example best solved by shallow or deep IE?

### Quantities and monetary values
- Currency
- Stocks
- Number

### Entity Disambiguation and Term Evolution
- Same name may point to two different entities
- An entity may have a new name

Question: How does one differenciate information about the Bush administrations of two different presidents?

Question: How do we connect news about Twitter from 2010-2024?

### Entity classes
- categories assigned to an entity
- Determines its relationship with other entities

A given entity can belong to more than one class.

Question: what are some classes Mark Zuckerburg belongs to?

### Entity Relations
- Meaning building involves connecting concepts
- IE includes extracting those connections


# Creating the components of a knowledge base

Information extraction is thus motivated to enrich a knowledge base that can be imporved and addded to. This happens by:

 - Selecting data sources (newspaper articles, reddit posts, question-answer datasets)
 - Extracting entities, classes, and relationships in the dataset
 - Consistntly integrating and linking new information in the right place in an existing knowledge base

# How do we carry out information extraction?

- Data processing:
  - Tokenization
  - Tagging
  - Stop word removal
  - Dependency parsing

- Finding a language model trained to find relations and entities

- Processing the output

## Some more concepts
### Precision
### Recall
### TF/IDF



# Why do we need this?

Essential for downstream tasks, such as text summarization, text classification and

# Working examples of information extraction

1. Information Collection
2. Process Data
3. Choosing the Right Model
4. Evaluation of the Model
5. Deploying Model in Production

# Entity extraction with SpaCy

Source: https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/

In [None]:
! pip install spacy
! python -m spacy download en_core_web_sm
# import and load the English language model for vocabluary, syntax & entities
import en_core_web_sm
nlp = en_core_web_sm.load()

#for visualization of Entity detection importing displacy from spacy:

from spacy import displacy

In [None]:


nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities=[(i, i.label_, i.label) for i in nytimes.ents]
entities

In [None]:
displacy.render(nytimes, style = "ent",jupyter = True)


In [None]:
# Query tags:
from spacy import explain
spacy.explain("GPE")

# Simple LLM-based relation extractor

This example uses Rebel, a simple relation extractor in the form of a triplet (Entity1, Relation, Entity2)

Model source: https://huggingface.co/Babelscape/rebel-large

In [None]:
#Load model and format for creating the triplets:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
gen_kwargs = {
    "max_length": 256,
    "length_penalty": 0,
    "num_beams": 3,
    "num_return_sequences": 3,
}

# Extract triplets from text:
def extract_triplets(text):
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
    return triplets



In [None]:
# function for processing a text corpus:
def extract_triplets(text):
  # Tokenizer text
  model_inputs = tokenizer(text, max_length=1000, padding=True, truncation=True, return_tensors = 'pt')

  # Generate
  generated_tokens = model.generate(
      model_inputs["input_ids"].to(model.device),
      attention_mask=model_inputs["attention_mask"].to(model.device),
      **gen_kwargs,
  )

  # Extract text
  decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)

  # Print triplets:
  for idx, sentence in enumerate(decoded_preds):
      print(f'Prediction triplets {idx}')
      [print(f"Entity1: {item['head']}\n Relation:{item['type']}\n Entity2:{item['tail']}") for item in extract_triplets(sentence)]

In [None]:
# Text to extract triplets from
text = 'Batman, created by the artist Bob Kane and writer Bill Finger, is a superhero who appears in American comic books published by DC Comics \
        and debuted in the 27th issue of the comic book Detective Comics on March 30, 1939.\
        In the DC Universe, Batman is the alias of Bruce Wayne, a wealthy American playboy, \
        philanthropist, and industrialist who resides in Gotham City.\
        His origin story features him swearing vengeance against criminals after witnessing the murder of his parents,\
        Thomas and Martha.'


In [None]:
for i, sent in enumerate(text.split('.')):
  print(f'Sentence being processed: {i}')
  print_triplets(sent)

# Automatic resume scraping with LLMs

Source: https://huggingface.co/foduucom/resume-extractor

This example uses a LLM to extract information from a resume in the PDF format.

A sobering example of how resumes get auto-rejected.

In [None]:
# install and download ollama with dependencies
! sudo apt-get install -y pciutils
!curl https://ollama.ai/install.sh | sh
!pip install ollama langchain_community pdfminer.six

# import necessary python libraries
import os
import threading
import subprocess
import requests
import json

def ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])


In [5]:
# start Ollama
ollama_thread = threading.Thread(target=ollama)
ollama_thread.start()

In [None]:
# run embedding model
!ollama run llama3

In [None]:
from langchain_community.llms import Ollama

json_content = """{{
    "name": "",
    "email" : "",
    "phone_1": "",
    "phone_2": "",
    "address": "",
    "city": "",
    "linkedin": "",
    "professional_experience_in_years": "",
    "highest_education": "",
    "is_fresher": "yes/no",
    "is_student": "yes/no",
    "skills": ["",""],
    "applied_for_profile": "",
    "education": [
        {{
            "institute_name": "",
            "year_of_passing": "",
            "score": ""
        }},
        {{
            "institute_name": "",
            "year_of_passing": "",
            "score": ""
        }}
    ],
    "professional_experience": [
        {{
            "organisation_name": "",
            "duration": "",
            "profile": ""
        }},
        {{
            "organisation_name": "",
            "duration": "",
            "profile": ""
        }}
    ]
}}"""


class InputData:
    def input_data(text):

        input = f"""Extract relevant information from the following resume text and fill the provided JSON template.
                    Ensure all keys in the template are present in the output,
                    even if the value is empty or unknown.
                    If a specific piece of information is not found in the text, use 'Not provided' as the value.

        Resume text:
        {text}

        JSON template:
        {json_content}

        Instructions:
        1. Carefully analyse the resume text.
        2. Extract relevant information for each field in the JSON template.
        3. If a piece of information is not explicitly stated, make a reasonable inference based on the context.
        4. Ensure all keys from the template are present in the output JSON.
        5. Format the output as a valid JSON string.

        Output the filled JSON template only, without any additional text or explanations."""

        return input

    def llm():
        llm = Ollama(model="llama3")
        return llm

In [None]:
from pdfminer.high_level import extract_text
import sys
sys.path.append("/content/resume-extractor/")

def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

text = extract_text_from_pdf(r'/content/anti-cv.pdf')

llm = input.llm()
data = llm.invoke(input.input_data(text))

print(data)

# References

- https://nanonets.com/blog/information-extraction/
- https://www.geeksforgeeks.org/information-extraction-in-nlp/
- "Introduction to Information Extraction: Basic Notions and Current Trends"
