<a href="https://colab.research.google.com/github/shahidulislamkhokon/ApacheSpark-PySpark-/blob/main/Information_Extrantion_from_Social_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Extraction for Social Science Research

installing spaCy and other dependencies for this project

In [None]:
!nvcc --version
!pip install --upgrade spacy
!pip install --upgrade spacy[cuda111,transformers]
!pip install jsonlines
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_sm

!wget https://andrewhalterman.com/files/cleaned_masdar.jsonl

I am going to use two techniques for information extraction: named entity recognition and rule-based extraction using dependency parses. The plan is to:

- get started with some hands-on named entity recognition
- step back and discuss information extraction and structured prediction at a higher level
- return to NER with some applications on real text
- next, use dependency parses and custom rules as a technique for information extraction
- conclude with some thoughts about extentions

Getting started with **NER** and **spaCy**

In [None]:
import jsonlines

from tqdm.autonotebook import tqdm
import jsonlines
import re

import spacy
from spacy import displacy
# assert spacy.__version__ == "3.1.3"

For spacy I need a pretrain model for prcessing document.
- large model (pre-train model with more example)
- small model (without pre-train word embedding)
- trf model (transformer-based model)
I am using large model (en_core_web_trf) and also small model(en_core_web_sm) for comparision

In [None]:
nlp = spacy.load("en_core_web_lg")
nlp_sm = spacy.load("en_core_web_sm")

Next, I'll load in a collection of news stories from a local **pro-government newspaper** in **Syri**a called al-Masdar. The articles here primarily describe the **civil war in Syria in 2016 and 2017**

In [None]:
with jsonlines.open("cleaned_masdar.jsonl", "r") as f:
    articles = list(f.iter())


In [None]:
# for i in range (len(articles)):
#   print(articles[i])
# print(len(articles))

In [None]:
article = articles[100]
article

To process a document with **spaCy**, I'll use the **nlp object** we instatiated earlier and pass a piece of text to it. The **nlp object** returns a **Document class object**, which has both document and token-level attributes.

In [None]:
doc = nlp(article['body'])
print(doc)
# take a look at how many words in a document
len(doc)

In [None]:
# look document-level attributes
dir(doc)

In [None]:
# tokens in a document can by accessed by their number:
print(doc[34])
# dir(doc[5])

One of the attributes it assigns is **named entity** information for the document. Using **spaCy's built-in visualizer**, we can see all the detected named entities in the document.

In [None]:
displacy.render(doc, style="ent", jupyter=True)

In [None]:
just_text = [i['body'] for i in articles]
docs = list(tqdm(nlp.pipe(just_text), total=len(just_text)))
print(docs)

# Information Extraction and Structured Prediction

In [None]:
for i in doc[0:]:
  print(i)

In [None]:

[(i.text, i.ent_iob_ + "-" + i.ent_type_) for i in doc[0:]]


# Models for NER

let's identify which organization are mentioned most in our corpus

In [None]:
from collections import Counter

all_orgs = []
for d in docs:
    orgs = [ent.text for ent in d.ents if ent.label_ == "PERSON"]
    all_orgs.extend(orgs)
    
Counter(all_orgs).most_common(15)

Which organizations are mentioned most alongside mentions of "ceasefires" or "negotiations"?

In [None]:
#@title 
negotiation_orgs = []
for d in docs:
    for ent in d.ents:
        if ent.label_ != "ORG":
            continue
        if re.search("negotiat|ceasefire|talks", ent.sent.text):
            negotiation_orgs.append(ent.text)
            
negotiation_orgs            

Dependency parses

In [None]:
doc = nlp(articles[313]['body'])
sent = list(doc.sents)[1]
displacy.render(sent, style="dep", jupyter=True)

In [None]:
# doc = "Al-Aqsa Mosque is located in close proximity to various historical and holy sites in Judaism and Christianity, most notably that of the Temple in Jerusalem."
# doc = nlp(doc)
# print(doc)
# # take a look at how many words in a document
# len(doc)
# displacy.render(doc, style="ent", jupyter=True)

##Example information extraction with dependency parses
On its own, a dependency parse doesn't give you the ability to extract information from documents. That said, the information within a dependency parse can help you with a rule-based for extracting information.

One thing we might want to be able to extract from text is generally what kinds of behaviors or actions are occurring in a particular location. Let's write a function to identify verbs + direct objects that are grammatically linked to a location.

In [None]:
print(doc)
tok = doc[21]
print("tok: ",tok)


def loc_to_verb(tok):
    verb_phrase = []
    # first, iterate through all the ancesters of the token
    for i in tok.ancestors:
        # when you get to a verb (using a POS tag)...
        if i.pos_ == "VERB":
            # ...add the verb to the verb phrase list
            verb_phrase.append(i)
            # then, also add the direct object(s) of the verb, as long as the original token
            # is in the same subtree as the direct object
            verb_phrase.extend([j for j in i.children if j.dep_ == "dobj" and tok in i.subtree])
            # we only want the first verb, so stop after we find one
            print("verb_phrase", verb_phrase)
            break
    # expand out the verb phrase to get modifiers ("amod") of the direct object
    for i in verb_phrase:
        for j in i.children:
            if j.dep_ == "amod":
                verb_phrase.append(j)

    # sort the tokens by their position in the original sentence
    new_list = sorted(verb_phrase, key=lambda x: x.i)
    # join them together with the correct whitespace and return
    return ''.join([i.text_with_ws for i in new_list]).strip()

loc_to_verb(tok)

We can then use our function to identify all the actions related to a single city, Aleppo.

In [None]:
aleppo_actions = []

for d in docs:
    for i in d:
        if i.text == "Aleppo":
            aleppo_actions.append(loc_to_verb(i))

sorted(list(set(aleppo_actions)))

In [None]:

def clean_phrase(subtree):
    """Sort and join tokens into a string"""
    new_list = sorted(list(subtree), key=lambda x: x.i)
    return ''.join([i.text_with_ws for i in new_list])


for i in doc:
    # Find instances of the word "backed" that play the role of an adjectival modifier
    if i.text == "backed" and i.dep_ == "amod":
        # The children of "backed" will report who the backer is
        print("Backer: ", clean_phrase(i.children))
        # Next, we go up one level to the immediate parent of the word "backed"
        parent = list(i.ancestors)[0]
        branches = [parent]
        # for each of the children of that word, except for the original "backed" token,
        # add it to the branch
        for j in parent.children:
            if j != i:
                branches.append(j)

        print(branches)
    


Extentions and Experiments

In [None]:
!python -m spacy download en_core_web_trf

nlp_trf = spacy.load("en_core_web_trf")

doc = nlp_trf(articles[313]['body'])
displacy.render(doc, style="ent", jupyter=True)

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

hugg = pipeline('question-answering', model=model_name, tokenizer=model_name)

In [None]:
QA_input = {
    'question': "Who controls Deir Hafer and Al-Bab?",
    'context': sent.text
}
res = hugg(QA_input)

print(res)