<a href="https://colab.research.google.com/github/shahidulislamkhokon/NLP/blob/main/Information_Extrantion_from_Social_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Extraction for Social Science Research

installing spaCy and other dependencies for this project

In [None]:
!nvcc --version
!pip install --upgrade spacy
!pip install --upgrade spacy[cuda111,transformers]
!pip install jsonlines
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_sm

!wget https://andrewhalterman.com/files/cleaned_masdar.jsonl

I am going to use two techniques for information extraction: named entity recognition and rule-based extraction using dependency parses. The plan is to:

- get started with some hands-on named entity recognition
- step back and discuss information extraction and structured prediction at a higher level
- return to NER with some applications on real text
- next, use dependency parses and custom rules as a technique for information extraction
- conclude with some thoughts about extentions

Getting started with **NER** and **spaCy**

In [None]:
import jsonlines

from tqdm.autonotebook import tqdm
import jsonlines
import re

import spacy
from spacy import displacy
# assert spacy.__version__ == "3.1.3"

For spacy I need a pretrain model for prcessing document.
- large model (pre-train model with more example)
- small model (without pre-train word embedding)
- trf model (transformer-based model)
I am using large model (en_core_web_trf) and also small model(en_core_web_sm) for comparision

In [None]:
nlp = spacy.load("en_core_web_lg")
nlp_sm = spacy.load("en_core_web_sm")

Next, I'll load in a collection of news stories from a local **pro-government newspaper** in **Syri**a called al-Masdar. The articles here primarily describe the **civil war in Syria in 2016 and 2017**

In [None]:
with jsonlines.open("cleaned_masdar.jsonl", "r") as f:
    articles = list(f.iter())
print(articles)
print(len(articles))

In [None]:
article = articles[500]
article

To process a document with **spaCy**, I'll use the **nlp object** we instatiated earlier and pass a piece of text to it. The **nlp object** returns a **Document class object**, which has both document and token-level attributes.

In [None]:
doc = nlp(article['body'])

# take a look at how many words in a document
len(doc)

In [None]:
# look document-level attributes
dir(doc)

In [None]:
# tokens in a document can by accessed by their number:
print(doc[5])
dir(doc[5])

One of the attributes it assigns is **named entity** information for the document. Using **spaCy's built-in visualizer**, we can see all the detected named entities in the document.

In [None]:
displacy.render(doc, style="ent", jupyter=True)

In [None]:
just_text = [i['body'] for i in articles]
docs = list(tqdm(nlp.pipe(just_text), total=len(just_text)))
print(docs)

# Information Extraction and Structured Prediction

In [None]:
for i in doc[0:]:
  print(i)

In [None]:

[(i.text, i.ent_iob_ + "-" + i.ent_type_) for i in doc[0:30]]


# Models for NER

let's identify which organization are mentioned most in our corpus

In [None]:
from collections import Counter

all_orgs = []
for d in docs:
    orgs = [ent.text for ent in d.ents if ent.label_ == "ORG"]
    all_orgs.extend(orgs)

Counter(all_orgs).most_common(15)

Which organizations are mentioned most alongside mentions of "ceasefires" or "negotiations"?

In [None]:
#@title 
negotiation_orgs = []
for d in docs:
    for ent in d.ents:
        if ent.label_ != "ORG":
            continue
        if re.search("negotiat|ceasefire|talks", ent.sent.text):
            negotiation_orgs.append(ent.text)
negotiation_orgs            