<a href="https://colab.research.google.com/github/vamel19/INFO5731_FALL2020/blob/master/In_class_exercise_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The sixth in-class-exercise (20 points in total, 10/14/2020)**

## **1. Rule-based information extraction (10 points)**

Use any keywords related to data science, natural language processing, machine learning to search from google scholar, get the **titles** of 100 articles (either by web scraping or manually) about this topic, define a set of patterns to extract the research questions/problems, methods/algorithms/models, datasets, applications, or any other important information about this topic. 

In [None]:
from IPython.display import Image
Image('https://raw.githubusercontent.com/unt-iialab/INFO5731_Spring2020/master/Interesting_Code/rule-based.png')

In [9]:
import re 
import string 
import nltk 
import spacy 
import pandas as pd 
import numpy as np 
import math 
from tqdm import tqdm 

from spacy.matcher import Matcher 
from spacy.tokens import Span 
from spacy import displacy 

pd.set_option('display.max_colwidth', 100)
# load spaCy model
nlp = spacy.load("en_core_web_sm")

In [10]:
# load spaCy model
nlp = spacy.load("en_core_web_sm")

In [13]:
# sample text 
text = "The practice of data science often includes the use of cloud computing products." 

# create a spaCy object 
doc = nlp(text)

In [14]:
# print token, dependency, POS tag 
for tok in doc: 
  print(tok.text, "-->",tok.dep_,"-->", tok.pos_)

The --> det --> DET
practice --> nsubj --> NOUN
of --> prep --> ADP
data --> compound --> NOUN
science --> pobj --> NOUN
often --> advmod --> ADV
includes --> ROOT --> VERB
the --> det --> DET
use --> dobj --> NOUN
of --> prep --> ADP
cloud --> compound --> NOUN
computing --> compound --> NOUN
products --> pobj --> NOUN
. --> punct --> PUNCT


## **2. Domain-specific information extraction (10 points)**

For the legal case used in the data cleaning exercise: [01-05-1 Adams v Tanner.txt](https://github.com/unt-iialab/INFO5731_FALL2020/blob/master/In_class_exercise/01-05-1%20%20Adams%20v%20Tanner.txt), use [legalNLP](https://lexpredict-lexnlp.readthedocs.io/en/latest/modules/extract/extract.html#nlp-based-extraction-methods) to extract the following inforation from the text (if the information is not exist, just print None):

(1) acts, e.g., “section 1 of the Advancing Hope Act, 1986”

(2) amounts, e.g., “ten pounds” or “5.8 megawatts”

(3) citations, e.g., “10 U.S. 100” or “1998 S. Ct. 1”

(4) companies, e.g., “Lexpredict LLC”

(5) conditions, e.g., “subject to …” or “unless and until …”

(6) constraints, e.g., “no more than”

(7) copyright, e.g., “(C) Copyright 2000 Acme”

(8) courts, e.g., “Supreme Court of New York”

(9) CUSIP, e.g., “392690QT3”

(10) dates, e.g., “June 1, 2017” or “2018-01-01”

(11) definitions, e.g., “Term shall mean …”

(12) distances, e.g., “fifteen miles”

(13) durations, e.g., “ten years” or “thirty days”

(14) geographic and geopolitical entities, e.g., “New York” or “Norway”

(15) money and currency usages, e.g., “$5” or “10 Euro”

(16) percents and rates, e.g., “10%” or “50 bps”

(17) PII, e.g., “212-212-2121” or “999-999-9999”

(18) ratios, e.g.,” 3:1” or “four to three”

(19) regulations, e.g., “32 CFR 170”

(20) trademarks, e.g., “MyApp (TM)”

(21) URLs, e.g., “http://acme.com/”

(22) addresses, e.g., “1999 Mount Read Blvd, Rochester, NY, USA, 14615”

(23) persons, e.g., “John Doe

In [None]:
import lexnlp.extract.en.acts
text = “Synopsis 1843””
print(lexnlp.extract.en.acts.get_act_list(text))
[{'location_start': None
  'location_end': None
  'section': 1
  'year': 1843
  'ambiguous': None
  'act_name': 'Act'
  'value': None }]



In [None]:
>>> import lexnlp.extract.en.amounts
>>> text = “the sum of thirty-seven hundred and seventy-seven 80-100 dollars”
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
[3700, 77.80]

In [None]:
import lexnlp.extract.en.citations
>>> text = None
>>> print(list(lexnlp.extract.en.citations.get_citations(text)))
[(10, None, None, None, None, None, None)]

In [None]:
>>> import lexnlp.extract.en.entities.nltk_re

>>> text = None
>>> print(list(lexnlp.extract.en.entities.nltk_re.get_entities.nltk_re(text)))
[(None, None, None)]

In [None]:
>>> import lexnlp.extract.en.conditions
>>> text = "unless they should be relieved from their engagements as indorsers."
>>> print(list(lexnlp.extract.en.conditions.get_conditions(text)))
[('unless', None, None)]

In [None]:
>>> import lexnlp.extract.en.constraints
>>> text = "the levy had nothing more than a mere equitable right to redeem the cotton by paying the debts indorsed by the claimants."
>>> print(list(lexnlp.extract.en.constraints.get_constraints(text)))
[('more than', None, None)]

In [None]:
>>> import lexnlp.extract.en.copyright
>>> text = "(C) Copyright 2019 Thomson Reuters"
>>> print(list(lexnlp.extract.en.copyright.get_copyright(text)))
[('Copyright', '2019', 'Thomson Reuters')]

In [None]:
>>> # Manually set court configuration data
>>> import lexnlp.extract.en.courts
>>> text = "None"
>>> court_config_data = [entity_config(0, None, 0, None),
    entity_config(1, "None", 0, ["None"])]
>>> for entity, alias in lexnlp.extract.en.courts.get_courts(text, court_config_data):
    print("entity=", entity)
    print("alias=", alias)
entity= (0, None, 0, [(None, None, False, None), (None, None, False, None)])
alias= (None, None, False, None)

In [None]:
>>> import lexnlp.extract.en.cusip
>>> text = None
>>> print(lexnlp.extract.en.cusip.get_cusip(text))
[{'location_start': None,
  'location_end': None,
  'text': None,
  'issuer_id': None,
  'issue_id': None,
  'checksum': None,
  'ppn': None,
  'tba': None,
  'internal': None}]

In [None]:
>>> import lexnlp.extract.en.dates
>>> text = None
>>> print(list(lexnlp.extract.en.dates.get_dates(text)))
[datetime.date(None, None, None)]
>>> text = None
>>> print(list(lexnlp.extract.en.dates.get_dates(text)))
[datetime.date(None, None, None)]

In [None]:
>>> import lexnlp.extract.en.definitions
>>> text = None
>>> print(list(lexnlp.extract.en.definitions.get_definitions(text)))
[None]


In [None]:
>>> import lexnlp.extract.en.distances
>>> text = None
>>> print(list(lexnlp.extract.en.distances.get_distances(text)))
[(None, None)]

In [None]:
>>> import lexnlp.extract.en.durations
>>> text = None
>>> print(list(lexnlp.extract.en.durations.get_durations(text)))
[('None', None, None)]

In [None]:
>>> import lexnlp.extract.en.money
>>> text = "amounting to upwards of fourteen thousand dollars."
>>> print(list(lexnlp.extract.en.money.get_money(text)))
[(14,000.0, 'USD')]


In [None]:
>>> import lexnlp.extract.en.percents
>>> text = None
>>> print(list(lexnlp.extract.en.percents.get_percents(text)))
[(None, None, None)]

In [None]:
>>> import lexnlp.extract.en.pii
>>> text = None
>>> print(list(lexnlp.extract.en.pii.get_pii(text)))
[(None, None)]

In [None]:
>>> import lexnlp.extract.en.ratios
>>> text = None
>>> print(list(lexnlp.extract.en.ratios.get_ratios(text)))
[(None, None, None)]

In [None]:
>>> import lexnlp.extract.en.regulations
>>> text = None
>>> print(list(lexnlp.extract.en.regulations.get_regulations(text)))
[(None, None)]

In [None]:
>>> import lexnlp.extract.en.trademarks
>>> text = None
>>> print(list(lexnlp.extract.en.trademarks.get_trademarks(text)))
[None]

In [None]:
>>> import lexnlp.extract.en.urls
>>> text = None
>>> print(list(lexnlp.extract.en.urls.get_urls(text)))
[None]

In [None]:
>>> import lexnlp.extract.en.addresses
>>> text = None
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
[None, None]

In [None]:
>>> import lexnlp.extract.en.persons
>>> text = "C. J. Collier"
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
["C.J. Collier"]

In [None]:
>>> import lexnlp.extract.en.persons
>>> text = "J. Ormond"
>>> print(list(lexnlp.extract.en.amounts.get_amounts(text)))
["J. Ormond"]