# Text preprocessing

## Design
*Input*: Text file (.txt) containing the text extracted from HTML, PDF, Word, etc.

*Output*: JSON file with sentences ready for use, with respective ID (if needed, we can keep both the original sentence and the processed sentence before/after splitting respectively)

- *Sample output template for document with id 23effs8765*:
```
{"23effs8765": 
    {
        "metadata": {
            "n_sentences": 23, 
            "n_words": 1000, 
            "filename": "FederalSomething.pdf", 
            "format": "pdf", 
            "country": "USA"
         },
        "sentences": [
            {
                "sentence_1": "Here is a sample sentence that is NOT an incentive",
                 "label": 0
            },
            {
                "sentence_2": "This sentence should be an incentive",
                "label": 1
            }
        ]
    }}
```


## Pipeline:

- **1st component:** Few, basic rules created to deal with acronyms ("U.T.M"), bullet points ("(3)") and abreviations ("ord."). Differs per country, state or local level - this is to adapt to variability of format. The creation of rules will be as standardized as possible, so that the process of creating them is easy regardless of country/state.
    - Dictionary of abbreviations and acronyms
    - 1-3 rules for the characters that come before/after a period, to avoid confusing sentence splitting model
    - 1-3 rules for ensuring good processing of bullet points as sentences/phrases
- **2nd component:** Pre-built sentence splitter (NLTK or spaCy)

### THINGS TO CONFIRM WITH JORDI:
- Do we still need the "label" field in our output JSON, for the data augmentation pipeline?

## Sentence splitting rules

### USA

*Notes from preliminary analysis:*
- Can filter out anything up to "ACTION: Final rule." or "-------------------" 
- We need to figure out how laws and docket numbers ("Docket No. FWS-R4-ES-2018-0074.") are represented, congressmen ("Cong."), sessions ("Sess."), district ("Dist.") numbers, etc.
- To figure out common patterns, we should grab everyting that comes before a "." and see if we can build them

In [1]:
import re
import random
import nltk.data
import spacy 
import string
from collections import Counter

In [277]:
base_path = "../input/USA/"
usa_paths = ["Federal Register, Volume 85 Issue 190 (Wednesday, September 30, 2020).htm", "Federal Register, Volume 86 Issue 28 (Friday, February 12, 2021).htm", "Federal Register, Volume 86 Issue 29 (Tuesday, February 16, 2021).htm"]
fname = usa_paths[0]
txt_path = base_path + fname

with open(txt_path, "r") as txt_file:
    txt = txt_file.read()

In [178]:
def remove_html_tags(text):
    """Remove html tags from a string"""
    return re.sub(re.compile('<.*?>'), '', text)

def replace_links(text):
    text = re.sub(r'http\S+', '[URL]', text)
    return re.sub(r'www\S+', '[URL]', text)

def remove_multiple_spaces(text):
    return re.sub('\s+', ' ', text)

# Optional preprocessing
txt = replace_links(remove_html_tags(txt)).replace("\n", " ").replace("\t", " ").strip()
txt = remove_multiple_spaces(txt)

#### 1. Find what happens around periods

In [156]:
def get_surrounding_chars(txt, radius=1):
    surrounding_chars = []
    all_period_idx = [indices.start() for indices in re.finditer("\.", txt)]
    
    for period_idx in all_period_idx:
        start_idx = period_idx - radius
        end_idx = period_idx + radius + 1
        substring = txt[start_idx: end_idx]
        
        if substring:
            surrounding_chars.append(substring)
    
    return surrounding_chars

surrounding_chars_1 = get_surrounding_chars(txt)
surrounding_chars_2 = get_surrounding_chars(txt, radius=2)

print(f"For 1 character before and after a period, we have {len(set(surrounding_chars_1))} unique patterns")
print(f"For 2 characters before and after a period, we have {len(set(surrounding_chars_2))} unique patterns")

For 1 character before and after a period, we have 55 unique patterns
For 2 characters before and after a period, we have 192 unique patterns


In [126]:
from collections import defaultdict

def get_possible_chars(neighboring_chars):
    possible_chars = defaultdict(list)

    for pattern in neighboring_chars:
        if pattern[-1] == " ":
            possible_chars[" "].append(pattern)
        elif pattern[-1].isalpha():
            possible_chars["alpha"].append(pattern)
        elif pattern[-1].isnumeric():
            possible_chars["numeric"].append(pattern)
        elif not pattern[-1].isalnum():
            possible_chars["symbol"].append(pattern)
        else:
            possible_chars["other"].append(pattern)
    
    print(f"Total: {len(neighboring_chars)}")
    return possible_chars

In [127]:
def print_char_stats(possible_chars):
    print(f"Space: {len(possible_chars[' '])}"), 
    print(f"Alpha: {len(possible_chars['alpha'])}"), 
    print(f"Numeric: {len(possible_chars['numeric'])}"), 
    print(f"Symbol: {len(possible_chars['symbol'])}"), 
    print(f"Other: {len(possible_chars['other'])}")

Let's analyze the characters surrounding a period, in all instances of a period in the text

In [128]:
possible_chars = get_possible_chars(surrounding_chars_1)
print_char_stats(possible_chars)

Total: 339
Space: 229
Alpha: 45
Numeric: 22
Symbol: 43
Other: 0


Now, we will do the same analysis but for unique patterns

In [136]:
possible_chars = get_possible_chars(set(surrounding_chars_1))
print_char_stats(possible_chars)

Total: 55
Space: 30
Alpha: 6
Numeric: 7
Symbol: 12
Other: 0


#### 1.1 Conclusions from period analysis

- 25/55 unique patterns involve a period being followed by another non-space character. (45%)
- 110/339 instances of a period are followed by something that is a non-space character. (32%)

#### 1.2 Potential rules

For neighboring characters within a radius of 1:
   - If the character after a period is not a space, delete the period

In [281]:
abreviations = {"No.", "Sec.", "Cong.", "Dist."}
acronyms = {"W.D.", "U.S.", "H.R."}

def parse_abrev_acro(text):
    """ 
    Remove the periods from abreviations and acronyms in the text (i.e "Sec." becomes "Sec" and "U.S." becomes "US") 
    """
    for abreviation in abreviations:
        text = text.replace(abreviation, abreviation[:-1])
        
    for acronym in acronyms:
        new_acronym = acronym.replace(".", "")
        text = text.replace(acronym, new_acronym)
        
    return text

def potential_preprocessing(txt):
    """
    Steps in the preprocessing of text:
        1. Remove HTML tags
        2. Replace URLS by a tag [URL]
        3. Replace new lines and tabs by normal spaces - sometimes sentences have new lines in the middle
        4. Remove excessive spaces (more than 1 occurrence)
        5. Parse abreviations and acronyms
    """
    txt = replace_links(remove_html_tags(txt)).replace("\n", " ").replace("\t", " ").strip()
    txt = remove_multiple_spaces(txt)
    txt = parse_abrev_acro(txt)
    
    new_txt = ""
    all_period_idx = set([indices.start() for indices in re.finditer("\.", txt)])
    
    # TODO: What happens if we reach the end of file (period is the end of file)????
    for i, char in enumerate(txt):
        
        # Any char following a period that is NOT a space means that we should not add that period
        if i in all_period_idx:
            if txt[i + 1] != " ":
                continue
            if i + 2 <= len(txt) and txt[i + 2].isnumeric():
                continue
            
        new_txt += char

    return new_txt
        
    
ppp = potential_preprocessing(txt)
surrounding_chars_1 = get_surrounding_chars(ppp)
possible_chars = get_possible_chars(surrounding_chars_1)
print_char_stats(possible_chars)

Total: 171
Space: 171
Alpha: 0
Numeric: 0
Symbol: 0
Other: 0


In [274]:
import nltk
en_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
es_tokenizer = nltk.data.load("tokenizers/punkt/spanish.pickle")

def get_nltk_sents(txt, tokenizer):
    sents = tokenizer.tokenize(txt)
    return sents

In [284]:
ppp



In [285]:
sents = get_nltk_sents(ppp, en_tokenizer)

In [286]:
sents

 'ACTION: Final rule.',
 '----------------------------------------------------------------------- SUMMARY: We, the US Fish and Wildlife Service (Service), adopt a rule under section 4(d) of the Endangered Species Act of 1973 (Act), as amended, for the trispot darter (Etheostoma trisella), a fish from Alabama, Georgia, and Tennessee.',
 'This rule provides measures that are necessary and advisable to conserve the species.',
 'DATES: This rule is effective October 30, 2020.',
 'ADDRESSES: This final rule is available on the internet at [URL] under Docket No FWS-R4-ES-2018-0074 and at [URL] Comments and materials we received, as well as supporting documentation we used in preparing this rule, are available for public inspection at [URL] under Docket No FWS-R4-ES-2018-0074.',
 'FOR FURTHER INFORMATION CONTACT: William Pearson, Field Supervisor, US Fish and Wildlife Service, Alabama Ecological Services Field Office, 1208-B Main Street, Daphne, AL 36526; telephone 251-441-5870.',
 'Persons w