# Data acquisition and pre-processing

## Data acquisition  
The main data source is here <https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3052>.  

Data acquisition process is fully automated through the following steps:  
* ZIP file which contains the whole dataset is downloaded from the main data source into a directory specified by **DIR_RAW_DATA** constant.
* All documents are extracted from the ZIP file into a separate directory, specified by **DIR_EXTRACTED_DATA** constant.

In [None]:
import os
import requests
import zipfile
from tqdm.notebook import tqdm

In [None]:
URL_RAW_DATA = 'https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11372/LRT-3052/SupCo.zip?sequence=3&isAllowed=y'
FNAME_ZIP_DATA = 'SupCo.zip'
DIR_RAW_DATA = 'raw_data'
DIR_EXTRACTED_DATA = 'extracted_data'

In [None]:
def download_dataset(url, local_fpath):
    resp = requests.get(url, stream=True)
    total = int(resp.headers.get('content-length', 0))
    
    with open(local_fpath, 'wb') as file, tqdm(desc=local_fpath, total=total, unit='iB', unit_scale=True, unit_divisor=1024,) as bar:
        for data in resp.iter_content(chunk_size=1024):
            size = file.write(data)
            bar.update(size)

In [None]:
if not os.path.exists(DIR_RAW_DATA):
    os.mkdir(DIR_RAW_DATA)
    download_dataset(URL_RAW_DATA, os.path.join(DIR_RAW_DATA, FNAME_ZIP_DATA))

In [None]:
if not os.path.exists(DIR_EXTRACTED_DATA):
    with zipfile.ZipFile(os.path.join(DIR_RAW_DATA, FNAME_ZIP_DATA)) as zip_file:
        for zip_info in tqdm(zip_file.infolist()):
            fname = os.path.basename(zip_info.filename)
            if fname.endswith('.txt'):
                zip_info.filename = fname
                zip_file.extract(zip_info, DIR_EXTRACTED_DATA)

In [None]:
print(f'Number of extracted documents: {len(os.listdir(DIR_EXTRACTED_DATA))}')

## Data pre-processing
The following section describes a process of data-preprocessing aswell as a parallel processing approach to such task.

### Input format:
Raw text files (encoding: Windows-1250).

### Pre-processing steps:
#### I. Text cleaning
* Numbers are separated from strings, eg. 'abc123' -> 'abc 123'.
* Quotation marks are removed.
* Leading and trailing spaces are removed.

#### II. Document filtering:
* Documents containing the following terms: **důvodnění**, **d ů v o d n ě n í** are selected.

#### III. Part-of-speech tagging (POS tagging)
* Filtering of tokens based on POS tags: nouns and adjectives are selected.

#### IV. Lemmatization
* The filtered tokens with length >= 3 are lemmatized and converted to lowercase.

### Output format:
JSON files (encoding: UTF-8) containing a list of extracted lemmas (one file per each document).

In [None]:
import spacy_udpipe
import json
import re

In [None]:
def preprocess_token(token):
    token = ' '.join(re.split('(\d+)', token)) # Separate numbers
    token = re.sub(r'[„“\"\']+', '', token) # Remove quotation marks
    token = token.strip() # Strip leading and trailing spaces
    return token

def extract_lemmas(text, spacy_model, allowed_pos_tags, min_token_len, to_lower):
    lemmas = []
    doc = spacy_model(text)

    for token in doc:
        if not token.is_stop and not token.is_punct and token.pos_ in allowed_pos_tags and len(token) >= min_token_len:
            if to_lower:
                lemma = token.lemma_.lower()
            else:
                lemma = token.lemma_
            lemmas.append(lemma)
    return lemmas
    
def preprocess_file(fpath, spacy_model, allowed_pos_tags, min_token_len, lemmas_to_lower, encoding):
    with open(fpath, 'r', encoding=encoding, errors='ignore') as file:
        textorig = ' '.join([preprocess_token(token) for token in file.read().split(' ')])
        
        if 'důvodnění' in textorig:
            index = textorig.find('důvodnění')
            text = textorig[index+9:]
            return extract_lemmas(text, spacy_model=spacy_model, allowed_pos_tags=allowed_pos_tags, min_token_len=min_token_len, to_lower=lemmas_to_lower)
            
        elif 'd ů v o d n ě n í' in textorig:
            index = textorig.find('d ů v o d n ě n í')
            text = textorig[index+16:]
            return extract_lemmas(text, spacy_model=spacy_model, allowed_pos_tags=allowed_pos_tags, min_token_len=min_token_len, to_lower=lemmas_to_lower)

### Parallel pre-processing
Multiprocessing can be applied in order to speed-up the pre-processing of data.  
The function **worker_func** is fully parametrized via constant variables as it serves only as an auxiliary function for multiprocessing.

**Parameters**:
* **INPUT_ENCODING**: Encoding of the extracted (input) documents (Default: 'Windows-1250).
* **OUTPUT_ENCODING**: Encoding of the output JSON files. (Default: 'utf-8')
* **DIR_PREPROCESSED_DATA**: Target directory for the pre-processed JSON files.
* **N_PROCESSES**: Number of parallel Python processes to execute.
* **ALLOWED_POS_TAGS**: Part-of-speech tags used for token filtering -> a token is selected if it's POS tag is specified here. (Default: {'NOUN','ADJ'})
* **MIN_TOKEN_LEN**: Minimal token length before lemmatization. (Default: 3)
* **LEMMAS_TO_LOWERCASE**: If True, convert the lemmas to lowercase. (Default: True)

In [None]:
import multiprocessing as mp

In [None]:
print(f'Maximum recommended number of parallel processes: {mp.cpu_count()}')

In [None]:
INPUT_ENCODING = 'Windows-1250'
OUTPUT_ENCODING = 'utf-8'
DIR_PREPROCESSED_DATA = 'preprocessed_data'
N_PROCESSES = 14

ALLOWED_POS_TAGS = {'NOUN','ADJ'}
MIN_TOKEN_LEN = 3 # <3, ...>
LEMMAS_TO_LOWERCASE = True

spacy_udpipe.download('cs')
MODEL = spacy_udpipe.load('cs')

In [None]:
print(f'Number of documents to process: {len(os.listdir(DIR_EXTRACTED_DATA))}')

In [None]:
def worker_func(fname):
    input_fpath = os.path.join(DIR_EXTRACTED_DATA, fname)
    output_fpath = os.path.join(DIR_PREPROCESSED_DATA, f'{fname.split(".")[0]}.json')

    preprocessed_data = preprocess_file(input_fpath, 
                                        spacy_model=MODEL, # Thread safe
                                        allowed_pos_tags=ALLOWED_POS_TAGS, 
                                        min_token_len=MIN_TOKEN_LEN,
                                        lemmas_to_lower=LEMMAS_TO_LOWERCASE,
                                        encoding=INPUT_ENCODING)
    
    # Check if we have successfully extracted any lemmas and create an output JSON file
    if preprocessed_data is not None:
        with open(output_fpath, 'w', encoding=OUTPUT_ENCODING) as output_file:
            json.dump(obj=preprocessed_data, fp=output_file, ensure_ascii=False)

if not os.path.exists(DIR_PREPROCESSED_DATA):
    os.mkdir(DIR_PREPROCESSED_DATA)

input_fnames = os.listdir(DIR_EXTRACTED_DATA)

with mp.Pool(processes=N_PROCESSES) as pool:
    for _ in tqdm(pool.imap_unordered(worker_func, input_fnames), total=len(input_fnames)):
        # Progress bar hack :-)
        pass

In [None]:
print(f'Number of pre-processed documents: {len(os.listdir(DIR_PREPROCESSED_DATA))}')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
doc_lengths = []

for fname in tqdm(os.listdir('preprocessed_data')):
    with open(os.path.join('preprocessed_data', fname), 'r') as f:
        res = json.load(f)
        doc_lengths.append(len(res))

In [None]:
plt.figure(figsize=(14, 7))
plt.title('Distribution of number of tokens per each lemmatized document')
plt.xlabel('Number of tokens')
sns.distplot(doc_lengths)
plt.show()