<div style="background-color: lightgreen; border-radius: 5px; padding: 10px;">
    <h4>Pre-processing Paragraphs</h4>
    <p>...</p>
</div>

## Packages and Loading Files

In [1]:
# Load Packages
import os
import pandas as pd
pd.options.mode.chained_assignment = None

# Import Scripts
from preprocessing_functions import *

## Pre-processing Function

In [2]:
def lemmatise_city_pair(df, POS, OVERWRITE=False, ONLY_ENGLISH_WORDS=False, ENGLISH_WORDS = [],
    english_words_file="../../../input/english_words_alpha_370k.txt", NLP_MAX_LENGTH=1500000):
    
    for tag in tqdm(POS, desc=f"POS: {POS}", leave=False):
        if OVERWRITE or tag not in df.columns:
            df.loc[:, f"{tag}"] = lemmatise_paragraphs(paragraphs=df.loc[:, 'paragraph'], POStag=tag, NLP_MAX_LENGTH=NLP_MAX_LENGTH)

        if ONLY_ENGLISH_WORDS and (OVERWRITE or f'{tag}_clean' not in df.columns):
            df.loc[:, f'{tag}_clean'] = keep_english_words_in_paragraphs(paragraphs=df.loc[:, tag], english_words=ENGLISH_WORDS)
            
    return df

## Variables

In [3]:
data_dir = '../../../../data_clean/' # directory where selected articles will be saved, change if you want to save these elsewhere
out_dir = 'output/'
in_dir = '../../input/'
# extr_dir = path/to/wikidump/ex

In [7]:
POS = ["NOUN", "VERB", "ADJ"]
OVERWRITE=False
ONLY_ENGLISH_WORDS=True
NLP_MAX_LENGTH = 1500000

ENGLISH_WORDS = get_english_words(path=f"{in_dir}english_words_alpha_370k.txt")
# df =  pd.read_csv("../../../../../data/clean/paragraphs/paragraphs_raw_folder_62.csv")

In [13]:
all_paragraphs_df = pd.DataFrame(columns= ['city_1', 'city_2', 'paragraph_id', 'paragraph', 'article_id', 'title'])

for file in tqdm(os.listdir(paragraphs_dir), desc= 'Pre-processing paragraphs...'):
    fp = os.path.join(paragraphs_dir, file)
    filename = file.split('.')[0]
    
    df = pd.read_csv(fp)
    all_paragraphs_df = pd.concat([all_paragraphs_df, df])

Pre-processing paragraphs...:   0%|          | 0/5 [00:00<?, ?it/s]

In [24]:
num, div = len(all_paragraphs_df), 20
chunks = [num // div + (1 if x < num % div else 0)  for x in range (div)]
cum_chunks = [0]

for i, x in enumerate(chunks):
    cum_chunks.append(sum(chunks[:i+1]))

In [26]:
chunks_min_max = list(zip(cum_chunks, cum_chunks[1:]))
chunks_min_max

[(0, 103821),
 (103821, 207642),
 (207642, 311463),
 (311463, 415284),
 (415284, 519104),
 (519104, 622924),
 (622924, 726744),
 (726744, 830564),
 (830564, 934384),
 (934384, 1038204),
 (1038204, 1142024),
 (1142024, 1245844),
 (1245844, 1349664),
 (1349664, 1453484),
 (1453484, 1557304),
 (1557304, 1661124),
 (1661124, 1764944),
 (1764944, 1868764),
 (1868764, 1972584),
 (1972584, 2076404)]

In [35]:
count = 1
for chunk in tqdm(chunks_min_max):
    sub_df = all_paragraphs_df.iloc[chunk[0]:chunk[1]]
    file_path = f"paragraphs_{count}_{chunk[0]}_{chunk[1]}.csv"
    sub_df.to_csv(file_path, index=False)
    count += 1

  0%|          | 0/20 [00:00<?, ?it/s]

## Pre-processing

In [8]:
paragraphs_dir = os.path.join(data_dir, 'paragraphs_smaller')


for file in tqdm(os.listdir(paragraphs_dir), desc= 'Pre-processing paragraphs...'):
    fp = os.path.join(paragraphs_dir, file)
    filename = file.split('.')[0]
    
    if not os.path.exists(f"{filename}_preprocessed.csv"):
        df = pd.read_csv(fp)

        lemmatised_df = lemmatise_city_pair(df=df, POS=POS, OVERWRITE=OVERWRITE, ONLY_ENGLISH_WORDS=ONLY_ENGLISH_WORDS, ENGLISH_WORDS=ENGLISH_WORDS, NLP_MAX_LENGTH=NLP_MAX_LENGTH)
        lemmatised_df.to_csv(f"{filename}_preprocessed.csv", index=False)

        print(f"Pre-processed {filename}, containing {len(df)} paragraphs.")

Pre-processing paragraphs...:   0%|          | 0/20 [00:00<?, ?it/s]

POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Pre-processed paragraphs_1_0_103821, containing 103821 paragraphs.


POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Pre-processed paragraphs_20_1972584_2076404, containing 103820 paragraphs.


POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Pre-processed paragraphs_2_103821_207642, containing 103821 paragraphs.


POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Pre-processed paragraphs_3_207642_311463, containing 103821 paragraphs.


POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103821 [00:00<?, ?it/s]

Pre-processed paragraphs_4_311463_415284, containing 103821 paragraphs.


POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Pre-processed paragraphs_5_415284_519104, containing 103820 paragraphs.


POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Pre-processed paragraphs_6_519104_622924, containing 103820 paragraphs.


POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Pre-processed paragraphs_7_622924_726744, containing 103820 paragraphs.


POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Pre-processed paragraphs_8_726744_830564, containing 103820 paragraphs.


POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/103820 [00:00<?, ?it/s]

Pre-processed paragraphs_9_830564_934384, containing 103820 paragraphs.


## Everything from here may be deleted later on

In [12]:
test_file = os.path.join(paragraphs_dir, "paragraphs_0_101735_test.csv")
df =  pd.read_csv(test_file)

In [14]:
df

Unnamed: 0,city_1,city_2,paragraph_id,paragraph,article_id,title
0,Birmingham,Florence,1,The first community of adherents of the Baha'i...,303,Alabama
1,Florence,Birmingham,2,The first community of adherents of the Baha'i...,303,Alabama
2,Paris,London,3,A major revision of the work by composer and a...,309,An American in Paris
3,London,Paris,4,A major revision of the work by composer and a...,309,An American in Paris
4,Madrid,Rome,5,Access to biocapacity in Algeria is lower than...,358,Algeria
...,...,...,...,...,...,...
488331,Dublin,London,488332,Allman-Smith played hockey for Dublin Universi...,3001932,Edward Allman-Smith
488332,London,Dublin,488333,O'Kelly and Condell met in Dublin in 1969 and ...,3001953,Tir na nOg (band)
488333,Dublin,London,488334,O'Kelly and Condell met in Dublin in 1969 and ...,3001953,Tir na nOg (band)
488334,Birmingham,Dublin,488335,"Tir na nOg reformed in 1985, releasing the sin...",3001953,Tir na nOg (band)


In [38]:
# %%time

# lemmatised_df = lemmatise_city_pair(df=df[:50000], POS=POS, OVERWRITE=OVERWRITE, ONLY_ENGLISH_WORDS=ONLY_ENGLISH_WORDS, ENGLISH_WORDS=ENGLISH_WORDS, NLP_MAX_LENGTH=NLP_MAX_LENGTH)
# lemmatised_df.to_csv("lemmatised_file.csv", index=False)

POS: ['NOUN', 'VERB', 'ADJ']:   0%|          | 0/3 [00:00<?, ?it/s]

Lemmatising (NOUN)...:   0%|          | 0/50000 [00:00<?, ?it/s]

Lemmatising (VERB)...:   0%|          | 0/50000 [00:00<?, ?it/s]

Lemmatising (ADJ)...:   0%|          | 0/50000 [00:00<?, ?it/s]

CPU times: total: 7min 57s
Wall time: 23min 6s


In [35]:
# test_df = pd.read_csv('lemmatised_file.csv')
# test_df

Unnamed: 0,city_1,city_2,paragraph_id,paragraph,article_id,title,NOUN,NOUN_clean,VERB,VERB_clean,ADJ,ADJ_clean
0,Birmingham,Florence,1,The first community of adherents of the Baha'i...,303,Alabama,"['community', 'adherent', 'center']","['community', 'adherent', 'center']","['found', 'move', 'exist']","['found', 'move', 'exist']",[],[]
1,Florence,Birmingham,2,The first community of adherents of the Baha'i...,303,Alabama,"['community', 'adherent', 'center']","['community', 'adherent', 'center']","['found', 'move', 'exist']","['found', 'move', 'exist']",[],[]
2,Paris,London,3,A major revision of the work by composer and a...,309,An American in Paris,"['revision', 'work', 'composer', 'arranger', '...","['revision', 'work', 'composer', 'arranger', '...","['simplify', 'reduce', 'eliminate', 'avoid', '...","['simplify', 'reduce', 'eliminate', 'avoid', '...","['major', 'standard', 'original', 'original']","['major', 'standard', 'original', 'original']"
3,London,Paris,4,A major revision of the work by composer and a...,309,An American in Paris,"['revision', 'work', 'composer', 'arranger', '...","['revision', 'work', 'composer', 'arranger', '...","['simplify', 'reduce', 'eliminate', 'avoid', '...","['simplify', 'reduce', 'eliminate', 'avoid', '...","['major', 'standard', 'original', 'original']","['major', 'standard', 'original', 'original']"
4,Madrid,Rome,5,Access to biocapacity in Algeria is lower than...,358,Algeria,"['access', 'biocapacity', 'world', 'hectare', ...","['access', 'world', 'hectare', 'person', 'terr...","['mean', 'use', 'contain', 'run', 'hold', 'sec...","['mean', 'use', 'contain', 'run', 'hold', 'sec...","['low', 'average', 'global', 'global', 'global...","['low', 'average', 'global', 'global', 'global..."
...,...,...,...,...,...,...,...,...,...,...,...,...
995,Athens,Stockholm,996,Athens was awarded the 2004 Summer Olympics on...,1216,Athens,"['bid', 'time', 'game', 'event', 'bid', 'bid',...","['bid', 'time', 'game', 'event', 'bid', 'bid',...","['award', 'have', 'lose', 'host', 'host', 'fol...","['award', 'have', 'lose', 'host', 'host', 'fol...","['previous', 'second', 'inaugural', 'unsuccess...","['previous', 'second', 'inaugural', 'unsuccess..."
996,Rome,Athens,997,Athens was awarded the 2004 Summer Olympics on...,1216,Athens,"['bid', 'time', 'game', 'event', 'bid', 'bid',...","['bid', 'time', 'game', 'event', 'bid', 'bid',...","['award', 'have', 'lose', 'host', 'host', 'fol...","['award', 'have', 'lose', 'host', 'host', 'fol...","['previous', 'second', 'inaugural', 'unsuccess...","['previous', 'second', 'inaugural', 'unsuccess..."
997,Rome,Stockholm,998,Athens was awarded the 2004 Summer Olympics on...,1216,Athens,"['bid', 'time', 'game', 'event', 'bid', 'bid',...","['bid', 'time', 'game', 'event', 'bid', 'bid',...","['award', 'have', 'lose', 'host', 'host', 'fol...","['award', 'have', 'lose', 'host', 'host', 'fol...","['previous', 'second', 'inaugural', 'unsuccess...","['previous', 'second', 'inaugural', 'unsuccess..."
998,Stockholm,Athens,999,Athens was awarded the 2004 Summer Olympics on...,1216,Athens,"['bid', 'time', 'game', 'event', 'bid', 'bid',...","['bid', 'time', 'game', 'event', 'bid', 'bid',...","['award', 'have', 'lose', 'host', 'host', 'fol...","['award', 'have', 'lose', 'host', 'host', 'fol...","['previous', 'second', 'inaugural', 'unsuccess...","['previous', 'second', 'inaugural', 'unsuccess..."
