# Data Prep


# Download datasets

`get_data.sh` prints commands that can download datasets and `parallel -j <n>` is used to parallelize downloads 

All the datasets are saved under `datasets` directory.

`datasets/<lang>-eng/mtdata.signature.txt` file flags as a succesful download. Delete these files to re-download.

Note: this is disk and network intensive task, so dont use too many parallel jobs ; 20 is a good limit

In [None]:
%%bash

./get_data.sh  | parallel -j 20

/bin/sh: q: command not found


# Collect statistics

In [18]:
%%bash 
stats=dataset-stats.tsv 
[[ -f $stats ]] || {
    (printf "Lang\tSents\tXXX\tENG\tHeldout\n"
    for i in datasets/*; do
        lang=$(echo $i | cut -f2 -d/ | cut -f1 -d-)  # datasets/spa-eng -> spa
        stat1=$(wc -lw < $i/train.$lang | awk -v OFS='\t' '{print $1,$2}')
        stat2=$(wc -w < $i/train.eng)
        heldout=$(ls $i/tests/*.eng 2> /dev/null | wc -l )
        printf "$lang\t$stat1\t$stat2\t$heldout\n"
    done
    ) > $stats
}

In [4]:
import pandas as pd

In [8]:
pd.read_csv('dataset-stats.tsv', sep='\t', header=0)

Unnamed: 0,Lang,Sents,XXX,ENG,Heldout
0,abk,24699,307650,439727,0
1,ace,1021,12026,12200,0
2,ach,78958,1544241,1359292,0
3,acm,11,52,75,0
4,acu,12953,292220,290637,0
...,...,...,...,...,...
565,zpa,365,6381,6642,0
566,zsl,2299,37803,37420,0
567,zsm,1234,7898,8674,0
568,zul,1066419,14324960,19162983,0


# Merge


In [22]:
%%bash
merged=merged

[[ -d $merged/tests ]] || mkdir -p $merged/tests
[[ -f $merged/train.raw.tsv ]] || (
    for i in datasets/*-eng; do
        lang=$(echo $i | awk -F '[/-]' '{print $2}')
        cp $i/tests/*.* $merged/tests 2> /dev/null
        paste <(zcat $i/train.meta.gz) $i/train.$lang $i/train.eng  | sed "s/^/$lang\t/" 
    done
) >  $merged/train.raw.tsv

In [2]:
!ls -lh merged/train.raw.tsv

-rw------- 1 tgowda G-819290 119G Jun  7 03:42 merged/train.raw.tsv


In [8]:
# smaleer file for development  with spark
!head -100000 merged/train.raw.tsv > merged/train-sample.raw.tsv
!wc -l merged/train-sample.raw.tsv
!head -2 merged/train-sample.raw.tsv

100000 merged/train-sample.raw.tsv
abk	JW300-abk_eng	Аҵакы	Table of Contents
abk	JW300-abk_eng	© 2016 Watch Tower Bible and Tract Society of Pennsylvania	© 2016 Watch Tower Bible and Tract Society of Pennsylvania


In [1]:
from pyspark.sql import SparkSession
from functools import partial

In [8]:
from sacremoses import MosesTokenizer, MosesPunctNormalizer
from html import unescape

normr = MosesPunctNormalizer(
        lang='en',
        norm_quote_commas=True,
        norm_numbers=True,
        pre_replace_unicode_punct=True,
        post_remove_control_chars=True,
    )
tok = MosesTokenizer(lang='en')

def tokenize_eng(text):
    try:
        text=unescape(text)
        text = normr.normalize(text)
        text = tok.tokenize(text, escape=False, return_str=True, aggressive_dash_splits=True,
            protected_patterns=tok.WEB_PROTECTED_PATTERNS)
        return text
    except:
        if text:
            return '<TOKERR> ' + text
        else:
            return ''

tokenize_src = tokenize_eng # using english ton source; not terrific, but not terrible either
print(tokenize_eng("This's cool-stuff...! http://isi.edu  @id #hashtag email@gmail.com and,this but oh.no  no... no."))
print(tokenize_src("मैं तुमसे प्यार करता हूँ!। चिल्लाईए मत।"))

This 's cool @-@ stuff ... ! http://isi.edu @id #hashtag email@gmail.com and , this but oh.no no ... no .
मैं तुमसे प्यार करता हूँ ! । चिल्लाईए मत ।


In [3]:

spark = SparkSession \
    .builder \
    .appName("SacreMoses Tokenizer on PySpark") \
    .config("spark.driver.memory", "20g") \
    .getOrCreate()

In [9]:
raw_file = 'merged/train.raw.tsv'
df = spark.read.csv(raw_file, sep='\t', schema='lang STRING, ds_name STRING, src STRING, eng STRING')

In [10]:
df_tok = df.rdd.map(lambda r: (r.lang, r.ds_name, tokenize_src(r.src), tokenize_eng(r.eng)))

In [None]:
def to_tsv(rec):
    return '\t'.join(col.replace('\t', ' ') for col in rec)

out_file = raw_file.replace('.raw.tsv', '.tok.tsv')
print(f'saving at {out_file}')
df_tok.map(to_tsv).saveAsTextFile(out_file)

saving at merged/train.tok.tsv


In [4]:
spark.stop()

---

Select top languages; exclude extreme low resource languages

Top is based on number of parallel sentences or tokens 

It would be nice to see what are top languages based on number of speakers. 

https://store.ethnologue.com/2019-ethnologue-200 has a list, but its $250 to obtain it. 



In [2]:
%%bash
ls datasets/ | head

abk-eng
ace-eng
ach-eng
acm-eng
acu-eng
ada-eng
ady-eng
aed-eng
afb-eng
afh-eng


In [24]:
!module load gcc/7


Lmod is automatically replacing "xl/16.1.1" with "gcc/7.3.0".


Inactive Modules:
  1) spectrum_mpi



In [None]:
%%bash
