<a href="https://colab.research.google.com/github/soniasol/Formation_TEI/blob/main/notebooks/test_normalisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Test of the Normalization Pipeline

Install all the dependencies.

In [None]:
%%capture
!pip install fairseq@git+https://github.com/pytorch/fairseq.git@5a75b079bf8911a327940c28794608e003a9fa52
!pip install sentencepiece sacrebleu hydra-core omegaconf==2.0.5 gdown==4.2.0

Download the model files.

Later, we can specify our own .zip file (after creating a release file).

In [None]:
!wget https://github.com/gabays/32M7131/releases/download/Norm/Normalisation-models.zip
!unzip Normalisation-models.zip
!mv -f French-normalisation-data-models data-models
!rm Normalisation-models.zip

--2023-11-06 09:43:08--  https://github.com/gabays/32M7131/releases/download/Norm/Normalisation-models.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/609944064/426229d6-b699-4f7b-b942-7f27168e037f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231106%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231106T094308Z&X-Amz-Expires=300&X-Amz-Signature=8e373ce61c319e24e03b99f533664b780c1efe54ee6799fe3e33c94e8e54cb5e&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=609944064&response-content-disposition=attachment%3B%20filename%3DNormalisation-models.zip&response-content-type=application%2Foctet-stream [following]
--2023-11-06 09:43:09--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/609944064/426229d6-b699-4f7b-b942-7f27168

Clone our repository.

In [None]:
!git clone https://github.com/soniasol/test_normalisation.git

fatal: destination path 'test_normalisation' already exists and is not an empty directory.


Move the model files we are going to be using under the `models` folder.

In [None]:
!mv data-models/bpe_joint_1000.model test_normalisation/models/bpe_joint_1000.model
!mv data-models/bpe_joint_1000.vocab test_normalisation/models/bpe_joint_1000.vocab
!mv data-models/lstm_norm.pt test_normalisation/models/lstm_norm.pt

mv: cannot stat 'data-models/bpe_joint_1000.model': No such file or directory
mv: cannot stat 'data-models/bpe_joint_1000.vocab': No such file or directory
mv: cannot stat 'data-models/lstm_norm.pt': No such file or directory


---

We can now start by defining a few functions we are going to use. These functions will eventually be moved to a Python file we will save in the repository. The functions will then be imported as any normal Python function (from a library).

For instance, if the file will be saved under `$PATH_TO_REPO/utils/file_utils.py`, where `$PATH_TO_REPO` is the path to the repository on the machine we are using, we will be able to import them by:
* adding `$PATH_TO_REPO` to the `PYTHONPATH` (either using the terminal or using the Python library `sys`)
* using `from utils import file_utils` and then, e.g., to use the `read_file()` function, `file_utils.read_file()`.

In [None]:
def read_file(filename):

  """
    Read a (text) file line by line.
  """

  # list storing all of the lines (strings)
  str_list = []

  # `fp` is a file pointer (see documentation) to the file `filename`
  # `fp` will exist, in this case, in the scope of the `with` statement only (see documentation)
  with open(filename) as fp:

    # append every line of the (text) file to our list `str_list`
    for line in fp:
      str_list.append(line.strip())

  # return the list
  return str_list

# -----------------------------

def write_file(str_list, path_to_file):

  """
    Write a list of strings `str_list` to a file `path_to_file`.
    The `path_to_file` variable must contain the path to the file and the file extension.
    For instance, `path_to_file` might be "/home/user/Desktop/output.txt"
  """

  # as before, `fp` is a file pointer (see documentation) to the file `filename`
  # `fp` will exist, in this case, in the scope of the `with` statement only (see documentation)
  with open(path_to_file, 'w') as fp:

    # write every string to the file `filename`
    for string in str_list:
      fp.write(string + '\n')

Let's populate some static variables (i.e., variables written in capital that we know are not going to change and should not be changed throughout the whole notebook!)

In [None]:
import os

PATH_TO_REPO = "/content/test_normalisation"

PATH_TO_MODELS = os.path.join(PATH_TO_REPO, "models")

PATH_TO_DATA = os.path.join(PATH_TO_REPO, "data")
PATH_TO_INPUT_DATA = os.path.join(PATH_TO_DATA, "input_data")
PATH_TO_OUTPUT_DATA = os.path.join(PATH_TO_DATA, "output_data")

Let's also make sure the output data folder is created (if not there yet!)

Note: use `os.mkdir` whenever you are sure only the last directory does not exist. In this case, `/content/test_normalisation/data` SHOULD exists, and if not we want to have an error popping up (that's why `os.mkdir`). Conversely, `os.makedirs` will create the whole directory tree.

In [None]:
if not os.path.exists(PATH_TO_OUTPUT_DATA):
  print("The directory", PATH_TO_OUTPUT_DATA, "doesn't exist yet. Creating it...")
  os.mkdir(PATH_TO_OUTPUT_DATA)
else:
  print("The directory", PATH_TO_OUTPUT_DATA, "exists already.")

The directory /content/test_normalisation/data/output_data exists already.


We can then test the `read_file()` function right away.

The following command uses what's called "list comprehension". A list comprehension is when "something in a list" is assigned to a variable straight away, like in this case. What's in the list usually follow this syntax: `x for x in whatever() if something in x` where `x` can be whatever letter/variable NOT used by Python by default (e.g., DO NOT use `str` or `file`), `whatever()` is a function returning something, and `something` is a condition we want the element `x` to meet. One can also omit `if something in x`.

In [None]:
input_files_list = [f for f in os.listdir(PATH_TO_INPUT_DATA) if f.lower().endswith(".txt")]

This is equivalent to the much longer:

```
input_files_list = list()

# get all the files in the directory
for f in os.listdir(PATH_TO_INPUT_DATA):

  # keep only the files that end in ".txt"
  if f.lower().endswith(".txt"):
    input_files_list.append(f)
```

Let's select the first text file, in alphabetical order, from the `PATH_TO_INPUT_DATA` folder.

In [None]:
# print the list
print("input_files_list:", input_files_list)

# select the first element/file (fn stands for "filename")
input_fn = input_files_list[0]
print("\ninput_fn:", input_fn)

# generate the path to `input_fn`
path_to_input_file = os.path.join(PATH_TO_INPUT_DATA, input_fn)
print("\npath_to_input_file:", path_to_input_file)

input_files_list: ['Test0_Chansons_nouvelles.txt', 'Test1_Chansons_nouvelles.txt']

input_fn: Test0_Chansons_nouvelles.txt

path_to_input_file: /content/test_normalisation/data/input_data/Test0_Chansons_nouvelles.txt


Let's read the file.

In [None]:
# sanity check: the file should exist!
assert os.path.exists(path_to_input_file), "The file %s doesn't exist! Exiting..."%(path_to_input_file)

input_file = read_file(path_to_input_file)

We can inspect the input file.

Remember this will be a list of strings. We can print the first 5 strings in the list!

In [None]:
input_file[0:5]

['',
 '¶ Chanson nouuelle de lorigine/',
 'autorite/ ⁊ puissance de leuãgile/',
 'contre tous ceulx qui le appellent',
 'nouuelle doctrine/ Sur le chant:']

---

Let's now preprocess the sentences to be normalized by tokenizing them.

In [None]:
import sentencepiece

tokenization_model_name = "bpe_joint_1000.model"
path_to_tokenization_model = os.path.join(PATH_TO_MODELS, tokenization_model_name)

# load the model in a variable named `spm`
spm = sentencepiece.SentencePieceProcessor(model_file=path_to_tokenization_model)

# tokenize the input file!
input_file_tokenized = spm.encode(input_file, out_type=str)

We can inspect the output of the tokenization process by printing a few elements of `input_file_tokenized`. Remember, this will be a LIST OF LISTS (every line of the `.txt` file will generate a list!) with A LOT of elements!

For instance, `input_file_tokenized[0]` will be the tokenization of the first line in the `.txt` file, `input_file_tokenized[1]` the second line, and so on.

In [None]:
input_file_tokenized[1]

['▁',
 '¶',
 '▁Ch',
 'ans',
 'on',
 '▁n',
 'ouu',
 'elle',
 '▁de',
 '▁l',
 'or',
 'ig',
 'ine',
 '/']

We can save the result of the tokenization step in a `.txt` file named as the `.txt` file we tokenized, but with `_tokenized` at the end!

(this step is not really mandatory, but it's good for debugging purposes as well!)

Since the `write_file()` function expects a single list as an input, we will need to join all the lists in `input_file_tokenized` (which, again, will store a list of lists) into one big list. We can do so in several ways, one of them being the Python function `join()` and list comprehension again.

In [None]:
tokenized_sentence_list = [' '.join(token) for token in input_file_tokenized]

Every sentence will now be tokenized, and every token will be separated by a space. We can verify this by printing one element of the list (string, or line). For instance, `tokenized_sentence_list[1]` will store the tokenized version of `input_file[1]`, where every token is separated by a space.

In [None]:
print("Input string/line:", input_file[1])

print("Tokenized version (space separated):", tokenized_sentence_list[1])

Input string/line: ¶ Chanson nouuelle de lorigine/
Tokenized version (space separated): ▁ ¶ ▁Ch ans on ▁n ouu elle ▁de ▁l or ig ine /


In [None]:
# get the name of the input file, without the extension
tokenized_file_name, extension = os.path.splitext(os.path.basename(path_to_input_file))

# add "_tokenized" at the end of the name, and add the extension back as well
tokenized_file_name = tokenized_file_name + "_tokenized" + extension

print("tokenized_file_name:", tokenized_file_name)

path_to_tokenized_file = os.path.join(PATH_TO_OUTPUT_DATA, tokenized_file_name)
print("\npath_to_tokenized_file:", path_to_tokenized_file)

tokenized_file_name: Test0_Chansons_nouvelles_tokenized.txt

path_to_tokenized_file: /content/test_normalisation/data/output_data/Test0_Chansons_nouvelles_tokenized.txt


Save the file at `path_to_tokenized_file`.

In [None]:
write_file(tokenized_sentence_list, path_to_tokenized_file)

---

Finally, we can run the `fairseq` model trained for normalization purposes on the tokens/sentences saved in the file at `path_to_tokenized_file`. Let's break down the following command before running it.

```
!head -n 10 $INPUT_FILE | fairseq-interactive $PATH_TO_MODELS --source-lang src --target-lang trg --path $PATH_TO_MODEL_FILE > $OUTPUT_FILE
```

The command `!head -n 10 $INPUT_FILE ` returns the first 10 elements of the file at `$INPUT_FILE`. This is a Linux command, so we need to either specify the path to the file or use `$` to tell Linux `path_to_tokenized_file` is a variable that stores something and not a simple text string.

In [None]:
!head -n 10 $path_to_tokenized_file


▁ ¶ ▁Ch ans on ▁n ouu elle ▁de ▁l or ig ine /
▁au t or ite / ▁ ⁊ ▁puiss ance ▁de ▁l eu ã g ile /
▁contre ▁tous ▁c eu l x ▁qui ▁le ▁app el l ent
▁n ouu elle ▁d oc tr ine / ▁S ur ▁le ▁ch ant :
▁Ie ▁ne ▁s c ay ▁pas ▁c õ ment .
▁I
▁E ▁mes b a h is ▁c õ ment
▁L h um ain ▁ent end ement /
▁R emp l y ▁dou tre c uy d ance /


The command `| fairseq-interactive $PATH_TO_DICTS --source-lang src --target-lang trg --path $PATH_TO_MODEL-FILE` runs the model on the sentences outputed by the `head` command (see above).

Finally, `> $OUTPUT_FILE` saves the output in `$OUTPUT_FILE`.

Let's create a variable storing the path to the normalization model.

In [None]:
normalization_model_name = "lstm_norm.pt"
path_to_normalization_model = os.path.join(PATH_TO_MODELS, normalization_model_name)

Let's also create a variable storing the path to the file we want to save the output of the normalization step in.

In [None]:
# get the name of the input file, without the extension
normalized_file_name, extension = os.path.splitext(os.path.basename(path_to_input_file))

# add "_tokenized" at the end of the name, and add the extension back as well
normalized_file_name = normalized_file_name + "_normalized" + extension

print("normalized_file_name:", normalized_file_name)

path_to_normalized_file = os.path.join(PATH_TO_OUTPUT_DATA, normalized_file_name)
print("\npath_to_normalized_file:", path_to_normalized_file)

normalized_file_name: Test0_Chansons_nouvelles_normalized.txt

path_to_normalized_file: /content/test_normalisation/data/output_data/Test0_Chansons_nouvelles_normalized.txt


Before running the normalization, we need to create two symbolic links for the `fairseq` normalization model to work. These "symbolic links" are basically like copies of a file. The file we want to copy is `dict_norm.txt` file. We need to copies for the model to work: one needs to be `dict.src.txt` and the other `dict.trg.txt`, so that when we specify the source and target language the model knows what to work with! The location of the dictionaries must be specified after `fairseq-interactive` before the `--source-lang` and the `--target-lang` flags. In our case, we will save the symbolic links in the folder specified by the variable `PATH_TO_MODELS`.

Note: we can populate these in the release of the repository to make this cleaner. No worries!

In [None]:
# create symlinks
!ln -sf /content/data-models/dict_norm.txt $PATH_TO_MODELS/dict.src.txt
!ln -sf /content/data-models/dict_norm.txt $PATH_TO_MODELS/dict.trg.txt

Finally, let's run the normalization step on the first 10 lines of the input `.txt` file. Note that this might take a moment.

In [None]:
!head -n 10 $path_to_tokenized_file | fairseq-interactive $PATH_TO_MODELS --source-lang src --target-lang trg --path $path_to_normalization_model > $path_to_normalized_file

2023-11-06 09:43:30.149685: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-06 09:43:30.149749: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-06 09:43:30.149774: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-06 09:43:30.155204: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Let's inspect some lines of the normalized file by loading it first.

In [None]:
normalized_file = read_file(path_to_normalized_file)

normalized_file[12:25]

['W-1\t0.220\tseconds',
 'H-1\t-0.2047189176082611\t▁ Ô ▁Ch ans on ▁n ouu elle ▁de ▁l or ig ine',
 'D-1\t-0.2047189176082611\t▁ Ô ▁Ch ans on ▁n ouu elle ▁de ▁l or ig ine',
 'P-1\t-0.0001 -1.7299 -0.0002 -0.3039 -0.0006 -0.0002 -0.2201 -0.0223 -0.0001 -0.0000 -0.0003 -0.0001 -0.0000 -0.5882',
 'S-2\t▁au t or ite / ▁ <unk> ▁puiss ance ▁de ▁l eu ã g ile /',
 'W-2\t0.254\tseconds',
 'H-2\t-0.41317692399024963\t▁au t or ite ▁: ▁ coup ▁puiss ance ▁de ▁l eu g ile ile',
 'D-2\t-0.41317692399024963\t▁au t or ite ▁: ▁ coup ▁puiss ance ▁de ▁l eu g ile ile',
 'P-2\t-0.0445 -0.0000 -0.0000 -0.0003 -3.0227 -0.1079 -3.0655 -0.0003 -0.0006 -0.0002 -0.0008 -0.0306 -0.0889 -0.0021 -0.2465 -0.0000',
 'S-3\t▁contre ▁tous ▁c eu l x ▁qui ▁le ▁app el l ent',
 'W-3\t0.238\tseconds',
 'H-3\t-0.08134998381137848\t▁contre ▁tous ▁c eu l x ▁qui ▁le ▁app el ent',
 'D-3\t-0.08134998381137848\t▁contre ▁tous ▁c eu l x ▁qui ▁le ▁app el ent']

The file contains a lot of information we don't need. Let's define a couple of functions to keep only the normalized sentences and then de-tokenize them.

In [None]:
def extract_hypothesis(path_to_file):
    outputs = []
    with open(path_to_file) as fp:
        for line in fp:
            # keep only the lines starting with H- (that stands for hypothesis)
            if 'H-' in line:
                # keep only the third column (since the indices start from [0], this will be [2])
                outputs.append(line.strip().split('\t')[2])
    return outputs

# -----------------------------

def decode_sp(str_list):
    return [''.join(sent).replace(' ', '').replace('▁', ' ').strip() for sent in str_list]

Let's keep only the normalized sentence and de-tokenize it. We can compare it to the sentences in the input file.

In [None]:
normalized_sentences_tokenized = extract_hypothesis(path_to_normalized_file)
normalized_sentences = decode_sp(normalized_sentences_tokenized)
num_normalized_sentences = len(normalized_sentences)

# overwrite `path_to_normalized_file` with the sentences only
write_file(normalized_sentences, path_to_normalized_file)

for line_number, line in enumerate(zip(input_file[0:num_normalized_sentences], normalized_sentences)):
  print("\nLine", line_number)
  print("\torig:", line[0])
  print("\tnorm:", line[1])


Line 0
	orig: 
	norm: Â

Line 1
	orig: ¶ Chanson nouuelle de lorigine/
	norm: Ô Chanson nouuelle de lorigine

Line 2
	orig: autorite/ ⁊ puissance de leuãgile/
	norm: autorite : coup puissance de leugileile

Line 3
	orig: contre tous ceulx qui le appellent
	norm: contre tous ceulx qui le appelent

Line 4
	orig: nouuelle doctrine/ Sur le chant:
	norm: nouuelle doctrineine Sur le chant:

Line 5
	orig: Ie ne scay pas cõment.
	norm: Je ne sais pas commment.

Line 6
	orig: I
	norm: I

Line 7
	orig: E mesbahis cõment
	norm: E mesbahis commment

Line 8
	orig: Lhumain entendement/
	norm: Lhumain entendement

Line 9
	orig: Remply doutrecuydance/
	norm: Remply doutrecuidance


We can download the outputs (normalized and tokenized) file(s) running the following cells.

In [None]:
from google.colab import files

files.download(path_to_tokenized_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
files.download(path_to_normalized_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>