# TP lemmatiseur et générateur en coréen
`Shami THIRION SEN`

### Consignes:

### 1. Données et Objectifs: <br>

Dans ce TP, on va réaliser un lemmatiseur/segmenteur morphosyntaxique pour le
coréen.
- Les données sont celles issues de UD (corpus KAIST).
- Celles-ci sont déjà divisées en train/valid/test.
(Mais on pensera à limiter la taille du train pour développer le modèle !)
- On pourra tester avec et sans normalisation+décomposition Unicode
(paquet python unicodedata)
- Il est conseillé d’extraire « manuellement » le vocabulaire.

  ---

### 2. Architectures neuronales:
- Les architectures que l’on va explorer avec ce TP sont des ”sequence to sequence”, et
plus précisement des Encodeur-Décodeur.
- Encodeur : Un premier RNN ”encode” l’information lue et la résume en un unique
vecteur (seq2vec)
- Décodeur : Un second RNN ”décode” et génère la cible à partir
- Plusieurs modèles à envisager. La partie encodeur ne change pas, mais ensuite la séquence générée par le décodeur peut l’être :
    - **modèle basique** : uniquement à partir du vecteur encodé. Il faudra répéter celui-ci avec une couche RepeatVector.
    - **modèle avancé** : on démarre avec le vecteur encodé, puis on poursuit en utilisant comme entrée du décodeur ce qui vient d’être généré par le décodeur à t-1
- **Bonus** : On peut essayer de combiner les deux approche en combinant le vecteur encodé et le vecteur décodé à t-1, par concaténation (layers.Concatenate) ou somme (layers.Add)
- **Bonus2**: Les fichiers UD fournissent aussi un étiquetage des morphèmes, on peut essayer de produire **une double sortie** : segmentation d’une part et étiquetage de l’autre.
Attention : Pour pouvoir utiliser le modèle « avancé », il faudra fournir une classe Generator inspirée du TP “Brassens”, car contrairement à la phase d’entraînement, lors de l’utilisation (et d’un test rigoureux), la séquence produite n’est pas connue à priori.

- Voir les graphes fournis qui décrivent les modèles à construire.

---

### 3. Tokenisation et Vectorisation
Quelques consignes pour le TextVectorizer :
- on tokenisera nous-même avec la fonction **tf.strings.unicodesplit**
- on construira nous même la liste du vocabulaire de tous les caractères possibles en
ajoutant les tokens [START] et [END]
- on évitera les RaggedTensor en fixant la **longueur à 48 tokens** (il faudra tout de
même faire attention au masking)

---

### 4. Construction des instances (X,Y)
- Avec Keras, il possible de construire des données d’entraînement plus complexes qu’un simple couple de tenseur (X,Y).
- On va pour cela fournir un **dictionnaire python** à la place du tenseur X.
    - Les clefs du dictionnaire doivent porter le même nom que les couches ”Input Layer” du réseau, et les valeurs sont les tenseurs qui seront envoyées à ces entrées.
    - Dans le cas présent, on va donc construire des dictionnaires avec les données suivantes :<br>
    
    **input_1** : mot-forme tel qu’observé dans le texte <br>
    
    **input_2** : mot-analysé qui arrivera en entrée du décodeur (donc décallé à droite en commençant par un token [START]
    le Y sera simplement le mot analysé suivit du token [END]

---

### Import des bibliothèques

In [1]:
# %env TF_FORCE_GPU_ALLOW_GROWTH=true
# %matplotlib widget
from typing import Optional
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import tensorflow as tf
from tensorflow import keras
from tensorflow.data import TextLineDataset
from keras.layers import Input, LSTM, Dense, RepeatVector, Embedding, Bidirectional, Concatenate, Dropout, StringLookup, RNN, LayerNormalization
from keras import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint


from utils import *
import pandas as pd
from pprint import pprint
import random

2024-05-30 08:40:36.111359: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-30 08:40:36.114986: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-30 08:40:36.159085: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 1. Extraction des données
- extraction du 1ère et 2ème colonnes
- Nous allons d'abord expérimenter avec **1000** tokens, et ensuite avec 10000
- Le résultat commence à être fiable qu'avec les **10000** tokens

In [2]:
dev_tokens, dev_morphs= read_csv_file('corpus_ko/dev.tsv', 0,1) # lecture colonnes 0 et 1
test_tokens, test_morphs= read_csv_file('corpus_ko/test.tsv',0,1)
train_tokens, train_morphs= read_csv_file('corpus_ko/train.tsv',0,1)

In [3]:
tf.size(dev_tokens), tf.size(test_tokens), tf.size(train_tokens), len(train_tokens), type(train_tokens)

(<tf.Tensor: shape=(), dtype=int32, numpy=25278>,
 <tf.Tensor: shape=(), dtype=int32, numpy=28366>,
 <tf.Tensor: shape=(), dtype=int32, numpy=296446>,
 296446,
 list)

### 1000 tokens

In [4]:
train_size = 1000 # finalise with train size 1000 5000 10000
dev_size = int(train_size * 0.2)
test_size = int(train_size * 0.2)

train_X, train_Y = train_tokens[:train_size], train_morphs[:train_size]
dev_X, dev_Y = dev_tokens[:dev_size], dev_morphs[:dev_size]
test_X, test_Y = train_tokens[:test_size], train_morphs[:test_size]

# extraction manuelle du vocabulaire à partir du corpus lemmatisé
vocab = build_vocab(train_Y)

In [5]:
# Pour comparaison token et lemme
print(test_X[:10])
print(test_Y[:10])

['하기야', '짐승도', '잘', '가르치기만', '하면', '어느', '정도는', '순치될', '수', '있다']
['하기야', '짐승+도', '잘', '가르치+기+만', '하+면', '어느', '정도+는', '순치+되+ㄹ', '수', '있+다']


In [6]:
# vocabulary size
tf.size(vocab)


<tf.Tensor: shape=(), dtype=int32, numpy=398>

### 2. Creation des instances - Tokenisation et Vectorisation
 - fonctions dans ***utils.py***

#### Rappel
Quelques consignes pour le TextVectorizer
- on tokenisera nous-même avec la fonction **tf.strings.unicodesplit**
- on construira nous même la liste du vocabulaire de tous les caractères possibles en ajoutant les tokens [START] et [END]
- on évitera les RaggedTensor en fixant la longueur à 48 tokens (il faudra tout de même faire attention au masking) -> `mask_zero=True`
- add extra tokens => `tf.constant(['[START]', '[END]']`)


#### Vectorization des données

In [7]:
X1_train, X2_train, Y_train, vectorization = create_instances(train_X, train_Y,vocab)
X1_dev, X2_dev, Y_dev, _ = create_instances(dev_X, dev_Y,vocab)
X1_test, X2_test, Y_test, _ = create_instances(test_X, test_Y,vocab)

X1 before split:  ['하기야', '짐승도']
X1:  <tf.RaggedTensor [[b'\xed\x95\x98', b'\xea\xb8\xb0', b'\xec\x95\xbc'],
 [b'\xec\xa7\x90', b'\xec\x8a\xb9', b'\xeb\x8f\x84']]>

 X2 before split:  ['[START]하기야', '[START]짐승+도']

 X2:  <tf.RaggedTensor [[b'\xed\x95\x98', b'\xea\xb8\xb0', b'\xec\x95\xbc'],
 [b'\xec\xa7\x90', b'\xec\x8a\xb9', b'+', b'\xeb\x8f\x84']]>

Y beofore split:  ['하기야[END]', '짐승+도[END]']

Data Y <tf.RaggedTensor [[b'\xed\x95\x98', b'\xea\xb8\xb0', b'\xec\x95\xbc'],
 [b'\xec\xa7\x90', b'\xec\x8a\xb9', b'+', b'\xeb\x8f\x84']]>
X1 before split:  ['내', '고향은']
X1:  <tf.RaggedTensor [[b'\xeb\x82\xb4'], [b'\xea\xb3\xa0', b'\xed\x96\xa5', b'\xec\x9d\x80']]>

 X2 before split:  ['[START]내', '[START]고향+은']

 X2:  <tf.RaggedTensor [[b'\xeb\x82\xb4'],
 [b'\xea\xb3\xa0', b'\xed\x96\xa5', b'+', b'\xec\x9d\x80']]>

Y beofore split:  ['내[END]', '고향+은[END]']

Data Y <tf.RaggedTensor [[b'\xeb\x82\xb4'],
 [b'\xea\xb3\xa0', b'\xed\x96\xa5', b'+', b'\xec\x9d\x80']]>
X1 before split:  ['하기야', '짐승도']


In [8]:
# pprint(X1_train[5])
# pprint(X2_train[5])
# pprint(Y_train[5])

####   Formatter les données - modèle basique
- préparation des données en entrée pour le modèle basique, sous le forme de dictionnaire <br>
```x_train = {'input1': X1, 'intput_2': X2}   |    y_train = Y``` 

In [9]:
x_train = {'input1': X1_train, 'input2': X2_train}
y_train = Y_train

x_dev = {'input1': X1_dev, 'input2': X2_dev}
y_dev = Y_dev

x_test = {'input1': X1_test, 'input2': X2_test}
y_test = Y_test

### Basic model
- les nom des input doivent être les mêmes que ceux du dictionnaire
    - ici **'input1'** et **'input2'
- On emploi Repeatvector du dernier état des encodeurs comme entrée du décodeur

In [10]:
def basic_model():

    # input1
    input1 = Input(shape=(None,), name='input1')
    emb1 = Embedding(vectorization.vocabulary_size(), 30, mask_zero=True, name='embedding1')(input1)
    emb1 = Dropout(0.3)(emb1)
    encoder_lstm1 = Bidirectional(LSTM(64, return_state=True, name='lstm1'))
    encoder_outputs1, forward_h1, forward_c1, backward_h1, backward_c1 = encoder_lstm1(emb1)

    # input2
    input2 = Input(shape=(None,), name='input2')
    emb2 = Embedding(vectorization.vocabulary_size(), 30, mask_zero=True, name='embedding2')(input2)
    emb2 = Dropout(0.3)(emb2)
    encoder_lstm2 = Bidirectional(LSTM(64, return_state=True, name='lstm2'))
    encoder_outputs2, forward_h2, forward_c2, backward_h2, backward_c2 = encoder_lstm2(emb2)

    # Concatenation des états
    state_h = Concatenate()([forward_h1, backward_h1, forward_h2, backward_h2])
    state_c = Concatenate()([forward_c1, backward_c1, forward_c2, backward_c2])
    encoder_states = [state_h, state_c]

    # encoder_states` est l'état initial
    decoder_inputs = RepeatVector(48)(state_h) # modele basique avec repeat 48
    decoder_lstm = LSTM(256, return_sequences=True)
    decoder_outputs = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = Dense(vectorization.vocabulary_size(), activation='softmax', name='output')
    decoder_outputs = decoder_dense(decoder_outputs)

    # définir the model
    model = Model([input1, input2], decoder_outputs)
    model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

    return model


In [11]:
model = basic_model()

In [12]:
model.summary()

In [13]:
history = model.fit(x_train, y_train, 
                    validation_data=(x_dev, y_dev),
                    epochs=10, batch_size=32)

Epoch 1/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 120ms/step - accuracy: 0.8075 - loss: 3.6175 - val_accuracy: 0.9251 - val_loss: 0.5886
Epoch 2/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 106ms/step - accuracy: 0.9196 - loss: 0.5565 - val_accuracy: 0.9251 - val_loss: 0.5062
Epoch 3/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 116ms/step - accuracy: 0.9222 - loss: 0.4765 - val_accuracy: 0.9251 - val_loss: 0.4740
Epoch 4/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 126ms/step - accuracy: 0.9248 - loss: 0.4386 - val_accuracy: 0.9414 - val_loss: 0.4025
Epoch 5/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 125ms/step - accuracy: 0.9392 - loss: 0.3691 - val_accuracy: 0.9430 - val_loss: 0.3582
Epoch 6/10
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 136ms/step - accuracy: 0.9422 - loss: 0.3241 - val_accuracy: 0.9429 - val_loss: 0.3435
Epoch 7/10
[1m32/32[0m [3

### Evaluation sur données test

In [14]:
#evaluation sur des données test
test_loss, test_accuracy = model.evaluate(x_test, y_test)

print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - accuracy: 0.9450 - loss: 0.2853
Test Loss: 0.2887529134750366
Test Accuracy: 0.9441667199134827


### Regardons les séquences générées
- création du dictionnaire **index_to_token - it2** et **token_to_index - t2i**

In [15]:
vocabulary = vectorization.get_vocabulary()
i2t = {i: tok for i, tok in enumerate(vocabulary)} # mappage des tokens et indice 
# print(i2t)

t2i = {tok: i for i, tok in enumerate(vocabulary)}
# print(t2i)

# print(t2i.get('+'))

In [16]:
## DECODE 
predictions = model.predict(x_test)

[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 116ms/step


#### Extraction des séquences prédites et comparaison avec séquences réelles
- Avec peu de données 100, 1000 train tokens:
    - beaucoup de répétition des mêmes caractères
   


In [17]:
# J'ai pas trouvé la façon keras pour décoder... 
predicted_tokens = []

for pred in predictions:
    predicted = ""
    for idx in pred:
        idx=  tf.argmax(idx, axis=-1).numpy()
        predicted+=i2t[idx] 
    predicted_tokens.append(predicted)

# predicted_tokens[:10]

In [18]:
test_tokens = ["".join(map(lambda idx: i2t[idx.numpy().astype(int) ] , test)) for test in y_test]
# test_tokens[:20]

### Comparaison des tokens générés et réels

In [19]:
# intersection de predicted_tokens , test_tokens
for p , t in zip(predicted_tokens[:10], test_tokens[:10]):
    print("predicted = ", p)
    print("real = ", t, "\n")

common = [x for x in predicted_tokens if x in test_tokens]
print("Nombre de prédiction correcte:",len(common), "\nNombre de prediction fausse: ", len(predicted_tokens)-len(common))

predicted =  ++
real =  하기야 

predicted =  +++
real =  짐승+도 

predicted =  .
real =  잘 

predicted =  그+++++
real =  가르치+기+만 

predicted =  +++
real =  하+면 

predicted =  +
real =  어느 

predicted =  +++
real =  정도+는 

predicted =  것++++
real =  순치+되+ㄹ 

predicted =  .
real =  수 

predicted =  것+
real =  있+다 

Nombre de prédiction correcte: 55 
Nombre de prediction fausse:  145


### Même entraînement - avec une augmentation des tokens à 10000

In [20]:
train_size = 10000 # finalise with train size 1000 5000 10000
dev_size = int(train_size * 0.2)
test_size = int(train_size * 0.2)

train_X, train_Y = train_tokens[:train_size], train_morphs[:train_size]
dev_X, dev_Y = dev_tokens[:dev_size], dev_morphs[:dev_size]
test_X, test_Y = train_tokens[:test_size], train_morphs[:test_size]

# extraction manuelle du vocabulaire à partir du corpus lemmatisé
vocab = build_vocab(train_Y)

#### Vectorisation

In [21]:
X1_train, X2_train, Y_train, vectorization = create_instances(train_X, train_Y,vocab)
X1_dev, X2_dev, Y_dev, _ = create_instances(dev_X, dev_Y,vocab)
X1_test, X2_test, Y_test, _ = create_instances(test_X, test_Y,vocab)

X1 before split:  ['하기야', '짐승도']
X1:  <tf.RaggedTensor [[b'\xed\x95\x98', b'\xea\xb8\xb0', b'\xec\x95\xbc'],
 [b'\xec\xa7\x90', b'\xec\x8a\xb9', b'\xeb\x8f\x84']]>

 X2 before split:  ['[START]하기야', '[START]짐승+도']

 X2:  <tf.RaggedTensor [[b'\xed\x95\x98', b'\xea\xb8\xb0', b'\xec\x95\xbc'],
 [b'\xec\xa7\x90', b'\xec\x8a\xb9', b'+', b'\xeb\x8f\x84']]>

Y beofore split:  ['하기야[END]', '짐승+도[END]']

Data Y <tf.RaggedTensor [[b'\xed\x95\x98', b'\xea\xb8\xb0', b'\xec\x95\xbc'],
 [b'\xec\xa7\x90', b'\xec\x8a\xb9', b'+', b'\xeb\x8f\x84']]>
X1 before split:  ['내', '고향은']
X1:  <tf.RaggedTensor [[b'\xeb\x82\xb4'], [b'\xea\xb3\xa0', b'\xed\x96\xa5', b'\xec\x9d\x80']]>

 X2 before split:  ['[START]내', '[START]고향+은']

 X2:  <tf.RaggedTensor [[b'\xeb\x82\xb4'],
 [b'\xea\xb3\xa0', b'\xed\x96\xa5', b'+', b'\xec\x9d\x80']]>

Y beofore split:  ['내[END]', '고향+은[END]']

Data Y <tf.RaggedTensor [[b'\xeb\x82\xb4'],
 [b'\xea\xb3\xa0', b'\xed\x96\xa5', b'+', b'\xec\x9d\x80']]>
X1 before split:  ['하기야', '짐승도']


In [22]:
def basic_model():

    # input1
    input1 = Input(shape=(None,), name='input1')
    emb1 = Embedding(vectorization.vocabulary_size(), 30, mask_zero=True, name='embedding1')(input1)
    emb1 = Dropout(0.3)(emb1)
    encoder_lstm1 = Bidirectional(LSTM(64, return_state=True, name='lstm1'))
    encoder_outputs1, forward_h1, forward_c1, backward_h1, backward_c1 = encoder_lstm1(emb1)

    # input2
    input2 = Input(shape=(None,), name='input2')
    emb2 = Embedding(vectorization.vocabulary_size(), 30, mask_zero=True, name='embedding2')(input2)
    emb2 = Dropout(0.3)(emb2)
    encoder_lstm2 = Bidirectional(LSTM(64, return_state=True, name='lstm2'))
    encoder_outputs2, forward_h2, forward_c2, backward_h2, backward_c2 = encoder_lstm2(emb2)

    # Concatenation des états
    state_h = Concatenate()([forward_h1, backward_h1, forward_h2, backward_h2])
    state_c = Concatenate()([forward_c1, backward_c1, forward_c2, backward_c2])
    encoder_states = [state_h, state_c]

    # encoder_states` est l'état initial
    decoder_inputs = RepeatVector(48)(state_h) # modele basique avec repeat 48
    decoder_lstm = LSTM(256, return_sequences=True)
    decoder_outputs = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = Dense(vectorization.vocabulary_size(), activation='softmax', name='output')
    decoder_outputs = decoder_dense(decoder_outputs)

    # définir the model
    model = Model([input1, input2], decoder_outputs)
    model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

    return model


### Former le dictionnaire

In [23]:
x_train = {'input1': X1_train, 'input2': X2_train}
y_train = Y_train

x_dev = {'input1': X1_dev, 'input2': X2_dev}
y_dev = Y_dev

x_test = {'input1': X1_test, 'input2': X2_test}
y_test = Y_test

In [24]:
big_model=basic_model()
history = big_model.fit(x_train, y_train, 
                    validation_data=(x_dev, y_dev),
                    epochs=10, batch_size=32)

Epoch 1/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 134ms/step - accuracy: 0.9114 - loss: 1.2958 - val_accuracy: 0.9404 - val_loss: 0.3428
Epoch 2/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 136ms/step - accuracy: 0.9471 - loss: 0.2866 - val_accuracy: 0.9466 - val_loss: 0.2967
Epoch 3/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 135ms/step - accuracy: 0.9536 - loss: 0.2385 - val_accuracy: 0.9511 - val_loss: 0.2666
Epoch 4/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 136ms/step - accuracy: 0.9605 - loss: 0.2000 - val_accuracy: 0.9568 - val_loss: 0.2401
Epoch 5/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 135ms/step - accuracy: 0.9654 - loss: 0.1727 - val_accuracy: 0.9606 - val_loss: 0.2154
Epoch 6/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 135ms/step - accuracy: 0.9691 - loss: 0.1510 - val_accuracy: 0.9643 - val_loss: 0.1944
Epoch 7/10

### Évaluation et comparaison des tokens

In [25]:
#evaluation sur des données test
test_loss, test_accuracy = big_model.evaluate(x_test, y_test)

print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 41ms/step - accuracy: 0.9866 - loss: 0.0609
Test Loss: 0.057124290615320206
Test Accuracy: 0.9874478578567505


In [26]:
vocabulary = vectorization.get_vocabulary()
i2t = {i: tok for i, tok in enumerate(vocabulary)} # mappage des tokens et indice 
# print(i2t)

t2i = {tok: i for i, tok in enumerate(vocabulary)}
# print(t2i)


In [27]:
## DECODE 
predictions = big_model.predict(x_test)

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 43ms/step


In [28]:
# # J'ai pas trouvé la façon keras pour décoder... 
predicted_tokens = []

for pred in predictions:
    predicted = ""
    for idx in pred:
        idx=  tf.argmax(idx, axis=-1).numpy()
        predicted+=i2t[idx] 
    predicted_tokens.append(predicted)

# predicted_tokens[:10]

In [29]:
test_tokens = ["".join(map(lambda idx: i2t[idx.numpy().astype(int) ] , test)) for test in y_test]
# test_tokens[:20]

### Comparaison

In [30]:
# intersection de predicted_tokens , test_tokens
for p , t in zip(predicted_tokens[:10], test_tokens[:10]):
    print("predicted = ", p)
    print("real = ", t, "\n")

common = [x for x in predicted_tokens if x in test_tokens]
print("Nombre de prédiction correcte:",len(common), "\nNombre de prediction fausse: ", len(predicted_tokens)-len(common))

predicted =  하기야
real =  하기야 

predicted =  도도+도
real =  짐승+도 

predicted =  잘
real =  잘 

predicted =  가가가+기+만
real =  가르치+기+만 

predicted =  하+면
real =  하+면 

predicted =  어느
real =  어느 

predicted =  정도+는
real =  정도+는 

predicted =  순치+되+ㄹ
real =  순치+되+ㄹ 

predicted =  수
real =  수 

predicted =  있+다
real =  있+다 

Nombre de prédiction correcte: 1282 
Nombre de prediction fausse:  718


## Advanced Model
#### on démarre avec le vecteur encodé, puis on poursuit en utilisant comme entrée du décodeur ce qui vient d’être généré par le décodeur à t-1

Input1 (séquence) → Embedding → Dropout → Bidirectional LSTM → États encoder <br>
Input2 (séquence) → Embedding → Dropout → Bidirectional LSTM → États encoder <br>
                                                    ↓<br>
                            Concaténation des états des deux encoders<br>
                                                    ↓<br>
                           Initialisation des états du décodeur <br>
                                                    ↓ <br>
                               LSTM Décodeur (avec embedding de Input1) <br>
                                                    ↓<br>
                               Normalisation des sorties du décodeur <br>
                                                    ↓<br>
                               Dense (Softmax) → Prédiction finale <br>


In [31]:
def model_advanced():
    # Input1
    input1 = Input(shape=(None,), name='input1')
    emb1 = Embedding(vectorization.vocabulary_size(), 30, mask_zero=True, name='embedding1')(input1)
    emb1 = Dropout(0.3)(emb1)
    encoder_lstm1 = Bidirectional(LSTM(64, return_state=True, name='lstm1'))
    encoder_outputs1, forward_h1, forward_c1, backward_h1, backward_c1 = encoder_lstm1(emb1)
    
    # Input2
    input2 = Input(shape=(None,), name='input2')
    emb2 = Embedding(vectorization.vocabulary_size(), 30, mask_zero=True, name='embedding2')(input2)
    emb2 = Dropout(0.3)(emb2)
    encoder_lstm2 = Bidirectional(LSTM(64, return_state=True, name='lstm2'))
    encoder_outputs2, forward_h2, forward_c2, backward_h2, backward_c2 = encoder_lstm2(emb2)
    
    # Concatenate states
    state_h = Concatenate()([forward_h1, backward_h1, forward_h2, backward_h2])
    state_c = Concatenate()([forward_c1, backward_c1, forward_c2, backward_c2])
    encoder_states = [state_h, state_c]
    
    # Decoder
    decoder_input_h = Input(shape=(None, ), name='decoder_input_h') # on utilise que l'état
    decoder_input_c = Input(shape=(None, ), name='decoder_input_c') # on utilise que l'état
    decoder_lstm = LSTM(256, return_sequences=True, return_state=True, name='decoder_lstm')
    
    # Initialisation avec les états
    decoder_outputs, _, _ = decoder_lstm(emb1, initial_state=encoder_states) # encoder_states pour l'initialisation
    decoder_outputs = LayerNormalization()(decoder_outputs)
    
    # Output 
    decoder_dense = Dense(vectorization.vocabulary_size(), activation='softmax', name='output')
    decoder_outputs = decoder_dense(decoder_outputs)
    
    # modèle
    model = Model([input1, input2], decoder_outputs)
    
    # Compile le modele
    model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

    return model


In [33]:
model2 = model_advanced()

In [34]:
model2.summary()

In [35]:
model2.fit(x_train, y_train, 
                    validation_data=(x_dev, y_dev),
                    epochs=10, batch_size=32)

Epoch 1/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 113ms/step - accuracy: 0.0250 - loss: 4.0114 - val_accuracy: 0.0490 - val_loss: 1.5246
Epoch 2/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 114ms/step - accuracy: 0.0516 - loss: 0.8753 - val_accuracy: 0.0624 - val_loss: 0.7026
Epoch 3/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 114ms/step - accuracy: 0.0603 - loss: 0.2992 - val_accuracy: 0.0658 - val_loss: 0.4722
Epoch 4/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 114ms/step - accuracy: 0.0643 - loss: 0.1309 - val_accuracy: 0.0832 - val_loss: 0.4026
Epoch 5/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 116ms/step - accuracy: 0.0717 - loss: 0.0743 - val_accuracy: 0.0720 - val_loss: 0.3619
Epoch 6/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 114ms/step - accuracy: 0.0699 - loss: 0.0468 - val_accuracy: 0.0723 - val_loss: 0.3579
Epoch 7/10

<keras.src.callbacks.history.History at 0x787960336770>

### Générateur 
- prend en entrée le modèle avancé et les dictionnaires token_to_index and index_to_tokens
- génère la séquence suivante
- '[UNK]' si non trouvé dans le l'index


In [36]:

class Generator:
    def __init__(self, model, t2i, i2t):
        self.model = model
        self.t2i = t2i
        self.i2t = i2t

        self.encoder = self.find_layer_by_name('bidirectional', 'encoder')
        self.decoder = self.find_layer_by_name('decoder_lstm', 'decoder')
        self.emb = self.find_layer_by_name('embedding', 'embedding')
        self.rnn = self.decoder
        self.classif = self.find_layer_by_name('output', 'output')

    # affichage de noms de layers 
    def find_layer_by_name(self, name_keyword, layer_type):
        for layer in self.model.layers:
            if name_keyword in layer.name:
                return layer
        raise ValueError(f"No {layer_type} layer found with keyword '{name_keyword}'")

    def _predict_next(self, last_char, state, encoder_output):
        e = self.emb(last_char)
        output_and_state = self.rnn(e, initial_state=state)  
        output = output_and_state[0]  # extraction output tenseur
        new_state = output_and_state[1:]  # extraction état
        probs = self.classif(output)
        next_char = tf.random.categorical(probs[-1, :, :], 1)
        return next_char, new_state
        
    def predict_seq(self, starting_token):
        result = []
        current_token = starting_token
        state = None
        encoder_output = None
        while current_token != '[END]' and len(result) <= 10:
            char_index = self.t2i.get(current_token, self.t2i['[UNK]']) # par defaut [UNK]
            char = tf.reshape(char_index, [1, 1])
            encoder_output = self.encoder(self.emb(char)) if encoder_output is None else encoder_output
            next_token, state = self._predict_next(char, state, encoder_output)
            current_token = self.i2t.get(tf.squeeze(next_token).numpy(), '[UNK]') # par defaut [UNK]
            if current_token != '[END]':
                result.append(current_token)
        return ''.join(result)

In [37]:
model2 = model_advanced() # model2

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_model.keras', save_best_only=True, monitor='val_loss')

history = model2.fit(x_train, y_train,
                    validation_data=(x_dev, y_dev),
                    epochs=10, batch_size=32, 
                    callbacks=[early_stopping, model_checkpoint])


Epoch 1/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 116ms/step - accuracy: 0.0295 - loss: 3.9646 - val_accuracy: 0.0567 - val_loss: 1.5078
Epoch 2/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 117ms/step - accuracy: 0.0611 - loss: 0.9029 - val_accuracy: 0.0869 - val_loss: 0.7116
Epoch 3/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 117ms/step - accuracy: 0.0767 - loss: 0.3002 - val_accuracy: 0.0815 - val_loss: 0.4941
Epoch 4/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 117ms/step - accuracy: 0.1011 - loss: 0.1343 - val_accuracy: 0.1091 - val_loss: 0.4276
Epoch 5/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 118ms/step - accuracy: 0.0992 - loss: 0.0870 - val_accuracy: 0.4295 - val_loss: 0.3881
Epoch 6/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 117ms/step - accuracy: 0.2412 - loss: 0.0536 - val_accuracy: 0.0980 - val_loss: 0.3692
Epoch 7/10

### Exemples des générations

In [38]:
gen = Generator(model2,t2i, i2t)
input_char = random.choice(vocab)
# input_char = '한'  # input Korean character
output_seq = gen.predict_seq(input_char)    
print(input_char, output_seq)

간 컵회뿌눈즐코수긋약틀샘


### Donnée une séquence à partir des données testes, les tokens générés

In [39]:

for input_token in test_X[:20]:
    token = gen.predict_seq(input_token)
    print(input_token, token)


하기야 위펼낭푸볕앉년남어얘것
짐승도 힘문퓨혈놀빛맙쌀냄꺾든
잘 똑늘강싶톳백둑흥층걸뛰
가르치기만 신년터욕글꽤넉끝근일약
하면 혈할량던황심눅휴률즘받
어느 햇싣좌남경.갓짖쭐분팔
정도는 허간쿵꿔판낮튼애온끗템
순치될 묻믿능빌빠풀곰꼭왕촉뿌
수 판욱매퉁념뻘윗즈업균렵
있다 맛그번맘풀묽관닭꿰일매
. 병임닝운딱나에쟁권데주
사람이 밸갑황땟삶문프거잔흑꽃
스스로 젖벽핵나첩끊끔꺼찍밭허
만물의 알히슬왕당닉떡형액애임
영장이라 찻쐬님까찬쳐륭썰베참님
하고 덜츠검웅히멀개거품딩개
우쭐대는 데람,악퍼토칭승리좁삶
까닭이 권업관갚뻐궁재답쩐방랄
여기에 약월못잣레심상은북잉외
있다 맘죽찰죠맥빗꽤쉬람물타


### MODELE AVANCÉ 2
- Les états cachés et les états de cellule des LSTM bidirectionnels des deux séquences d'entrée sont concaténés pour former les états initiaux du décodeur
- le décodeur est initialisé directement avec les états concaténés des encodeurs.
- les embeddings de input1 sont utilisés comme entrée séquentielle du décodeur

### Générateur:
- génère une séquence à partir d'un token de départ ([START]) jusqu'à ce qu'un token de fin ([END]) soit atteint ou que la longueur maximale soit atteinte.


In [40]:
def model_advanced2():
    # Input1
    input1 = Input(shape=(None,), name='input1')
    emb1 = Embedding(vectorization.vocabulary_size(), 100, mask_zero=True, name='embedding1')(input1)
    emb1 = Dropout(0.3)(emb1)
    encoder_lstm1 = Bidirectional(LSTM(100, return_state=True, name='lstm1'))
    encoder_outputs1, forward_h1, forward_c1, backward_h1, backward_c1 = encoder_lstm1(emb1)

    # Input2
    input2 = Input(shape=(None,), name='input2')
    emb2 = Embedding(vectorization.vocabulary_size(), 100, mask_zero=True, name='embedding2')(input2)
    emb2 = Dropout(0.3)(emb2)
    encoder_lstm2 = Bidirectional(LSTM(100, return_state=True, name='lstm2'))
    encoder_outputs2, forward_h2, forward_c2, backward_h2, backward_c2 = encoder_lstm2(emb2)

    # Concatenate states
    state_h = Concatenate()([forward_h1, backward_h1, forward_h2, backward_h2])
    state_c = Concatenate()([forward_c1, backward_c1, forward_c2, backward_c2])
    encoder_states = [state_h, state_c]

    # Decoder
    decoder_input_h = Input(shape=(None, 300), name='decoder_input_h') # on utilisera que l'état
    decoder_input_c = Input(shape=(None, 300), name='decoder_input_c') # on utilisera que l'état
    decoder_lstm = LSTM(400, return_sequences=True, return_state=True, name='decoder_lstm')

    # Initialize decoder states with encoder states
    decoder_outputs, _, _ = decoder_lstm(emb1, initial_state=encoder_states)
    decoder_outputs = LayerNormalization()(decoder_outputs)

    # Output layer
    decoder_dense = Dense(vectorization.vocabulary_size(), activation='softmax', name='output')
    decoder_outputs = decoder_dense(decoder_outputs)

    # Define the model
    model = Model([input1, input2], decoder_outputs)

    # Compile the model
    model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

    return model

class Generator:
    def __init__(self, model, t2i, i2t):
        self.model = model
        self.t2i = t2i
        self.i2t = i2t

        self.encoder = self.find_layer_by_name('bidirectional', 'encoder')
        self.decoder = self.find_layer_by_name('decoder_lstm', 'decoder')
        self.emb = self.find_layer_by_name('embedding', 'embedding')
        self.rnn = self.decoder
        self.classif = self.find_layer_by_name('output', 'output')

    def find_layer_by_name(self, name_keyword, layer_type):
        for layer in self.model.layers:
            if name_keyword in layer.name:
                return layer
        raise ValueError(f"No {layer_type} layer found with keyword '{name_keyword}'")

    def _predict_next(self, last_char, state, encoder_output):
        e = self.emb(last_char)
        output_and_state = self.rnn(e, initial_state=state)
        output = output_and_state[0]
        new_state = output_and_state[1:]
        probs = self.classif(output)
        next_char = tf.random.categorical(probs[-1, :, :], 1)
        return next_char, new_state

    def predict_seq(self, starting_token='[START]'):
        result = []
        current_token = starting_token
        state = None
        encoder_output = None
        while current_token != '[END]' and len(result) <= 10:
            char_index = self.t2i.get(current_token, self.t2i['[UNK]'])
            char = tf.reshape(char_index, [1, 1])
            encoder_output = self.encoder(self.emb(char)) if encoder_output is None else encoder_output
            next_token, state = self._predict_next(char, state, encoder_output)
            current_token = self.i2t.get(tf.squeeze(next_token).numpy(), '[UNK]')
            if current_token != '[END]':
                result.append(current_token)
        return ''.join(result)


In [41]:
mod2= model_advanced2()

In [42]:
mod2.fit(x_train, y_train,
        validation_data=(x_dev, y_dev),
        epochs=10, batch_size=64, 
        callbacks=[early_stopping, model_checkpoint])

Epoch 1/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 330ms/step - accuracy: 0.0619 - loss: 3.7681 - val_accuracy: 0.0757 - val_loss: 1.2762
Epoch 2/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 333ms/step - accuracy: 0.0892 - loss: 0.6573 - val_accuracy: 0.5339 - val_loss: 0.6441
Epoch 3/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 331ms/step - accuracy: 0.4399 - loss: 0.2070 - val_accuracy: 0.0759 - val_loss: 0.4536
Epoch 4/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 332ms/step - accuracy: 0.0840 - loss: 0.0801 - val_accuracy: 0.0856 - val_loss: 0.3870
Epoch 5/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 332ms/step - accuracy: 0.1189 - loss: 0.0455 - val_accuracy: 0.0898 - val_loss: 0.3579


<keras.src.callbacks.history.History at 0x78797ea558d0>

### Exemples de générations

In [43]:
gen = Generator(mod2,t2i, i2t)
input_char = '[START]'  # input Korean character
output_seq = gen.predict_seq(input_char)
    
print(output_seq)

슨덩쑤플볍빛엇얘떠관판


In [44]:
for input_token in test_X[:20]:
    print(input_token, gen.predict_seq(input_token))

하기야 슴돈노ㅆ당꿰토표럭의노
짐승도 밧뿍은쿨툼벽참둘콧외들
잘 땀님곡씻집린령냐준밥얘
가르치기만 덮걱초노릇럼려종냐찰잉
하면 맘론닭멀뱉벗밤맹꺾롭짖
어느 쭉얗물짜외땀싶얇겁끊균
정도는 심닐꺼터즙냉인음립곁푼
순치될 
수 향꿀메릎깥매쉽끊독딪참
있다 섬웬책등짙는돈ㅆㅂ샘올
. 운솔급막참싸아싶지에귀
사람이 씌감야빨톱색촉쭈평맥칠
스스로 햄추힘른레냐앉축괴팡흐
만물의 강창듬블왕널카막픔탁욕
영장이라 라함튼곁당임창느튀굽믿
하고 존앉클클얗깎늦틈싹경울
우쭐대는 찰넷찾틀황주끓풀챙에어
까닭이 제홀훌걱렵끼놓낫묵잉칙
여기에 슴넥볍,솜파출멸숫깊편
있다 독뽑깨컨부떡튀키칡풀코


---
---




## Multioutput 
Il s'agit des données multi-entrée et multiple-sorties
A partir des données CoNLL-U du coréen, nous allons créer un modèle qui prend 3 entrées :
- token
- tokens lemmatisé (prefixé de [START])
- étiquette POS des lemmes des tokens (prefixé de [START])

Donne comme sortie:
- tokens lemmatisé (suffixé de [END])
- étiquette POS des lemmes des tokens (suffixé de [END])


In [45]:
# %env TF_FORCE_GPU_ALLOW_GROWTH=true
# %matplotlib widget
from typing import Optional
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import tensorflow as tf
from tensorflow import keras
from tensorflow.data import TextLineDataset
from keras.layers import Input, LSTM, Dense, RepeatVector, Embedding, Bidirectional, Concatenate, Dropout, StringLookup, RNN, LayerNormalization
from keras import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint


from utils import *
import pandas as pd
from pprint import pprint
import random

#### Extraction des données

fichier coNLU : 
    - col2 : tokens | col3 : lemmatized_tokens | col5 : lemma_tags

In [46]:
# Lecture 
df_train = parse_conllu_file('corpus_ko/ko_train.conllu')[:10000]
df_dev = parse_conllu_file('corpus_ko/ko_dev.conllu')[:2000]
df_test = parse_conllu_file('corpus_ko/ko_test.conllu')[:2000]

In [47]:
df_test[:10]

Unnamed: 0,token,lemma,lemma_tag
0,현대증권,현대증권,NNP
1,배성영,배성영,NNP
2,수석연구원은,수석+연구원+은,NNG+NNG+JX
3,"""","""",SS
4,미국,미국,NNP
5,통화당국의,통화+당국+의,NNG+NNG+JKG
6,경기부양,경기+부양,NNG+NNG
7,기조,기조,NNG
8,유지,유지,NNG
9,가능성과,가능+성+과,XR+XSN+JC


In [48]:
print(list(df_test['lemma'])[:30])

['현대증권', '배성영', '수석+연구원+은', '"', '미국', '통화+당국+의', '경기+부양', '기조', '유지', '가능+성+과', '주요+20+개국', '(', 'G+20', ')', '정상+회의+를', '앞두+ㄴ', '위안+화', '절상', '압력+의', '고조+로', '인하+아', '당분간', '원화', '강세', '압력+은', '더', '지속+되+ㄹ', '전망', '"', '이']


In [49]:
print(list(df_test['lemma_tag'])[:30])

['NNP', 'NNP', 'NNG+NNG+JX', 'SS', 'NNP', 'NNG+NNG+JKG', 'NNG+NNG', 'NNG', 'NNG', 'XR+XSN+JC', 'NNG+SN+NNB', 'SS', 'SL+SN', 'SS', 'NNG+NNG+JKO', 'VV+ETM', 'NNG+XSN', 'NNG', 'NNG+JKG', 'NNG+JKB', 'VV+EC', 'MAG', 'NNG', 'NNG', 'NNG+JX', 'MAG', 'NNG+XSV+ETM', 'NNG', 'SS', 'VCP+EC']


### Creation de vocabulaire

In [50]:
vocab = create_char_vocab(df_train)
print(vocab[:30])


[b'\xea\xb6\x81', b'\xeb\x9e\xab', b'\xec\x84\xac', b'\xea\xb2\xac', b'\xe8\xbb\x8d', b'\xea\xba\xbc', b'\xed\x97\x9d', b'\xeb\x8d\xb8', b'\xeb\xaa\xbb', b'\xeb\x81\x8c', b'\xec\xb0\xac', b'\xec\xa0\x84', b'\xeb\x8a\xa6', b'\xec\x99\x95', b'\xeb\x8b\x99', b'\xeb\x82\x98', b'\xec\xb9\xa9', b'\xeb\x83\x90', b'\xeb\x86\x80', b'P', b'\xec\x9d\xbd', b',', b'\xe4\xb9\x9d', b'\xeb\x94\x94', b'\xec\x86\x9f', b'\xeb\xac\x98', b'\xeb\x90\x90', b'\xec\x98\x81', b'\xeb\x89\xb4', b'\xeb\xae\xa4']


### Vectorisation des données CoNLL-U

In [51]:
# vectorisation
X1_train, X2_train, X3_train, Y1_train, Y2_train, vectorization = conll_instances(df_train, vocab)
X1_dev, X2_dev, X3_dev, Y1_dev, Y2_dev, _ = conll_instances(df_dev, vocab)
X1_test, X2_test, X3_test, Y1_test, Y2_test, _ = conll_instances(df_test, vocab)


0        잡스는
1     워즈니악에게
2        보수를
3         반씩
4        나누는
       ...  
95     다이오드의
96      본격적인
97       보급이
98     시작되었다
99         .
Name: token, Length: 100, dtype: object
0         잡스+는
1      워즈니악+에게
2         보수+를
3          반+씩
4         나누+는
        ...   
95      다이오드+의
96    본격+적+이+ㄴ
97        보급+이
98    시작+되+었+다
99           .
Name: lemma, Length: 100, dtype: object
0             NNP+JX
1            NNP+JKB
2            NNG+JKO
3            NNG+XSN
4             VV+ETM
           ...      
95           NNG+JKG
96    XR+XSN+VCP+ETM
97           NNG+JKS
98     NNG+XSV+EP+EF
99                SF
Name: lemma_tag, Length: 100, dtype: object
X1:  <tf.RaggedTensor [[b'\xec\x9e\xa1', b'\xec\x8a\xa4', b'\xeb\x8a\x94'],
 [b'\xec\x9b\x8c', b'\xec\xa6\x88', b'\xeb\x8b\x88', b'\xec\x95\x85',
  b'\xec\x97\x90', b'\xea\xb2\x8c']                                  ]>
X2:  <tf.RaggedTensor [[b'[', b'S', b'T', b'A', b'R', b'T', b']', b'\xec\x9e\xa1',
  b'\xec\x8a\xa4', b'+', b'\xeb\x8a\

In [52]:
## creation dictionnaire inverse
vocabulary2 = vectorization.get_vocabulary()
reverse_vocab = {i: tok for i, tok in enumerate(vocabulary2)} # mappage des tokens et indice 

#### Le modèle de séquence à séquence multi-sorties encode des séquences de tokens, lemmes, et tags, concatène leurs états, et utilise ces états pour générer les séquences de lemmes et tags correspondants.

**Entrées:**
Token: Séquence d'entrée pour les tokens (input_token).
Lemma: Séquence d'entrée pour les lemmes (input_lemma).
Tag: Séquence d'entrée pour les tags (input_tag).

**Encodeur:**
Chaque séquence d'entrée passe par une couche d'embedding (taille 30) suivie d'un dropout (taux de 0.3).
Chaque embedding est ensuite traité par un LSTM bidirectionnel avec 64 unités.
Les états cachés et les états de cellule des LSTM bidirectionnels sont extraits pour chaque type d'entrée (token, lemma, tag).

**Concatenation des États:**
Les états cachés et les états de cellule des encodeurs sont concaténés pour former les états initiaux du décodeur.

**Décodeur:**
Les états concaténés (state_h) sont répétés pour correspondre à la longueur de séquence de sortie attendue (48).
La séquence répétée est concaténée avec les embeddings des tokens pour former l'entrée du LSTM du décodeur.
Le décodeur utilise un LSTM avec 384 unités pour générer des séquences de sortie.


**Sorties:**
Lemma: Une couche dense avec activation softmax génère les lemmes (output_lemma).
Tag: Une autre couche dense avec activation softmax génère les tags (output_tag).


In [53]:
def create_model_multioutput():
    # Encoder for token
    input_token = Input(shape=(None,), name='input_token')
    emb_token = Embedding(vectorization.vocabulary_size(), 30, mask_zero=True, name='embedding_token')(input_token)
    emb_token = Dropout(0.3)(emb_token)
    encoder_lstm_token = Bidirectional(LSTM(64, return_state=True, name='lstm_token'))
    _, forward_h_token, forward_c_token, backward_h_token, backward_c_token = encoder_lstm_token(emb_token)
    
    # Encoder for lemma
    input_lemma = Input(shape=(None,), name='input_lemma')
    emb_lemma = Embedding(vectorization.vocabulary_size(), 30, mask_zero=True, name='embedding_lemma')(input_lemma)
    emb_lemma = Dropout(0.3)(emb_lemma)
    encoder_lstm_lemma = Bidirectional(LSTM(64, return_state=True, name='lstm_lemma'))
    _, forward_h_lemma, forward_c_lemma, backward_h_lemma, backward_c_lemma = encoder_lstm_lemma(emb_lemma)
    
    # Encoder for token tags
    input_tag = Input(shape=(None,), name='input_tag')
    emb_tag = Embedding(vectorization.vocabulary_size(), 30, mask_zero=True, name='embedding_tag')(input_tag)
    emb_tag = Dropout(0.3)(emb_tag)
    encoder_lstm_tag = Bidirectional(LSTM(64, return_state=True, name='lstm_tag'))
    _, forward_h_tag, forward_c_tag, backward_h_tag, backward_c_tag = encoder_lstm_tag(emb_tag)
    
    # Concatenate all encoder states
    state_h = Concatenate()([forward_h_token, backward_h_token, forward_h_lemma, backward_h_lemma, forward_h_tag, backward_h_tag])
    state_c = Concatenate()([forward_c_token, backward_c_token, forward_c_lemma, backward_c_lemma, forward_c_tag, backward_c_tag])
    encoder_states = [state_h, state_c]
    
    # Decoder
    dec_input_repeat = RepeatVector(48)(state_h)  
    dec_lstm_input = Concatenate()([dec_input_repeat, emb_token]) 
    
    dec_lstm = LSTM(384, return_sequences=True, return_state=True, name='dec_lstm')
    dec_outputs, _, _ = dec_lstm(dec_lstm_input, initial_state=encoder_states)
    
    # Decoder output for lemma
    dec_dense_lemma = Dense(vectorization.vocabulary_size(), activation='softmax', name='output_lemma')
    dec_outputs_lemma = dec_dense_lemma(dec_outputs)
    
    # Decoder output for lemma tags
    dec_dense_tag = Dense(vectorization.vocabulary_size(), activation='softmax', name='output_tag')
    dec_outputs_tag = dec_dense_tag(dec_outputs)
    
    model = Model([input_token, input_lemma, input_tag], [dec_outputs_lemma, dec_outputs_tag])
    model.compile(
        loss=["sparse_categorical_crossentropy", "sparse_categorical_crossentropy"], 
        optimizer="adam", 
        metrics=["accuracy", "accuracy"]
    )
    
    return model


In [54]:
model3 = create_model_multioutput()
model3.summary()

In [55]:
history = model3.fit(
    [X1_train, X2_train, X3_train],
    [Y1_train, Y2_train],
    validation_data=([X1_dev, X2_dev, X3_dev], [Y1_dev, Y2_dev]),
    epochs=10,
    batch_size=32
)


Epoch 1/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m102s[0m 298ms/step - loss: 2.6724 - output_lemma_accuracy: 0.8305 - output_tag_accuracy: 0.7986 - val_loss: 0.7470 - val_output_lemma_accuracy: 0.9298 - val_output_tag_accuracy: 0.9169
Epoch 2/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m93s[0m 298ms/step - loss: 0.6759 - output_lemma_accuracy: 0.9319 - output_tag_accuracy: 0.9281 - val_loss: 0.5101 - val_output_lemma_accuracy: 0.9400 - val_output_tag_accuracy: 0.9555
Epoch 3/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m94s[0m 299ms/step - loss: 0.4759 - output_lemma_accuracy: 0.9422 - output_tag_accuracy: 0.9597 - val_loss: 0.3930 - val_output_lemma_accuracy: 0.9468 - val_output_tag_accuracy: 0.9745
Epoch 4/10
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m94s[0m 301ms/step - loss: 0.3675 - output_lemma_accuracy: 0.9495 - output_tag_accuracy: 0.9757 - val_loss: 0.3349 - val_output_lemma_accuracy: 0.9503 - val_output_

In [56]:
# Evaluate the model on the test set
test_loss, test_lemma_accuracy, test_tag_accuracy = model3.evaluate(
    [X1_test, X2_test, X3_test],
    [Y1_test, Y2_test]
)

print(f"\nTest Loss: {test_loss}, \nTest Lemma Accuracy: {test_lemma_accuracy}, \nTest Tag Accuracy: {test_tag_accuracy}")
# print(f"\nTest Lemma Accuracy: {test_lemma_accuracy}, \nTest Tag Accuracy: {test_tag_accuracy}")


[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 87ms/step - loss: 0.1682 - output_lemma_accuracy: 0.9708 - output_tag_accuracy: 0.9937

Test Loss: 0.16909939050674438, 
Test Lemma Accuracy: 0.971281111240387, 
Test Tag Accuracy: 0.9938541650772095


### Prédictions et comparaison avec des données tests

In [70]:
# Make predictions
predictions = model3.predict([X1_test, X2_test, X3_test])

# prediction à séquences
predicted_lemma_sequences = []
predicted_tag_sequences = []

for pred_lemma, pred_tag in zip(predictions[0], predictions[1]):
    predicted_lemma_sequence = ''.join([reverse_vocab.get(idx) for idx in tf.argmax(pred_lemma, axis=-1).numpy()])
    predicted_tag_sequence = ''.join([reverse_vocab.get(idx) for idx in tf.argmax(pred_tag, axis=-1).numpy()])
    predicted_lemma_sequences.append(predicted_lemma_sequence)
    predicted_tag_sequences.append(predicted_tag_sequence)

# Ensure Y1_test and Y2_test are numpy arrays
Y1_test_np = Y1_test.numpy()
Y2_test_np = Y2_test.numpy()

# Convert true sequences to human-readable format
true_lemma_sequences = []
true_tag_sequences = []

for y1, y2 in zip(Y1_test_np, Y2_test_np):
    true_lemma_sequence = ''.join([reverse_vocab.get(idx) for idx in y1])
    true_tag_sequence = ''.join([reverse_vocab.get(idx) for idx in y2])
    true_lemma_sequences.append(true_lemma_sequence)
    true_tag_sequences.append(true_tag_sequence)

# Print the predicted and true sequences for the first few examples in the test set
for i in range(10):
    print(f"Example {i + 1}:")
    print("Predicted lemmas:", predicted_lemma_sequences[i])
    print("Predicted tags:", predicted_tag_sequences[i])
    print("True lemmas:", true_lemma_sequences[i])
    print("True tags:", true_tag_sequences[i])
    print()




[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 78ms/step
Example 1:
Predicted lemmas: 현현현가[UNK]END[UNK]
Predicted tags: NNP[UNK]END[UNK]
True lemmas: 현대증권[UNK]END[UNK]
True tags: NNP[UNK]END[UNK]

Example 2:
Predicted lemmas: 강민민[UNK]END[UNK]
Predicted tags: NNP[UNK]END[UNK]
True lemmas: 배성영[UNK]END[UNK]
True tags: NNP[UNK]END[UNK]

Example 3:
Predicted lemmas: 사원++원++은[UNK]END[UNK]
Predicted tags: NNG+NNG+JX[UNK]END[UNK]
True lemmas: 수석+연구원+은[UNK]END[UNK]
True tags: NNG+NNG+JX[UNK]END[UNK]

Example 4:
Predicted lemmas: "[UNK]END[UNK]
Predicted tags: SS[UNK]END[UNK]
True lemmas: "[UNK]END[UNK]
True tags: SS[UNK]END[UNK]

Example 5:
Predicted lemmas: 국국[UNK]END[UNK]
Predicted tags: NNP[UNK]END[UNK]
True lemmas: 미국[UNK]END[UNK]
True tags: NNP[UNK]END[UNK]

Example 6:
Predicted lemmas: 대민+대++의[UNK]END[UNK]
Predicted tags: NNG+NNG+JKG[UNK]END[UNK]
True lemmas: 통화+당국+의[UNK]END[UNK]
True tags: NNG+NNG+JKG[UNK]END[UNK]

Example 7:
Predicted lemmas: 경++목간[UNK]END[UNK]
Predicte

In [73]:
import pandas as pd

# Create DataFrame to represent common and different elements
data = {
    '': ['Common', 'Different'],
    'Lemmas': [common_lemma_count, len(predicted_lemma_set) - common_lemma_count],
    'Tokens': [common_token_count, len(predicted_token_set) - common_token_count]
}

df = pd.DataFrame(data)
df.set_index('', inplace=True)

# Display DataFrame
print("Comparison between Predicted and True Sequences:")
print(df)


Comparison between Predicted and True Sequences:
           Lemmas  Tokens
                         
Common        329     126
Different    1085     125


### Observations

- Si le modèle basique commence à générer des résultats à peu près fiable, 
malgré mes recherches, je n'ai pas réussi à créer un modèle recursive pour les modèles avancés.
- En ce qui concerne la génération, je n'ai pas très bien compris, ce qu'il faut générer:
    - pour Brassens, c'est des génération à partir des lettres de l'alphabet
    - ici, la tâche étant lemmatisation, à partir d'un token, le générateur devrait générer la forme lemmatisée des tokens.
- le modèle multi-output semble avoir des quelques résultats correctes, par contre, je suis pas sûr de la façon de décoder

### Pour améliorer
- essayer avec normalisation des tokens
- faire varier les hyperparamètres pour comparer les performances
- matrices de confusion

### Sitographie 
- [Keras sequence-to-sequence learning](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)
- [Character-level recurrent sequence-to-sequence model](https://keras.io/examples/nlp/lstm_seq2seq/)
- [Tokens to sequence](https://medium.com/geekculture/nlp-with-tensorflow-keras-explanation-and-tutorial-cae3554b1290)
- [keras tv tuto](https://www.youtube.com/watch?v=gjjAyZWFkds&t=37)
- [multi-output model](https://stackoverflow.com/questions/66845924/multi-input-multi-output-model-with-keras-functional-api)
