# J2 — Modèle Embedding + BiLSTM (training)

Objectif : construire un modèle **Embedding + BiLSTM**, entraîner, et sauvegarder le modèle entraîné.

Notes :
- On utilise `TextVectorization` (déjà vu en J1) via `TextPreprocessor`.
- On garde un split train/val pour surveiller l'overfitting (J3 ira plus loin sur les courbes).


## 1) Imports & Setup

In [1]:
from pathlib import Path
import sys

import numpy as np
import tensorflow as tf

cwd = Path.cwd().resolve()
PROJECT_DIR = None
for p in [cwd] + list(cwd.parents):
    if (p / 'src').exists():
        PROJECT_DIR = p
        break
if PROJECT_DIR is None:
    raise RuntimeError(f"Could not find project root containing 'src' starting from: {cwd}")
sys.path.insert(0, str(PROJECT_DIR))

from src.text_preprocessing import TextPreprocessor
from src.model_architecture import ModelConfig, build_bilstm_model

print('Project dir:', PROJECT_DIR)
print('Python:', sys.version)
print('TensorFlow:', tf.__version__)

Project dir: C:\Users\bello\Documents\data-science-portfolio\02_DL_NLP_Sentiment
Python: 3.11.6 (tags/v3.11.6:8b6ee5b, Oct  2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)]
TensorFlow: 2.13.0


## 2) Charger IMDB (vectorisé + padding)

In [2]:
MAX_WORDS = 10_000
MAX_LEN = 200
VAL_SIZE = 5_000

pre = TextPreprocessor(max_words=MAX_WORDS, max_len=MAX_LEN)
data = pre.load_imdb_text(validation_size=VAL_SIZE, seed=42)

X_train, y_train = data.X_train, data.y_train
X_val, y_val = data.X_val, data.y_val

print('X_train:', X_train.shape, X_train.dtype)
print('X_val  :', X_val.shape, X_val.dtype)
print('y_train:', y_train.shape, y_train.dtype)
print('Churn-like label balance (mean):', float(y_train.mean()))

X_train: (20000, 200) int64
X_val  : (5000, 200) int64
y_train: (20000,) int32
Churn-like label balance (mean): 0.5001


## 3) Construire le modèle (Embedding + BiLSTM + Dropout)

In [3]:
cfg = ModelConfig(
    vocab_size=MAX_WORDS,
    max_len=MAX_LEN,
    embedding_dim=128,
    rnn_units=64,
    dropout=0.3,
)
model = build_bilstm_model(cfg)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 128)          1280000   
                                                                 
 spatial_dropout1d (Spatial  (None, 200, 128)          0         
 Dropout1D)                                                      
                                                                 
 bidirectional (Bidirection  (None, 128)               98816     
 al)                                                             
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1378945 (5.26 MB)
Trainable params: 1378945 (5.26 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## 4) Training (EarlyStopping + Checkpoint)

In [None]:
MODELS_DIR = PROJECT_DIR / 'models'
RESULTS_DIR = PROJECT_DIR / 'results'
MODELS_DIR.mkdir(parents=True, exist_ok=True)
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

# Workaround for Keras format compatibility: checkpoint weights during training,
# then save the full model at the end.
weights_ckpt_path = MODELS_DIR / 'sentiment_model.weights.h5'
final_model_path = MODELS_DIR / 'sentiment_model.keras'

callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True),
    tf.keras.callbacks.ModelCheckpoint(
        filepath=str(weights_ckpt_path),
        monitor='val_loss',
        save_best_only=True,
        save_weights_only=True,
    ),
]

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=6,
    batch_size=128,
    callbacks=callbacks,
    verbose=1,
)

# Ensure best weights are on the model (restore_best_weights=True should already do it)
if weights_ckpt_path.exists():
    model.load_weights(str(weights_ckpt_path))

model.save(str(final_model_path))

print('Best weights saved to:', str(weights_ckpt_path))
print('Final model saved to:', str(final_model_path))


Epoch 1/6
Epoch 2/6
Epoch 3/6


ValueError: The following argument(s) are not supported with the native Keras format: ['include_optimizer']

## 5) Évaluation rapide (val)

In [None]:
val_metrics = model.evaluate(X_val, y_val, verbose=0)
for name, value in zip(model.metrics_names, val_metrics):
    print(f'{name}: {value:.4f}')

loss: 0.3402
accuracy: 0.8618
roc_auc: 0.9382


## 6) Sanity check : prédire une phrase

In [None]:
sample = 'This movie was surprisingly good, I loved the acting and the story.'
x = data.vectorizer(tf.constant([sample])).numpy()
p = float(model.predict(x, verbose=0)[0][0])
label = 'Positive' if p >= 0.5 else 'Negative'
print('Prob(positive):', round(p, 3), '=>', label)

Prob(positive): 0.822 => Positive


✅ Fin de J2 : modèle BiLSTM entraîné + sauvegardé en `models/sentiment_model.keras`.

Prochain : mini-exercice *Ablation* (LSTM vs GRU, 2 epochs) puis J3 (courbes + évaluation test).