# IMDB — Tokenisation + Padding (J1)

Objectif : charger le dataset IMDB, comprendre le **vocabulaire**, la **vectorisation** (tokenisation) et le **padding/truncation**.

**Important** : TensorFlow ne fonctionne pas avec Python 3.14. Utilise un environnement Python 3.10/3.11 pour exécuter ce notebook.


## 1) Imports & Setup

In [1]:
from pathlib import Path
import sys

import numpy as np
import tensorflow as tf

# VS Code notebooks run from the notebook folder by default.
# Find the project root (the folder that contains `src/`) and add it to PYTHONPATH.
cwd = Path.cwd().resolve()
PROJECT_DIR = None
for p in [cwd] + list(cwd.parents):
    if (p / "src").exists():
        PROJECT_DIR = p
        break

if PROJECT_DIR is None:
    raise RuntimeError(f"Could not find project root containing 'src' starting from: {cwd}")

sys.path.insert(0, str(PROJECT_DIR))

from src.text_preprocessing import TextPreprocessor

print("Project dir:", PROJECT_DIR)
print("Python:", sys.version)
print("TensorFlow:", tf.__version__)
PROJECT_DIR

Project dir: C:\Users\bello\Documents\data-science-portfolio\02_DL_NLP_Sentiment
Python: 3.11.6 (tags/v3.11.6:8b6ee5b, Oct  2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)]
TensorFlow: 2.13.0


WindowsPath('C:/Users/bello/Documents/data-science-portfolio/02_DL_NLP_Sentiment')

## 2) Charger IMDB (texte brut)

In [2]:
MAX_WORDS = 10_000
MAX_LEN = 200
VAL_SIZE = 5_000

pre = TextPreprocessor(max_words=MAX_WORDS, max_len=MAX_LEN)
data = pre.load_imdb_text(validation_size=VAL_SIZE, seed=42)

print('X_train:', data.X_train.shape, data.X_train.dtype)
print('y_train:', data.y_train.shape, data.y_train.dtype)
print('X_val  :', data.X_val.shape, data.X_val.dtype)
print('X_test :', data.X_test.shape, data.X_test.dtype)
print('Vocab size (vectorizer):', len(data.vectorizer.get_vocabulary()))

X_train: (20000, 200) int64
y_train: (20000,) int32
X_val  : (5000, 200) int64
X_test : (25000, 200) int64
Vocab size (vectorizer): 10000


## 3) Comprendre la tokenisation (TextVectorization)

In [3]:
vocab = data.vectorizer.get_vocabulary()
print('Top 20 tokens:', vocab[:20])

# Le token 0 est généralement réservé au padding.
print('Token[0]:', vocab[0])

Top 20 tokens: ['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film']
Token[0]: 


## 4) Padding / truncation : voir l'effet

In [4]:
lengths = (data.X_train != 0).sum(axis=1)
print('Sequence lengths (non-zero tokens)')
print('min:', int(lengths.min()))
print('p50:', int(np.percentile(lengths, 50)))
print('p90:', int(np.percentile(lengths, 90)))
print('max:', int(lengths.max()))

print('Example vector (first 30 tokens):')
print(data.X_train[0][:30])

Sequence lengths (non-zero tokens)
min: 10
p50: 172
p90: 200
max: 200
Example vector (first 30 tokens):
[  10  173  255   10  236   37 1580  100  308   11   12   10  236   42
    6  901   55   85  121  715   20  938   11  903  667    7  156   18
    4    1]


## 5) Sauvegarder l'artefact de prétraitement (vectorizer) pour l'inférence

In [5]:
MODELS_DIR = PROJECT_DIR / 'models'
MODELS_DIR.mkdir(parents=True, exist_ok=True)

vectorizer_model = pre.build_vectorizer_model(data.vectorizer)
vectorizer_path = MODELS_DIR / 'text_vectorizer.keras'
vectorizer_model.save(vectorizer_path)

print('Saved:', vectorizer_path)

Saved: C:\Users\bello\Documents\data-science-portfolio\02_DL_NLP_Sentiment\models\text_vectorizer.keras


## 6) Mini test : vectoriser une phrase

In [6]:
sample = 'This movie was surprisingly good, I loved the acting and the story.'
vec = data.vectorizer(tf.constant([sample])).numpy()[0]
print(vec[:30])
print('Non-zero length:', int((vec != 0).sum()))

[  11   17   13 1181   49   10  437    2  111    3    2   63    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]
Non-zero length: 12


✅ Fin de J1 : tu as un dataset vectorisé, padding/truncation, et un `text_vectorizer.keras` sauvegardé pour l'app Streamlit.