# Préparation des données

Ce notebook prépare le dataset avant l’entraînement des modèles.

Étapes :
- Configuration
- Nettoyage des tweets
- Construction du dataset
- Split train / validation / test
- Sauvegarde des fichiers CSV

In [1]:
import os
import re
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Configuration

In [2]:

DATA_PATH = "data/tweets.csv"  
TEXT_COL = "text"
LABEL_COL = "label"

RANDOM_STATE = 42
TEST_SIZE = 0.2
VAL_SIZE = 0.2

print("CWD :", Path.cwd())
print("DATA_PATH :", DATA_PATH, "| Exists ?", Path(DATA_PATH).exists())

CWD : C:\Users\Jeremy\IA\sentiment_tri
DATA_PATH : data/tweets.csv | Exists ? True


## Fonction de nettoyage

In [3]:
_URL_RE = re.compile(r"http\S+|www\.\S+")
_MENTION_RE = re.compile(r"@\w+")
_HASHTAG_RE = re.compile(r"#")

def clean_tweet(text: str) -> str:
    text = str(text)
    text = _URL_RE.sub("", text)
    text = _MENTION_RE.sub("@user", text)
    text = _HASHTAG_RE.sub("", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

## Chargement du dataset

In [4]:
df_raw = pd.read_csv(
    DATA_PATH,
    encoding="latin-1",
    header=None
)

df_raw.columns = ["target", "id", "date", "query", "user", "text"]

print("Shape brut :", df_raw.shape)
df_raw.head()

Shape brut : (1600000, 6)


Unnamed: 0,target,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


## Construction du dataset final

In [5]:
df = df_raw[["text", "target"]].copy()

df[LABEL_COL] = (df["target"] == 0).astype(int)
df[TEXT_COL] = df["text"].map(clean_tweet)

df = df[[TEXT_COL, LABEL_COL]].dropna()

print("Shape final :", df.shape)
df.head()

Shape final : (1600000, 2)


Unnamed: 0,text,label
0,"@user - Awww, that's a bummer. You shoulda got...",1
1,is upset that he can't update his Facebook by ...,1
2,@user I dived many times for the ball. Managed...,1
3,my whole body feels itchy and like its on fire,1
4,"@user no, it's not behaving at all. i'm mad. w...",1


## Vérification des labels

In [6]:
df[LABEL_COL].value_counts(normalize=True)

label
1    0.5
0    0.5
Name: proportion, dtype: float64

## Split train / validation / test

In [7]:
X = df[TEXT_COL].values
y = df[LABEL_COL].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    stratify=y,
    random_state=RANDOM_STATE
)

val_ratio = VAL_SIZE / (1 - TEST_SIZE)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train,
    test_size=val_ratio,
    stratify=y_train,
    random_state=RANDOM_STATE
)

print("Train :", len(X_train))
print("Validation :", len(X_val))
print("Test :", len(X_test))

Train : 960000
Validation : 320000
Test : 320000


## Sauvegarde des splits

In [8]:
os.makedirs("data/processed", exist_ok=True)

train_df = pd.DataFrame({TEXT_COL: X_train, LABEL_COL: y_train})
val_df   = pd.DataFrame({TEXT_COL: X_val, LABEL_COL: y_val})
test_df  = pd.DataFrame({TEXT_COL: X_test, LABEL_COL: y_test})

train_df.to_csv("data/processed/train.csv", index=False)
val_df.to_csv("data/processed/val.csv", index=False)
test_df.to_csv("data/processed/test.csv", index=False)

print("Fichiers sauvegardés.")

Fichiers sauvegardés.
