# 0 – Data Importation and Visualization

In this notebook, we import the dataset used by the paper *Learning Word Vectors for Sentiment Analysis* by Maas et al. (2011), and save it in a `/data` folder which will be ignored by git. The dataset is stored as two dataframes, `df_train` and `df_test`, saved in `parquet` containing the labeled IMDb texts.

This dataset contains movie reviews along with their associated binary sentiment polarity labels. The core dataset consists of 50,000 reviews, split evenly into 25k training and 25k test sets. The overall distribution of labels is balanced (25k positive and 25k negative).


In [1]:
import os
import tarfile
import urllib.request
import pandas as pd

In [2]:
# Paths and URL
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dossier_data = "data"
fichier_tar = os.path.join(dossier_data, "aclImdb_v1.tar.gz")
dossier_extrait = os.path.join(dossier_data, "aclImdb")

# Create the data folder if it doesn't exist
os.makedirs(dossier_data, exist_ok=True)

# Download the file if it doesn't already exist
if not os.path.exists(fichier_tar):
    print("Downloading data...")
    urllib.request.urlretrieve(url, filename=fichier_tar)
    print("Download complete.")
else:
    print("The file already exists.")

# Extract the data if it hasn't been extracted yet
if not os.path.exists(dossier_extrait):
    print("Extracting data...")
    with tarfile.open(fichier_tar, "r:gz") as tar:
        tar.extractall(path=dossier_data)
    print("Extraction complete.")
else:
    print("The data is already extracted.")

The file already exists.
The data is already extracted.


In [4]:
# Paths
dossier_train = os.path.join("data", "aclImdb", "train")
paths = {
    "pos": os.path.join(dossier_train, "pos"),
    "neg": os.path.join(dossier_train, "neg")
}

# Function to load the files
def load_reviews(folder, label):
    texts = []
    for file_name in os.listdir(folder):
        file_path = os.path.join(folder, file_name)
        with open(file_path, encoding="utf-8") as f:
            texts.append(f.read())
    return pd.DataFrame({"texte": texts, "label": label})

# Load positive and negative reviews
df_pos = load_reviews(paths["pos"], label=1)
df_neg = load_reviews(paths["neg"], label=0)

# Combine into a single DataFrame
df_train = pd.concat([df_pos, df_neg], ignore_index=True)

# Preview
df_train.head()


Unnamed: 0,texte,label
0,"A most recommendable masterpiece, not only for...",1
1,Full disclosure: I'm a cynic. I like my ending...,1
2,"For quite a long time in my life, I either did...",1
3,Richard Attenborough who already given us magn...,1
4,I remember the first time I saw this movie -- ...,1


In [6]:
# Paths for test data
dossier_test = os.path.join("data", "aclImdb", "test")
paths_test = {
    "pos": os.path.join(dossier_test, "pos"),
    "neg": os.path.join(dossier_test, "neg")
}

# Load positive and negative reviews for test
df_test_pos = load_reviews(paths_test["pos"], label=1)
df_test_neg = load_reviews(paths_test["neg"], label=0)

# Combine into a single DataFrame
df_test = pd.concat([df_test_pos, df_test_neg], ignore_index=True)

# Shuffle the rows
df_test = df_test.sample(frac=1, random_state=42).reset_index(drop=True)

# Preview
df_test.head()


Unnamed: 0,texte,label
0,"I don't cry easily over movies, but I have to ...",1
1,This movie stinks. IMDb needs negative numbers...,0
2,"I loved the film ""Eddie Monroe"". The film had ...",1
3,This movie made my face hurt. I don't understa...,0
4,I think Downey was perhaps inspired by French ...,0


In [7]:
""" Saves df_train and df_test"""
output_path_train = "data/df_train.parquet"
output_path_test = "data/df_test.parquet"

df_train.to_parquet(output_path_train, index=False)
df_test.to_parquet(output_path_test, index=False)