# 3. Data Preparation

### 3.1 Dataset Sampling

To accelerate development, we'll create a smaller, balanced sample from the full 1.6 million tweet dataset. We will sample 100,000 negative and 100,000 positive tweets to form a new 200,000-tweet dataset. This ensures that our model trains on a representative and manageable subset of the data.

In [1]:
import pandas as pd

# Define paths and column names
DATA_PATH = '../data/training.1600000.processed.noemoticon.csv'
SAMPLED_DATA_PATH = '../data/sentiment140_sampled_200k.csv'
COLUMNS = ['target', 'ids', 'date', 'flag', 'user', 'text']

# Load the full dataset
try:
    df_full = pd.read_csv(DATA_PATH, encoding='latin-1', header=None, names=COLUMNS)
except FileNotFoundError:
    print(f"Error: The file {DATA_PATH} was not found.")
    print("Please ensure the training data is correctly placed in the 'data' directory.")
else:
    print(f"Full dataset loaded successfully. Shape: {df_full.shape}")

    # Separate by sentiment
    df_negative = df_full[df_full['target'] == 0]
    df_positive = df_full[df_full['target'] == 4]

    # Create balanced samples
    df_negative_sampled = df_negative.sample(n=100000, random_state=42)
    df_positive_sampled = df_positive.sample(n=100000, random_state=42)

    # Concatenate and shuffle the samples
    df_sampled = pd.concat([df_negative_sampled, df_positive_sampled])
    df_sampled = df_sampled.sample(frac=1, random_state=42).reset_index(drop=True)

    # Save the sampled dataset
    df_sampled.to_csv(SAMPLED_DATA_PATH, index=False)

    print(f"Created and saved a balanced sample of {df_sampled.shape[0]} tweets.")
    print("Target distribution in the sample:")
    print(df_sampled['target'].value_counts())

Full dataset loaded successfully. Shape: (1600000, 6)
Created and saved a balanced sample of 200000 tweets.
Target distribution in the sample:
target
4    100000
0    100000
Name: count, dtype: int64


### 3.2 Load and Verify Sampled Dataset

Now, we load the newly created sampled dataset to proceed with the next steps.

In [2]:
import re

# Load the sampled dataset
df_sampled = pd.read_csv(SAMPLED_DATA_PATH)

print(f"Sampled dataset loaded. Shape: {df_sampled.shape}")
print("Target distribution:")
print(df_sampled['target'].value_counts())
df_sampled.head()

Sampled dataset loaded. Shape: (200000, 6)
Target distribution:
target
4    100000
0    100000
Name: count, dtype: int64


Unnamed: 0,target,ids,date,flag,user,text
0,4,1793657674,Thu May 14 03:31:43 PDT 2009,NO_QUERY,marita_holm,Looks like the sun finally located Trondheim ;...
1,0,1971008586,Sat May 30 05:56:45 PDT 2009,NO_QUERY,addicthim,A long weekend begins. The sun is shining and ...
2,4,1881031249,Fri May 22 03:21:42 PDT 2009,NO_QUERY,susietech,to the beach we go! hope it stays nice...
3,0,1795951084,Thu May 14 08:38:29 PDT 2009,NO_QUERY,KatieBlockley,@JBFutureboy I missed it busted need to do a ...
4,0,1978767005,Sun May 31 00:23:54 PDT 2009,NO_QUERY,Camila_love,Why I can't change my background image??


### 3.2 Advanced Text Cleaning

We will now apply advanced text cleaning techniques to handle the specific characteristics of Twitter data. This includes removing URLs, user mentions, numbers, and normalizing elongated words, while preserving potentially useful information like hashtag text.

In [3]:
def clean_tweet(text):
    # Convert to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove user mentions (@username)
    text = re.sub(r'@\w+', '', text)
    # Remove '#' symbol, keeping the word
    text = re.sub(r'#', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Remove punctuation and special characters (keeping spaces)
    text = re.sub(r'[^a-z\s]', '', text)
    # Normalize elongated words (e.g., 'loooove' -> 'love')
    text = re.sub(r'(.)\1{2,}', r'\1', text)
    # Remove extra spaces and strip leading/trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df_sampled['cleaned_text'] = df_sampled['text'].apply(clean_tweet)

print("Text after advanced cleaning:")
print(df_sampled[['text', 'cleaned_text']].head())

Text after advanced cleaning:
                                                text  \
0  Looks like the sun finally located Trondheim ;...   
1  A long weekend begins. The sun is shining and ...   
2         to the beach we go! hope it stays nice...    
3  @JBFutureboy I missed it  busted need to do a ...   
4          Why I can't change my background image??    

                                        cleaned_text  
0  looks like the sun finally located trondheim h...  
1  a long weekend begins the sun is shining and i...  
2              to the beach we go hope it stays nice  
3  i missed it busted need to do a reunion tour t...  
4              why i cant change my background image  


### 3.3 Data Splitting

We will split the cleaned and sampled dataset into training, validation, and test sets. This ensures that our model is trained on one subset of data, tuned on another, and evaluated on a completely unseen third subset to provide an unbiased assessment of its performance.

In [4]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = df_sampled['cleaned_text']
y = df_sampled['target']

# Split into training and temporary sets (80% train, 20% temp)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Split temporary set into validation and test sets (50% validation, 50% test from temp)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

print(f"Training set shape: {X_train.shape}, {y_train.shape}")
print(f"Validation set shape: {X_val.shape}, {y_val.shape}")
print(f"Test set shape: {X_test.shape}, {y_test.shape}")

print("\nTraining target distribution:")
print(y_train.value_counts(normalize=True))
print("\nValidation target distribution:")
print(y_val.value_counts(normalize=True))
print("\nTest target distribution:")
print(y_test.value_counts(normalize=True))

Training set shape: (160000,), (160000,)
Validation set shape: (20000,), (20000,)
Test set shape: (20000,), (20000,)

Training target distribution:
target
0    0.5
4    0.5
Name: proportion, dtype: float64

Validation target distribution:
target
4    0.5
0    0.5
Name: proportion, dtype: float64

Test target distribution:
target
0    0.5
4    0.5
Name: proportion, dtype: float64


### 3.4 BERT-Specific Tokenization

We will tokenize the cleaned text data using a pre-trained BERT tokenizer from the Hugging Face `transformers` library. This process converts text into numerical input IDs and attention masks, which are required for BERT-based models. We will use `vinai/bertweet-base` as it is pre-trained on tweets.

In [5]:
from transformers import AutoTokenizer

MODEL_NAME = 'vinai/bertweet-base'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(texts):
    return tokenizer(texts, padding='max_length', truncation=True, max_length=128, return_tensors='pt')

# Tokenize the datasets
train_encodings = tokenize_function(X_train.tolist())
val_encodings = tokenize_function(X_val.tolist())
test_encodings = tokenize_function(X_test.tolist())

print("Tokenization complete.")
print(f"Train input_ids shape: {train_encodings['input_ids'].shape}")
print(f"Train attention_mask shape: {train_encodings['attention_mask'].shape}")
print(f"Validation input_ids shape: {val_encodings['input_ids'].shape}")
print(f"Validation attention_mask shape: {val_encodings['attention_mask'].shape}")
print(f"Test input_ids shape: {test_encodings['input_ids'].shape}")
print(f"Test attention_mask shape: {test_encodings['attention_mask'].shape}")

  from .autonotebook import tqdm as notebook_tqdm
emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0


Tokenization complete.
Train input_ids shape: torch.Size([160000, 128])
Train attention_mask shape: torch.Size([160000, 128])
Validation input_ids shape: torch.Size([20000, 128])
Validation attention_mask shape: torch.Size([20000, 128])
Test input_ids shape: torch.Size([20000, 128])
Test attention_mask shape: torch.Size([20000, 128])


### 3.5 Create PyTorch Datasets and DataLoaders

We will create a custom PyTorch Dataset class to handle the tokenized text and labels. This allows us to efficiently load and batch data during model training and evaluation. We will then create DataLoader instances for the training, validation, and test sets.

In [6]:
import torch
from torch.utils.data import Dataset, DataLoader

class TweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Convert labels to numpy arrays
y_train_np = y_train.to_numpy()
y_val_np = y_val.to_numpy()
y_test_np = y_test.to_numpy()

# Create Dataset objects
train_dataset = TweetDataset(train_encodings, y_train_np)
val_dataset = TweetDataset(val_encodings, y_val_np)
test_dataset = TweetDataset(test_encodings, y_test_np)

# Create DataLoader objects
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

print("PyTorch Datasets and DataLoaders created.")
print(f"Train loader: {len(train_loader)} batches of size 16")
print(f"Validation loader: {len(val_loader)} batches of size 16")
print(f"Test loader: {len(test_loader)} batches of size 16")

PyTorch Datasets and DataLoaders created.
Train loader: 10000 batches of size 16
Validation loader: 1250 batches of size 16
Test loader: 1250 batches of size 16
