Submitted by: Sampriti Mahapatra, MDS202433

# SMS Spam Classification - Data Preparation

This notebook loads, preprocesses, splits, and saves the SMS spam dataset for model training.

## Imports and Setup

In [61]:
import pandas as pd
import numpy as np
import re
import os
import subprocess
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Project directory
PROJECT_DIR = '/Users/sampriti/Downloads/cmi/AML_2'

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Load Data

Load SMS spam data from tab-separated file.

In [62]:
def load_data(file_path):
    # Read tab-separated file with proper encoding
    df = pd.read_csv(file_path, sep='\t', header=None, names=['label', 'message'], encoding='utf-8')
    
    print(f"Total samples: {len(df)}")
    print(f"\nClass distribution:")
    print(df['label'].value_counts())
    print(f"\nClass percentages:")
    print(df['label'].value_counts(normalize=True) * 100)
    
    # Check for missing values
    print(f"\nMissing values:")
    print(df.isnull().sum())
    
    # Check for duplicates
    duplicates = df.duplicated().sum()
    print(f"\nDuplicate rows: {duplicates}")
    
    return df

In [63]:
# Load the raw SMS data
df = load_data(os.path.join(PROJECT_DIR, 'sms+spam+collection/SMSSpamCollection'))

# Save the raw loaded data as raw_data.csv (before preprocessing)
raw_data_path = os.path.join(PROJECT_DIR, 'raw_data.csv')
df.to_csv(raw_data_path, index=False)
print(f"\nRaw data saved to {raw_data_path}")

# Display first few rows
print("\nFirst 5 rows:")
df.head()

Total samples: 5572

Class distribution:
label
ham     4825
spam     747
Name: count, dtype: int64

Class percentages:
label
ham     86.593683
spam    13.406317
Name: proportion, dtype: float64

Missing values:
label      0
message    0
dtype: int64

Duplicate rows: 403

Raw data saved to /Users/sampriti/Downloads/cmi/AML_2/raw_data.csv

First 5 rows:


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [64]:
# Display sample messages from each class
print("Sample HAM messages:")
for i, msg in enumerate(df[df['label'] == 'ham']['message'].sample(3, random_state=RANDOM_STATE).values, 1):
    print(f"{i}. {msg}")
    print()

print("\nSample SPAM messages:")
for i, msg in enumerate(df[df['label'] == 'spam']['message'].sample(3, random_state=RANDOM_STATE).values, 1):
    print(f"{i}. {msg}")
    print()

Sample HAM messages:
1. If i not meeting ü all rite then i'll go home lor. If ü dun feel like comin it's ok.

2. I.ll always be there, even if its just in spirit. I.ll get a bb soon. Just trying to be sure i need it.

3. Sorry that took so long, omw now


Sample SPAM messages:
1. Summers finally here! Fancy a chat or flirt with sexy singles in yr area? To get MATCHED up just reply SUMMER now. Free 2 Join. OptOut txt STOP Help08714742804

2. This is the 2nd time we have tried 2 contact u. U have won the 750 Pound prize. 2 claim is easy, call 08718726970 NOW! Only 10p per min. BT-national-rate 

3. Get ur 1st RINGTONE FREE NOW! Reply to this msg with TONE. Gr8 TOP 20 tones to your phone every week just £1.50 per wk 2 opt out send STOP 08452810071 16



## Preprocess Text

In [65]:
def preprocess_text(text):
    if not isinstance(text, str):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Replace URLs with token
    text = re.sub(r'http\S+|www\S+|https\S+', 'URL', text, flags=re.MULTILINE)
    
    # Replace phone numbers with token (various formats)
    text = re.sub(r'\b\d{5,}\b', 'PHONE', text)  # 5+ consecutive digits
    text = re.sub(r'\+?\d[\d\s\-\(\)]{7,}\d', 'PHONE', text)  # Phone formats
    
    # Replace currency symbols with token
    text = re.sub(r'[£$€¥₹]', 'CURRENCY', text)
    
    # Replace numbers with token (but keep single digits for now)
    text = re.sub(r'\b\d{2,}\b', 'NUMBER', text)
    
    # Remove special characters but keep spaces
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [66]:
# Apply preprocessing to all messages
print("Preprocessing all messages...")
df['cleaned_message'] = df['message'].apply(preprocess_text)

# Verify no empty messages after cleaning
empty_count = (df['cleaned_message'].str.len() == 0).sum()
print(f"Empty messages after cleaning: {empty_count}")

# Display examples
print("\nCleaning Results (sample):")
df[['message', 'cleaned_message']].sample(5, random_state=RANDOM_STATE)

Preprocessing all messages...
Empty messages after cleaning: 3

Cleaning Results (sample):


Unnamed: 0,message,cleaned_message
3245,Squeeeeeze!! This is christmas hug.. If u lik ...,squeeeeeze this is christmas hug if u lik my f...
944,And also I've sorta blown him off a couple tim...,and also i ve sorta blown him off a couple tim...
1044,Mmm thats better now i got a roast down me! i...,mmm thats better now i got a roast down me i d...
2484,Mm have some kanji dont eat anything heavy ok,mm have some kanji dont eat anything heavy ok
812,So there's a ring that comes with the guys cos...,so there s a ring that comes with the guys cos...


## Split data into train/test/validation sets

In [67]:
def split_data(df, train_size=0.70, val_size=0.15, test_size=0.15, random_state=42):
    # Verify proportions sum to 1
    assert abs(train_size + val_size + test_size - 1.0) < 0.001, "Split proportions must sum to 1"
    
    # First split: separate test set
    train_val_df, test_df = train_test_split(
        df,
        test_size=test_size,
        random_state=random_state,
        stratify=df['label']
    )
    
    # Second split: separate train and validation
    val_proportion = val_size / (train_size + val_size)
    train_df, val_df = train_test_split(
        train_val_df,
        test_size=val_proportion,
        random_state=random_state,
        stratify=train_val_df['label']
    )
    
    # Display split statistics
    print("Data Split Summary:")
    print(f"Total samples: {len(df)}")
    print(f"\nTraining set:   {len(train_df)} samples ({len(train_df)/len(df)*100:.1f}%)")
    print(f"Validation set: {len(val_df)} samples ({len(val_df)/len(df)*100:.1f}%)")
    print(f"Test set:       {len(test_df)} samples ({len(test_df)/len(df)*100:.1f}%)")
    
    # Verify stratification
    print("\nClass distribution in each split:")
    
    for split_name, split_df in [('Training', train_df), ('Validation', val_df), ('Test', test_df)]:
        spam_count = (split_df['label'] == 'spam').sum()
        ham_count = (split_df['label'] == 'ham').sum()
        spam_pct = spam_count / len(split_df) * 100
        print(f"{split_name:12s}: Ham={ham_count:4d} ({100-spam_pct:5.2f}%), Spam={spam_count:3d} ({spam_pct:5.2f}%)")
    
    # Verify no overlap
    train_indices = set(train_df.index)
    val_indices = set(val_df.index)
    test_indices = set(test_df.index)
    
    assert len(train_indices & val_indices) == 0, "Train and validation sets overlap"
    assert len(train_indices & test_indices) == 0, "Train and test sets overlap"
    assert len(val_indices & test_indices) == 0, "Validation and test sets overlap"
    print("No data leakage detected between splits")
    
    return train_df, val_df, test_df

In [68]:
# Split the data
train_df, val_df, test_df = split_data(
    df,
    train_size=0.70,
    val_size=0.15,
    test_size=0.15,
    random_state=RANDOM_STATE
)

Data Split Summary:
Total samples: 5572

Training set:   3900 samples (70.0%)
Validation set: 836 samples (15.0%)
Test set:       836 samples (15.0%)

Class distribution in each split:
Training    : Ham=3377 (86.59%), Spam=523 (13.41%)
Validation  : Ham= 724 (86.60%), Spam=112 (13.40%)
Test        : Ham= 724 (86.60%), Spam=112 (13.40%)
No data leakage detected between splits


In [69]:
# Filter out rows with empty cleaned_message
print("\nFiltering out empty cleaned messages...")
print(f"Before filtering:")
print(f"  Training: {len(train_df)} samples")
print(f"  Validation: {len(val_df)} samples")
print(f"  Test: {len(test_df)} samples")

# Count empty messages in each split
train_empty = (train_df['cleaned_message'].str.len() == 0).sum()
val_empty = (val_df['cleaned_message'].str.len() == 0).sum()
test_empty = (test_df['cleaned_message'].str.len() == 0).sum()
print(f"\nEmpty messages to remove:")
print(f"  Training: {train_empty}")
print(f"  Validation: {val_empty}")
print(f"  Test: {test_empty}")

# Filter out empty messages
train_df = train_df[train_df['cleaned_message'].str.len() > 0].copy()
val_df = val_df[val_df['cleaned_message'].str.len() > 0].copy()
test_df = test_df[test_df['cleaned_message'].str.len() > 0].copy()

print(f"\nAfter filtering:")
print(f"  Training: {len(train_df)} samples")
print(f"  Validation: {len(val_df)} samples")
print(f"  Test: {len(test_df)} samples")
print(f"  Total: {len(train_df) + len(val_df) + len(test_df)} samples")
print("\nEmpty messages removed successfully")


Filtering out empty cleaned messages...
Before filtering:
  Training: 3900 samples
  Validation: 836 samples
  Test: 836 samples

Empty messages to remove:
  Training: 2
  Validation: 0
  Test: 1

After filtering:
  Training: 3898 samples
  Validation: 836 samples
  Test: 835 samples
  Total: 5569 samples

Empty messages removed successfully


## Save splits as csv files

In [70]:
def save_splits(train_df, val_df, test_df, output_dir='.'):
    import os
    
    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Define file paths
    train_path = os.path.join(output_dir, 'train.csv')
    val_path = os.path.join(output_dir, 'validation.csv')
    test_path = os.path.join(output_dir, 'test.csv')
    
    # Select columns to save (label, original message, cleaned message)
    columns_to_save = ['label', 'message', 'cleaned_message']
    
    # Save to CSV
    train_df[columns_to_save].to_csv(train_path, index=False)
    val_df[columns_to_save].to_csv(val_path, index=False)
    test_df[columns_to_save].to_csv(test_path, index=False)
    
    print("Files saved successfully:")
    print(f"Training set:   {train_path}")
    print(f"Validation set: {val_path}")
    print(f"Test set:       {test_path}")

In [71]:
# Save the splits
output_directory = PROJECT_DIR  # Fixed: was incorrectly pointing to 'AML 1'
save_splits(train_df, val_df, test_df, output_dir=output_directory)

Files saved successfully:
Training set:   /Users/sampriti/Downloads/cmi/AML_2/train.csv
Validation set: /Users/sampriti/Downloads/cmi/AML_2/validation.csv
Test set:       /Users/sampriti/Downloads/cmi/AML_2/test.csv


## Initialize Git and DVC

In [72]:
os.chdir(PROJECT_DIR)

# Initialize git repository
!git init
!git config user.name "Sampriti Mahapatra"
!git config user.email "sampriti@example.com"

# Create .gitignore
gitignore_content = """# Python
__pycache__/
*.py[cod]
*.egg-info/
.venv/
venv/

# Jupyter
.ipynb_checkpoints/

# OS
.DS_Store

# Data files tracked by DVC
/raw_data.csv
/train.csv
/validation.csv
/test.csv
"""

with open(os.path.join(PROJECT_DIR, '.gitignore'), 'w') as f:
    f.write(gitignore_content.strip())
print("\n.gitignore created")

# Initialize DVC
!dvc init
print("\nDVC initialized successfully")

Reinitialized existing Git repository in /Users/sampriti/Downloads/cmi/AML_2/.git/

.gitignore created
[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.
[0m
DVC initialized successfully


In [73]:
# Initial commit with project setup
!git add .gitignore .dvc/ .dvcignore prepare.ipynb plan.md requirements.txt sms+spam+collection/
!git commit -m "Initial commit: project setup with DVC initialization"
!git log --oneline

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mkeys[m
	[31mmlruns/[m

nothing added to commit but untracked files present (use "git add" to track)
[33m6661081[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmain[m[33m)[m Add train.ipynb with 3 benchmark models tracked by MLflow
[33m564de3c[m Add Google Drive as DVC remote storage
[33meaf8eeb[m configure google drive as dvc remote storage
[33mbff705e[m configure google drive as dvc remote storage
[33mc4f01f1[m Version 2: data splits with RANDOM_STATE=123
[33md3baad4[m Version 1: data splits with RANDOM_STATE=42
[33m4a140a4[m Initial commit: project setup with DVC initialization
[33mc88235d[m Add train.ipynb with 3 benchmark models tracked by MLflow
[33m016ac01[m Add Google Drive as DVC remote storage
[33m70d0695[m[33m ([m[1;33mtag: [m[1;33mv2[m[33m)[m Version 2: data splits with RANDOM_STATE=123
[33md9bca26[m[33m ([m[1;33mtag: [m[1;33mv1[m[33m

## Track Data with DVC - Version 1 (seed=42)

In [74]:
# Track data files with DVC (Version 1 - seed=42)
os.chdir(PROJECT_DIR)
!dvc add raw_data.csv train.csv validation.csv test.csv

# Show the .dvc pointer files created
print("\nDVC tracking files created:")
!ls -la *.dvc

 [?25l[32m⠋[0m Checking graph
  0% Adding...|                       | raw_data.csv |0/4 [00:00<?,     ?file/s]
![A
Collecting files and computing hashes in raw_data.csv |0.00 [00:00,     ?file/s][A
                                                                                [A
![A
  0% Checking cache in '/Users/sampriti/Downloads/cmi/AML_2/.dvc/cache/files/md5[A
                                                                                [A
![A
  0%|          |Checking out /Users/sampriti/Downloads0/1 [00:00<?,    ?files/s][A
  0% Adding...|                          | train.csv |0/4 [00:00<?,     ?file/s][A
![A
Collecting files and computing hashes in train.csv    |0.00 [00:00,     ?file/s][A
                                                                                [A
![A
  0% Checking cache in '/Users/sampriti/Downloads/cmi/AML_2/.dvc/cache/files/md5[A
                                                                                [A
![A
  0%|          |

In [75]:
# Commit Version 1 (seed=42) to git
!git add raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc .gitignore prepare.ipynb
!git commit -m "Version 1: data splits with RANDOM_STATE=42"
!git tag v1
print("\nVersion 1 committed and tagged as 'v1'")
!git log --oneline

[main 0edda6c] Version 1: data splits with RANDOM_STATE=42
 3 files changed, 6 insertions(+), 6 deletions(-)
fatal: tag 'v1' already exists

Version 1 committed and tagged as 'v1'
[33m0edda6c[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmain[m[33m)[m Version 1: data splits with RANDOM_STATE=42
[33m6661081[m Add train.ipynb with 3 benchmark models tracked by MLflow
[33m564de3c[m Add Google Drive as DVC remote storage
[33meaf8eeb[m configure google drive as dvc remote storage
[33mbff705e[m configure google drive as dvc remote storage
[33mc4f01f1[m Version 2: data splits with RANDOM_STATE=123
[33md3baad4[m Version 1: data splits with RANDOM_STATE=42
[33m4a140a4[m Initial commit: project setup with DVC initialization
[33mc88235d[m Add train.ipynb with 3 benchmark models tracked by MLflow
[33m016ac01[m Add Google Drive as DVC remote storage
[33m70d0695[m[33m ([m[1;33mtag: [m[1;33mv2[m[33m)[m Version 2: data splits with RANDOM_STATE=123
[33md9bca26[m[33m (

## Update Split - Version 2 (seed=123)

In [76]:
# Re-split with a different random seed
RANDOM_STATE_V2 = 123
np.random.seed(RANDOM_STATE_V2)

# Re-split using the preprocessed dataframe (df already has cleaned_message)
train_df_v2, val_df_v2, test_df_v2 = split_data(
    df,
    train_size=0.70,
    val_size=0.15,
    test_size=0.15,
    random_state=RANDOM_STATE_V2
)

# Filter empty messages
train_df_v2 = train_df_v2[train_df_v2['cleaned_message'].str.len() > 0].copy()
val_df_v2 = val_df_v2[val_df_v2['cleaned_message'].str.len() > 0].copy()
test_df_v2 = test_df_v2[test_df_v2['cleaned_message'].str.len() > 0].copy()

print(f"\nVersion 2 splits (seed=123):")
print(f"  Training:   {len(train_df_v2)} samples")
print(f"  Validation: {len(val_df_v2)} samples")
print(f"  Test:       {len(test_df_v2)} samples")

# Save the updated splits (overwrite existing files)
save_splits(train_df_v2, val_df_v2, test_df_v2, output_dir=PROJECT_DIR)

Data Split Summary:
Total samples: 5572

Training set:   3900 samples (70.0%)
Validation set: 836 samples (15.0%)
Test set:       836 samples (15.0%)

Class distribution in each split:
Training    : Ham=3377 (86.59%), Spam=523 (13.41%)
Validation  : Ham= 724 (86.60%), Spam=112 (13.40%)
Test        : Ham= 724 (86.60%), Spam=112 (13.40%)
No data leakage detected between splits

Version 2 splits (seed=123):
  Training:   3898 samples
  Validation: 835 samples
  Test:       836 samples
Files saved successfully:
Training set:   /Users/sampriti/Downloads/cmi/AML_2/train.csv
Validation set: /Users/sampriti/Downloads/cmi/AML_2/validation.csv
Test set:       /Users/sampriti/Downloads/cmi/AML_2/test.csv


In [77]:
# Track updated data with DVC
os.chdir(PROJECT_DIR)
!dvc add raw_data.csv train.csv validation.csv test.csv

# Commit Version 2
!git add raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc prepare.ipynb
!git commit -m "Version 2: data splits with RANDOM_STATE=123"
!git tag v2
print("\nVersion 2 committed and tagged as 'v2'")
!git log --oneline

 [?25l[32m⠋[0m Checking graph
  0% Adding...|                       | raw_data.csv |0/4 [00:00<?,     ?file/s]
![A
Collecting files and computing hashes in raw_data.csv |0.00 [00:00,     ?file/s][A
                                                                                [A
![A
  0% Checking cache in '/Users/sampriti/Downloads/cmi/AML_2/.dvc/cache/files/md5[A
                                                                                [A
![A
  0%|          |Checking out /Users/sampriti/Downloads0/1 [00:00<?,    ?files/s][A
  0% Adding...|                          | train.csv |0/4 [00:00<?,     ?file/s][A
![A
Collecting files and computing hashes in train.csv    |0.00 [00:00,     ?file/s][A
                                                                                [A
![A
  0% Checking cache in '/Users/sampriti/Downloads/cmi/AML_2/.dvc/cache/files/md5[A
                                                                                [A
![A
  0%|          |

## Checkout Versions and Compare Target Variable Distribution

In [78]:
def print_label_distribution(version_name):
    """Load current CSV files and print ham/spam distribution."""
    print(f"\n{'='*60}")
    print(f"Target Variable Distribution -- {version_name}")
    print(f"{'='*60}")
    
    for split_name, filename in [('Train', 'train.csv'), ('Validation', 'validation.csv'), ('Test', 'test.csv')]:
        filepath = os.path.join(PROJECT_DIR, filename)
        split_df = pd.read_csv(filepath)
        counts = split_df['label'].value_counts()
        total = len(split_df)
        print(f"\n{split_name} ({filename}):")
        print(f"  ham:   {counts.get('ham', 0):5d}  ({counts.get('ham', 0)/total*100:.2f}%)")
        print(f"  spam:  {counts.get('spam', 0):5d}  ({counts.get('spam', 0)/total*100:.2f}%)")
        print(f"  total: {total}")

In [79]:
# Checkout Version 1 (seed=42) and print distribution
os.chdir(PROJECT_DIR)
!git checkout v1 -- raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc
!dvc checkout

print_label_distribution("Version 1 (seed=42)")

Building workspace index                              |4.00 [00:00,  887entry/s]
Comparing indexes                                    |5.00 [00:00, 13.3kentry/s]
Applying changes                                      |3.00 [00:00, 4.26kfile/s]
[33mM[0m       test.csv
[33mM[0m       train.csv
[33mM[0m       validation.csv
[0m
Target Variable Distribution -- Version 1 (seed=42)

Train (train.csv):
  ham:    3375  (86.58%)
  spam:    523  (13.42%)
  total: 3898

Validation (validation.csv):
  ham:     724  (86.60%)
  spam:    112  (13.40%)
  total: 836

Test (test.csv):
  ham:     723  (86.59%)
  spam:    112  (13.41%)
  total: 835


In [80]:
# Checkout Version 2 (seed=123) and print distribution
os.chdir(PROJECT_DIR)
!git checkout v2 -- raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc
!dvc checkout

print_label_distribution("Version 2 (seed=123)")

Building workspace index                              |4.00 [00:00,  863entry/s]
Comparing indexes                                    |5.00 [00:00, 13.6kentry/s]
Applying changes                                      |3.00 [00:00, 4.49kfile/s]
[33mM[0m       test.csv
[33mM[0m       train.csv
[33mM[0m       validation.csv
[0m
Target Variable Distribution -- Version 2 (seed=123)

Train (train.csv):
  ham:    3375  (86.58%)
  spam:    523  (13.42%)
  total: 3898

Validation (validation.csv):
  ham:     723  (86.59%)
  spam:    112  (13.41%)
  total: 835

Test (test.csv):
  ham:     724  (86.60%)
  spam:    112  (13.40%)
  total: 836


## Bonus: Configure Google Drive as DVC Remote Storage

In [81]:
!pip install dvc-gdrive -q

In [None]:
GDRIVE_FOLDER_ID = "1G9H6RgIKeXkNPi-7nYFup5DYxUK3i6bD"

!dvc remote add -d gdrive gdrive://{GDRIVE_FOLDER_ID} -f

!dvc remote modify gdrive gdrive_client_id ''
!dvc remote modify gdrive gdrive_client_secret ''

Setting 'gdrive' as a default remote.
[0m[0m[0m

In [83]:
!dvc push

Collecting                                           |4.00 [00:00, 1.47kentry/s]
Pushing
![A
  0% Checking cache in '1G9H6RgIKeXkNPi-7nYFup5DYxUK3i6bD/files/md5'| |0/? [00:0[A
  0% Querying cache in '1G9H6RgIKeXkNPi-7nYFup5DYxUK3i6bD/files/md5'| |1/256 [00[A
Pushing                                                                         [A
Everything is up to date.
[0m

In [84]:
!git add .dvc/config
!git commit -m "configure google drive as dvc remote storage"

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mkeys[m
	[31mmlruns/[m

nothing added to commit but untracked files present (use "git add" to track)


In [85]:
# Commit remote configuration
!git add .dvc/config
!git commit -a -m "Add Google Drive as DVC remote storage"
!git log --oneline

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mkeys[m
	[31mmlruns/[m

nothing added to commit but untracked files present (use "git add" to track)
[33mf97c3d1[m[33m ([m[1;36mHEAD[m[33m -> [m[1;32mmain[m[33m)[m Version 2: data splits with RANDOM_STATE=123
[33m0edda6c[m Version 1: data splits with RANDOM_STATE=42
[33m6661081[m Add train.ipynb with 3 benchmark models tracked by MLflow
[33m564de3c[m Add Google Drive as DVC remote storage
[33meaf8eeb[m configure google drive as dvc remote storage
[33mbff705e[m configure google drive as dvc remote storage
[33mc4f01f1[m Version 2: data splits with RANDOM_STATE=123
[33md3baad4[m Version 1: data splits with RANDOM_STATE=42
[33m4a140a4[m Initial commit: project setup with DVC initialization
[33mc88235d[m Add train.ipynb with 3 benchmark models tracked by MLflow
[33m016ac01[m Add Google Drive as DVC remote storage
[33m70d0695[m[33m ([m[1;33mtag: [m[1