# SMS Spam Data Preparation with DVC (Unstratified)

This notebook downloads the SMS Spam Collection dataset, splits it into training, validation, and test sets (without stratification), and uses DVC to track the data versions.

In [1]:
print("Hello World")

Hello World


In [2]:
import os
import requests
import zipfile
import pandas as pd
from sklearn.model_selection import train_test_split
import subprocess

## 1. Download and Extract Data

In [3]:
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "smsspamcollection.zip"
data_file = "SMSSpamCollection"

if not os.path.exists(data_file):
    print(f"Downloading dataset from {url}...")
    response = requests.get(url)
    with open(zip_path, 'wb') as f:
        f.write(response.content)

    print("Extracting dataset...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(".")
    print("Data downloaded and extracted.")
else:
    print("Data already exists.")

Data already exists.


## 2. Process and Save Raw Data

In [4]:
# Load raw data (tab-separated, no header)
df = pd.read_csv(data_file, sep='\t', names=['label', 'text'])
df.to_csv('raw_data.csv', index=False)
print(f"Raw data saved to raw_data.csv with {len(df)} records.")

Raw data saved to raw_data.csv with 5572 records.


## 3. Split Data into Train, Validation, and Test (Seed 42 - Unstratified)

In [5]:
# Split: 80% train, 10% validation, 10% test (Unstratified)
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42) # Removed stratify
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42) # Removed stratify

train_df.to_csv('train.csv', index=False)
val_df.to_csv('validation.csv', index=False)
test_df.to_csv('test.csv', index=False)

print(f"Train size: {len(train_df)}")
print(f"Validation size: {len(val_df)}")
print(f"Test size: {len(test_df)}")

Train size: 4457
Validation size: 557
Test size: 558


## 4. Track with DVC (v1.0)

In [6]:
def run_command(cmd):
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if result.returncode == 0:
        print(result.stdout)
    else:
        print(result.stderr)

print("Adding files to DVC...")
run_command("dvc add raw_data.csv train.csv validation.csv test.csv")

print("Committing DVC changes to Git...")
run_command("git add raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc .gitignore")
run_command('git commit -m "Add unstratified data versions (Seed 42)"')
run_command('git tag -a v1.0 -m "Unstratified version with seed 42"')

Adding files to DVC...

To track the changes with git, run:

	git add raw_data.csv.dvc validation.csv.dvc train.csv.dvc test.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true

Committing DVC changes to Git...

[master c6ae12c] Add unstratified data versions (Seed 42)
 3 files changed, 6 insertions(+), 6 deletions(-)

fatal: tag 'v1.0' already exists



## 5. Update Split with a Different Random Seed (Seed 100 - Unstratified)

In this section, we'll re-split the data using a different random seed without stratification.

In [7]:
new_seed = 100
print(f"Updating splits with random_state={new_seed}...")

train_df_new, temp_df_new = train_test_split(df, test_size=0.2, random_state=new_seed) # Removed stratify
val_df_new, test_df_new = train_test_split(temp_df_new, test_size=0.5, random_state=new_seed) # Removed stratify

train_df_new.to_csv('train.csv', index=False)
val_df_new.to_csv('validation.csv', index=False)
test_df_new.to_csv('test.csv', index=False)

print(f"New Train size: {len(train_df_new)}")
print(f"New Validation size: {len(val_df_new)}")
print(f"New Test size: {len(test_df_new)}")

Updating splits with random_state=100...
New Train size: 4457
New Validation size: 557
New Test size: 558


## 6. Track New Version with DVC (v2.0)

Now we'll track this new unstratified version of the data.

In [8]:
print("Adding new data versions to DVC...")
run_command("dvc add train.csv validation.csv test.csv")

print("Committing updated DVC pointers to Git...")
run_command("git add train.csv.dvc validation.csv.dvc test.csv.dvc")
run_command(f'git commit -m "Update unstratified data splits (Seed {new_seed})"')
run_command('git tag -a v2.0 -m "Unstratified version with seed 100"')

Adding new data versions to DVC...

To track the changes with git, run:

	git add train.csv.dvc validation.csv.dvc test.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true

Committing updated DVC pointers to Git...

[master 78487fa] Update unstratified data splits (Seed 100)
 3 files changed, 6 insertions(+), 6 deletions(-)

fatal: tag 'v2.0' already exists



## 8. Compare Distributions (v1.0 vs v2.0)

In [9]:
def get_distribution(tag_name):
    print(f"\n--- Checking out {tag_name} ---")
    # Checkout .dvc files from the git tag
    run_command(f"git checkout {tag_name} -- train.csv.dvc validation.csv.dvc test.csv.dvc")
    # DVC checkout to restore data
    run_command("dvc checkout")

    for filename in ['train.csv', 'validation.csv', 'test.csv']:
        if os.path.exists(filename):
            df = pd.read_csv(filename)
            print(f"Distribution for {filename} ({tag_name}):")
            print(df['label'].value_counts())
            print("-" * 20)



In [10]:
# 1. Checkout v1.0 and print distribution
get_distribution("v1.0")


--- Checking out v1.0 ---

M       test.csv
M       train.csv
M       validation.csv

Distribution for train.csv (v1.0):
label
ham     3859
spam     598
Name: count, dtype: int64
--------------------
Distribution for validation.csv (v1.0):
label
ham     485
spam     72
Name: count, dtype: int64
--------------------
Distribution for test.csv (v1.0):
label
ham     481
spam     77
Name: count, dtype: int64
--------------------


In [11]:
# 2. Checkout v2.0 and print distribution
get_distribution("v2.0")


--- Checking out v2.0 ---

M       test.csv
M       train.csv
M       validation.csv

Distribution for train.csv (v2.0):
label
ham     3857
spam     600
Name: count, dtype: int64
--------------------
Distribution for validation.csv (v2.0):
label
ham     490
spam     67
Name: count, dtype: int64
--------------------
Distribution for test.csv (v2.0):
label
ham     478
spam     80
Name: count, dtype: int64
--------------------


## BONUS: (decouple compute and storage) track the data versions using google drive as storage


### NOTE : I have NOT exposed my OAUTH Client ID and Client Secret here BUT I have used for authentication

DVC Remote Storage : Google Drive Path : https://drive.google.com/drive/folders/1gMTOXybo0KtEAHUu-uHj9iaQlTC3lPQt?usp=sharing

In [None]:
# run_command("dvc remote modify myremote gdrive_client_id MY_CLIENT_ID")
# run_command("dvc remote modify myremote gdrive_client_secret MY_CLIENT_SECRET")

In [13]:
run_command("dvc push")

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=142561804177-am7q7haaassvih0pu36d8ovqbl0ostgi.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.appdata&access_type=offline&response_type=code&approval_prompt=force

Authentication successful.
4 files pushed



## 10. Push All Versions to Remote

We ensure all tagged versions (v1.0 and v2.0) are pushed to Google Drive using `dvc push --all-tags`.

In [14]:
print("Pushing all tags to remote...")
run_command("dvc push --all-tags")
print("Push complete.")

Pushing all tags to remote...
Everything is up to date.

Push complete.


## 11. Verify Remote Retrieval

We will simulate retrieving data from the remote by deleting local data and pulling specific versions from Google Drive.
**Note**: We use `git checkout <tag> -- <file>.dvc` to update the DVC pointer files without changing the notebook itself.

In [16]:
import os

def verify_version(tag):
    print(f"\n--- Verifying Remote Pull for {tag} ---")
    
    # 1. Update DVC files to match the tag
    dvc_files = "raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc"
    run_command(f"git checkout {tag} -- {dvc_files}")
    
    # 2. Delete local data to force pull
    for f in ['raw_data.csv', 'train.csv', 'validation.csv', 'test.csv']:
        if os.path.exists(f):
            os.remove(f)
    
    # 3. Pull from remote
    # This will fail if data is not on the remote
    run_command("dvc pull")
    
    # 4. Verify existence
    if os.path.exists("train.csv"):
        print(f"SUCCESS: Data for {tag} pulled from remote.")
        # Optional: Print row count to match version
        import pandas as pd
        print(f"Train rows: {len(pd.read_csv('train.csv'))}")
    else:
        print(f"FAILURE: Data for {tag} NOT found after pull.")

# Verify v1.0
verify_version("v1.0")

# Verify v2.0
verify_version("v2.0")

# Restore DVC files to HEAD (Current state)
print("\n--- Restoring HEAD ---")
run_command("git checkout HEAD -- raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc")
run_command("dvc pull")


--- Verifying Remote Pull for v1.0 ---

A       raw_data.csv
A       test.csv
A       train.csv
A       validation.csv
4 files added

SUCCESS: Data for v1.0 pulled from remote.
Train rows: 4457

--- Verifying Remote Pull for v2.0 ---

A       raw_data.csv
A       test.csv
A       train.csv
A       validation.csv
4 files added

SUCCESS: Data for v2.0 pulled from remote.
Train rows: 4457

--- Restoring HEAD ---

Everything is up to date.



In [18]:
def analyze_distribution(tag):
    print(f"\n=== Distribution Analysis for {tag} ===")
    
    # 1. Checkout DVC files
    dvc_files = "raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc"
    run_command(f"git checkout {tag} -- {dvc_files}")
    
    # 2. Force pull data
    for f in ['train.csv', 'validation.csv', 'test.csv']:
        if os.path.exists(f):
            os.remove(f)
    run_command("dvc pull")
    
    # 3. Analyze
    for filename in ['train.csv', 'validation.csv', 'test.csv']:
        if os.path.exists(filename):
            df = pd.read_csv(filename)
            print(f"\n--- {filename} ({tag}) ---")
            print(df['label'].value_counts())
            print(f"Total rows: {len(df)}")
        else:
            print(f"Warning: {filename} not found for {tag}")

# Analyze v1.0
analyze_distribution("v1.0")

# Analyze v2.0
analyze_distribution("v2.0")

# Restore HEAD
print("\n=== Restoring HEAD ===")
run_command("git checkout HEAD -- raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc")
run_command("dvc pull")


=== Distribution Analysis for v1.0 ===



A       test.csv
A       train.csv
A       validation.csv
3 files added


--- train.csv (v1.0) ---
label
ham     3859
spam     598
Name: count, dtype: int64
Total rows: 4457

--- validation.csv (v1.0) ---
label
ham     485
spam     72
Name: count, dtype: int64
Total rows: 557

--- test.csv (v1.0) ---
label
ham     481
spam     77
Name: count, dtype: int64
Total rows: 558

=== Distribution Analysis for v2.0 ===

A       test.csv
A       train.csv
A       validation.csv
3 files added


--- train.csv (v2.0) ---
label
ham     3857
spam     600
Name: count, dtype: int64
Total rows: 4457

--- validation.csv (v2.0) ---
label
ham     490
spam     67
Name: count, dtype: int64
Total rows: 557

--- test.csv (v2.0) ---
label
ham     478
spam     80
Name: count, dtype: int64
Total rows: 558

=== Restoring HEAD ===

Everything is up to date.

