# Assignment 1

### Ayush Yadav (MDS202315)

**Data Version Control** <br>

In `prepare.ipynb` track the versions of data using dvc
1) Load the raw data into raw_data.csv and save the split data into train.csv/validation.csv/test.csv
2) Update train/validation/test split by choosing different random seed
3) Checkout the first version (before update) using dvc and print the distribution of target variable (number of 0s and number of 1s) in train.csv, validation.csv, and test.csv
4) Checkout the updated version using dvc and print the distribution of target variable in train.csv, validation.csv, test.csv



## Set up dvc

In [11]:
!dvc init --no-scm

Initialized DVC repository.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


In [12]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	./

nothing added to commit but untracked files present (use "git add" to track)


In [14]:
!git add .dvc
!git commit -m "Reinitialize DVC"
!git push origin main

[main 2005ffa] Reinitialize DVC
 2 files changed, 2 insertions(+)
 create mode 100644 Assignment 2/.dvc/config
 create mode 100644 Assignment 2/.dvc/tmp/btime


To https://github.com/yadavayush7028/AppliedMachineLearning.git
   8364a24..2005ffa  main -> main


In [15]:
!dvc remote add -d gdrive_remote gdrive://1lqWMn1zwrachgWtH82xSFnOsYZsoiDgU

Setting 'gdrive_remote' as a default remote.


In [16]:
!dvc remote modify gdrive_remote gdrive_use_service_account true
!dvc remote modify gdrive_remote --local \
            gdrive_service_account_json_file_path maps-391110-98fa3766ee6f.json

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score, train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from scipy import sparse
from scipy.sparse import save_npz
import pickle

#### 1. Loading the Data

In [19]:
data = pd.read_csv('./sms+spam+collection/SMSSpamCollection', sep='\t', names=["label", "message"])
data

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [24]:
# Save the raw data
data.to_csv("raw_data.csv", index=False)

# Track the raw data using DVC
!dvc add raw_data.csv
!git add raw_data.csv.dvc
!git commit -m "Added raw data"
!dvc push

⠋ Checking graph



On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   .dvc/config

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.dvc/cache/
	.dvc/config.local
	.dvc/tmp/lock
	.dvc/tmp/rwlock
	.dvc/tmp/rwlock.lock
	.dvcignore
	maps-391110-98fa3766ee6f.json
	prepare.ipynb
	raw_data.csv
	sms+spam+collection/
	train.ipynb

no changes added to commit (use "git add" and/or "git commit -a")
1 file pushed


In [25]:
data.groupby('label').describe()

Unnamed: 0_level_0,length,length,length,length,length,length,length,length
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ham,4825.0,71.482487,58.440652,2.0,33.0,52.0,93.0,910.0
spam,747.0,138.670683,28.873603,13.0,133.0,149.0,157.0,223.0


In [26]:
data['length'] = data['message'].map(lambda text: len(text))
data.length.describe()

count    5572.000000
mean       80.489950
std        59.942907
min         2.000000
25%        36.000000
50%        62.000000
75%       122.000000
max       910.000000
Name: length, dtype: float64

#### 2. Data Preprocessing

In [27]:
STOPWORDS = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = text.lower()  ### Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text) ### Remove punctuation and non-alphabetic characters
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in STOPWORDS]) ### Lemmatize and Remove Stopwords

    return text

In [28]:
data['preprocessed'] = data.message.apply(preprocess_text)
data['label'] = data.label.map({'ham':0,'spam':1})
data.drop(columns=['length'], inplace=True)
data

Unnamed: 0,label,message,preprocessed
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts st ...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah dont think go usf life around though
...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,nd time tried contact u u pound prize claim ea...
5568,0,Will ü b going to esplanade fr home?,b going esplanade fr home
5569,0,"Pity, * was in mood for that. So...any other s...",pity mood soany suggestion
5570,0,The guy did some bitching but I acted like i'd...,guy bitching acted like id interested buying s...


In [30]:
### Tokenize and Creating Vocabulary

tokens = word_tokenize(" ".join(data['preprocessed']))
bag_of_words = CountVectorizer().fit(tokens)

vocab = bag_of_words.vocabulary_
print("Vocabulary Size:",len(vocab))

Vocabulary Size: 7947


In [31]:
with open('./sms+spam+collection/bag_of_words.pkl','wb') as f:
    pickle.dump(bag_of_words,f)

In [32]:
bow_msgs = bag_of_words.transform(data['message'])
print('sparse matrix shape:', bow_msgs.shape)
print('number of non-zeros:', bow_msgs.nnz)
print('sparsity: %.2f%%' % (100.0 * bow_msgs.nnz / (bow_msgs.shape[0] * bow_msgs.shape[1])))

sparse matrix shape: (5572, 7947)
number of non-zeros: 44810
sparsity: 0.10%


In [33]:
bow_msgs.shape

(5572, 7947)

In [34]:
DATASET = data['preprocessed']
LABELS = data['label'].values

#### 3. Version 1: Splitting the Data

In [35]:
X, X_test, Y, Y_test = train_test_split(DATASET, LABELS, test_size=0.15, random_state= 143)

In [36]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state= 143)

In [37]:
print("Train Size:", len(X_train))
print("Validation Size:", len(X_val))
print("Test Size:", len(X_test))

Train Size: 3788
Validation Size: 948
Test Size: 836


In [38]:
TRAIN_DF = pd.DataFrame(X_train)
TRAIN_DF['label'] = Y_train

VAL_DF = pd.DataFrame(X_val)
VAL_DF['label'] = Y_val

TEST_DF = pd.DataFrame(X_test)
TEST_DF['label'] = Y_test

In [39]:
TRAIN_DF

Unnamed: 0,preprocessed,label
4982,said okay sorry,0
3597,good morning princess happy new year,0
2319,way office da,0
1741,ur going bahamas callfreefone speak live opera...,1
5467,get garden ready summer free selection summer ...,1
...,...,...
5256,well shes big surprise,0
3899,otherwise part time job natuition,0
5117,aslamalaikkuminsha allah tohar beeen muht albi...,0
4972,hey come online use msn,0


In [40]:
VAL_DF

Unnamed: 0,preprocessed,label
3819,xmas iscoming ur awarded either cd gift vouche...,1
5058,hey next sun there basic yoga course bugis go ...,0
480,whenre guy getting back g said thinking stayin...,0
2398,neshanthtel r u,0
2700,oh baby house come dont new picture facebook,0
...,...,...
2593,friend got say he upping order gram he got ltg...,0
1455,decide faster co si going home liao,0
2014,great news call freefone claim guaranteed cash...,1
4095,miss,0


In [41]:
TEST_DF

Unnamed: 0,preprocessed,label
2248,back work morro half term u c nite sexy passio...,1
4111,yo gonna still stock tomorrowtoday im trying g...,0
4478,oh outside player allowed play know,0
2759,time im prob,0
1140,messagesome text missing sendername missing nu...,0
...,...,...
2613,yes innocent fun,0
4576,directly behind abt row behind,0
4986,dont let studying stress lr,0
1869,today system sh get readyall well also deep well,0


Saving the Splits in Separate Files for Version 1

In [44]:
TRAIN_DF.to_csv('./TRAIN.csv', index=False)
VAL_DF.to_csv('./VALIDATION.csv', index=False)
TEST_DF.to_csv('./TEST.csv', index=False)

In [45]:
# Track the version 1 with DVC
!dvc add TRAIN.csv VALIDATION.csv TEST.csv
!git add TRAIN.csv.dvc VALIDATION.csv.dvc TEST.csv.dvc
!git commit -m "Version 1 of train/validation/test split"
!dvc push

⠋ Checking graph



[main 63c8d6a] Version 1 of train/validation/test split
 3 files changed, 15 insertions(+)
 create mode 100644 Assignment 2/TEST.csv.dvc
 create mode 100644 Assignment 2/TRAIN.csv.dvc
 create mode 100644 Assignment 2/VALIDATION.csv.dvc
3 files pushed


#### 4. Version 2: Splitting the Data with different random seed

In [46]:
X, X_test, Y, Y_test = train_test_split(DATASET, LABELS, test_size=0.15, random_state= 1432)

In [47]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state= 1432)

In [48]:
print("Train Size:", len(X_train))
print("Validation Size:", len(X_val))
print("Test Size:", len(X_test))

Train Size: 3788
Validation Size: 948
Test Size: 836


In [49]:
TRAIN_DF = pd.DataFrame(X_train)
TRAIN_DF['label'] = Y_train

VAL_DF = pd.DataFrame(X_val)
VAL_DF['label'] = Y_val

TEST_DF = pd.DataFrame(X_test)
TEST_DF['label'] = Y_test

Saving the Splits in Separate Files for Version 2

In [50]:
TRAIN_DF.to_csv('./TRAIN.csv', index=False)
VAL_DF.to_csv('./VALIDATION.csv', index=False)
TEST_DF.to_csv('./TEST.csv', index=False)

In [51]:
# Track the version 1 with DVC
!dvc add TRAIN.csv VALIDATION.csv TEST.csv
!git add TRAIN.csv.dvc VALIDATION.csv.dvc TEST.csv.dvc
!git commit -m "Version 2 of train/validation/test split"
!dvc push

⠋ Checking graph



[main a249381] Version 2 of train/validation/test split
 3 files changed, 6 insertions(+), 6 deletions(-)
3 files pushed


#### Checking the Version 1 distribution

In [52]:
!git log 

commit a2493817ce89ceaefa38eac1e865f5bff7efa168
Author: Ayush Yadav <yadavayush7028@gmail.com>
Date:   Tue Mar 4 14:18:12 2025 +0530

    Version 2 of train/validation/test split

commit 63c8d6aef7d9530e41cb4169c6c966f5d81e6403
Author: Ayush Yadav <yadavayush7028@gmail.com>
Date:   Tue Mar 4 14:15:02 2025 +0530

    Version 1 of train/validation/test split

commit 0e29c03effe0d869bf37c9ea008c5eb848dc1b1f
Author: Ayush Yadav <yadavayush7028@gmail.com>
Date:   Tue Mar 4 14:07:15 2025 +0530

    Added raw data

commit 2005ffab53da966f6c93435e7fc54d4f849e9640
Author: Ayush Yadav <yadavayush7028@gmail.com>
Date:   Tue Mar 4 14:03:09 2025 +0530

    Reinitialize DVC

commit 8364a24b0beedc2cf89955fa4e8de59da396ce65
Author: Ayush Yadav <yadavayush7028@gmail.com>
Date:   Thu Jan 30 16:16:26 2025 +0530

    ''

commit fc9e18397d0a0fb58e330d28941f0ee7d687af86
Author: Ayush Yadav <89350275+yadavayush7028@users.noreply.github.com>
Date:   Thu Jan 9 17:05:58 2025 +0530

    Initial commit


In [54]:
!git checkout 63c8d6aef7d9530e41cb4169c6c966f5d81e6403
!dvc checkout

M	Assignment 2/.dvc/config


Note: switching to '63c8d6aef7d9530e41cb4169c6c966f5d81e6403'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 63c8d6a Version 1 of train/validation/test split


M       TEST.csv
M       VALIDATION.csv
M       TRAIN.csv


In [56]:
# Load and print class distributions
TRAIN = pd.read_csv("TRAIN.csv")
VALIDATION = pd.read_csv("VALIDATION.csv")
TEST = pd.read_csv("TEST.csv")

print("Version 1: class distribution:")
print("Train:\n", TRAIN["label"].value_counts())
print("Validation:\n", VALIDATION["label"].value_counts())
print("Test:\n", TEST["label"].value_counts())

Version 1: class distribution:
Train:
 label
0    3273
1     515
Name: count, dtype: int64
Validation:
 label
0    827
1    121
Name: count, dtype: int64
Test:
 label
0    725
1    111
Name: count, dtype: int64


#### Checking Version 2 

In [57]:
!git checkout a2493817ce89ceaefa38eac1e865f5bff7efa168
!dvc checkout

M	Assignment 2/.dvc/config


Previous HEAD position was 63c8d6a Version 1 of train/validation/test split
HEAD is now at a249381 Version 2 of train/validation/test split


M       VALIDATION.csv
M       TEST.csv
M       TRAIN.csv


In [58]:
# Load and print class distributions
TRAIN = pd.read_csv("TRAIN.csv")
VALIDATION = pd.read_csv("VALIDATION.csv")
TEST = pd.read_csv("TEST.csv")

print("Version 1: class distribution:")
print("Train:\n", TRAIN["label"].value_counts())
print("Validation:\n", VALIDATION["label"].value_counts())
print("Test:\n", TEST["label"].value_counts())

Version 1: class distribution:
Train:
 label
0    3296
1     492
Name: count, dtype: int64
Validation:
 label
0    817
1    131
Name: count, dtype: int64
Test:
 label
0    712
1    124
Name: count, dtype: int64
