#### **AML Assignment 02 : Version Control**
#### **Task 2.1: Data version control**


- Name: Soumyajoy Kundu
- Roll No: MDS202349

----

* in `prepare.ipynb` track the versions of data using dvc
    * load the raw data into `raw_data.csv` and save the split data into `train.csv`/`validation.csv`/`test.csv`
    * update train/validation/test split by choosing different random seed
    * checkout the first version (before update) using dvc and print the distribution of target variable (number of 0s and number of 1s) in `train.csv`, `validation.csv`, and `test.csv`
    * checkout the updated version using dvc and print the distribution of target variable in `train.csv`, `validation.csv`, `test.csv`
    * *bonus* : (decouple compute and storage) track the data versions using google drive as storage
---
**References**: (Data Version Control)
* https://dvc.org/doc/start/data-management/data-versioning
* https://realpython.com/python-data-version-control/
* https://towardsdatascience.com/how-to-manage-files-in-google-drive-with-python-d26471d91ecd
* https://madewithml.com/courses/mlops/versioning/


In [1]:
!pip install dvc
!pip install pandas scikit-learn

Collecting dvc
  Downloading dvc-3.59.1-py3-none-any.whl.metadata (18 kB)
Collecting celery (from dvc)
  Downloading celery-5.4.0-py3-none-any.whl.metadata (21 kB)
Collecting colorama>=0.3.9 (from dvc)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting configobj>=5.0.9 (from dvc)
  Downloading configobj-5.0.9-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting dpath<3,>=2.1.0 (from dvc)
  Downloading dpath-2.2.0-py3-none-any.whl.metadata (15 kB)
Collecting dulwich (from dvc)
  Downloading dulwich-0.22.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Collecting dvc-data<3.17,>=3.16.2 (from dvc)
  Downloading dvc_data-3.16.9-py3-none-any.whl.metadata (5.0 kB)
Collecting dvc-http>=2.29.0 (from dvc)
  Downloading dvc_http-2.32.0-py3-none-any.whl.metadata (1.3 kB)
Collecting dvc-objects (from dvc)
  Downloading dvc_objects-5.1.0-py3-none-any.whl.metadata (3.7 kB)
Collecting dvc-render<2,>=1.0.1 (from dvc)
  Downloading dvc_render-1.0.2-

^C


### Importing necessary libraries

In [4]:
import nltk
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
import random
random.seed(42)
import seaborn as sns
import matplotlib.pyplot as plt
import re
import csv

%matplotlib inline
import matplotlib.pyplot as plt

nltk.download('stopwords')

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


* Converting the input file to CSV format for further processing

In [5]:
# Read the text file and process the data
input_file = "/content/SMSSpamCollection"
output_file = "/content/SMSSpamCollection.csv"

with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8", newline="") as outfile:
    writer = csv.writer(outfile)

    # Write the header
    writer.writerow(["Label", "Message"])

    # Read each line and split it into label and message
    for line in infile:
        try:
            # Split only on the first space to handle messages with spaces
            label, message = line.strip().split("\t", 1)
            writer.writerow([label, message])
        except ValueError:
            # Skip lines that don't match the format
            continue

print(f"Conversion complete! The CSV file is saved as {output_file}.")

Conversion complete! The CSV file is saved as /content/SMSSpamCollection.csv.


In [6]:
raw_data = pd.read_csv("/content/SMSSpamCollection.csv")
raw_data

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will ü b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


### Preprocessing

In [7]:
def preprocess_text(text):
    text = text.lower()
    regex = f"^subject:\s(.*)"
    match = re.search(regex, text)
    if match:
      text = match.group(1)
    text = re.sub(r"[^a-z .]", "", text)
    words = text.split()
    words = [word for word in words if word.isalpha() and word not in stopwords.words('english')]
    return ' '.join(words)

In [8]:
print("Begin text preprocessing:", end="\n\n")
raw_data["processed_text"] = ""
for i in range(raw_data.shape[0]):
    if i % 500 == 0 and i != 0:
        a = round(i/raw_data.shape[0]*100)
        print("+"*(a//10*4) + "-"*(40-(a//10*4)) + " : "+ str(a) + "% completed")
    raw_data["processed_text"][i] = preprocess_text(str(raw_data["Message"][i]))
    if i == raw_data.shape[0]-1:
        print("+"*40 + " : " + "100% completed", end="\n\n")
print("Preprocessing complete")

Begin text preprocessing:

---------------------------------------- : 9% completed
++++------------------------------------ : 18% completed
++++++++-------------------------------- : 27% completed
++++++++++++---------------------------- : 36% completed
++++++++++++++++------------------------ : 45% completed
++++++++++++++++++++-------------------- : 54% completed
++++++++++++++++++++++++---------------- : 63% completed
++++++++++++++++++++++++++++------------ : 72% completed
++++++++++++++++++++++++++++++++-------- : 81% completed
++++++++++++++++++++++++++++++++++++---- : 90% completed
++++++++++++++++++++++++++++++++++++---- : 99% completed
++++++++++++++++++++++++++++++++++++++++ : 100% completed

Preprocessing complete


In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Initializing DVC

In [10]:
!dvc init -f --no-scm

Initialized DVC repository.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

### Setting up remote storage

In [29]:
!dvc remote add --default storage gdrive://1dkzdzTtGUES5kMWg9lk5N6ECtmy2a5u3

Setting 'storage' as a default remote.
[31mERROR[39m: configuration error - config file error: remote 'storage' already exists. Use `-f|--force` to overwrite it.
[0m

In [30]:
!mkdir -p /content/drive/MyDrive/sk_aml2_dvc_remote

In [31]:
!dvc remote add -d myremote /content/drive/MyDrive/sk_aml2_dvc_remote

Setting 'myremote' as a default remote.
[0m

### Git

In [12]:
!git init

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/


In [23]:
!git config --global user.name "soumyajoykundu"
!git config --global user.email "skundu072003@gmail.com"

In [24]:
!git status

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m../../.config/[m
	[31m../../.dvc/[m
	[31m../../.dvcignore[m
	[31m../../SMSSpamCollection[m
	[31m../../SMSSpamCollection.csv[m
	[31m../../SMSSpamCollection.csv.dvc[m
	[31m../../drive/[m
	[31m../../sample_data/[m

nothing added to commit but untracked files present (use "git add" to track)


In [25]:
!mkdir -p Assignments/Assignment\ 02\ :\ Version\ Control
!cp *.txt Assignments/Assignment\ 02\ :\ Version\ Control/
!cp *.csv Assignments/Assignment\ 02\ :\ Version\ Control/
!cp *.ipynb Assignments/Assignment\ 02\ :\ Version\ Control/
%cd Assignments/Assignment\ 02\ :\ Version\ Control
!git add .
!git commit -m "Added files"

cp: cannot stat '*.txt': No such file or directory
cp: cannot stat '*.ipynb': No such file or directory
/content/Assignments/Assignment 02 : Version Control/Assignments/Assignment 02 : Version Control
[main 13d7c8a] Added files
 1 file changed, 5575 insertions(+)
 create mode 100644 Assignments/Assignment 02 : Version Control/Assignments/Assignment 02 : Version Control/SMSSpamCollection.csv


### First adding raw_data to dvc

In [27]:
!dvc add SMSSpamCollection.csv

[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
  0% 0/1 [00:00<?, ?files/s][A
  0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...: 100% 1/1 [00:00<00:00, 31.98file/s{'info': ''}]
[0m

In [34]:
!dvc push

Collecting          |2.00 [00:00,  141entry/s]
Pushing
![A
  0% |          |0/? [00:00<?,    ?files/s][A
100% 1/1 [00:00<00:00,  2.84files/s{'info': ''}][A
Pushing
Everything is up to date.
[0m

### Splitting the data (Version 1)

In [35]:
# Breaking the dataset into 70%, 15%, 15% for train, validation and test respectively

train, val_test = train_test_split(raw_data[["processed_text", "Label"]], test_size = 0.30, random_state=42)
val, test = train_test_split(val_test, test_size = 0.50, random_state=42)

In [36]:
# Save splits in .csv format
train.to_csv('train.csv', index=False)
val.to_csv('validation.csv', index=False)
test.to_csv('test.csv', index=False)

#### Adding these split files to dvc tracking

In [37]:
!dvc add train.csv validation.csv test.csv

[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/3 [00:00<?, ?file/s{'info': ' train.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding train.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding train.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                                  [A
  0% 0/1 [00:00<?, ?files/s][A
  0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...:   0% 0/3 [00:00<?, ?file/s{'info': ' validation.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding validation.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding validation.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                         

In [60]:
!git add train.csv.dvc test.csv.dvc .gitignore validation.csv.dvc

In [40]:
!dvc commit

[0m

In [42]:
!dvc push

Collecting          |5.00 [00:00,  165entry/s]
Pushing
![A
  0% |          |0/? [00:00<?,    ?files/s][A
 25% 1/4 [00:00<00:00,  4.41files/s{'info': ''}][A
 50% 2/4 [00:00<00:00,  3.91files/s{'info': ''}][A
100% 4/4 [00:00<00:00,  5.17files/s{'info': ''}][A
Pushing
Everything is up to date.
[0m

### Splitting the data (Version 2)

In [43]:
train_data, test_data = train_test_split(raw_data, test_size=0.3, random_state=100)
val_data, test_data = train_test_split(test_data, test_size=0.5, random_state=100)

In [44]:
train_data.to_csv('train.csv', index=False)
val_data.to_csv('validation.csv', index=False)
test_data.to_csv('test.csv', index=False)

#### Adding these dvc files to dvc tracking

In [45]:
!dvc add train.csv validation.csv test.csv

[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/3 [00:00<?, ?file/s{'info': ' train.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding train.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding train.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                                  [A
  0% 0/1 [00:00<?, ?files/s][A
  0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...:   0% 0/3 [00:00<?, ?file/s{'info': ' validation.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding validation.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding validation.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                         

In [46]:
!dvc commit

[0m

In [47]:
!dvc push

Collecting          |8.00 [00:00,  135entry/s]
Pushing
![A
  0% |          |0/? [00:00<?,    ?files/s][A
 14% 1/7 [00:00<00:02,  2.75files/s{'info': ''}][A
 86% 6/7 [00:00<00:00, 10.15files/s{'info': ''}][A
Pushing
Everything is up to date.
[0m

In [48]:
!dvc checkout train.csv.dvc validation.csv.dvc test.csv.dvc

Building workspace index          |9.00 [00:00,  375entry/s]
Comparing indexes          |10.0 [00:00,  718entry/s]
Applying changes          |0.00 [00:00,     ?file/s]
[0m

### Distribution of Version 1 split

In [54]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
val = pd.read_csv("validation.csv")

In [57]:
test_y = test["Label"]
train_y = train["Label"]
val_y = val["Label"]

* Labelling the target attributes
    * 0 : `ham`
    * 1 : `spam`

In [60]:
print("Training data:", end = "\n\n")
print("Number of 0 =", np.sum(train_y == 'ham'))
print("Number of 1 =", np.sum(train_y == 'spam'), end = "\n\n\n\n")
print("Validation data:", end = "\n\n")
print("Number of 0 =", np.sum(val_y == 'ham'))
print("Number of 1 =", np.sum(val_y == 'spam'), end = "\n\n\n\n")
print("Testing data:", end = "\n\n")
print("Number of 0 =", np.sum(test_y == 'ham'))
print("Number of 1 =", np.sum(test_y == 'spam'))

Training data:

Number of 0 = 3374
Number of 1 = 527



Validation data:

Number of 0 = 732
Number of 1 = 104



Testing data:

Number of 0 = 721
Number of 1 = 116


### Distribution of Version 2 split

In [None]:
!git checkout d56e477102e984884aa91d89f1245a0079e7e796

Note: switching to 'd56e477102e984884aa91d89f1245a0079e7e796'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at d56e477 Version 2 split


In [52]:
!dvc checkout

Building workspace index          |14.0 [00:00,  506entry/s]
Comparing indexes          |15.0 [00:00, 1.70kentry/s]
Applying changes          |0.00 [00:00,     ?file/s]
[0m

In [60]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
val = pd.read_csv("validation.csv")

In [60]:
test_y = test["Label"]
train_y = train["Label"]
val_y = val["Label"]

* Labelling the target attributes
    * 0 : `ham`
    * 1 : `spam`

In [None]:
print("Training data:", end = "\n\n")
print("Number of 0 =", np.sum(train_y == 'ham'))
print("Number of 1 =", np.sum(train_y == 'spam'), end = "\n\n\n\n")
print("Validation data:", end = "\n\n")
print("Number of 0 =", np.sum(val_y == 'ham'))
print("Number of 1 =", np.sum(val_y == 'spam'), end = "\n\n\n\n")
print("Testing data:", end = "\n\n")
print("Number of 0 =", np.sum(test_y == 'ham'))
print("Number of 1 =", np.sum(test_y == 'spam'))

Training data:

Number of 0 = 3045
Number of 1 = 964



Validation data:

Number of 0 = 661
Number of 1 = 198



Testing data:

Number of 0 = 654
Number of 1 = 206


#### Push all data versions to gdrive

In [None]:
!dvc push

3 files pushed
