Analysis implemented and available at https://github.com/ivanovitchm/mlops/tree/main/lessons/week_09

<a href="https://colab.research.google.com/github/terrematte/mlops_wandb/blob/main/lesson1/03_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning and Deduplication for Sentiment Analysis Dataset

In this data cleaning exercise, the objective was to identify and remove duplicate text files from a sentiment analysis dataset hosted on **``Weights & Biases (wandb)``**, under the project **`my_user/sentiment_analysis`** with the artifact name **`txt_sentoken:v0`**. The dataset contains text files categorized into positive and negative sentiments, organized within two subfolders named **`pos`** and **`neg`**.

Here are the steps undertaken to achieve the deduplication and re-upload the cleaned data back to wandb:

1. **Wandb Initialization:**
   - A wandb run was initialized using **`wandb.init()`** under the project **`my_user/sentiment_analysis`** with the run name **`data_cleaning`**.

2. **Artifact Retrieval:**
   - The **`txt_sentoken:v0`** artifact was retrieved from wandb using `wandb.use_artifact()` and its content was downloaded to the local directory using **`artifact.download()`**.

3. **Hash Calculation:**
   - A function named **`calculate_hash`** was defined to compute the SHA-256 hash of a file, which serves as a unique identifier for the file content.

4. **Duplicate Identification and Removal:**
   - A function named **`identify_and_remove_duplicates`** was defined to traverse through each file in the **`pos`** and **`neg`** subfolders, calculate the SHA-256 hash, and identify duplicate files. If a duplicate file was identified (based on the hash), it was removed from the directory using **`os.remove()`**.

5. **Cleaned Data Artifact Creation:**
   - A new wandb artifact named **`clean_data`** was created to hold the cleaned dataset. This artifact was described as a dataset with duplicates removed.

6. **Adding Cleaned Data to the Artifact:**
   - The cleaned data directories (**`pos`** and **`neg`** subfolders) were added to the **`clean_data`** artifact using **`clean_data_artifact.add_dir()`**.

7. **Logging the Cleaned Data Artifact to Wandb:**
   - The **`clean_data`** artifact was logged to wandb using **`wandb.log_artifact()`**, making the cleaned dataset available for further analysis or modeling.

8. **Wandb Run Termination (Optional):**
   - Optionally, the wandb run was concluded using **`wandb.finish()`** to indicate the end of the data cleaning run.

Through these steps, a systematic approach was followed to download the original dataset, identify and remove duplicate files, and re-upload the cleaned dataset as a new artifact on wandb, ensuring a clean, deduplicated dataset ready for subsequent analysis or machine learning tasks.

## Install and load libraries

In [8]:
!pip install wandb



In [9]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [10]:
import hashlib
import os
import wandb
import shutil

## Wandb Initialization and Artifact Retrieval

In [11]:
# Initialize wandb run
wandb.init(project='sentiment_analysis', job_type="preprocessing")

# Get the artifact
artifact = wandb.use_artifact('txt_sentoken:v0')

# Download the content of the artifact to the local directory
artifact_dir = artifact.download()

[34m[1mwandb[0m:   2000 of 2000 files downloaded.  


## Hash Calculation and Duplicate Identification and Removal

In [12]:
def calculate_hash(file_path):
    """Calculate SHA-256 hash of a file."""
    sha256_hash = hashlib.sha256()
    with open(file_path,"rb") as f:
        # Read and update hash in chunks of 4K
        for byte_block in iter(lambda: f.read(4096),b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()

def identify_and_remove_duplicates(folder_path):
    """Identify duplicate files in a folder and remove them."""
    file_hashes = {}
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            file_path = os.path.join(root, file)
            file_hash = calculate_hash(file_path)
            if file_hash in file_hashes:
                log_message = f'Removing duplicate: {file_path}'
                wandb.log({'log': log_message})
                os.remove(file_path)
            else:
                file_hashes[file_hash] = file_path

# Paths to 'pos' and 'neg' subfolders
pos_folder_path = os.path.join(artifact_dir, 'pos')
neg_folder_path = os.path.join(artifact_dir, 'neg')

# Identify and remove duplicates in 'pos' and 'neg' subfolders
identify_and_remove_duplicates(pos_folder_path)
identify_and_remove_duplicates(neg_folder_path)

## Cleaned Data Artifact Creation and Adding Cleaned Data to the Artifact

In [13]:
# Create a new artifact for the clean data
clean_data_artifact = wandb.Artifact(
    name='clean_data',
    type='CleanData',
    description='Cleaned dataset with duplicates removed'
)

# Add the cleaned data directories to the clean_data artifact
clean_data_artifact.add_dir(pos_folder_path, name='pos')
clean_data_artifact.add_dir(neg_folder_path, name='neg')

[34m[1mwandb[0m: Adding directory to artifact (/content/artifacts/txt_sentoken:v0/pos)... Done. 0.7s
[34m[1mwandb[0m: Adding directory to artifact (/content/artifacts/txt_sentoken:v0/neg)... Done. 0.7s


## Logging the Cleaned Data Artifact to Wandb and Terminate

In [14]:
# Log the clean_data artifact to wandb
wandb.log_artifact(clean_data_artifact)

# Optionally, finish the wandb run (if this is the end of your script)
wandb.finish()