# Training a sentiment analysis model using artifacts from Wandb

These steps collectively represent the setup, execution, and logging of a machine learning training session for sentiment analysis, fully integrated with the Wandb platform for experiment tracking and artifact management.

1. **Wandb Initialization**:
   - The Wandb (Weights & Biases) session is initialized with the project name **``sentiment_analysis``** and the entity ``**your-user``**. This step associates the code execution with the specific project on Wandb for tracking experiments, logging metrics, and storing artifacts.

2. **Downloading Artifacts**:
   - Two artifacts are used: one for training data (**``train_data:v0``**) and another for vocabulary (**``vocab:v0``**). The artifacts are specified by their names and versions, and they are downloaded using the **``artifact.download()``** method from Wandb. This method retrieves the artifacts from Wandb and saves them locally in the file system.

3. **Loading Training Data**:
   - The **``load_train_data``** function is defined to read the downloaded **``train_data.csv``** using Pandas. This CSV file contains two columns: **``text``** for the reviews and **``label``** for the sentiments (0 for negative and 1 for positive). The function returns the text and labels as a Pandas Series and a NumPy array, respectively.

4. **Loading Vocabulary**:
   - The **``load_vocab``** function opens the **``vocabulary.txt``** file from the downloaded vocabulary artifact directory. It reads all the lines, splits them into individual words, and then converts the list of words into a set. This set is used to filter the tokens in the training data so that only words present in the vocabulary are included.

5. **Cleaning and Tokenizing Documents**:
   - The **``clean_doc``** function takes a document string, tokenizes it by whitespace, removes punctuation, filters out non-alphabetic tokens, stop words, and short tokens. This results in a list of clean tokens.
   - The **``filter_by_vocab``** function is applied to the list of training documents to retain only the tokens that are present in the vocabulary set.

6. **Tokenization**:
   - A Keras **``Tokenizer``** is created using the **``create_tokenizer``** function which fits on the filtered training documents. The tokenizer converts the text documents into sequences of integers, where each integer represents a unique word in the vocabulary.

7. **Data Encoding**:
   - The training documents are converted into a matrix representation with the **``texts_to_matrix``** method of the tokenizer object, using the **``freq``** mode to represent token frequency in the documents.

8. **Model Definition**:
   - The **``define_model``** function constructs a Sequential neural network model with an input layer sized according to the number of words (features) and two Dense layers. The first Dense layer has 50 units with ReLU activation, and the second one is the output layer with a single unit and sigmoid activation, suitable for binary classification.

9. **Model Training**:
   - The model is trained on the encoded training data for a predefined number of epochs using the **``fit``** method. During this process, the model learns to associate the input features with the sentiment labels.

10. **Logging with Wandb**:
    - The training process logs the number of epochs, loss, and accuracy to Wandb using **``wandb.log``**. This allows for tracking the model's performance metrics in the Wandb dashboard.

11. **Cleanup**:
    - Finally, the Wandb run is closed with **``wandb.finish()``** to signal that this run is complete, which helps in organizing and comparing runs within the Wandb interface.

## Install, load libraries and setup wandb

In [1]:
!pip install wandb



In [2]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [13]:
import string
import re
import pandas as pd
from numpy import array
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import wandb
import os
import nltk
from wandb.integration.keras import WandbMetricsLogger

In [8]:
# Ensure that NLTK Stopwords are downloaded
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Initialization, Wandb Run Setup and Artifact Download

In [19]:
# Initialize the W&B run
wandb.init(project="sentiment_analysis", job_type="train")

# Use W&B artifact for training data
train_data_artifact = wandb.use_artifact('sentiment_analysis/train_data:v0', type='TrainData')
train_data_dir = train_data_artifact.download()

# Use W&B artifact for validation data
test_data_artifact = wandb.use_artifact('sentiment_analysis/test_data:v0', type='TestData')
test_data_dir = test_data_artifact.download()

# Use W&B artifact for vocabulary
vocab_artifact = wandb.use_artifact('sentiment_analysis/vocab:v0', type='Vocab')
vocab_dir = vocab_artifact.download()

[34m[1mwandb[0m:   1 of 1 files downloaded.  
[34m[1mwandb[0m:   1 of 1 files downloaded.  
[34m[1mwandb[0m:   1 of 1 files downloaded.  


## Loading Vocabulary, Cleaning and Tokenizing Documents, Tokenization, Model Definition

In [20]:
# Function to load the training data
def load_data(data_dir):
    df = pd.read_csv(data_dir)
    return df['text'], array(df['label'])

# Function to load the vocabulary
def load_vocab(vocab_dir):
    with open(vocab_dir, 'r') as file:
        vocab = file.read().split()
    return set(vocab)

# Function to clean the documents
def clean_doc(doc):
    tokens = doc.split()
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    tokens = [re_punc.sub('', w) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

# Function to filter documents by vocabulary
def filter_by_vocab(docs, vocab):
    new_docs = []
    for doc in docs:
        tokens = clean_doc(doc)
        tokens = [w for w in tokens if w in vocab]
        new_docs.append(' '.join(tokens))
    return new_docs

# Function to create the tokenizer
def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

# Function to define the model
def define_model(n_words):
    model = Sequential()
    model.add(Dense(50, input_shape=(n_words,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

# Load the vocabulary
full_vocab_dir = os.path.join(vocab_dir, 'vocabulary.txt')
vocab = load_vocab(full_vocab_dir)

# Load all reviews
# Train
full_train_data_dir = os.path.join(train_data_dir, 'train_data.csv')
train_docs, y_train = load_data(full_train_data_dir)
train_docs = filter_by_vocab(train_docs, vocab)

## Create the tokenizer
tokenizer_train = create_tokenizer(train_docs)

## Encode data
x_train = tokenizer_train.texts_to_matrix(train_docs, mode='freq')

# Validation
full_test_data_dir = os.path.join(test_data_dir, 'test_data.csv')
test_docs, y_test = load_data(full_test_data_dir)
test_docs = filter_by_vocab(test_docs, vocab)

## Encode data
x_test = tokenizer_train.texts_to_matrix(test_docs, mode='freq')

# Define the model
n_words = x_train.shape[1]
model = define_model(n_words)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


## Training

In [37]:
# Fit network
model.fit(x_train,
          y_train,
          epochs=10,
          verbose=0,
          validation_data=(x_test,y_test),
          callbacks=[wandb.keras.WandbMetricsLogger(),
                     wandb.keras.WandbModelCheckpoint(filepath='model.keras', save_best_only=True)])



<keras.src.callbacks.history.History at 0x7e745c341300>

In [38]:
# Finish the W&B run
wandb.finish()

0,1
epoch/accuracy,▁▄▃▆▆▇▇▇▇▇▇█████████
epoch/epoch,▁▂▃▃▄▅▆▆▇█▁▂▃▃▄▅▆▆▇█
epoch/learning_rate,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
epoch/loss,███▇▇▆▆▅▄▄▃▃▃▂▂▂▂▁▁▁
epoch/val_accuracy,▂▃▁▄▆▄▄▅▅▇▇▇▇█▇▇█▇██
epoch/val_loss,██▇▇▆▆▅▅▄▄▃▃▃▂▂▂▁▁▁▁

0,1
epoch/accuracy,0.995
epoch/epoch,9.0
epoch/learning_rate,0.001
epoch/loss,0.12297
epoch/val_accuracy,0.86
epoch/val_loss,0.34533
