<a href="https://colab.research.google.com/github/rajkstats/uplimit-mlops/blob/main/MLOPS_Week_1_FINAL_RK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1 Project: Sentiment Analysis Project for Lamada E-commerce Platform
Welcome to the Week 1 Project! In this project you will be taking on the role as an MLOps Engineer working on a new greenfield project over at Lamada a leading E-commerce Giant.

![Lamada](https://drive.google.com/uc?id=17r13P5Wy9DtjmLEaiaXf3DUVViUXmFso)

## Background
Lamada, an e-commerce platform, currently relies on manual review analysis by customer support specialists and product analysts. This process is time-consuming, error-prone, and struggles to scale with increasing review volumes.

## Project Goal
Implement automated sentiment analysis to improve the efficiency and accuracy of review processing at Lamada.

## Benefits of Automated Sentiment Analysis
- Real-time processing for quicker customer feedback responses
- Systematic identification of common themes and issues
- Data-driven insights for targeted marketing strategies
- Efficient prioritization of customer support tasks
- Trend analysis for product satisfaction and inventory management

In this project, you will develop a machine learning model to automate the sentiment analysis process, addressing Lamada's current challenges and unlocking these benefits.

## MLOps Focus
This greenfield project presents an opportunity to address issues that have historically challenged machine learning teams at Lamada. By implementing MLOps best practices, we aim to enhance the entire ML lifecycle, resulting in:

- Improved model quality and reliability
- Faster development and deployment cycles
- Better collaboration between data scientists and operations teams
- Increased reproducibility of results
- Enhanced monitoring and maintenance of models in production

In the following section we will be diving into the actual data that we have at hand at this point of time and performing some quick EDA on it in order to understand the data and all its quirks.

In [None]:
# Installing all the necessary packages

!pip install \
ucimlrepo \
ydata-profiling \
pandas \
snorkel \
ipytest \
pytest \
great_expectations==0.18.19 \
scikit-learn \
wandb \
skl2onnx \
onnxruntime \
checklist \
openai==0.28 \




# Exploratory Data Analysis (EDA)

## Importance of EDA in Sentiment Analysis

Exploratory Data Analysis is a crucial step in any data science project, particularly in sentiment analysis. For our Lamada e-commerce platform project using the Drugs.com dataset, EDA will help us:

1. Understand the data distribution and quality
2. Identify patterns and relationships between variables
3. Detect anomalies or outliers that might affect our model
4. Inform feature engineering and selection
5. Guide our choice of machine learning algorithms

## Key Areas to Explore

Given the structure of our Drugs.com dataset, we should focus on:

1. Text Analysis:
   - Review length distribution
   - Common words and phrases
   - Correlation between review text and ratings

2. Rating Distribution:
   - Overall rating distribution
   - Rating patterns across different drugs and conditions

3. Temporal Trends:
   - Changes in sentiment over time
   - Seasonal patterns in reviews or ratings

4. Drug and Condition Analysis:
   - Most reviewed drugs and conditions
   - Relationship between conditions and ratings

5. Useful Count Analysis:
   - Distribution of useful votes
   - Correlation between usefulness and sentiment

## Potential Challenges

During EDA, we need to be aware of:

1. Class Imbalance: The rating distribution might be skewed, affecting our model's performance.

2. Data Quality: Look for missing values, inconsistencies in drug names or conditions, and potential data entry errors.

3. Text Preprocessing Needs: Identify requirements for text cleaning, such as handling abbreviations, misspellings, or medical jargon.

4. Bias Detection: Check for potential biases in the data, such as overrepresentation of certain drugs or conditions.

5. Feature Relevance: Assess which features contribute most to sentiment and which might introduce noise.


In [None]:
# Fetching the dataset that we will be using throughout this course.
# Read more about it here: https://archive.ics.uci.edu/dataset/462/drug+review+dataset+drugs+com

from ucimlrepo import fetch_ucirepo
import pandas as pd


drug_reviews_drugs_com = fetch_ucirepo(id=462)
df = pd.concat([drug_reviews_drugs_com.data.features, drug_reviews_drugs_com.data.targets])

In [None]:
df.head()

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,20-May-12,27
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,27-Apr-10,192
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,14-Dec-09,17
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,3-Nov-15,10
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,27-Nov-16,37


## Quick & Easy EDA using Ydata-Profilling

While manual EDA is crucial for deep understanding, we can accelerate our initial data exploration using automated tools. For this project, we'll be utilizing [ydata-profiling](https://github.com/ydataai/ydata-profiling), a powerful library that generates comprehensive exploratory data analysis reports.

### Benefits of ydata-profiling

1. **Speed**: Quickly generates an in-depth EDA report, saving time in the initial exploration phase.
2. **Comprehensiveness**: Provides a wide range of statistics and visualizations for each variable in the dataset.
3. **Interactivity**: Creates an interactive HTML report that allows for easy navigation and exploration.
4. **Correlation Analysis**: Automatically detects and visualizes relationships between variables.
5. Missing Data Overview **bold text**: Clearly highlights missing data patterns across the dataset.

While ydata-profiling will provide us with a solid starting point, remember that *it's a complement to, not a replacement for, thoughtful manual EDA*. We'll use its insights as a springboard for more targeted analysis and feature engineering.

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Drug Reviews Dataset Profiling Report")

In [None]:
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

### Analyzing the ydata-profiling Report

Now that you've generated the ydata-profiling report for our Drugs.com dataset, it's time to dive in and extract meaningful insights.

## Add in your observations
1. List at least five interesting findings from the report.
2. Identify any potential data quality issues that need addressing.
3. Based on this initial analysis, propose three hypotheses about sentiment in drug reviews that we could test in our deeper analysis.

Remember, this automated report is a starting point. Use these insights to guide your manual EDA and feature engineering in the next steps of our project.

# ANSWER

## Interesting Observations from ydata profile report

- Most of the reviews are on the positive side which is 9 or 10. This class imbalance might make it difficult for the model to learn how to classify negative reviews properly, which we might need to revisit later for class rebalance

- We have missing values in columns like drugname, condition, review, rating, date and useful count which is a mix of numeric and text columns.

- There are certain months where we see a spike in reviews. This could indicate that health concerns are seasonal which could be allergies or flu which could be worth exploring further.

- There appears to be a correlation between the number of useful votes a review receives and the sentiment of the review. Positive reviews with high ratings generally received more useful votes, which suggest that helpful content tends to be perceived positively.

- Words like side effect, relief, and pain keep popping up in reviews, which tells us a lot about what users care about most. It’s clear that people are primarily focusing on the drug’s effectiveness and the side effects which is helpful for guiding our sentiment analysis


# Potential data quality issues that need addressing
- With so many positive reviews, there is an issue of class imbalance which can be problematic in building an effective model. We may need to use resampling techniques to address this.

- There are also a few duplicate reviews which could skew our results, so it's best to remove them before modeling

- We have missing values in certain text and numberic fields. These missing values must be imputed or handled in a way that doesn't add any bias to the model.

- Need to inspect some if there are inconsistencies in drug naming conventions which can be problematic as well

## Three hypotheses about sentiment in drug reviews

- Drug review sentiments change depending on the season due to factors such as allergies or flu.

- Drug reviews that mention side effects might tend to have lower ratings

- Drug reviews that talk about relief/effective are probably going to have higher sentiment scores

# Designing the ML System

Before diving into model development, it's crucial to plan and scope out our machine learning project. This ensures that we're building a system that meets stakeholders' needs and aligns with business objectives. We'll use the Machine Learning Canvas to structure our planning process.

## Project Expectations from Stakeholders

Imagine the following stakeholder requests for our sentiment analysis system at Lamada:

- Chief Customer Officer (CCO): "We need a system that can automatically categorize the sentiment of drug reviews in real-time. This will help us respond quickly to negative feedback and highlight positive experiences."
- Head of Product: "The system should be able to process reviews as they come in, with a latency of no more than 200ms per review. We want to use this data to inform our product recommendations and marketing strategies."
- CTO: "We need to ensure the system can handle our current load of about 1000 reviews per hour, with the ability to scale up to 5000 reviews per hour during peak times."

## Machine Learning Canvas Overview

Let's break down each component of the Machine Learning Canvas and what you should consider:

1. **Background**:
   - Consider Lamada's current manual review process and its limitations.
   - Think about the volume of reviews and the impact of slow response times on customer satisfaction.

2. **Value Proposition**:
   - How will automated sentiment analysis improve Lamada's operations and customer experience?
   - What specific pain points will this solution address?

3. **Objectives**:
   - Break down the high-level goal of "sentiment analysis" into specific, measurable objectives.
   - Consider accuracy targets, processing speed, and scalability requirements.

4. **Solution**:
   - Outline the key features of your sentiment analysis system.
   - Consider how it will integrate with Lamada's existing e-commerce platform.
   - Define what's in scope (e.g., English language reviews) and out of scope (e.g., image-based reviews).

5. **Feasibility**:
   - Assess if you have the necessary data, computational resources, and expertise to build this system.
   - Consider any potential technical or ethical challenges.

6. **Data**:
   - Describe the Drugs.com dataset and how it will be used for training.
   - Consider how you'll handle real-time incoming review data in production.
   - Think about data privacy and security considerations.

7. **Metrics**:
   - Define key performance indicators (KPIs) for your model, such as accuracy, F1 score, and latency.
   - Consider business metrics like customer satisfaction scores or response time improvements.

8. **Evaluation**:
   - Design your offline evaluation strategy using the Drugs.com dataset.
   - Plan how you'll conduct online evaluation once the system is deployed.

9. **Modeling**:
   - Outline your approach to building the sentiment analysis model.
   - Consider starting with a baseline model and iterating to more complex approaches.

10. **Inference**:
    - Based on the stakeholder requirements, you'll need to design for real-time inference.
    - Consider how to optimize your model for low-latency predictions.

11. **Feedback**:
    - Plan how you'll collect feedback on the model's performance in production.
    - Consider implementing a human-in-the-loop system for reviewing uncertain predictions.

12. **Project**:
    - Outline the team members needed (e.g., data scientists, ML engineers, DevOps).
    - Create a timeline for development, testing, and deployment phases.


![ML Canvas](https://drive.google.com/uc?id=1j1dXJ3PLdpbvbAMIeyUNAL-vJjiV8P-w)

## [OPTIONAL] Fill up the ML Canvas
Go through the notebook and finish the minimum requirements before filling out the Machine Learning Canvas for our Lamada sentiment analysis project. Be sure to consider the stakeholder requirements and the insights gained from our EDA phase. This canvas will serve as a roadmap for the rest of our project, ensuring that we're building a system that meets both technical and business needs. Make sure to include how we can create ground truth labels for this use case(we could for example, use the `rating` to map scores that are `1` to be `Negative` and `10` to be `Positive`).

**Disclaimer**:
For the purposes of this course project, you are not required to fill out all sections of the Machine Learning Canvas in full detail. Specifically, the "Project" section, including team member requirements and timelines, is optional. However, we encourage you to think about these aspects as they are crucial in real-world ML projects. Considering the full spectrum of project planning will give you a more comprehensive understanding of what goes into deploying a machine learning system in a production environment.

Focus primarily on the technical aspects such as the data, modeling, metrics, and evaluation strategies. These elements will directly inform the implementation phase of our project. The business and operational considerations (like team composition and timelines) are included to give you a holistic view of ML project planning, which will be valuable in your future career.


## Machine Learning Canvas

# Background
Lamada's current review process is manual, requiring customer support specialists to analyze drug reviews. This approach is time-consuming, error-prone, and cannot keep up with the growing volume of reviews. Slow response times to customer feedback negatively impact customer satisfaction.

# Value Proposition
- Automated Sentiment analysis system which will improve Lamada's operational efficiency
- Improving customer experience by reducing response times and improving scalability of review processing

# Objectives

- Achieve an accuracy score of at least 80% for sentiment classification
- Reduce average review processing time
- Deliver a scalable solution that can handle increasing volumes of reviews without a decline in performance

# Solution
- Sentiment analysis system will classify customer reviews as positive, neutral, or negative based on their content
- Key features include real-time processing for incoming reviews, integration with the existing platform and an interface for customer support to review flagged content

# Data
- Drugs.com datasets contains reviews and ratings. Ratings will be used to generate sentiment labels with 1-3(negative), 10 (positive)

- Incoming real time reviews data from Lamada's platform will be processed and integrated with the trained model

- Review text will be anonymized to protect customer identities


# Metrics
- For model performance metrics which include accuracy, F1 score (to balance precision and recall), and latency targeting under 50 seconds per review

- Improvements in customer satisfaction scores and reduction in response times will be KPIs for success

# Evaluation
-  Offline Eval: Model will be tested using a holdout set from the dataset, focusing on accuracy and F1 score
- Once deployed, A/B testing will be conducted to compare customer satisfaction before and after introducing the automated system

# Modeling

-  Start with a baseline model majority class predictor
- Iterate towards more advanced models such as Logistic Regression, using BERT embeddings along with LR
- Hyperparameter tuning and cross-validation will be used to improve performance


## Inference

- System will be designed for real-time inference to classify reviews as they come in
- For Low latency batch processing or GPU acceleration techniques will be explored

# Feedback

- A human-in-the-loop mechanism will be implemented, allowing customer support agents to review model predictions with some level of uncertainity and improve model accuracy

# Data Preparation

While our Drugs.com dataset is conveniently packaged for this project, it's important to understand that in real-world scenarios like Lamada, data is rarely so neatly organized. Let's explore how data is typically handled in production environments.

## Data in the Wild: The Reality of Enterprise Data

In most enterprises, including e-commerce platforms like Lamada, data is:

1. Distributed: Stored across multiple databases, data lakes, and other storage systems.
2. Heterogeneous: Comes in various formats (structured, semi-structured, and unstructured).
3. Dynamic: Constantly updated and growing.
4. Raw: Often requires significant processing before it's usable for analysis or modeling.

## The Role of Data Engineering

Data Engineers play a crucial role in making raw data usable for Data Scientists. Their responsibilities include:

1. Data Integration: Combining data from various sources.
2. Data Transformation: Converting data into a suitable format for analysis.
3. Data Quality Assurance: Ensuring data accuracy, completeness, and consistency.
4. Data Pipeline Management: Creating and maintaining data flows.

## ETL vs ELT Processes


![ELT Pipeline](https://drive.google.com/uc?id=11AGHxqmvfNgYGCL7kZmInyQhxml0X-6r)

Two common approaches to data preparation are:

1. Extract, Transform, Load (ETL):
   - Data is extracted from source systems.
   - Transformed to fit operational needs.
   - Loaded into the target system (e.g., Data Warehouse).

2. Extract, Load, Transform (ELT):
   - Data is extracted from source systems.
   - Loaded into the target system in its original format.
   - Transformed within the target system as needed.

ELT is becoming increasingly popular due to the decreasing cost of storage and the increasing power of modern data warehouses to handle transformations.

## Data Warehouses and Their Role


![Data Systems](https://drive.google.com/uc?id=1Rd1WLEm30gWlJzaJiQXEzlFQ6mJNWfMl)


Data Warehouses like Snowflake, AWS Redshift, or Google BigQuery serve as centralized repositories for integrated data from various sources. They offer:

1. Scalability: Can handle large volumes of data.
2. Performance: Optimized for complex queries and analytics.
3. Integration: Can combine data from multiple sources.
4. Historical Data: Maintain historical records for trend analysis.

## From Raw Data to ML-Ready Datasets

For our Lamada sentiment analysis project, the process might look like this:

1. Data Collection:
   - Customer reviews are collected from web forms, mobile apps, and customer service interactions.
   - Product information is stored in product databases.
   - User data is kept in customer relationship management (CRM) systems.

2. Data Integration:
   - Data Engineers create pipelines to extract this data from various sources.
   - The data is loaded into a staging area in the Data Warehouse.

3. Data Transformation:
   - Engineers apply transformations to clean the data (e.g., handling missing values, standardizing formats).
   - They join different tables to create a unified view of reviews with associated metadata.

4. Feature Engineering:
   - Data Scientists work with the integrated data to create relevant features.
   - This might include text preprocessing, sentiment score calculation, or deriving new features from existing data.

5. Data Labelling:
   - If manual labelling is required (e.g., for a subset of reviews to train or validate the model), a labelling workflow is set up.
   - This could involve a team of human annotators or a crowdsourcing platform.

6. Dataset Creation:
   - The final ML-ready dataset is created, combining the engineered features and labels.
   - This dataset is versioned and stored, often in a format optimized for ML workflows (e.g., Parquet files in a data lake).

## Data Quality Checks

### Importance of Data Quality Checks

Data quality is crucial for any machine learning project. Poor data quality can lead to unreliable models, incorrect insights, and wasted time and resources. Implementing data quality checks on raw data is essential because:

1. **Garbage In, Garbage Out**: The quality of your model's output is directly dependent on the quality of input data.

2. **Early Error Detection**: Catching data issues early in the pipeline saves time and prevents downstream problems.

3. **Consistency**: Ensures that data meets predefined standards and is consistent across different batches or sources.

4. **Trust**: Builds confidence in the data and subsequent analysis among stakeholders.

5. **Compliance**: Helps meet regulatory requirements in industries where data quality is mandated.

6. **Efficiency**: Automates the process of data validation, reducing manual checks and human error.

7. **Documentation**: Creates a clear record of data expectations and quality over time.


### Implementing Data Quality Checks with Great Expectations

Great Expectations is a powerful tool for data validation and documentation. It allows you to express what you "expect" from your data and then validates those expectations.

Implement the following data quality checks using Great Expectations for our Drugs.com review dataset



In [None]:
!great_expectations init

[36m
  ___              _     ___                  _        _   _
 / __|_ _ ___ __ _| |_  | __|_ ___ __  ___ __| |_ __ _| |_(_)___ _ _  ___
| (_ | '_/ -_) _` |  _| | _|\ \ / '_ \/ -_) _|  _/ _` |  _| / _ \ ' \(_-<
 \___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
                                |_|
             ~ Always know what to expect from your data ~
[0m
Let's create a new Data Context to hold your project configuration.

Great Expectations will create a new directory with the following structure:

    great_expectations
    |-- great_expectations.yml
    |-- expectations
    |-- checkpoints
    |-- plugins
    |-- .gitignore
    |-- uncommitted
        |-- config_variables.yml
        |-- data_docs
        |-- validations

OK to proceed? [Y/n]: Y


[36mCongratulations! You are now ready to customize your Great Expectations configuration.[0m

[36mYou can customize your configuration in many ways. Here are some examples:[0m

  [36mUse the CLI to:[0m

In [None]:
import great_expectations as gx
from great_expectations.dataset import PandasDataset

# Initialize Great Expectations context
context = gx.get_context()

# Delete the existing datasource if it already exists
if "my_pandas_datasource" in context.datasources:
    context.delete_datasource("my_pandas_datasource")


# Add Pandas dataframe as a datasource
datasource = context.sources.add_pandas(name="my_pandas_datasource")
data_asset = datasource.add_dataframe_asset(name="drug_reviews", dataframe=df)

# Create an Expectation Suite
expectation_suite_name = "drug_reviews_suite"
context.add_or_update_expectation_suite(expectation_suite_name=expectation_suite_name)

# Create a validator
validator = context.get_validator(
    datasource_name="my_pandas_datasource",
    data_asset_name="drug_reviews",
    expectation_suite_name=expectation_suite_name
)

In [None]:
# Refer to the GX Expectations Gallery
# Solve this section https://greatexpectations.io/expectations

# 1: Check for the presence and order of all expected columns
# Hint: This expectation ensures that the table has exactly these columns in this order

validator.expect_table_columns_to_match_ordered_list(
    column_list=["drugName", "condition", "review", "rating", "date", "usefulCount"]
)


#2: Ensure 'rating' is between 1 and 10
# Hint: This expectation checks if all values in the 'rating' column are within the specified range

validator.expect_column_values_to_be_between(
    column="rating",
    min_value=1,
    max_value=10
)

# 3: Check that 'review' column doesn't contain null values
# Hint: This expectation verifies that there are no null values in the 'review' column
validator.expect_column_values_to_not_be_null(column="review")


# 4: Verify that 'usefulCount' is non-negative
# Hint: This expectation ensures all values in 'usefulCount' are greater than or equal to zero
validator.expect_column_min_to_be_between(
    column="usefulCount",
    min_value=0
)

# Save Expectation suite
validator.save_expectation_suite(discard_failed_expectations=False)

checkpoint_name = "my_checkpoint"
checkpoint = context.add_or_update_checkpoint(
    name=checkpoint_name,
    validator=validator,
)

checkpoint_result = checkpoint.run()

print(checkpoint_result)


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/16 [00:00<?, ?it/s]

{
  "run_id": {
    "run_name": null,
    "run_time": "2024-10-14T10:54:29.105042+00:00"
  },
  "run_results": {
    "ValidationResultIdentifier::drug_reviews_suite/__none__/20241014T105429.105042Z/my_pandas_datasource-drug_reviews": {
      "validation_result": {
        "success": true,
        "results": [
          {
            "success": true,
            "expectation_config": {
              "expectation_type": "expect_table_columns_to_match_ordered_list",
              "kwargs": {
                "column_list": [
                  "drugName",
                  "condition",
                  "review",
                  "rating",
                  "date",
                  "usefulCount"
                ],
                "batch_id": "my_pandas_datasource-drug_reviews"
              },
              "meta": {}
            },
            "result": {
              "observed_value": [
                "drugName",
                "condition",
                "review",
                "

### [OPTIONAL]Create more checks for the data

In this section you can try performing more checks based on the ones we have listed down below or feel free to add in checks that you feel would be suitable for our use case!

In [None]:
import great_expectations as gx
from great_expectations.dataset import PandasDataset

# Initialize Great Expectations context
context = gx.get_context()

# Delete the existing datasource if it already exists
if "my_pandas_datasource" in context.datasources:
    context.delete_datasource("my_pandas_datasource")

# Add Pandas dataframe as a datasource
datasource = context.sources.add_pandas(name="my_pandas_datasource")
data_asset = datasource.add_dataframe_asset(name="drug_reviews", dataframe=df)

# Create an Expectation Suite
expectation_suite_name = "drug_reviews_suite"
context.add_or_update_expectation_suite(expectation_suite_name=expectation_suite_name)

# Create a Validator for the dataset
validator = context.get_validator(
    datasource_name="my_pandas_datasource",
    data_asset_name="drug_reviews",
    expectation_suite_name=expectation_suite_name
)


# 1: Check for the presence and order of all expected columns
# Hint: Use expect_table_columns_to_match_ordered_list() method
# Expected columns: "drugName", "condition", "review", "rating", "date", "usefulCount"

validator.expect_table_columns_to_match_ordered_list(
    column_list=["drugName", "condition", "review", "rating", "date", "usefulCount"]
)

#2: Ensure 'rating' is between 1 and 10
# Hint: Use expect_column_values_to_be_between() method
validator.expect_column_values_to_be_between(
    column="rating",
    min_value=1,
    max_value=10
)

#3: Check that 'review' column doesn't contain null values
# Hint: Use expect_column_values_to_not_be_null() method

validator.expect_column_values_to_not_be_null(column="review")



#4: Verify that 'usefulCount' is non-negative
# Hint: Use expect_column_values_to_be_between() method with only a min_value
validator.expect_column_min_to_be_between(
    column="usefulCount",
    min_value=0
)


#5: Ensure 'date' follows the expected format (DD-Mon-YY)
# Hint: Use expect_column_values_to_match_strftime_format() method
# The strftime format for DD-Mon-YY is "%d-%b-%y"
validator.expect_column_values_to_match_strftime_format(
    column="date",
    strftime_format="%d-%b-%y"
)

# Ensure 'rating' has no missing values
validator.expect_column_values_to_not_be_null(column="rating")

# Check 'condition' column to make sure it has no null values
validator.expect_column_values_to_not_be_null(column="condition")

# Save the Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)


checkpoint_name = "my_checkpoint"
checkpoint = context.add_or_update_checkpoint(
    name=checkpoint_name,
    validator=validator,
)

checkpoint_result = checkpoint.run()

print(checkpoint_result)


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/31 [00:00<?, ?it/s]

{
  "run_id": {
    "run_name": null,
    "run_time": "2024-10-14T10:54:35.441359+00:00"
  },
  "run_results": {
    "ValidationResultIdentifier::drug_reviews_suite/__none__/20241014T105435.441359Z/my_pandas_datasource-drug_reviews": {
      "validation_result": {
        "success": false,
        "results": [
          {
            "success": true,
            "expectation_config": {
              "expectation_type": "expect_table_columns_to_match_ordered_list",
              "kwargs": {
                "column_list": [
                  "drugName",
                  "condition",
                  "review",
                  "rating",
                  "date",
                  "usefulCount"
                ],
                "batch_id": "my_pandas_datasource-drug_reviews"
              },
              "meta": {}
            },
            "result": {
              "observed_value": [
                "drugName",
                "condition",
                "review",
                

## Data Labeling for Sentiment Analysis

Looking at our sample dataset, we can see that we have reviews with associated ratings. For our sentiment analysis task, we need to convert these ratings into sentiment labels. Let's explore different labeling techniques:

### 1. Natural Labels

In our case, we're fortunate to have natural labels in the form of ratings.
> Tasks with natural labels are tasks where the model's predictions can be automatically evaluated or partially evaluated by the system.
These ratings can be used to infer sentiment without additional manual labeling.

### 2. Threshold-Based Labeling(Programmatic Labeling)

![Programmatic Labeling](https://drive.google.com/uc?id=1Esp2XapZggGpyUyST5X_clzlrIQnTRfo)

A simple approach to start with:
- `Ratings 1-4`: Negative sentiment
- `Ratings 5-6`: Neutral sentiment
- `Ratings 7-10`: Positive sentiment

This method is quick and easy but may oversimplify the nuances in the reviews. You can use a tool like [Snorkel](https://github.com/snorkel-team/snorkel) for this task.

### 3. Hand Labeling

![Hand Labeling](https://drive.google.com/uc?id=14tb6DFFpe3ApCYlI8n0vnBv5MM7Z6Xtf)

For more nuanced labeling, we could manually review a subset of the data:
- Read each review
- Assign a sentiment label (Negative, Neutral, Positive)
- Consider the rating as a guide, but allow for discrepancies

While this method can be more accurate, it's **time-consuming**, and where bias easily creeps in, and may not be feasible for large datasets.

### 4. LLM-Assisted Labeling
![Data Labelling using LLMs](https://drive.google.com/uc?id=1jRIMZFUUDr3_03I6btVsOlv6y9thlcZA)

Large Language Models (LLMs) have revolutionized the labeling process:
- Use an LLM to analyze the review text and suggest a sentiment label
- Optionally, have a human review the LLM's suggestions for quality control

This method can significantly speed up the labeling process while maintaining high quality. More information about it can be found [here](https://www.refuel.ai/blog-posts/llm-labeling-technical-report).

### 5. Active Learning

![Active Learning](https://drive.google.com/uc?id=1xEUrPHaKHSA4G9PYGA35013QbdHRwvfl)



If resources are limited:
1. Label a small initial dataset
2. Train a model on this dataset
3. Use the model to predict labels for unlabeled data
4. Select the most uncertain predictions for human review
5. Add these newly labeled examples to the training set
6. Repeat steps 2-5

This iterative process can efficiently improve your model with minimal labeling effort.

If you would like to dive a bit deeper into Active Learning then check out this [video](https://youtu.be/7kX6rhUGtzA?si=sLhi6gRZFaPeMA4X) about the topic.

For our project, let's start with the threshold-based approach for quick results. As we progress, we can explore LLM-assisted labeling to refine our dataset and potentially improve model performance.

Remember, the quality of your labels directly impacts your model's performance. Always validate a sample of your labels, regardless of the method used.

In [None]:
from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis
import pytest
import ipytest

ipytest.autoconfig()

# Create the labeling function based on the threshold we created and
# leveraging the LABEL_MAPPING below:
# Label 0 => Ratings 1-4: Negative sentiment
# Label 1 => Ratings 5-6: Neutral sentiment
# Label 2 => Ratings 7-10: Positive sentiment

LABEL_MAPPING = {
    "NEGATIVE": 0,
    "NEUTRAL": 1,
    "POSITIVE": 2,
}

@labeling_function()
def label_sentiment(x):
    # Assign labels based on the rating thresholds
    rating = x if isinstance(x, int) else x.rating  # Adjust to extract 'rating' properly
    if 1 <= rating <= 4:
        return LABEL_MAPPING["NEGATIVE"]
    elif 5 <= rating <= 6:
        return LABEL_MAPPING["NEUTRAL"]
    elif 7 <= rating <= 10:
        return LABEL_MAPPING["POSITIVE"]
    else:
        return -1  # Return -1 if the value doesn't fall within any specified range

In [None]:
def label_data(df):
    lf_applier = PandasLFApplier([label_sentiment])
    labels = lf_applier.apply(df)
    df['sentiment_label'] = labels
    return df

In [None]:
%%ipytest -vv -s

@pytest.mark.parametrize("rating, expected_label", [
    (2, 0),
    (5, 1),
    (7, 2),
    (10, 2),
    (1, 0),
    (3, 0),
    (4, 0),
    (6, 1),
    (8, 2),
])
def test_sentiment_labeling(rating, expected_label):
    # Create a sample dataframe with a single row
    data = {
        'drugName': ['Drug A'],
        'condition': ['Condition A'],
        'review': ['Review A'],
        'rating': [rating],
        'date': ['2022-01-01'],
        'usefulCount': [10]
    }
    df = pd.DataFrame(data)
    labeled_df = label_data(df)
    assert labeled_df['sentiment_label'].iloc[0] == expected_label, f"Labeling doesn't match expected output for rating {rating}"
    assert all(labeled_df[col].iloc[0] == df[col].iloc[0] for col in df.columns), "Original data was modified"
    assert 'sentiment_label' in labeled_df.columns, "sentiment_label column not added"

platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /content
plugins: anyio-3.7.1, typeguard-4.3.0
[1mcollecting ... [0mcollected 9 items

t_c42cfb26bd7d4efead96730430754815.py::test_sentiment_labeling[2-0] 

100%|██████████| 1/1 [00:00<00:00, 1437.88it/s]

[32mPASSED[0m
t_c42cfb26bd7d4efead96730430754815.py::test_sentiment_labeling[5-1] 


100%|██████████| 1/1 [00:00<00:00, 991.33it/s]

[32mPASSED[0m
t_c42cfb26bd7d4efead96730430754815.py::test_sentiment_labeling[7-2] 


100%|██████████| 1/1 [00:00<00:00, 1057.83it/s]

[32mPASSED[0m
t_c42cfb26bd7d4efead96730430754815.py::test_sentiment_labeling[10-2] 


100%|██████████| 1/1 [00:00<00:00, 1101.73it/s]

[32mPASSED[0m
t_c42cfb26bd7d4efead96730430754815.py::test_sentiment_labeling[1-0] 


100%|██████████| 1/1 [00:00<00:00, 744.20it/s]

[32mPASSED[0m
t_c42cfb26bd7d4efead96730430754815.py::test_sentiment_labeling[3-0] 


100%|██████████| 1/1 [00:00<00:00, 969.33it/s]

[32mPASSED[0m
t_c42cfb26bd7d4efead96730430754815.py::test_sentiment_labeling[4-0] 


100%|██████████| 1/1 [00:00<00:00, 999.36it/s]

[32mPASSED[0m
t_c42cfb26bd7d4efead96730430754815.py::test_sentiment_labeling[6-1] 


100%|██████████| 1/1 [00:00<00:00, 1061.04it/s]

[32mPASSED[0m
t_c42cfb26bd7d4efead96730430754815.py::test_sentiment_labeling[8-2] 


100%|██████████| 1/1 [00:00<00:00, 1170.61it/s]

[32mPASSED[0m

../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1204
../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1204
    self._mark_plugins_for_rewrite(hook)






## [OPTIONAL]LLM-Assisted Labeling

In this section, you'll explore how to use a Large Language Model (LLM) to assist in labeling our dataset.
This can be especially useful for more nuanced sentiment analysis or when dealing with large datasets.

Steps to complete:
1. Choose an LLM API (e.g., OpenAI's GPT-3, Hugging Face's API, or any other accessible LLM)
2. Set up the necessary API credentials
3. Create a function to send reviews to the LLM and interpret its responses
4. Apply the LLM labeling to a subset of our data
5. Compare LLM-generated labels with our rule-based labels

**NOTE**: Be mindful of API usage costs and rate limits when using LLM services.

You could also try out [autolabel](https://github.com/refuel-ai/autolabel) which does all of this right out of the box for you!

**Discussion Questions**:
1. How does the agreement rate between rule-based and LLM-generated labels compare?
2. In cases of disagreement, which labeling method seems more accurate? Why?
3. What are the pros and cons of using an LLM for labeling compared to our rule-based approach?

In [None]:
from getpass import getpass
import openai
import pandas as pd
import os

In [None]:
openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

OpenAI API Key: ··········


In [None]:
# Define a function to use a chat model for sentiment analysis
def get_sentiment_from_llm(review_text):
    try:
        # Call OpenAI API with the review text using the chat endpoint
        response = openai.ChatCompletion.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are an expert sentiment analyzer."},
                {"role": "user", "content": f"Classify the sentiment of the following drug review into one of these labels: 'negative', 'neutral', or 'positive'. Respond with only one word ('negative', 'neutral', or 'positive'):\n\n{review_text}"}
            ],
            max_tokens=10,
            temperature=0.0
        )

        # Extract sentiment from response
        sentiment = response.choices[0].message["content"].strip().lower()

        # Remove quotes if present
        sentiment = sentiment.replace("'", "")

        # Print response for debugging
        print(f"LLM Response: {sentiment}")

        # Convert sentiment to label
        if sentiment == "negative":
            return 0
        elif sentiment == "neutral":
            return 1
        elif sentiment == "positive":
            return 2
        else:
            return -1
    except Exception as e:
        print(f"Error while getting sentiment from LLM: {e}")
        return -1

In [None]:
# Apply LLM-assisted labeling to a subset of our dataset (first 10 rows)
subset_df = df.head(10)
subset_df['llm_label'] = subset_df['review'].apply(get_sentiment_from_llm)

LLM Response: positive
LLM Response: positive
LLM Response: negative
LLM Response: positive
LLM Response: positive
LLM Response: negative
LLM Response: negative
LLM Response: positive
LLM Response: negative


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_df['llm_label'] = subset_df['review'].apply(get_sentiment_from_llm)



LLM Response: neutral


In [None]:
# Compare LLM-generated labels with rule-based labels
subset_df['rule_based_label'] = subset_df['rating'].apply(label_sentiment)

# Analyze agreement
agreement_rate = (subset_df['llm_label'] == subset_df['rule_based_label']).mean()
print(f"Agreement Rate: {agreement_rate:.2f}")

# Samples for manual inspection
print(subset_df[['review', 'llm_label', 'rule_based_label']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_df['rule_based_label'] = subset_df['rating'].apply(label_sentiment)



Agreement Rate: 0.80
                                              review  llm_label  \
0  "It has no side effect, I take it in combinati...          2   
1  "My son is halfway through his fourth week of ...          2   
2  "I used to take another oral contraceptive, wh...          0   
3  "This is my first time using any form of birth...          2   
4  "Suboxone has completely turned my life around...          2   
5  "2nd day on 5mg started to work with rock hard...          0   
6  "He pulled out, but he cummed a bit in me. I t...          0   
7  "Abilify changed my life. There is hope. I was...          2   
8  " I Ve had  nothing but problems with the Kepp...          0   
9  "I had been on the pill for many years. When m...          1   

   rule_based_label  
0                 2  
1                 2  
2                 1  
3                 2  
4                 2  
5                 0  
6                 0  
7                 2  
8                 0  
9                 2  

An agreement rate of 80% between the LLM-generated labels and rule-based labels is quite promising, but there is still some level of disagreement that might be worth investigating further

## Data Quality Checks on Features


In [None]:
from sklearn.model_selection import train_test_split


labels = label_data(df)
sentiment_df = labels[['review', 'sentiment_label']]
sentiment_df = sentiment_df.rename(columns={'review': 'text', 'sentiment_label': 'label'})
train_df, test_df = train_test_split(sentiment_df, test_size=0.2, random_state=42)

print("Sample of sentiment_df:")
print(sentiment_df.head())

print(f"\nTrain set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")

100%|██████████| 215063/215063 [00:02<00:00, 82408.72it/s]


Sample of sentiment_df:
                                                text  label
0  "It has no side effect, I take it in combinati...      2
1  "My son is halfway through his fourth week of ...      2
2  "I used to take another oral contraceptive, wh...      1
3  "This is my first time using any form of birth...      2
4  "Suboxone has completely turned my life around...      2

Train set shape: (172050, 2)
Test set shape: (43013, 2)


In [None]:
import great_expectations as gx
from great_expectations.dataset import PandasDataset
import pandas as pd

def perform_data_quality_checks(train_df, test_df):
    train_ge = gx.dataset.PandasDataset(train_df)
    test_ge = gx.dataset.PandasDataset(test_df)

    print("Performing data quality checks on processed data...")

    # Check that the labels are of values 0, 1, 2
    # Hint: Use the expect_column_values_to_be_in_set() method

    overall_success = True

    print("\n1. Checking label values:")

    train_label_check = train_ge.expect_column_values_to_be_in_set(
        column='label',
        value_set={0, 1, 2}
    )

    test_label_check = test_ge.expect_column_values_to_be_in_set(
        column='label',
        value_set={0, 1, 2}
    )
    print("Train label values check:", "Passed" if train_label_check.success else "Failed")
    print("Test label values check:", "Passed" if test_label_check.success else "Failed")

    if not (train_label_check.success and test_label_check.success):
        overall_success = False


    # Verify that we only have columns text and label
    # Hint: Use the expect_table_columns_to_match_ordered_list() method

    print("\n2. Checking columns:")

    expected_columns = ["text", "label"]
    train_columns_check = train_ge.expect_table_columns_to_match_ordered_list(column_list=expected_columns)
    test_columns_check = test_ge.expect_table_columns_to_match_ordered_list(column_list=expected_columns)

    print("Train columns check:", "Passed" if train_columns_check.success else "Failed")

    print("Test columns check:", "Passed" if test_columns_check.success else "Failed")

    if not (train_columns_check.success and test_columns_check.success):
        overall_success = False


    # Check for data leakage between train and test data on text
    # Hint: Compare the 'text' columns of train and test dataframes
    # Look for any duplicate texts between the two datasets

    print("\n3. Checking for data leakage:")
    train_texts = set(train_df['text'])
    test_texts = set(test_df['text'])
    leakage_count = len(train_texts.intersection(test_texts))
    if leakage_count == 0:
        print("Data leakage check: Passed")
    else:
        print(f"Data leakage check: Failed - Found {leakage_count} duplicate texts between train and test sets")
        overall_success = False


    # Check for duplicates in each dataset
    # Hint: Use the expect_column_values_to_be_unique() method on the 'text' column
    print("\n4. Checking for duplicate reviews within each dataset:")

    train_duplicates_check = train_ge.expect_column_values_to_be_unique(column='text')
    test_duplicates_check = test_ge.expect_column_values_to_be_unique(column='text')

    print("Train duplicates check:", "Passed" if train_duplicates_check.success else "Failed")

    print("Test duplicates check:", "Passed" if test_duplicates_check.success else "Failed")

    if not (train_duplicates_check.success and test_duplicates_check.success):
        overall_success = False

    #Update overall success check
    # Hint: Combine the results of all previous checks
    print("\nOverall data quality check result:", "Passed" if overall_success else "Failed")
    return overall_success

quality_check_passed = perform_data_quality_checks(train_df, test_df)

if quality_check_passed:
    print("All data quality checks passed. Proceeding with model training...")
else:
    print("Data quality checks failed. Please address the issues before proceeding.")

Performing data quality checks on processed data...

1. Checking label values:
Train label values check: Passed
Test label values check: Passed

2. Checking columns:
Train columns check: Passed
Test columns check: Passed

3. Checking for data leakage:
Data leakage check: Failed - Found 27545 duplicate texts between train and test sets

4. Checking for duplicate reviews within each dataset:
Train duplicates check: Failed
Test duplicates check: Failed

Overall data quality check result: Failed
Data quality checks failed. Please address the issues before proceeding.


## Addressing Data Issues

Our data leakage check has revealed a significant number of duplicate reviews between the train and test sets. This is a critical issue that needs to be resolved before we can proceed with model training. There are a couple of things that we can do of varying complexity:

### TODO: Drop Duplicates(Simple Fix)
A simple fix that we can do is to just drop the duplicate rows that exists in the original dataset and then

### [OPTIONAL] Investigate and Handle Duplicates
Perform a deeper EDA into the dataset looking into why this happens in the first place and from there decide whether we should be



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

print(f"Original dataset shape: {df.shape}")
labels = label_data(df)
sentiment_df = labels[['review', 'sentiment_label']]
sentiment_df = sentiment_df.rename(columns={'review': 'text', 'sentiment_label': 'label'})

# Drop duplicates here
df_deduplicated = sentiment_df.drop_duplicates(subset='text')

print(f"Shape after removing duplicates: {df_deduplicated.shape}")

train_df, test_df = train_test_split(df_deduplicated, test_size=0.2, random_state=42)

Original dataset shape: (215063, 7)


100%|██████████| 215063/215063 [00:02<00:00, 76152.29it/s]


Shape after removing duplicates: (128478, 2)


In [None]:
quality_check_passed = perform_data_quality_checks(train_df, test_df)

if quality_check_passed:
    print("All data quality checks passed. Proceeding with model training...")
else:
    print("Data quality checks failed. Please review the processing steps.")

Performing data quality checks on processed data...

1. Checking label values:
Train label values check: Passed
Test label values check: Passed

2. Checking columns:
Train columns check: Passed
Test columns check: Passed

3. Checking for data leakage:
Data leakage check: Passed

4. Checking for duplicate reviews within each dataset:
Train duplicates check: Passed
Test duplicates check: Passed

Overall data quality check result: Passed
All data quality checks passed. Proceeding with model training...


## Model Development

Now that we have prepared and validated our data, we can focus our efforts on modeling. Before diving into complex algorithms, it's crucial to establish a baseline model. This step, often overlooked, is fundamental in the machine learning development process.

### The Importance of Baseline Models

1. **Benchmark for Comparison**:
   A baseline model provides a point of reference against which we can compare more sophisticated models. It helps answer the question: "Is our complex model actually performing better than a simple approach?"

2. **Justification for Complexity**:
   If a simple model performs nearly as well as a complex one, it may not be worth the additional computational cost and potential overfitting risk of the complex model.

3. **Problem Understanding**:
   Implementing a baseline forces us to understand the fundamental aspects of our problem and data.

4. **Quick Insights**:
   Baseline models can provide quick insights into the problem, potentially revealing simple patterns or biases in the data.

5. **Sanity Check**:
   If a complex model performs worse than the baseline, it's a clear sign that something is wrong – either with the model, the data, or our approach.

### Baseline Models for Sentiment Analysis

For our drug review sentiment analysis task, we can consider the following baseline approaches:

**Majority Class Predictor**:
Always predict the most common sentiment in our training data. This is the simplest possible baseline.

```python
from sklearn.dummy import DummyClassifier

majority_baseline = DummyClassifier(strategy='most_frequent')
majority_baseline.fit(X_train, y_train)
majority_accuracy = majority_baseline.score(X_test, y_test)
print(f"Majority Class Baseline Accuracy: {majority_accuracy:.4f}")
```

In [None]:
!wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
import os
import wandb
import pandas as pd

def save_and_log_datasets(train_df, test_df, y_probas, run):
    """
    Save the split datasets and probabilities to CSV files and log them as artifacts in W&B.

    Args:
    train_df (pd.DataFrame): Training dataframe
    test_df (pd.DataFrame): Test dataframe
    y_probas (np.array): Predicted probabilities for test set, (n_samples, n_classes)
    run (wandb.Run): The current W&B run
    """

    os.makedirs('datasets', exist_ok=True)

    train_df.to_csv('datasets/train.csv', index=False)
    test_df.to_csv('datasets/test.csv', index=False)

    probas_df = pd.DataFrame(y_probas, columns=LABELS)
    probas_df.to_csv('datasets/test_probas.csv', index=False)

    artifact = wandb.Artifact(name="drug-review-dataset", type="dataset")

    artifact.add_file(local_path="datasets/train.csv", name="train.csv")
    artifact.add_file(local_path="datasets/test.csv", name="test.csv")
    artifact.add_file(local_path="datasets/test_probas.csv", name="test_probas.csv")

    run.log_artifact(artifact)

    print("Datasets and probabilities saved and logged as artifacts in W&B.")

  self.hub = sentry_sdk.Hub(client)



In [None]:
X_train, y_train = train_df[['text']].values.flatten(), train_df[['label']].values.flatten()
X_test, y_test = test_df[['text']].values.flatten(), test_df[['label']].values.flatten()

In [None]:
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split

# Define the labels for classification
LABELS = ['Negative', 'Neutral', 'Positive']

def train_and_evaluate(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_probas = model.predict_proba(X_test)

    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred, average='weighted'),
        'precision': precision_score(y_test, y_pred, average='weighted'),
        'recall': recall_score(y_test, y_pred, average='weighted')
    }
    # Ensure that we save the dataset used so that we can use it for debugging
    # or in the future for comparisons with other models
    save_and_log_datasets(train_df, test_df, y_probas, run)

    for metric, value in metrics.items():
        wandb.log({metric: value})

    wandb.sklearn.plot_confusion_matrix(y_test, y_pred, labels=['Negative', 'Neutral', 'Positive'])
    wandb.sklearn.plot_roc(y_test, y_probas, labels=['Negative', 'Neutral', 'Positive'])
    wandb.sklearn.plot_precision_recall(y_test, y_probas, labels=['Negative', 'Neutral', 'Positive'])
    wandb.sklearn.plot_class_proportions(y_train, y_test, ['Negative', 'Neutral', 'Positive'])

    return metrics

# Initialize W&B run
run = wandb.init(project="Drug Review MLOps Uplimit", name="DummyClassifier_Stratified",
           notes="Dummy Classifier Baseline", tags=["baseline", "dummy", "stratified"])

config = {
    "model": "DummyClassifier",
    "strategy": "stratified"
}
wandb.config.update(config)

dummy_clf = DummyClassifier(strategy='stratified', random_state=42)
dummy_metrics = train_and_evaluate(dummy_clf, X_train, y_train, X_test, y_test)

wandb.finish()

  return LooseVersion(v) >= LooseVersion(check)

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
  self.comm = Comm(**args)

[34m[1mwandb[0m: Currently logged in as: [33mrajkstats[0m. Use [1m`wandb login --relogin`[0m to force relogin


Datasets and probabilities saved and logged as artifacts in W&B.


VBox(children=(Label(value='2.241 MB of 57.283 MB uploaded\r'), FloatProgress(value=0.03911501171993445, max=1…

0,1
accuracy,▁
f1,▁
precision,▁
recall,▁

0,1
accuracy,0.51086
f1,0.5115
precision,0.51215
recall,0.51086


In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def check_batch_training(X, y, n_gram_size, max_iter, batch_size=32):
    """
    Check if the model can train on a batch of specified size using a scikit-learn Pipeline.
    """
    X_batch = X[:batch_size]
    y_batch = y[:batch_size]

    pipeline = Pipeline([
        ('featurizer', CountVectorizer(ngram_range=(1, n_gram_size))),
        ('classifier', LogisticRegression(max_iter=max_iter))
    ])

    try:
      pipeline.fit(X_batch, y_batch)
      return True
    except Exception as e:
        print(f"Error training on batch of size {batch_size}: {str(e)}")
        return False

# Perform batch check before main training loop
print("Performing batch training check...")
N_GRAM_SIZE = 3
LR_MAX_ITER = 100
batch_success = check_batch_training(X_train, y_train, N_GRAM_SIZE, LR_MAX_ITER, batch_size=32)

if not batch_success:
    print("Batch training failed. Please review your model and data.")
else:
    print("Batch training successful. Proceeding with full training...")

Performing batch training check...
Batch training successful. Proceeding with full training...


In [None]:
import wandb
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
import os
from pathlib import Path

# Constants
N_GRAM_SIZE = 3
LR_MAX_ITER = 100
TRAIN_SIZE_EVALS = [1000, 5000, 10000, len(X_train)]

def train_and_evaluate(pipeline, X_train, y_train, X_test, y_test, run_name, run):
    pipeline.fit(X_train, y_train)
    registered_model_name = "review-sentiment-analysis-dev"

    y_pred = pipeline.predict(X_test)
    y_probas = pipeline.predict_proba(X_test)

    X_train_vec = pipeline.named_steps['featurizer'].transform(X_train)
    X_test_vec = pipeline.named_steps['featurizer'].transform(X_test)

    wandb.sklearn.plot_confusion_matrix(y_test, y_pred, labels=LABELS)
    wandb.sklearn.plot_roc(y_test, y_probas, labels=LABELS)
    wandb.sklearn.plot_precision_recall(y_test, y_probas, labels=LABELS)
    wandb.sklearn.plot_class_proportions(y_train, y_test, LABELS)

    save_and_log_datasets(train_df, test_df, y_probas, run)

    # Export the model to ONNX format
    initial_type = [('text_input', StringTensorType([None, 1]))]
    onx = convert_sklearn(pipeline, initial_types=initial_type)

    # Save the ONNX model locally
    onnx_filename = f"logreg_model_{run_name}.onnx"
    onnx_filepath = Path(onnx_filename)
    with open(onnx_filepath, "wb") as f:
        f.write(onx.SerializeToString())

    run.link_model(
        onnx_filepath,
        registered_model_name
    )

    # Clean up the local file
    os.remove(onnx_filepath)

# OPTIONAL: Initially lets just train on a small sample if you have the time
# go ahead and try training on the entire dataset!
n = TRAIN_SIZE_EVALS[0]
run_name = f"LR_train_size_{n}"
run = wandb.init(project="Drug Review MLOps Uplimit", name=run_name,
            notes="Logistic Regression with various train sizes",
            tags=["logistic-regression", "experiment"])

config = {
    "model": "LogisticRegression",
    "n_gram_size": N_GRAM_SIZE,
    "max_iter": LR_MAX_ITER,
    "train_size": n
}
run.config.update(config)

X_train_i = X_train[:n]
y_train_i = y_train[:n]

pipeline = Pipeline([
    ('featurizer', CountVectorizer(ngram_range=(1, N_GRAM_SIZE))),
    ('classifier', LogisticRegression(max_iter=LR_MAX_ITER))
])

train_and_evaluate(pipeline, X_train_i, y_train_i, X_test, y_test, run_name, run)

wandb.finish()

print("Experiment tracking completed.")

  self.comm = Comm(**args)





Datasets and probabilities saved and logged as artifacts in W&B.


VBox(children=(Label(value='6.377 MB of 6.377 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Experiment tracking completed.


## Compare Models in W&B

Once we have trained our models we can utilise W&B to compare them by heading to our Project and looking at the charts that are generated, W&B will compare different models across the charts that we specified for them during the model training process. It's important to identify what plots we want ahead of time to ensure that we can always compare them easily!

![Compare ML Models in W&B](https://drive.google.com/uc?id=1WSKNK3rlwTr_e-mhVhjCNy-LcSBg3amw)

## Fetch and Run Inference using the Model from Model Registry

Now that we have logged our models to the model registry, let's try loading it here and performing inference with it using the ONNX Runtime to verify that everything works as expected!

In [None]:
import wandb
run = wandb.init(project="Drug Review MLOps Uplimit")
downloaded_model_path = run.use_model(name="run-g1k7ho60-logreg_model_LR_train_size_1000.onnx:v0")
print(downloaded_model_path)

VBox(children=(Label(value='0.013 MB of 0.013 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

[34m[1mwandb[0m:   1 of 1 files downloaded.  


/content/artifacts/run-g1k7ho60-logreg_model_LR_train_size_1000.onnx:v0/logreg_model_LR_train_size_1000.onnx


In [None]:
import numpy as np
import onnxruntime as rt

# First we must start a session.
sess = rt.InferenceSession(downloaded_model_path)
# The name of the input is saved as part of the .onnx file.
# We are retreiving it because we will need it later.
input_name = sess.get_inputs()[0].name
print(f"{input_name=}")
# This code will run the model on our behalf.
query = "I loved the product!"
_, probas = sess.run(None, {input_name: np.array([[query]])})
print(probas[0])

input_name='text_input'
{0: 0.1563500612974167, 1: 0.038675498217344284, 2: 0.8049744367599487}


## [OPTIONAL] Advanced Models for Sentiment Analysis

For those interested in exploring more sophisticated approaches, this section introduces two advanced techniques: using BERT embeddings as a featurizer and leveraging Large Language Models (LLMs) for sentiment analysis.

### 1. BERT Embeddings with Logistic Regression

This approach uses BERT to create embeddings, which are then fed into a logistic regression classifier.

```python
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, f1_score

class TransformerFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        # TODO: Initialize the SentenceTransformer model

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # TODO: Use the SentenceTransformer model to create embeddings for the input text
        pass

# Training and evaluation
models_advanced = {}
for n in TRAIN_SIZE_EVALS:
    print(f"Evaluating BERT+LR for training data size = {n}")
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    pipeline = Pipeline([
        ('featurizer', TransformerFeaturizer()),
        ('classifier', LogisticRegression(max_iter=1000))
    ])

    # TODO: Fit the pipeline, make predictions, and calculate metrics
    # Store results in models_advanced[n]

    print(f"Accuracy on test set: {models_advanced[n]['accuracy']}")
```

### 2. LLM-based Sentiment Analysis

This approach uses a Large Language Model for zero-shot and few-shot sentiment analysis.

```python
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Initialize your LLM (replace with your preferred model)
llm = OpenAI(temperature=0)

# Zero-shot prompting
zero_shot_template = """
YOUR PROMPT
"""
zero_shot_prompt = PromptTemplate(input_variables=["review"], template=zero_shot_template)
zero_shot_chain = LLMChain(llm=llm, prompt=zero_shot_prompt)

# TODO: Implement zero-shot sentiment analysis on a sample of drug reviews

# Few-shot prompting
few_shot_template = """
YOUR PROMPT
"""
few_shot_prompt = PromptTemplate(input_variables=["review"], template=few_shot_template)
few_shot_chain = LLMChain(llm=llm, prompt=few_shot_prompt)

# TODO: Implement few-shot sentiment analysis on a sample of drug reviews

# TODO: Compare the performance of zero-shot and few-shot approaches
```

In [None]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-3.2.0-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.2.0-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.2/255.2 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence_transformers
Successfully installed sentence_transformers-3.2.0


### BERT Embeddings with Logistic Regression



In [48]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, f1_score
import pickle
import os

class TransformerFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        # Initialize the SentenceTransformer model
        self.model = SentenceTransformer(model_name)

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # Use the SentenceTransformer model to create embeddings for the input text
        embeddings = self.model.encode(X, show_progress_bar=True)
        return embeddings


# Initialize W&B run for tracking
run = wandb.init(
    project="Drug Review MLOps Uplimit",
    name="BERT+LR_Advanced",
    notes="Using BERT embeddings with Logistic Regression",
    tags=["BERT", "Logistic Regression", "NLP", "sentiment-analysis"]
)

# Training and evaluation
models_advanced = {}
for n in TRAIN_SIZE_EVALS:
    print(f"Evaluating BERT+LR for training data size = {n}")
    X_train_i = X_train[:n]
    Y_train_i = y_train[:n]

    pipeline = Pipeline([
        ('featurizer', TransformerFeaturizer()),
        ('classifier', LogisticRegression(max_iter=1000))
    ])

    # Fit the pipeline on subset of data
    pipeline.fit(X_train_i, Y_train_i)

    # Make predictions on the test set
    y_pred = pipeline.predict(X_test)

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')

    # Store results in models_advanced[n]
    models_advanced[n] = {
        'pipeline': pipeline,
        'accuracy': accuracy,
        'f1_score': f1
    }

    # Log the metrics to W&B
    wandb.log({
        "train_size": n,
        "accuracy": accuracy,
        "f1_score": f1
    })

    # Save the model to a file
    model_filename = f"bert_logreg_pipeline_{n}.pkl"
    with open(model_filename, "wb") as f:
        pickle.dump(pipeline, f)

    # Log the model file as an artifact in W&B
    artifact = wandb.Artifact(name=f"bert_logreg_pipeline_{n}", type="model")
    artifact.add_file(model_filename)
    run.log_artifact(artifact)

    # Clean up the local file
    os.remove(model_filename)


    print(f"Accuracy on test set: {models_advanced[n]['accuracy']}")

VBox(children=(Label(value='0.013 MB of 0.013 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Evaluating BERT+LR for training data size = 1000


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

Accuracy on test set: 0.7115504358655044
Evaluating BERT+LR for training data size = 5000


  self.comm = Comm(**args)



Batches:   0%|          | 0/157 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

Accuracy on test set: 0.7334993773349938
Evaluating BERT+LR for training data size = 10000


  self.comm = Comm(**args)



Batches:   0%|          | 0/313 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

Accuracy on test set: 0.7392590286425903
Evaluating BERT+LR for training data size = 102782


  self.comm = Comm(**args)



Batches:   0%|          | 0/3212 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

Accuracy on test set: 0.7475093399750934


### LLM-based Sentiment Analysis

In [None]:
!pip install langchain langchain-community

Collecting langchain
  Downloading langchain-0.3.3-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.2-py3-none-any.whl.metadata (2.8 kB)
Collecting langchain-core<0.4.0,>=0.3.10 (from langchain)
  Downloading langchain_core-0.3.10-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.134-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.5.2-py3-none-any.whl.metadata (3.5 kB)
Collectin

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Initialize your LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
# Zero-shot prompt template
zero_shot_template = """
Classify the sentiment of the following drug review into one of the following categories: 'positive', 'neutral', or 'negative'.

Review: "{review}"
Respond with only one word (negative, neutral, or positive)
"""
zero_shot_prompt = PromptTemplate(input_variables=["review"], template=zero_shot_template)
zero_shot_chain = LLMChain(llm=llm, prompt=zero_shot_prompt)

# Perform zero-shot sentiment analysis on a sample of drug reviews
sample_reviews = X_test[:5]
for review in sample_reviews:
    sentiment = zero_shot_chain.run({"review": review})
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")

  llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

  zero_shot_chain = LLMChain(llm=llm, prompt=zero_shot_prompt)

  sentiment = zero_shot_chain.run({"review": review})



Review: "This birth control was great for me. I went from an a cup to a d cup on it though. No weight gain etc. It was pretty easy on my emotions and I wish I could find it again now."
Predicted Sentiment: positive

Review: "I&#039;m a 50 yr old female who is still menstruating. 

I started this drug in the hospital and my cycle went from 3-5 days - up to 9 days then stopped.



This month- I am on day 15 of an incredibly heavy cycle. I wear TWO extra long overnight pads ( which should last 10 hours) and I go through them in less than 2 hours.



I started this drug after a mild stroke that seems to have been a by-product of A-fib . I didn&#039;t want Coumadin (aka Rat Poison) - and thought out of the 3 two-year old medicines left, this seemed to be the best for my active lifestyle.   It is not.



The bleeding is so heavy, I am exhausted. Pains in my feet adn ankles. Back pains, neck pains and Migraine like headaches on the opposite side of my stroke affected side. HELP!"
Predicted Se

In [None]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Few-shot prompt template with examples
few_shot_template = """
Classify the sentiment of the following drug review into one of the following categories: 'positive', 'neutral', or 'negative'.

Examples:
1. Review: "This medication worked really well for me, I feel like myself again." Sentiment: positive
2. Review: "I had a few side effects, but nothing too serious. It helped a bit." Sentiment: neutral
3. Review: "I couldn't handle the side effects at all, and it didn't help my condition." Sentiment: negative

Now classify the sentiment of the given review:

Review: "{review}"
Respond with only one word (negative, neutral, or positive)
"""

# Create a PromptTemplate instance using the few-shot prompt
few_shot_prompt = PromptTemplate(input_variables=["review"], template=few_shot_template)
few_shot_chain = LLMChain(llm=llm, prompt=few_shot_prompt)


In [None]:
# Sample reviews from the test set for evaluation
sample_reviews = X_test[:5]

# Perform few-shot sentiment analysis on a sample of drug reviews
print("Few-shot Sentiment Analysis Results:\n")
few_shot_results = []
for review in sample_reviews:
    sentiment = few_shot_chain.run({"review": review})
    few_shot_results.append(sentiment.strip())  # Collect predictions and remove any extra whitespace
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")

Few-shot Sentiment Analysis Results:

Review: "This birth control was great for me. I went from an a cup to a d cup on it though. No weight gain etc. It was pretty easy on my emotions and I wish I could find it again now."
Predicted Sentiment: positive

Review: "I&#039;m a 50 yr old female who is still menstruating. 

I started this drug in the hospital and my cycle went from 3-5 days - up to 9 days then stopped.



This month- I am on day 15 of an incredibly heavy cycle. I wear TWO extra long overnight pads ( which should last 10 hours) and I go through them in less than 2 hours.



I started this drug after a mild stroke that seems to have been a by-product of A-fib . I didn&#039;t want Coumadin (aka Rat Poison) - and thought out of the 3 two-year old medicines left, this seemed to be the best for my active lifestyle.   It is not.



The bleeding is so heavy, I am exhausted. Pains in my feet adn ankles. Back pains, neck pains and Migraine like headaches on the opposite side of my str

In [41]:
from sklearn.metrics import classification_report, accuracy_score

# Initialize W&B run for zero-shot vs few-shot comparison
run = wandb.init(project="Drug Review MLOps Uplimit",
                 name="Zero-vs-Few-Shot-Comparison",
                 notes="Compare zero-shot and few-shot prompting for sentiment analysis",
                 tags=["zero-shot", "few-shot", "LLM", "sentiment-analysis"])

# Logging the LLM configuration
run.config.update({
    "model": "gpt-4o-mini",
    "temperature": 0,
    "prompt_type": "comparison between zero-shot and few-shot"
})

# Define the labels in both numerical and string formats
target_names = ['negative', 'neutral', 'positive']
label_mapping = {0: 'negative', 1: 'neutral', 2: 'positive'}

# Assume y_true contains the true labels for the sample reviews
y_true = y_test[:5]


# Convert numerical true labels to strings
y_true_str = [label_mapping[label] for label in y_true]

# Zero-shot predictions
zero_shot_preds = [zero_shot_chain.run({"review": review}).strip().lower() for review in sample_reviews]

# Few-shot predictions
few_shot_preds = [few_shot_chain.run({"review": review}).strip().lower() for review in sample_reviews]

# Convert predictions to numerical labels
reverse_label_mapping = {'negative': 0, 'neutral': 1, 'positive': 2}
zero_shot_preds_num = [reverse_label_mapping[pred] for pred in zero_shot_preds]
few_shot_preds_num = [reverse_label_mapping[pred] for pred in few_shot_preds]

# Calculate accuracy
zero_shot_accuracy = accuracy_score(y_true_str, zero_shot_preds)
few_shot_accuracy = accuracy_score(y_true_str, few_shot_preds)

# Generate classification reports with specified labels to avoid class mismatch errors
zero_shot_report = classification_report(
    y_true_str,
    zero_shot_preds,
    target_names=target_names,
    labels=target_names,
    output_dict=True,
    zero_division=0  # Set to 0 to handle missing classes
)

few_shot_report = classification_report(
    y_true_str,
    few_shot_preds,
    target_names=target_names,
    labels=target_names,
    output_dict=True,
    zero_division=0  # Set to 0 to handle missing classes
)

# Log metrics to W&B
wandb.log({"zero_shot_accuracy": zero_shot_accuracy,
           "few_shot_accuracy": few_shot_accuracy})

# Log detailed metrics for both approaches
for label in target_names:
    wandb.log({
        f"zero_shot_{label}_precision": zero_shot_report[label]['precision'],
        f"zero_shot_{label}_recall": zero_shot_report[label]['recall'],
        f"zero_shot_{label}_f1-score": zero_shot_report[label]['f1-score'],
        f"few_shot_{label}_precision": few_shot_report[label]['precision'],
        f"few_shot_{label}_recall": few_shot_report[label]['recall'],
        f"few_shot_{label}_f1-score": few_shot_report[label]['f1-score']
    })

# Log sample reviews and predictions to W&B
for i, review in enumerate(sample_reviews):
    wandb.log({
        f"review_{i}": review,
        f"true_label_{i}": y_true_str[i],
        f"zero_shot_pred_{i}": zero_shot_preds[i],
        f"few_shot_pred_{i}": few_shot_preds[i]
    })

# Finish W&B run
wandb.finish()

print("Zero-shot vs Few-shot comparison logged successfully")

VBox(children=(Label(value='0.014 MB of 0.014 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='0.016 MB of 0.016 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
few_shot_accuracy,▁
few_shot_negative_f1-score,▁
few_shot_negative_precision,▁
few_shot_negative_recall,▁
few_shot_neutral_f1-score,▁
few_shot_neutral_precision,▁
few_shot_neutral_recall,▁
few_shot_positive_f1-score,▁
few_shot_positive_precision,▁
few_shot_positive_recall,▁

0,1
few_shot_accuracy,1
few_shot_negative_f1-score,1
few_shot_negative_precision,1
few_shot_negative_recall,1
few_shot_neutral_f1-score,0
few_shot_neutral_precision,0
few_shot_neutral_recall,0
few_shot_positive_f1-score,1
few_shot_positive_precision,1
few_shot_positive_recall,1


Zero-shot vs Few-shot comparison logged successfully


## [OPTIONAL] Model Evaluation: Estimating Confidence Intervals with Bootstrap Sampling

When evaluating machine learning models, it's crucial to understand not just the point estimates of performance metrics, but also the uncertainty around these estimates. Bootstrap sampling is a powerful technique that allows us to estimate confidence intervals for our model's performance metrics.

## Why Bootstrap Sampling?

1. **Quantify Uncertainty**: Bootstrap sampling helps us quantify the uncertainty in our model's performance metrics.
2. **Robustness**: It provides a more robust estimate of model performance than a single point estimate.
3. **No Distributional Assumptions**: Bootstrap sampling doesn't require assumptions about the underlying distribution of the data.

## Implementing Bootstrap Sampling

Here's a step-by-step guide to implement bootstrap sampling for estimating confidence intervals:

1. **Generate Bootstrap Samples**:
   Create N (e.g., 1000) bootstrap samples, each the same size as your original test set. Each sample is created by randomly selecting instances from the test set with replacement.

2. **Calculate Metrics for Each Sample**:
   For each bootstrap sample, calculate the performance metrics you're interested in (e.g., accuracy, F1-score).

3. **Sort the Results**:
   Sort the N values for each metric in ascending order.

4. **Compute Confidence Intervals**:
   The 95% confidence interval is given by the 2.5th and 97.5th percentiles of the sorted values.

In [50]:
import wandb
import pickle

# Initialize a W&B run to access the saved model artifact
run = wandb.init(project="Drug Review MLOps Uplimit", job_type="inference")

artifact = run.use_artifact('rajkstats/Drug Review MLOps Uplimit/bert_logreg_pipeline_1000:v0', type='model')
artifact_dir = artifact.download()

# Load the model pipeline from the downloaded artifact directory
model_path = f"{artifact_dir}/bert_logreg_pipeline_1000.pkl"
with open(model_path, "rb") as model_file:
    model = pickle.load(model_file)

# End W&B run
run.finish()

print("Model loaded successfully from W&B.")


VBox(children=(Label(value='0.013 MB of 0.013 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

[34m[1mwandb[0m: Downloading large artifact bert_logreg_pipeline_1000:v0, 87.17MB. 1 files... 
[34m[1mwandb[0m:   1 of 1 files downloaded.  
Done. 0:0:0.4


VBox(children=(Label(value='0.013 MB of 0.013 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Model loaded successfully from W&B.


In [51]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

def bootstrap_sample(X, y, n_samples):
    # Generate random indices for bootstrap sampling
    indices = np.random.randint(0, n_samples, n_samples)
    return X[indices], y[indices]

def bootstrap_confidence_interval(model, X_test, y_test, n_iterations=1000):
    accuracies = []
    f1_scores = []
    n_samples = len(X_test)

    for _ in tqdm(range(n_iterations)):
        # Generate bootstrap sample
        X_sample, y_sample = bootstrap_sample(X_test, y_test, n_samples)

        # Make predictions on bootstrap sample
        y_pred = model.predict(X_sample)

        # Calculate accuracy and F1-score
        accuracy = accuracy_score(y_sample, y_pred)
        f1 = f1_score(y_sample, y_pred, average='weighted')

        # Append results to accuracies and f1_scores lists
        accuracies.append(accuracy)
        f1_scores.append(f1)

    # Calculate mean and confidence intervals for accuracy and F1-score
    accuracy_mean = np.mean(accuracies)
    accuracy_ci = (np.percentile(accuracies, 2.5), np.percentile(accuracies, 97.5))
    f1_mean = np.mean(f1_scores)
    f1_ci = (np.percentile(f1_scores, 2.5), np.percentile(f1_scores, 97.5))

    # Return results in a dictionary format
    return {
        'accuracy_mean': accuracy_mean,
        'accuracy_ci': accuracy_ci,
        'f1_mean': f1_mean,
        'f1_ci': f1_ci
    }

# Usage
# Call the bootstrap_confidence_interval function with your model and test data
results = bootstrap_confidence_interval(model, X_test, y_test)
# Print the results
print(results)


  0%|          | 0/1000 [00:00<?, ?it/s]

Batches:   0%|          | 0/803 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Batches:   0%|          | 0/803 [00:00<?, ?it/s]

  1%|          | 8/1000 [05:41<11:45:15, 42.66s/it]


KeyboardInterrupt: 

# [OPTIONAL] Post-Training Tests on Model Behavior

After training your sentiment analysis model, it's crucial to thoroughly test its behavior beyond just accuracy metrics. This section introduces three types of behavioral tests that will help you understand your model's strengths, weaknesses, and potential biases.

## Invariance Tests

Invariance tests check whether your model's output remains unchanged when irrelevant input features are modified.

### Example: Name Invariance Test

In sentiment analysis, a person's name mentioned in a review should not affect the sentiment prediction.

**Test Setup:**
1. Select a set of reviews from your test dataset.
2. Create copies of these reviews, replacing any mentioned names with different names.
3. Run both the original and modified reviews through your model.
4. Compare the sentiment predictions.

**Expected Outcome:** The sentiment predictions should remain the same for both original and name-modified reviews.

## Directional Expectation Tests

Directional Expectation Tests check if changes in input lead to expected changes in output.

### Example: Intensifier Test

Adding intensity-related words should increase the strength of the sentiment prediction.

**Test Setup:**
1. Select a set of positive and negative reviews from your test dataset.
2. Create copies of these reviews, adding intensifiers like "very", "extremely", or "incredibly".
3. Run both the original and modified reviews through your model.
4. Compare the sentiment prediction probabilities.

**Expected Outcome:** The sentiment prediction probability should increase in the direction of the original sentiment.

## Minimum Functionality Tests

Minimum Functionality Tests check if your model performs correctly on very simple or critical cases.

### Example: Explicit Sentiment Words Test

Reviews containing explicit sentiment words should be correctly classified.

**Test Setup:**
1. Create a list of reviews using explicit positive and negative sentiment words.
2. Run these through your model.
3. Check if the predictions match the expected sentiments.

**Expected Outcome:** The model should correctly classify these simple, explicit sentiment expressions with high accuracy.

In [None]:
import checklist
from checklist.editor import Editor
from checklist.test_types import MFT
from checklist.pred_wrapper import PredictorWrapper

editor = Editor()

# TODO: Define variables for test data generation
# Example:
# positive_words = ['excellent', 'amazing', 'fantastic']
# negative_words = ['terrible', 'awful', 'horrible']
# neutral_words = ['okay', 'average', 'mediocre']
# products = ['movie', 'book', 'restaurant', 'hotel']

# TODO: Create templates for Invariance Tests
# Example: Name Invariance Test
# ret = editor.template('According to {name}, the {product} was {quality}.',
#                       name=['John', 'Emma', 'Mohammed', 'Yuki', 'Maria'],
#                       product=products,
#                       quality=positive_words + negative_words + neutral_words,
#                       labels='Sentiment')

# TODO: Create templates for Directional Expectation Tests
# Example: Intensifier Test
# ret += editor.template('The {product} was {intensifier} {quality}.',
#                        product=products,
#                        intensifier=['', 'very', 'extremely'],
#                        quality=positive_words + negative_words,
#                        labels='Sentiment')

# TODO: Create templates for Minimum Functionality Tests
# Example: Explicit Sentiment Words Test
# ret += editor.template('This {product} is {quality}.',
#                        product=products,
#                        quality=positive_words + negative_words,
#                        labels='Sentiment')

# Part 2: Test Configuration

# TODO: Configure the MFT test
# test = MFT(**ret, name='Sentiment Analysis Behavioral Tests')

# Part 3: Test Run & Results summary

# TODO: Implement a function to use your trained model for predictions
def predict_sentiment(texts):
    # Your code here to make predictions using your trained model
    pass

# TODO: Wrap your prediction function
# wrapped_predictor = PredictorWrapper.wrap_predict(predict_sentiment)

# TODO: Run the test and display results
# test.run(wrapped_predictor)
# test.summary()

# Selecting the Model to Productionize

Now that you've trained, evaluated, and performed behavioral testing on your sentiment analysis models, it's time to select the best model for production deployment. This decision should be based on a comprehensive analysis of each model's performance, behavior, and suitability for the real-world application. For this project the full justification step is left as an optional task. We recommend you just pick a simple model that can be easily be deployed like the Logistic Regression model.

## [OPTIONAL] Model Selection and Justification

Example table of factors to consider:

| Aspect | Logistic Regression | BERT | [Other Models] |
|--------|---------------------|------|----------------|
| Accuracy | | | |
| F1 Score | | | |
| Invariance Test | | | |
| Directional Test | | | |
| MF Test | | | |
| Pros | | | |
| Cons | | | |
| Training Time | | | |
| Inference Time | | | |
| Explainability | | | |

1. Review your models' performance, considering factors such as:
   - Accuracy, F1 Score, Precision, Recall
   - Results from Invariance, Directional Expectation, and Minimum Functionality Tests
   - Training and inference time
   - Model size and resource requirements
   - Explainability and interpretability
   - Robustness and potential biases

2. Write a brief justification for your chosen model, addressing:
   - Why this model is the best fit for the drug review sentiment analysis task
   - How it balances performance, efficiency, and robustness
   - Any potential challenges or limitations, and how you plan to address them
   - How this model aligns with the business requirements and constraints

3. Promote the selected model to production in Weights & Biases (W&B):
   - Log into your W&B account
   - Navigate to your project
   - Find the run corresponding to your selected model
   - Use the W&B UI to promote this model to `production` by changing the alias to it
   - Provide any necessary metadata or tags