[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openlayer-ai/examples-gallery/blob/main/text-classification/sklearn/sentiment-analysis/sentiment-sklearn.ipynb)


# <a id="top">Sentiment analysis using sklearn</a>

This notebook illustrates how sklearn models can be uploaded to the Openlayer platform.

## <a id="toc">Table of contents</a>

1. [**Getting the data and training the model**](#1)
    - [Downloading the dataset](#download)
    - [Training the model](#train)
    

2. [**Using Openlayer's Python API**](#2)
    - [Instantiating the client](#client)
    - [Creating a project](#project)
    - [Uploading datasets](#dataset)
    - [Uploading models](#model)
        - [Shell models](#shell)
        - [Full models](#full-model)
    - [Committing and pushing to the platform](#commit)

In [None]:
%%bash

if [ ! -e "requirements.txt" ]; then
    curl "https://raw.githubusercontent.com/openlayer-ai/examples-gallery/main/text-classification/sklearn/sentiment-analysis/requirements.txt" --output "requirements.txt"
fi

In [None]:
!pip install -r requirements.txt

## <a id="1"> 1. Getting the data and training the model </a>

[Back to top](#top)

In this first part, we will get the dataset, pre-process it, split it into training and validation sets, and train a model. Feel free to skim through this section if you are already comfortable with how these steps look for an sklearn model.   

In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

### <a id="download">Downloading the dataset </a>


We have stored the dataset on the following S3 bucket. If, for some reason, you get an error reading the csv directly from it, feel free to copy and paste the URL in your browser and download the csv files. Alternatively, you can also find the original datasets on [this Kaggle competition](https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset?select=testdata.manual.2009.06.14.csv). The training set in this example corresponds to the first 20,000 rows of the original training set.

In [None]:
'''
Test dataset has no labels, so we will have to split the training set into a test partition

df_test = pd.read_csv(
    "./quora-insincere-questions-classification/test.csv",
    encoding='ISO-8859-1'
)

# Load the labels
df_test_labels = pd.read_csv('test_labels.csv')

# Join the two dataframes by the shared column
merged_df = pd.merge(df_test, df_test_labels, on='id')

# Save the merged dataframe to a new CSV file
merged_df.to_csv('test_with_labels.csv', index=False)

# Load the labels
df_test_labels = pd.read_csv('test_labels.csv')

# Join the two dataframes by the shared column
merged_df = pd.merge(df_test, df_test_labels, on='id')

# Save the merged dataframe to a new CSV file
merged_df.to_csv('test_with_labels.csv', index=False)
'''

In [2]:
df = pd.read_csv(
    "./quora-insincere-questions-classification/train.csv",
    encoding='ISO-8859-1', 
)

df.drop(df[df['question_text'].str.len() > 1000].index, inplace = True)

def split_dataframe(df):
    train, test = train_test_split(df, test_size=0.25)
    return train, test

df_train, df_val = split_dataframe(df)
df_test = pd.read_csv(
    "./quora-insincere-questions-classification/test.csv",
    encoding='ISO-8859-1'
)

In [3]:
print(df_val.columns)
print(df_train.columns)
print(df_test.columns)

Index(['qid', 'question_text', 'target'], dtype='object')
Index(['qid', 'question_text', 'target'], dtype='object')
Index(['qid', 'question_text'], dtype='object')


In [4]:
df_train.head()

Unnamed: 0,qid,question_text,target
946370,b971b06e15aad10d2211,What advice would you give to someone who is m...,0
613421,78222924f9f1844dd68c,Do Indian Muslims know that they are converted...,1
477613,5d87fd135e566f8826a7,How powerful was the USA before WW2?,0
517537,65582a45c0dca303d0a8,What is the importance of probability?,0
967893,bda271da57ddf70f9c01,How does it feel or how did you react when you...,0


### <a id="train">Training the model</a>

In [5]:
sklearn_model = Pipeline([("count_vect", 
                           CountVectorizer(min_df=100, 
                                           ngram_range=(1, 2), 
                                           stop_words="english"),),
                          ("lr", LogisticRegression()),])
sklearn_model.fit(df_train.question_text, df_train.target)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Pipeline(steps=[('count_vect',
                 CountVectorizer(min_df=100, ngram_range=(1, 2),
                                 stop_words='english')),
                ('lr', LogisticRegression())])

In [6]:
x_val, y_val = df_val.question_text, df_val.target
print(classification_report(y_val, sklearn_model.predict(x_val)))

              precision    recall  f1-score   support

           0       0.96      0.99      0.97    306376
           1       0.67      0.39      0.49     20155

    accuracy                           0.95    326531
   macro avg       0.82      0.69      0.73    326531
weighted avg       0.94      0.95      0.94    326531



## <a id="2"> 2. Using Openlayer's Python API</a>

[Back to top](#top)

Now it's time to upload the datasets and model to the Openlayer platform.

In [7]:
!pip install openlayer





### <a id="client">Instantiating the client</a>

In [8]:
import openlayer

openlayer.api.OPENLAYER_ENDPOINT = "https://api-staging.openlayer.com/v1"
client = openlayer.OpenlayerClient("Vnua5sn7Z9bVtrrhEhSeJnzLPNtXTDLx")

### <a id="project">Creating a project on the platform</a>

In [24]:
from openlayer import TaskType

project = client.create_or_load_project(
    name="Parthib and Vikas: Insincere Questions",
    task_type=TaskType.TextClassification,
    description="Classifying quora questions by sincerity"
)

Found your project. Navigate to https://staging.openlayer.com/openlayer/8f02885f-bc0c-4f79-bc20-23f59b56d470 to see it.


### <a id="dataset">Uploading datasets</a>

Before adding the datasets to a project, we need to do two things:
1. Enhance the dataset with additional columns to make it comprehensive, such as adding a column for labels and one for model predictions (if you're uploading a model as well).
2. Prepare a `dataset_config.yaml` file. This is a file that contains all the information needed by the Openlayer platform to utilize the dataset. It should include the column names, the class names, etc. For details on the fields of the `dataset_config.yaml` file, see the [API reference](https://reference.openlayer.com/reference/api/openlayer.OpenlayerClient.add_dataset.html#openlayer.OpenlayerClient.add_dataset).

Let's start by enhancing the datasets with the extra columns:

In [25]:
# Adding the column with the predictions (since we'll also upload a model later)
df_train["predictions"] = sklearn_model.predict_proba(df_train['question_text']).tolist()
df_val["predictions"] = sklearn_model.predict_proba(df_val['question_text']).tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train["predictions"] = sklearn_model.predict_proba(df_train['question_text']).tolist()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_val["predictions"] = sklearn_model.predict_proba(df_val['question_text']).tolist()


Now, we can prepare the `dataset_config.yaml` files for the training and validation sets.

In [26]:
# Some variables that will go into the `dataset_config.yaml` file
column_names = list(df_train.columns)
class_names = ["negative", "positive"]
label_column_name = "target"
prediction_scores_column_name = "predictions"
text_column_name = "question_text"

In [27]:
import yaml 

# Note the camelCase for the dict's keys
training_dataset_config = {
    "classNames": class_names,
    "columnNames": column_names,
    "textColumnName": text_column_name,
    "label": "training",
    "labelColumnName": label_column_name,
    "predictionScoresColumnName": prediction_scores_column_name,
}

with open("training_dataset_config.yaml", "w") as dataset_config_file:
    yaml.dump(training_dataset_config, dataset_config_file, default_flow_style=False)

In [28]:
import copy

validation_dataset_config = copy.deepcopy(training_dataset_config)

# In our case, the only field that changes is the `label`, from "training" -> "validation"
validation_dataset_config["label"] = "validation"

with open("validation_dataset_config.yaml", "w") as dataset_config_file:
    yaml.dump(validation_dataset_config, dataset_config_file, default_flow_style=False)

In [29]:
# Training set
project.add_dataframe(
    dataset_df=df_train,
    dataset_config_file_path="training_dataset_config.yaml",
)

Found an existing `training` resource staged.
Do you want to overwrite it? [y/n] y
Overwriting previously staged `training` resource...
Staged the `training` resource!


In [30]:
# Validation set
project.add_dataframe(
    dataset_df=df_val,
    dataset_config_file_path="validation_dataset_config.yaml",
)

Staged the `validation` resource!


We can check that both datasets are now staged using the `project.status()` method. 

In [31]:
project.status()

The following resources are staged, waiting to be committed:
	 - training
	 - model
	 - validation
Use the `commit` method to add a commit message to your changes.


### <a id="model">Uploading models</a>

When it comes to uploading models to the Openlayer platform, there are two options:

- The first one is to upload a **shell model**. Shell models are the most straightforward way to get started. They are comprised of metadata and all of the analysis are done via its predictions (which are [uploaded with the datasets](#dataset)).
- The second one is to upload a **full model**, with artifacts. When a full model is uploaded, it becomes available in the platform and it becomes possible to perform what-if analysis, use all the explainability techniques available, and perform a series of robustness assessments with it. 

#### <a id="shell">Shell models</a>

To upload a shell model, we only need to define its name, the architecture type, and add some metadata that will be rendered in the platform to help us identify it. This information should be saved to a `model_config.yaml` file.

Let's create a `model_config.yaml` file for our model:

In [32]:
import yaml

model_config = {
    "name": "Sentiment analysis model",
    "architectureType": "sklearn",
    "metadata": {  # Can add anything here, as long as it is a dict
        "model_type": "Logistic Regression",
        "regularization": "None",
    },
    "classNames": class_names,
}

with open("model_config.yaml", "w") as model_config_file:
    yaml.dump(model_config, model_config_file, default_flow_style=False)

In [33]:
project.add_model(
    model_config_file_path="model_config.yaml",
)

Found an existing `model` resource staged.
Do you want to overwrite it? [y/n] y
Overwriting previously staged `model` resource...
Staged the `model` resource!


We can check that both datasets and model are staged using the `project.status()` method.

In [34]:
project.status()

The following resources are staged, waiting to be committed:
	 - training
	 - model
	 - validation
Use the `commit` method to add a commit message to your changes.


Since in this example, we're interested in uploading a full model, let's unstage the shell model:

In [None]:
project.restore("model")

#### <a id="full-model"> Full models </a>

To upload a full model to Openlayer, you will need to create a model package, which is nothing more than a folder with all the necessary information to run inference with the model. The package should include the following:
1. A `requirements.txt` file listing the dependencies for the model.
2. Serialized model files, such as model weights, encoders, etc., in a format specific to the framework used for training (e.g. `.pkl` for sklearn, `.pb` for TensorFlow, and so on.)
3. A `prediction_interface.py` file that acts as a wrapper for the model and implements the `predict_proba` function. 

Other than the model package, a `model_config.yaml` file is needed, with information about the model to the Openlayer platform, such as the framework used, feature names, and categorical feature names.

Lets prepare the model package one piece at a time

In [None]:
# Creating the model package folder (we'll call it `model_package`)
!mkdir model_package

**1. Adding the `requirements.txt` to the model package**

In [None]:
!scp requirements.txt model_package

**2. Serializing the model and other objects needed**

In [None]:
import pickle 

# Trained model pipeline
with open('model_package/model.pkl', 'wb') as handle:
    pickle.dump(sklearn_model, handle, protocol=pickle.HIGHEST_PROTOCOL)

**3. Writing the `prediction_interface.py` file**

In [None]:
%%writefile model_package/prediction_interface.py

import pickle
from pathlib import Path

import pandas as pd

PACKAGE_PATH = Path(__file__).parent


class SklearnModel:
    def __init__(self):
        """This is where the serialized objects needed should
        be loaded as class attributes."""

        with open(PACKAGE_PATH / "model.pkl", "rb") as model_file:
            self.model = pickle.load(model_file)

    def predict_proba(self, input_data_df: pd.DataFrame):
        """Makes predictions with the model. Returns the class probabilities."""
        text_column = input_data_df.columns[0]
        return self.model.predict_proba(input_data_df[text_column])


def load_model():
    """Function that returns the wrapped model object."""
    return SklearnModel()

**Creating the `model_config.yaml`**

In [None]:
import yaml 

model_config = {
    "name": "Sentiment analysis model",
    "architectureType": "sklearn",
    "classNames": class_names,
}

with open('model_config.yaml', 'w') as model_config_file:
    yaml.dump(model_config, model_config_file, default_flow_style=False)

Lets check that the model package contains everything needed:

In [None]:
from openlayer.validators import model_validators

model_validator = model_validators.ModelValidator(
    model_package_dir="model_package",
    model_config_file_path="model_config.yaml",
    sample_data = df_val[["text"]].iloc[:10, :]
)
model_validator.validate()

Now, we are ready to add the model:

In [None]:
project.add_model(
    model_package_dir="model_package",
    model_config_file_path="model_config.yaml",
    sample_data=df_val[["text"]].iloc[:10, :]
)

We can check that both datasets and model are staged using the `project.status()` method.

In [35]:
project.status()

The following resources are staged, waiting to be committed:
	 - training
	 - model
	 - validation
Use the `commit` method to add a commit message to your changes.


### <a id="commit"> Committing and pushing to the platform </a>

Finally, we can commit the first project version to the platform. 

In [36]:
project.commit("Initial commit!")

Committed!


In [37]:
project.status()

The following resources are committed, waiting to be pushed:
	 - training
	 - model
	 - validation
Commit message from Wed May 10 22:12:43 2023:
	 Initial commit!
Use the `push` method to push your changes to the platform.


In [38]:
project.push()

Pushing changes to the platform with the commit message: 
	 - Message: Initial commit! 
	 - Date: Wed May 10 22:12:43 2023


100%|[34m█████████████████████████████████████████████████████████████████████████████[0m| 84.3M/84.3M [01:46<00:00, 826kB/s][0m


Pushed!
