<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/scikit_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SCIKIT-LLM
Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks.

# SETUP

In [None]:
%%capture
!pip install scikit-llm watermark


In [None]:
%load_ext watermark
%watermark -a "Sudarshan Koirala" -vmp scikit-llm

Author: Sudarshan Koirala

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

scikit-llm: not installed

Compiler    : GCC 9.4.0
OS          : Linux
Release     : 5.15.107+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit



In [None]:
# importing SKLLMConfig to configure OpenAI API (key and Name)
from skllm.config import SKLLMConfig

# Set your OpenAI API key
SKLLMConfig.set_openai_key("<YOUR_KEY>")

# Set your OpenAI organization (optional)
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")

# OPENAI

## Zero-Shot Text Classification
One of the powerful ChatGPT features is the ability to perform text classification without being re-trained. All it requires is just the descriptive labels.

ZeroShotGPTClassifier allows to create such a model as a regular scikit-learn classifier.

### Training as a regular classifier

In [None]:
# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

In [None]:
# sentiment analysis dataset
# labels: positive, negative, neutral
X, y = get_classification_dataset()

In [None]:
len(X)

30

In [None]:
X

["I was absolutely blown away by the performances in 'Summer's End'. The acting was top-notch, and the plot had me gripped from start to finish. A truly captivating cinematic experience that I would highly recommend.",
 "The special effects in 'Star Battles: Nebula Conflict' were out of this world. I felt like I was actually in space. The storyline was incredibly engaging and left me wanting more. Excellent film.",
 "'The Lost Symphony' was a masterclass in character development and storytelling. The score was hauntingly beautiful and complimented the intense, emotional scenes perfectly. Kudos to the director and cast for creating such a masterpiece.",
 "I was pleasantly surprised by 'Love in the Time of Cholera'. The romantic storyline was heartwarming and the characters were incredibly realistic. The cinematography was also top-notch. A must-watch for all romance lovers.",
 "I went into 'Marble Street' with low expectations, but I was pleasantly surprised. The suspense was well-maint

In [None]:
y

['positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral',
 'neutral']

In [None]:
def subset_of_data(data):
    subset_1 = data[:2]  # First two elements from 1-10
    subset_2 = data[10:12]  # First two elements from 11-20
    subset_3 = data[20:22]  # First two elements from rest of the data

    combined_data = subset_1 + subset_2 + subset_3
    return combined_data

In [None]:
X_subset = subset_of_data(X)
X_subset

["I was absolutely blown away by the performances in 'Summer's End'. The acting was top-notch, and the plot had me gripped from start to finish. A truly captivating cinematic experience that I would highly recommend.",
 "The special effects in 'Star Battles: Nebula Conflict' were out of this world. I felt like I was actually in space. The storyline was incredibly engaging and left me wanting more. Excellent film.",
 "I was thoroughly disappointed with 'Silver Shadows'. The plot was confusing and the performances were lackluster. I wouldn't recommend wasting your time on this one.",
 "'The Darkened Path' was a disaster. The storyline was unoriginal, the acting was wooden and the special effects were laughably bad. Save your money and skip this one.",
 "'Remember the Days' was utterly forgettable. The storyline was dull, the performances were bland, and the dialogue was cringeworthy. A big disappointment.",
 "'The Last Frontier' was simply okay. The plot was decent and the performances w

In [None]:
y_subset = subset_of_data(y)
y_subset

['positive', 'positive', 'negative', 'negative', 'neutral', 'neutral']

In [None]:
# defining the openai model to use
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# fitting the data
clf.fit(X_subset, y_subset)

In [None]:
# predicting the data
labels = clf.predict(X_subset)

100%|██████████| 6/6 [00:06<00:00,  1.02s/it]


In [None]:
labels

['positive', 'positive', 'negative', 'negative', 'negative', 'neutral']

- Scikit-LLM will automatically query the OpenAI API and transform the response into a regular list of labels.
- Scikit-LLM will ensure that the obtained response contains a valid label. If this is not the case, a label will be selected randomly (label probabilities are proportional to label occurrences in the training set).

## Training without label (What if you don't have labelled data ??)
- you don’t even need labeled data to train the model.

In [None]:
# defining the model
clf_no_label = ZeroShotGPTClassifier()

# No training so passing the labels only for prediction
clf_no_label.fit(None, ['positive', 'negative', 'neutral'])

# predicting the labels
labels = clf_no_label.predict(X_subset)
labels

100%|██████████| 6/6 [00:06<00:00,  1.01s/it]


['positive', 'positive', 'negative', 'negative', 'negative', 'neutral']

- You can train a classifier without explicitly labeled data, simply by specifying the potential labels.
- Label has to be expressed in natural language, be descriptive and self-explanatory

### <font color="orange">What if you have multi-labels case ?? There is Multi-Label Zero-shot Text Classification. In this case also you can go with and without providing labelled data.</font>

# GPT4ALL
- Same data with one of the model from gpt4all
- When running the first time, the model will be downloaded automatically.

In [None]:
%%capture
!pip install "scikit-llm[gpt4all]"

- In order to switch from OpenAI to GPT4ALL model, simply provide a string of the format `gpt4all::<model_name>` as an argument.
- While the model runs completely locally, the estimator still treats it as an OpenAI endpoint and will try to check that the API key is present. - You can provide any string as a key.

In [None]:
from skllm.config import SKLLMConfig

SKLLMConfig.set_openai_key("any string")
SKLLMConfig.set_openai_org("any string")

from skllm import ZeroShotGPTClassifier
clf_gpt4all = ZeroShotGPTClassifier(openai_model="gpt4all::ggml-gpt4all-j-v1.3-groovy")

In [None]:
from skllm.datasets import get_classification_dataset
X, y = get_classification_dataset()


def subset_of_data(data):
    subset_1 = data[:2]  # First two elements from 1-10
    subset_2 = data[10:12]  # First two elements from 11-20
    subset_3 = data[20:22]  # First two elements from rest of the data

    combined_data = subset_1 + subset_2 + subset_3
    return combined_data

In [None]:
X_subset = subset_of_data(X)
X_subset

["I was absolutely blown away by the performances in 'Summer's End'. The acting was top-notch, and the plot had me gripped from start to finish. A truly captivating cinematic experience that I would highly recommend.",
 "The special effects in 'Star Battles: Nebula Conflict' were out of this world. I felt like I was actually in space. The storyline was incredibly engaging and left me wanting more. Excellent film.",
 "I was thoroughly disappointed with 'Silver Shadows'. The plot was confusing and the performances were lackluster. I wouldn't recommend wasting your time on this one.",
 "'The Darkened Path' was a disaster. The storyline was unoriginal, the acting was wooden and the special effects were laughably bad. Save your money and skip this one.",
 "'Remember the Days' was utterly forgettable. The storyline was dull, the performances were bland, and the dialogue was cringeworthy. A big disappointment.",
 "'The Last Frontier' was simply okay. The plot was decent and the performances w

In [None]:
y_subset = subset_of_data(y)
y_subset

['positive', 'positive', 'negative', 'negative', 'neutral', 'neutral']

In [None]:
# fitting the data
clf_gpt4all.fit(X_subset, y_subset)

In [None]:
# predicting the labels
labels = clf_gpt4all.predict(X_subset)
labels

  0%|          | 0/6 [00:00<?, ?it/s]
  0%|          | 0.00/3.79G [00:00<?, ?iB/s][A
  0%|          | 7.34M/3.79G [00:00<00:52, 72.5MiB/s][A
  0%|          | 14.7M/3.79G [00:00<01:18, 48.1MiB/s][A
  1%|          | 21.0M/3.79G [00:00<01:13, 51.1MiB/s][A
  1%|          | 27.3M/3.79G [00:00<01:13, 51.0MiB/s][A
  1%|          | 34.6M/3.79G [00:00<01:07, 55.3MiB/s][A
  1%|          | 41.9M/3.79G [00:00<01:08, 55.0MiB/s][A
  1%|▏         | 48.2M/3.79G [00:00<01:06, 56.0MiB/s][A
  1%|▏         | 54.5M/3.79G [00:01<01:29, 41.7MiB/s][A
  2%|▏         | 60.8M/3.79G [00:01<01:20, 46.3MiB/s][A
  2%|▏         | 67.1M/3.79G [00:01<01:26, 42.8MiB/s][A
  2%|▏         | 73.4M/3.79G [00:01<01:20, 46.4MiB/s][A
  2%|▏         | 79.7M/3.79G [00:01<01:14, 49.7MiB/s][A
  2%|▏         | 86.0M/3.79G [00:01<01:44, 35.5MiB/s][A
  2%|▏         | 92.3M/3.79G [00:02<02:05, 29.3MiB/s][A
  3%|▎         | 97.5M/3.79G [00:02<02:17, 26.8MiB/s][A
  3%|▎         | 102M/3.79G [00:02<02:22, 25.8MiB/s] [A
  

Model downloaded at:  /root/.cache/gpt4all/ggml-gpt4all-j-v1.3-groovy.bin


 17%|█▋        | 1/6 [06:41<33:26, 401.35s/it]




 33%|███▎      | 2/6 [10:53<20:55, 313.80s/it]




 50%|█████     | 3/6 [15:25<14:43, 294.55s/it]




 67%|██████▋   | 4/6 [21:21<10:37, 318.96s/it]




 83%|████████▎ | 5/6 [27:57<05:46, 346.55s/it]




100%|██████████| 6/6 [35:21<00:00, 353.60s/it]







['positive', 'positive', 'negative', 'negative', 'negative', 'negative']

Now, assuming we don't have labels for the data.

In [None]:
# defining the model
clf_gpt4all_no_label = ZeroShotGPTClassifier(openai_model="gpt4all::ggml-gpt4all-j-v1.3-groovy")

# No training so passing the labels only for prediction
clf_gpt4all_no_label.fit(None, ['positive', 'negative', 'neutral'])

In [None]:
%%time
# predicting the labels
labels = clf_gpt4all_no_label.predict(X_subset)
labels

 17%|█▋        | 1/6 [19:43<1:38:39, 1183.83s/it]




 33%|███▎      | 2/6 [26:44<49:00, 735.00s/it]   




 50%|█████     | 3/6 [34:31<30:38, 612.77s/it]




 67%|██████▋   | 4/6 [54:11<27:53, 836.68s/it]




 83%|████████▎ | 5/6 [1:00:58<11:21, 681.72s/it]




100%|██████████| 6/6 [1:08:38<00:00, 686.39s/it]


CPU times: user 1h 59min 59s, sys: 5.3 s, total: 2h 4s
Wall time: 1h 8min 38s





['positive', 'positive', 'negative', 'negative', 'negative', 'negative']