Esse notebook contem algumas experimentações sobre o pacote skikit-llm: https://github.com/iryna-kondr/scikit-llm

In [1]:
# Importanto pacotes
import os
import random

from skllm.config import SKLLMConfig
from dotenv import load_dotenv

from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset

from sklearn.metrics import  accuracy_score

In [2]:
#Configurando API KEY pelo ambiente:
load_dotenv()
SKLLMConfig.set_openai_key(os.getenv("OPENAI_API_KEY"))

## Zero Shot GPTClassifier
Classificador de sentimentos

In [93]:
# Carregando o dataset default 
X,y = get_classification_dataset()

In [20]:
len(X)

30

In [14]:
random.sample(X, 5)

["I found 'After the Rain' to be pretty average. The plot was okay and the performances were decent, but it didn't leave a lasting impression on me.",
 "'The Darkened Path' was a disaster. The storyline was unoriginal, the acting was wooden and the special effects were laughably bad. Save your money and skip this one.",
 "I thought 'The Scent of Roses' was pretty average. The plot was somewhat engaging, and the performances were okay, but it didn't live up to my expectations.",
 "I was thoroughly disappointed with 'Silver Shadows'. The plot was confusing and the performances were lackluster. I wouldn't recommend wasting your time on this one.",
 "The screenwriting in 'Under the Willow Tree' was superb. The dialogue felt real and the characters were well-rounded. The performances were also fantastic. I haven't enjoyed a movie this much in a while."]

In [15]:
random.sample(y,5)

['neutral', 'positive', 'negative', 'negative', 'neutral']

In [122]:
# Separando dados de treinamento e teste
X_train = X[:21]
y_train = y[:21]
X_test = X[21:]
y_test = y[21:]

In [22]:
# Definindo o modelo:
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# Treinando o modelo com os dados: 
clf.fit(X_train,y_train)

In [23]:
# Fazendo predições dos dados
labels = clf.predict(X_test)

100%|█████████████████████████████████████████████| 9/9 [00:09<00:00,  1.01s/it]


In [29]:
# Função para comparação de resultados previstos e resultados reais: 
def compare_results(predict, y_test):
    results = []
    for index in range(len(predict)):
        if predict[index] == y_test[index]:
            results.append(True)
        else:
            results.append(False)
    return results

# Função para calculo da acurácia

In [30]:
results = compare_results(labels, y_test)

In [31]:
results

[True, True, True, True, False, False, True, True, True]

In [39]:
accuracy_score(y_test, labels)

0.7777777777777778

In [47]:
clf.predict(['All the product are good, but I dont know how i am feeling about'])

100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.12it/s]


['neutral']

## Caso não contenha as frases sequenciadas com seus labels

In [89]:
X2, _ = get_classification_dataset()


In [90]:
clf2 = ZeroShotGPTClassifier()
clf2.fit(None,['positive', 'negative', 'neutral'])

In [91]:
labels2 = clf.predict(X2)

100%|███████████████████████████████████████████| 30/30 [00:24<00:00,  1.20it/s]


In [95]:
accuracy_score(y, labels2)

0.9

Podemos treinar um classificador sem dados explicitamente rotulados, simplesmente especificando os rótulos potenciais. 

# Treinamento de multi-label


In [64]:
# Carregando pacotes:
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset


In [65]:
# Carregando dados: 
X3, y3 = get_multilabel_classification_dataset()


In [66]:
X3

['The product was of excellent quality, and the packaging was also very good. Highly recommend!',
 'The delivery was super fast, but the product did not match the information provided on the website.',
 'Great variety of products, but the customer support was quite unresponsive.',
 'Affordable prices and an easy-to-use website. A great shopping experience overall.',
 'The delivery was delayed, and the packaging was damaged. Not a good experience.',
 'Excellent customer support, but the return policy is quite complicated.',
 'The product was not as described. However, the return process was easy and quick.',
 'Great service and fast delivery. The product was also of high quality.',
 'The prices are a bit high. However, the product quality and user experience are worth it.',
 'The website provides detailed information about products. The delivery was also very fast.']

In [67]:
y3

[['Quality', 'Packaging'],
 ['Delivery', 'Product Information'],
 ['Product Variety', 'Customer Support'],
 ['Price', 'User Experience'],
 ['Delivery', 'Packaging'],
 ['Customer Support', 'Return Policy'],
 ['Product Information', 'Return Policy'],
 ['Service', 'Delivery', 'Quality'],
 ['Price', 'Quality', 'User Experience'],
 ['Product Information', 'Delivery']]

In [72]:
# Definindo o modelo:
clf3 = MultiLabelZeroShotGPTClassifier(max_labels=2)

In [73]:
# Treinando o modelo:
clf3.fit(X3,y3)

In [74]:
labels3 = clf3.predict(X3)

100%|███████████████████████████████████████████| 10/10 [00:10<00:00,  1.02s/it]


In [75]:
labels3

[['Quality', 'Packaging'],
 ['Delivery', 'Product Information'],
 ['Product Variety', 'Customer Support'],
 ['Price', 'User Experience'],
 ['Delivery', 'Packaging'],
 ['Customer Support', 'Return Policy'],
 ['Product Information', 'Return Policy'],
 ['Delivery', 'Quality'],
 ['Price', 'Quality'],
 ['Product Information', 'Delivery']]

## Caso que não contém os multi-labels definidos: 

In [76]:
X3, _ = get_multilabel_classification_dataset()


In [77]:
# Definindo os labels em potencial para serem previstos:
candidate_labels = [
    "Quality",
    "Price",
    "Delivery",
    "Service",
    "Product Variety"
]

In [83]:
# Criando o modelo:
clf3 = MultiLabelZeroShotGPTClassifier(max_labels=3)

# Treinamento
clf3.fit(None, [candidate_labels])

In [84]:
labels3 = clf3.predict(X3)

100%|███████████████████████████████████████████| 10/10 [00:09<00:00,  1.04it/s]


In [85]:
y3

[['Quality', 'Packaging'],
 ['Delivery', 'Product Information'],
 ['Product Variety', 'Customer Support'],
 ['Price', 'User Experience'],
 ['Delivery', 'Packaging'],
 ['Customer Support', 'Return Policy'],
 ['Product Information', 'Return Policy'],
 ['Service', 'Delivery', 'Quality'],
 ['Price', 'Quality', 'User Experience'],
 ['Product Information', 'Delivery']]

In [86]:
labels3

[['Quality'],
 ['Delivery'],
 ['Product Variety', 'Service'],
 ['Price', 'Service'],
 ['Delivery', 'Quality'],
 ['Service'],
 ['Quality', 'Service'],
 ['Service', 'Delivery', 'Quality'],
 ['Quality', 'Price'],
 ['Product Variety', 'Delivery']]

## Text Vectorization

Vetorização de texto é um processo de conversão dos textos em números para posteriores análises. 

In [102]:
from skllm.preprocessing import GPTVectorizer


In [103]:
# Definindo o modelo
model = GPTVectorizer()


In [104]:
# Fazendo a transformação (Vetorização de texto):
vectors = model.fit_transform(X)

100%|███████████████████████████████████████████| 30/30 [00:13<00:00,  2.21it/s]


In [105]:
vectors

array([[-1.08071305e-02,  8.25319032e-04,  6.32961188e-03, ...,
        -1.49337249e-02, -7.62059426e-05, -4.83753905e-02],
       [-7.46790506e-03, -1.57088041e-02, -1.33570693e-02, ...,
        -1.55908894e-02, -1.83553249e-02, -3.60293686e-02],
       [ 7.93671049e-03, -1.18406452e-02,  1.95840914e-02, ...,
         4.15195618e-03, -1.71231963e-02, -3.01491935e-02],
       ...,
       [-6.81832340e-03, -1.20293340e-02,  1.34453001e-02, ...,
        -3.72488401e-03, -9.90538485e-03, -2.87530422e-02],
       [-1.88294612e-02, -2.02019094e-03,  1.10821901e-02, ...,
        -1.68541633e-02, -1.12681761e-02, -1.94066595e-02],
       [-2.28717495e-02, -6.95682364e-03,  1.82211604e-02, ...,
        -3.40789035e-02, -2.81830765e-02, -1.62643548e-02]])

Aplicando a função fit_transform() acima, o modelo GPTVectorizer() transforma as frases em vetores numéricos com dimensões fixas. Esses vetores podem ser utilizados para outras integrações. Como exemplo abaixo realizando uma integração com o classificador XGBoost. 

In [108]:
# Importanto pacotes: 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

In [123]:
# Criando uma instancia para a classe Label Encoder
le = LabelEncoder()

# Fazendo o encoding para os dados de treinamento, y_train: 
y_train_encoded = le.fit_transform(y_train)

# Fazendo o encoding para os dados de teste, y_test:
y_test_encoded = le.transform(y_test)

In [124]:
# Definindo a lista de execuções do pipeline
steps = [('GPT', GPTVectorizer()),
         ('Clf', XGBClassifier())]

In [125]:
# Definindo o pipeline para os passos acima: 
clf = Pipeline(steps)

# Realizando o treinamento
clf.fit(X_train, y_train_encoded)

100%|███████████████████████████████████████████| 21/21 [00:08<00:00,  2.37it/s]


In [126]:
# Fazendo as previsões: 
labels4 = clf.predict(X_test)

100%|█████████████████████████████████████████████| 9/9 [00:03<00:00,  2.90it/s]


In [127]:
accuracy_score(y_test_encoded, labels4)

0.2222222222222222

In [131]:
X_test

["'The Last Frontier' was simply okay. The plot was decent and the performances were acceptable. However, it lacked a certain spark to make it truly memorable.",
 "'Through the Storm' was not bad, but it wasn't great either. The storyline was somewhat predictable, and the characters were somewhat stereotypical. It was an average movie at best.",
 "I found 'After the Rain' to be pretty average. The plot was okay and the performances were decent, but it didn't leave a lasting impression on me.",
 "'Beyond the Horizon' was neither good nor bad. The plot was interesting enough, but the characters were not very well developed. It was an okay watch.",
 "'The Silent Echo' was a mediocre movie. The storyline was passable and the performances were fair, but it didn't stand out in any way.",
 "I thought 'The Scent of Roses' was pretty average. The plot was somewhat engaging, and the performances were okay, but it didn't live up to my expectations.",
 "'Under the Same Sky' was an okay movie. The 

In [134]:
y_test_encoded

array([1, 1, 1, 1, 1, 1, 1, 1, 1])

In [133]:
labels4

array([2, 1, 2, 0, 2, 0, 0, 1, 0])

## Text Summarization

A "text summarization" (em português, sumarização de texto) refere-se ao processo de encurtar um texto, mantendo as informações mais importantes e eliminando detalhes secundários ou redundantes, para produzir uma versão condensada que seja mais fácil e rápida de ler, mas que ainda transmita o essencial da mensagem original.

In [7]:
# Importanto pacotes: 
from skllm.preprocessing import GPTSummarizer

# Importanto dados
from skllm.datasets import get_summarization_dataset

In [137]:
# Carregando os dados: 
X = get_summarization_dataset()

In [138]:
X

['The AI research company, OpenAI, has launched a new language model called GPT-4. This model is the latest in a series of transformer-based AI systems designed to perform complex tasks, such as generating human-like text, translating languages, and answering questions. According to OpenAI, GPT-4 is even more powerful and versatile than its predecessors.',
 'John went to the grocery store in the morning to prepare for a small get-together at his house. He bought fresh apples, juicy oranges, and a bottle of milk. Once back home, he used the apples and oranges to make a delicious fruit salad, which he served to his guests in the evening.',
 'The first Mars rover, named Sojourner, was launched by NASA in 1996. The mission was a part of the Mars Pathfinder project and was a major success. The data Sojourner provided about Martian terrain and atmosphere greatly contributed to our understanding of the Red Planet.',
 'A new study suggests that regular exercise can improve memory and cognitive

In [139]:
# Definindo o modelo :
summary = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)

In [140]:
# Aplicando o modelo aos dados:
summaries = summary.fit_transform(X)

100%|███████████████████████████████████████████| 10/10 [00:12<00:00,  1.24s/it]


In [141]:
summaries

array(['OpenAI has released GPT-4, a powerful and versatile language model for complex tasks.',
       'John bought groceries in the morning and made a fruit salad for his guests in the evening.',
       "NASA's first Mars rover, Sojourner, launched in 1996, greatly contributed to our understanding of the Red Planet.",
       'Regular exercise improves memory and cognitive function in older adults, recommends 30 minutes daily.',
       'The Eiffel Tower, completed in 1889, is a beloved symbol of Paris and French architecture.',
       'Microsoft announces new version of Windows with improved security and redesigned user interface.',
       'WHO declares global public health emergency due to unknown virus outbreak, urges nations to strengthen response systems.',
       "Paris, France will host the 2024 Olympics, marking the city's third time hosting the games.",
       "Apple's latest iPhone model features improved camera, faster processor, longer battery life, launching soon.",
       

## Outros testes com o GPTSummarizer

Textos retirados da cartilha sobre produção de batata doce embrapa: https://www.embrapa.br/documents/1355126/8971369/Sistema+de+Produ%C3%A7%C3%A3o+de+Batata-Doce.pdf/4632fe60-0c35-71af-79cc-7c15a01680c9

In [31]:
text = ["""
A formação da raiz tuberosa pode começar quatro semanas após o plantio, embora o mais comum seja entre 4 a 6 semanas, dependendo da cultivar e das
condições ambientais. A presença de condições ambientais favoráveis durante o primeiro mês após o plantio é de vital importância para o início da
formação das raízes tuberosas. Sete semanas após o plantio, 80% das raízes tuberosas estarão formadas e entre 8 a 2 semanas após o plantio a planta
deixará de formar novas raízes tuberosas. Depois disso, toda a energia é direcionada para o engrossamento das raízes tuberosas. Quando são formadas muitas raízes tuberosas por planta, normalmente o peso da raiz é baixo, enquanto poucas raízes por planta normalmente resultam em raízes maiores.
"""]
n_words = 20

summary = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=n_words)
summaries = summary.fit_transform(text)
summaries

100%|█████████████████████████████████████████████| 1/1 [00:03<00:00,  3.88s/it]


array(['Root tuber formation in plants typically begins 4-6 weeks after planting, with favorable environmental conditions being crucial. After 8-12 weeks, no new tuber roots are formed, and energy is directed towards thickening existing roots.'],
      dtype=object)