![](https://i.ibb.co/TWw70TP/logo.png)

# Chad the chatbot

## Notebook description

1. Creation of a custom dataset by web scraping of the website [Cross Validated - Stack Exchange](https://stats.stackexchange.com).

2. Fine-tuning of the ["EleutherAI/gpt-neox-20b"](https://huggingface.co/EleutherAI/gpt-neox-20b) model using bitsandbytes, 4-bit quantization and QLoRA.

3. Zero-shot-classification with ["facebook/bart-large-mnli"](https://huggingface.co/facebook/bart-large-mnli) to filter questions related to our topics.

4. Creation of the real chatbot that combines the classification of questions and the possible generation of answers through the fine-tuned model.

In [1]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git 
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install accelerate

## Dataset creation

In [None]:
import requests
import pandas as pd
import numpy as np
import pickle as pkl
from bs4 import BeautifulSoup
from tqdm import tqdm
from random import uniform
from time import sleep
from datasets import DatasetDict, Dataset

#### Getting all the links of the pages

In [None]:
last_page = 4150
link_list = []
for i in range(1, last_page+1):
    link_list.append("https://stats.stackexchange.com/questions?tab=votes&page={}".format(i))

#### Getting all the links of the questions

Only performed on the first 500 pages, each of which contained 50 questions, all sorted by how "hot" the question was

In [None]:
questions = []
start = 0
end = 500
for link in tqdm(link_list[start:end]):
    page = requests.get(link)
    if page.status_code == 200:
        pageParsed = BeautifulSoup(page.content, 'html.parser')
        try:
            all_page = pageParsed.find_all('div', {'class':'s-post-summary--content'})
            for question in all_page:
                question_link = question.find('h3', class_='s-post-summary--content-title').find('a')['href']
                questions.append('https://stats.stackexchange.com' + question_link)
        except:
            print('Failed')

#### Save the links list of the questions

In [None]:
with open('questions_links.pkl', 'wb') as file:
    pkl.dump(questions, file)

#### Open the list

In [27]:
with open('questions_links.pkl', 'rb') as file:
    questions_links = pkl.load(file)
print(f'Total number of questions: {len(questions_links)}')

Total number of questions: 25000


#### Create a dictionary that contains:

- index

- question

- answer

In [None]:
def scrape(df_dict, idx, start, stop, save=True):
    
    # Set headers and user agent
    headers = {'User-Agent': 
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'}
    page_num = start
    for question_link in tqdm(questions_links[start:stop]):
        sleep(uniform(0.4, 0.6))

        try:
            page = requests.get(question_link, headers=headers)
            pageParsed = BeautifulSoup(page.content, 'html.parser')

            question = pageParsed.find('div', {'class': 'd-flex sm:fd-column'}).find('a').text
            texts = pageParsed.find_all('div', {'class': 's-prose js-post-body'})

            if len(texts) > 1:

                for answer in texts[1:]:
                    df_dict[idx] = {'question': question, 'answer': answer.text.strip()}
                    idx += 1
        except:
            print(f'Failed page: {page_num}')
            
        page_num += page_num

    if save == True:
        with open('df_dict.pkl', 'wb') as file:
            pkl.dump(df_dict, file)
            
    return df_dict, idx

In [None]:
df_dict = {}
idx = 1
start = 0
stop = 25000
df_dict, idx = scrape(df_dict=df_dict, idx=idx, start=start, stop=stop, save=True)

#### Create a dataframe and then a transformers Dataset from the dictionary

In [None]:
rows = [{'idx': key, 'question': value['question'], 'answer': value['answer']} for key, value in df_dict.items()]
df = pd.DataFrame(rows)
df['conversation'] = df['question'] + '\n\n' + df['answer']
print(df.shape)
display(df.head())
dataset = Dataset.from_pandas(df)

(623, 4)


Unnamed: 0,idx,question,answer,conversation
0,1,In today's pattern recognition class my profes...,Imagine a big family dinner where everybody st...,In today's pattern recognition class my profes...
1,2,In today's pattern recognition class my profes...,"The manuscript ""A tutorial on Principal Compon...",In today's pattern recognition class my profes...
2,3,In today's pattern recognition class my profes...,Let's do (2) first. PCA fits an ellipsoid to ...,In today's pattern recognition class my profes...
3,4,In today's pattern recognition class my profes...,"Hmm, here goes for a completely non-mathematic...",In today's pattern recognition class my profes...
4,5,In today's pattern recognition class my profes...,"I'd answer in ""layman's terms"" by saying that ...",In today's pattern recognition class my profes...


In [None]:
# create the pandas dataframe from the dictionary
rows = [{'idx': key, 'question': value['question'], 'answer': value['answer']} for key, value in df_dict.items()]
df = pd.DataFrame(rows)

# reduce the max length of the questions and answers
q_max_length = df["question"].str.len().max()
print(f'Question max length: {q_max_length}')
df['question_cut'] = df['question'].str.slice(0, 100)
a_max_length = df["answer"].str.len().max()
print(f'Asnwer max length: {a_max_length}')
df['answer_cut'] = df['answer'].str.slice(0, 200)

# create a new column called 'conversation' joining 'question' and 'answer'
df['conversation'] = df['question'] + '\n' + df['answer']
# create a new column called 'conversation_cut' joining 'question_cut' and 'answer_cut'
df['conversation_cut'] = df['question_cut'] + '\n' + df['answer_cut']

# resetting the index column
df['idx'] = np.arange(1, len(df)+1)
df.head()
df2 = df[['idx', 'conversation_cut']]

# transforming the pandas df in a huggingface DatasetDict
dataset = Dataset.from_pandas(df)
dataset_dict = DatasetDict({'train': dataset})
dataset2 = Dataset.from_pandas(df2)
dataset_dict2 = DatasetDict({'train': dataset2})

#### Save the dataset

*In addition the dataset has also been uploaded to the [Hugging Face](https://huggingface.co) website and can be found here [Prot10/CrossValidated](https://huggingface.co/datasets/Prot10/CrossValidated)*

In [None]:
with open('dataset.pkl', 'wb') as file:
    pkl.dump(dataset_dict, file)
    
with open('dataset_sub.pkl', 'wb') as file:
    pkl.dump(dataset_dict2, file)
    
df.to_csv('dataset.csv', index=False)

## Fine-tuning of the model for text-generation

#### First let's load the model we are going to use: `GPT-neo-x-20B`

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

Check the memory usage:

In [28]:
memory_footprint = model.get_memory_footprint()
memory_footprint_gb = memory_footprint / (1024 ** 3)
print(f"Memory Footprint: {memory_footprint_gb:.2f} GB")

Memory Footprint: 10.61 GB


#### Apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [None]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=20, 
    lora_alpha=32, 
    target_modules=["query_key_value"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

#### Load the dataset previously created (from Hugging Face)

*N.B. skip this cell if you have already loaded the dataset from the previous part...*

In [None]:
from datasets import load_dataset

dataset = load_dataset("Prot10/CrossValidated")
data = dataset.map(lambda samples: tokenizer(samples["conversation"]), batched=True)

*and just run...*

In [None]:
data = dataset_dict2.map(lambda samples: tokenizer(samples["conversation"]), batched=True)

#### Let's now train the model

In [None]:
import transformers

# needed for gpt-neo-x tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=6,
        warmup_steps=4,
        max_steps=20,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False
trainer.train()

You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  attn_scores = torch.where(causal_mask, attn_scores, mask_value)


Step,Training Loss
1,3.0828
2,3.0359
3,3.2475
4,3.0053
5,2.9713
6,2.5866
7,3.159
8,2.9985
9,2.9484
10,2.6374


TrainOutput(global_step=20, training_loss=2.8552961707115174, metrics={'train_runtime': 547.1652, 'train_samples_per_second': 0.439, 'train_steps_per_second': 0.037, 'total_flos': 994427072593920.0, 'train_loss': 2.8552961707115174, 'epoch': 0.0})

#### Save the fine-tuned model

*In addition the dataset has also been uploaded to the [Hugging Face](https://huggingface.co) website and can be found here [Prot10/chad](https://huggingface.co/Prot10/chad)*

In [None]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model
model_to_save.save_pretrained("outputs")

In [None]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

#### And now we can try the model

In [8]:
text = 'What is the definition of expected value?'
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=60)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
What is the definition of expected value?


I am trying to understand the definition of expected value.
I have seen the definition of expected value as:

The expected value of a random variable X is the sum of the probabilities of all possible outcomes of X, multiplied by the value of the outcome.
I am trying


## Model for zero-shot-classification 

#### Load the model we are going to use: `bart-large-mnli`

In [11]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

#### Set some parameters and give it a try

In [24]:
candidate_labels = ['greetings', 'stats', 'probs', 'math', 'machine-learning', 'other']
threshold = 0.5

sequence_to_classify = "what is an integral?"

result = classifier(sequence_to_classify, candidate_labels)
labels = dict(zip(result['labels'], result['scores']))
print('Predicted labels:')
print(labels, '\n')
if labels['greetings'] + labels['other'] < threshold:
    print("It's an appropriate question!")
else:
    print("It's not an appropriate question!")

Predicted labels:
{'math': 0.6207872033119202, 'probs': 0.2595915198326111, 'other': 0.05335698649287224, 'stats': 0.027889791876077652, 'machine-learning': 0.021786486729979515, 'greetings': 0.01658804342150688} 

It's an appropriate question!


## Chatbot

#### Define a function to run the Chatbot

In [13]:
def chat(question, threshold=0.5, max_new_tokens=60):
    
    candidate_labels = ['greetings', 'stats', 'probs', 'math', 'machine-learning', 'other']
    
    result = classifier(question, candidate_labels)
    labels = dict(zip(result['labels'], result['scores']))

    if labels['greetings'] + labels['other'] < threshold:
        device = "cuda:0"
        inputs = tokenizer(question, return_tensors="pt").to(device)
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
        print(f'\n\n\033[1mCHAD:\033[0m {tokenizer.decode(outputs[0], skip_special_tokens=True)[len(question)+2:]}')
    
    elif labels['greetings'] > 0.5:
        print("\033[1mCHAD:\033[0m Hi, I'm Chad and I was created to help you deepen concepts related to the world of Data Science, ask me what you need!")
    
    else:
        print("\033[1mCHAD:\033[0m I'm sorry but the question asked does not fall within the topics to which I can answer. If you think this is a mistake please try to rephrase the question differently!")

#### And finally run the Chatbot with a few different examples

- **Example 1:**

  Question: "What is the definition of expected value?"

  Class: "probs"

In [32]:
question = "What is the definition of expected value?"
chat(question)

[1mCHAD:[0m I am trying to understand the definition of expected value.
I have seen the definition of expected value as:

The expected value of a random variable X is the sum of the probabilities of all possible outcomes of X, multiplied by the value of the outcome.
I am trying


- **Example 2:**

  Question: "Hi, can I ask you a question?"

  Class: "greetings"

In [31]:
question = "Hi, what's your name?"
chat(question)

[1mCHAD:[0m Hi, I'm Chad and I was created to help you deepen concepts related to the world of Data Science, ask me what you need!


- **Example 3:**

  Question: "What is a Blue Glaucus?"

  Class: "other"

In [30]:
question = "What is a Blue Glaucus?"
chat(question)

[1mCHAD:[0m I'm sorry but the question asked does not fall within the topics to which I can answer. If you think this is a mistake please try to rephrase the question differently!


### Try now the model on Colab

You can find the notebook here $\rightsquigarrow$ [CHAD](https://colab.research.google.com/drive/1k8wHhCZkePFECJp1AOaPZ4ZH8BiZGtwq?usp=share_link)