# Transfer Learning with RAG Models
----
**Objective**: In this notebook, you will experiment with a QA RAG model on the climate_fever dataset using the Haystack framework. You will get to go through the whole process of fitting the parts of a RAG model together, and learn how to prompt it with queries to get answers from the provided dataset.

NOTE: Make sure to change the runtime from CPU to TPU or GPU for faster training

## Install Libraries
Install the Haystack (for colab) and Datasets libraries

In [5]:
!pip install farm-haystack[colab]
!pip install datasets

Collecting farm-haystack[colab]
  Downloading farm_haystack-1.21.2-py3-none-any.whl (819 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m819.7/819.7 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting boilerpy3 (from farm-haystack[colab])
  Downloading boilerpy3-1.0.6-py3-none-any.whl (22 kB)
Collecting events (from farm-haystack[colab])
  Downloading Events-0.5-py3-none-any.whl (6.8 kB)
Collecting httpx (from farm-haystack[colab])
  Downloading httpx-0.25.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.7/75.7 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting lazy-imports==0.3.1 (from farm-haystack[colab])
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting posthog (from farm-haystack[colab])
  Downloading posthog-3.0.2-py2.py3-none-any.whl (37 kB)
Collecting prompthub-py==4.0.0 (from farm-haystack[colab])
  Downloading prompthub_py-4.0.0-py3-none-any.whl (6.9 kB)
Collecting quantulum3

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.14.6 dill-0.3.7 multiprocess-0.70.15


## Import Dataset
----
In this section, we will use as an example the climate_fever dataset. The dataset consists of 1535 rows of claims about climate change, and they either refute or support climate change, with some claims being neutral. We will build with a specific topic in mind so that we can get more accurate answers, and keep in mind that bigger datasets with open topics can also be used.

**Question 1**: Use the "load_dataset" function to load the "climate_fever" dataset with the "test" split.

In [6]:
from datasets import load_dataset

# Load the dataset
data = load_dataset("climate_fever", split="test")


Downloading builder script:   0%|          | 0.00/5.13k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.00k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/686k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1535 [00:00<?, ? examples/s]

## Formatting and Writing Documents
----
First, we need to extract, format and write the documents from our chosen dataset so that we can later build our QA Pipeline. This Pipeline will facilitate the process of building our RAG model and getting answers from it.

Keep in mind that for this notebook we will focus on how to build the pipeline with the simplest configurations. Feel free to experiment with different parameters.



In [7]:
data[0]

{'claim_id': '0',
 'claim': 'Global warming is driving polar bears toward extinction',
 'claim_label': 0,
 'evidences': [{'evidence_id': 'Extinction risk from global warming:170',
   'evidence_label': 2,
   'article': 'Extinction risk from global warming',
   'evidence': '"Recent Research Shows Human Activity Driving Earth Towards Global Extinction Event".',
   'entropy': 0.6931471824645996,
   'votes': ['SUPPORTS', 'NOT_ENOUGH_INFO', None, None, None]},
  {'evidence_id': 'Global warming:14',
   'evidence_label': 0,
   'article': 'Global warming',
   'evidence': 'Environmental impacts include the extinction or relocation of many species as their ecosystems change, most immediately the environments of coral reefs, mountains, and the Arctic.',
   'entropy': 0.0,
   'votes': ['SUPPORTS', 'SUPPORTS', None, None, None]},
  {'evidence_id': 'Global warming:178',
   'evidence_label': 2,
   'article': 'Global warming',
   'evidence': 'Rising temperatures push bees to their physiological limits,

**Question 2**: Use the write_documents method to save the formatted documents into document_storage

In [8]:
!pip install --upgrade tensorflow

Collecting tensorflow
  Downloading tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (489.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.8/489.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.15,>=2.14 (from tensorflow)
  Downloading tensorboard-2.14.1-py3-none-any.whl (5.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow-estimator<2.15,>=2.14.0 (from tensorflow)
  Downloading tensorflow_estimator-2.14.0-py2.py3-none-any.whl (440 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m440.7/440.7 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras<2.15,>=2.14.0 (from tensorflow)
  Downloading keras-2.14.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m47.5 MB/s[0m eta [36m0:00:00[0m
Collecting google-auth-oauthlib

In [9]:
from haystack.document_stores import InMemoryDocumentStore

# Create an InMemoryDocumentStore
document_storage = InMemoryDocumentStore(use_bm25=True)

# Extract and format the documents from the dataset

documents = []
for example in data:
    claim_id = example["claim_id"]
    text = example["claim"]
    content = " ".join([ev["evidence"] for ev in example["evidences"]])  # Combine evidence texts
    document = {"text": text, "content": content, "meta": {"claim_id": claim_id}}
    documents.append(document)


# Write the documents to the document store

document_storage.write_documents(documents)

Updating BM25 representation...: 100%|██████████| 1525/1525 [00:00<00:00, 6905.75 docs/s]


In [29]:
documents2 = [{"content": document['claim']} for document in data]

In [30]:
documents2

[{'content': 'Global warming is driving polar bears toward extinction'},
 {'content': 'The sun has gone into ‘lockdown’ which could cause freezing weather, earthquakes and famine, say scientists'},
 {'content': 'The polar bear population has been growing.'},
 {'content': "Ironic' study finds more CO2 has slightly cooled the planet"},
 {'content': 'Human additions of CO2 are in the margin of error of current measurements and the gradual increase in CO2 is mainly from oceans degassing as the planet slowly emerges from the last ice age.'},
 {'content': 'They tell us that we are the primary forces controlling earth temperatures by the burning of fossil fuels and releasing their carbon dioxide.'},
 {'content': 'The Great Barrier Reef is experiencing the most widespread bleaching ever recorded'},
 {'content': 'it’s not a pollutant that threatens human civilization.'},
 {'content': 'If CO2 was so terrible for the planet, then installing a CO2 generator in a greenhouse would kill the plants.'}

In [10]:
documents[0]

{'text': 'Global warming is driving polar bears toward extinction',
 'content': '"Recent Research Shows Human Activity Driving Earth Towards Global Extinction Event". Environmental impacts include the extinction or relocation of many species as their ecosystems change, most immediately the environments of coral reefs, mountains, and the Arctic. Rising temperatures push bees to their physiological limits, and could cause the extinction of bee populations. Rising global temperatures, caused by the greenhouse effect, contribute to habitat destruction, endangering various species, such as the polar bear. "Bear hunting caught in global warming debate".',
 'meta': {'claim_id': '0'}}

In [11]:
documents[1]

{'text': 'The sun has gone into ‘lockdown’ which could cause freezing weather, earthquakes and famine, say scientists',
 'content': "The current consensus of the scientific community is that the aerosols and dust released into the upper atmosphere causes cooler temperatures by preventing the sun's energy from reaching the ground. The Little Ice Age caused crop failures and famines in Europe. The persistently cold, wet weather caused great hardship, was primarily responsible for the Great Famine of 1315–1317, and strongly contributed to the weakened immunity and malnutrition leading up to the Black Death (1348–1350). The manifestation of the meteorological winter (freezing temperatures) in the northerly snow–prone latitudes is highly variable depending on elevation, position versus marine winds and the amount of precipitation. In many regions, winter is associated with snow and freezing temperatures.",
 'meta': {'claim_id': '5'}}

## Preparing the Retriever
----
We need to prepare our Retriever node of our pipeline. It will be responsible to get the documents from our document storage, so that they can be used by the Language Model later. We will the BM25Retriever provided by haystack, as it is the recommended Retriever for begginners.

**Question 3**: Create the BM25Retriever using the document_storage created earlier, with a top_k of value 2

In [12]:
from haystack.nodes import BM25Retriever

# Note: The higher the top_k is, the better the answer will be. However, speed will be affected
retriever = BM25Retriever(document_store=document_storage, top_k=2)


## Preparing the Language Model
----
Now, we will prepare our Language Model using the prompt node. We need to first create our prompt, and for that, Haystack requires a specific structure. We will then define our desired language model alongside the prompt template we created. When creating this template, we need to Parse the output to a format that Haystack can use.

**Question 4**: Define the prompt node using PromptNode with the model name as "google/flan-t5-large" and the default prompt template as the created "rag_prompt"

In [31]:
from haystack.nodes.prompt import prompt_template
from haystack.nodes import PromptNode, PromptTemplate, AnswerParser

rag_prompt = PromptTemplate(
    prompt="""Create comprehensive answers from the related text given the questions.
                             Provide a clear and concise response that displays the key points and information presented in the related text.
                             Your answer should be in your own words and be no longer than 50 words.
                             \n\n Related text: {join(documents)} \n\n Question: {query} \n\n Answer:""",
    output_parser=AnswerParser(),
)

prompt_node = PromptNode(model_name_or_path="google/flan-t5-large", default_prompt_template= rag_prompt)




In [14]:
prompt_node = PromptNode(default_prompt_template= rag_prompt)

(…)le/flan-t5-base/resolve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

(…)base/resolve/main/generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

(…)-base/resolve/main/tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

(…)flan-t5-base/resolve/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

(…)ase/resolve/main/special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

## Fitting our Pipeline Together
----
Finally, we are going to put our pipeline nodes together. For that we will use the Pipeline function from haystack. With the pipeline ready you will be able to ask it questions and get answers

**Question 5**: Add the retriever node and prompt_node created in the previous steps to the Pipeline using the add_node function. Hint: you need to provide the inputs to each of these nodes.

In [15]:
from haystack.pipelines import Pipeline

pipe = Pipeline()



In [24]:
# Add the retriever node, specifying the name and inputs
pipe.add_node(retriever, name="retriever", inputs=["Query"])

# Add the prompt node, specifying the name and inputs
pipe.add_node(prompt_node, name="prompt_node", inputs=["retriever"])


## Asking the RAG Model Questions
----
We use the pipeline .run() method to ask a question. Since the output provided by our Prompt Node is a Haystack object, we retrieve in the way provided inside the print() function.

In [25]:
output = pipe.run(query="When did global warming start")

print(output["answers"][0].answer)

1993.


In [28]:
output = pipe.run(query="What is the biggest damaging factor for the climate?")

print(output["answers"][0].answer)



30% in 2010[update].


In [None]:
output = pipe.run(query="Who is most responsible for pollution")

print(output["answers"][0].answer)

In [None]:
# Here are some other examples you can use
examples = [
    "Who is most responsible for pollution",
    "What is the biggest damaging factor for the climate?",
    "What are some clean energy sources?",
    "How much does the average temperature of our planet rise per decade?"
]