# Install and load the libraries. 
To start we need to install the necesary Python packages. 
* **[langchain](https://python.langchain.com/docs/get_started/introduction.html)**. The revolutionary framework to build apps using large language models. 
* **[sentence_transformers](https://www.sbert.net/)**. necesary to create the embeddings we are going to store in the vector database.  
* **[chromadb](https://www.trychroma.com/)**. This is our vector Database. ChromaDB is easy to use and open source, maybe the most used Vector Database used to store embeddings. 

In [None]:
!pip install --upgrade pip

In [2]:
!pip install -q chromadb==0.4.22
!pip install -q langchain==0.1.4
!pip install -q sentence_transformers==2.3.0


[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


I'm sure that you know the next two packages: Numpy and Pandas, maybe the most used python libraries.

Numpy is a powerful library for numerical computing. 

Pandas is a library for data manipulation

In [3]:
import numpy as np 
import pandas as pd

# Load the Dataset
As you can see the notebook is ready to work with two different Datasets. Just uncomment the lines of the Dataset you want to use. 

As we are working in a free and limited space, and we can use just 30 gb of memory I limited the number of news to use with the variable MAX_NEWS. If you are using a GPU your memoty will be limited to 16GB. 

The name of the field containing the text of the new is stored in the variable *DOCUMENT* and the metadata in *TOPIC*

In [4]:
news = pd.read_csv('labelled_newscatcher_dataset.csv', sep=';')
MAX_NEWS = 1000
DOCUMENT="title"
TOPIC="topic"

#news = pd.read_csv('/kaggle/input/bbc-news/bbc_news.csv')
#MAX_NEWS = 500
#DOCUMENT="description"
#TOPIC="title"

#Because it is just a course we select a small portion of News.
subset_news = news.head(MAX_NEWS)

In [5]:
news.head(2)

Unnamed: 0,topic,link,domain,published_date,title,lang
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en


## CREATE THE DOCUMENT FROM THE DATAFRAME
We are going to load the data from a pandas DataFrame. However, LangChain, through the document_loader library, supports multiple data sources, such as Word documents, Excel files, plain text, SQL, and more.

We also imported the Chroma library, which is used to save the embeddings in the ChromaDB database.

In [6]:
from langchain.document_loaders import DataFrameLoader
from langchain.vectorstores import Chroma


First, we create the loader, indicating the data source and the name of the column in the DataFrame where we store what we could consider as the document, that is, the information we want to pass to the model so that it takes it into account in its responses.

In [7]:
df_loader = DataFrameLoader(subset_news, page_content_column=DOCUMENT)

Then, we use the loader to load the document.

In [8]:
df_document = df_loader.load()

In [9]:
display(df_document[:2])

[Document(page_content="A closer look at water-splitting's solar fuel potential", metadata={'topic': 'SCIENCE', 'link': 'https://www.eurekalert.org/pub_releases/2020-08/dbnl-acl080620.php', 'domain': 'eurekalert.org', 'published_date': '2020-08-06 13:59:45', 'lang': 'en'}),
 Document(page_content='An irresistible scent makes locusts swarm, study finds', metadata={'topic': 'SCIENCE', 'link': 'https://www.pulse.ng/news/world/an-irresistible-scent-makes-locusts-swarm-study-finds/jy784jw', 'domain': 'pulse.ng', 'published_date': '2020-08-12 15:14:19', 'lang': 'en'})]

# Creating the embeddings
First, we import a couple of libraries.
* CharacterTextSplitter: we will use it to group the information contained in different blocks.
* HuggingFaceEmbeddings: it will create the embeddings in the format that we will store in the database.

In [10]:
from langchain.text_splitter import CharacterTextSplitter
#from langchain.embeddings import HuggingFaceEmbeddings

As I said above we split the data into manageable chunks to store as vectors using **CharacterTextSplitter**. There isn't an exact way to do this, more chunks means more detailed context, but will increase the size of our vectorstore.

There are no magic numbers to inform. It is important to consider that the larger the chunk size, the more context the model will have, but the size of our vector store will also increase.

In [11]:
text_splitter = CharacterTextSplitter(chunk_size=250, chunk_overlap=10)
texts = text_splitter.split_documents(df_document)

In [12]:
display(texts[:2])

[Document(page_content="A closer look at water-splitting's solar fuel potential", metadata={'topic': 'SCIENCE', 'link': 'https://www.eurekalert.org/pub_releases/2020-08/dbnl-acl080620.php', 'domain': 'eurekalert.org', 'published_date': '2020-08-06 13:59:45', 'lang': 'en'}),
 Document(page_content='An irresistible scent makes locusts swarm, study finds', metadata={'topic': 'SCIENCE', 'link': 'https://www.pulse.ng/news/world/an-irresistible-scent-makes-locusts-swarm-study-finds/jy784jw', 'domain': 'pulse.ng', 'published_date': '2020-08-12 15:14:19', 'lang': 'en'})]

We load the library to create the pre trained model from HuggingFace to create the embeddings from sentences. 


In [13]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

#embedding_function = HuggingFaceEmbeddings(
#    model_name="sentence-transformers/all-MiniLM-L6-v2"
#)  


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Here we are creating the index of Chroma Object. Using the document, and the embedding function created above. 

In [14]:
chroma_db = Chroma.from_documents(
    texts, embedding_function, persist_directory='./input'
)

## LANGCHAIN

Finally, the time has come to create our chain with LangChain. It will be straightforward. All we do is give it a retriever and a model to call with the result obtained from the retriever.

Now we are going to import RetrievalQA and HuggingFacePipeline classes from langchain module.  

In [15]:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain_core.output_parsers import StrOutputParser

Now we create the retriever object, the responsible to return the data contained in the ChromaDB Database. 

In [16]:
retriever = chroma_db.as_retriever()

I tested the notebook with two models from Fugging Face. 

The first one is [dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b), the smallest Dolly model. It have 3billion paramaters, more than enough for our sample, and works much better than GPT2. It's a text generation model, and therefore generates slightly more imaginative responses.

The second one is a t5 model. This is a text2text-generation. so it will produce more concise and succinct responses.

Just be sure the test both, and if you want select other models from Hugging Face. 

In [17]:
model_id = "databricks/dolly-v2-3b" #my favourite textgeneration model for testing
task="text-generation"

#model_id = "google/flan-t5-large" #Nice text2text model
#task="text2text-generation"

We use HuggingFacePipeline class to create a pipeline for a specific Hugging Face language model. Let's break down the code:

* **model_id**: This is the ID of the Hugging Face language model you want to use. It typically consists of the model name and version.
* **task**: This parameter specifies the task that you want to perform using the language model. It could be "text-generation", "text2text-generation", "question-answering", or other tasks supported by the model.
* **model_kwargs**: Allows you to provide additional arguments specific to the chosen model. In this case, it sets "temperature" to 0 (indicating deterministic output) and "max_length" to 256, which limits the maximum length of generated text to 256 tokens.


In [None]:
hf_llm = HuggingFacePipeline.from_model_id(
    model_id=model_id,
    task=task,
    model_kwargs={
        "temperature": 0,
        "max_length": 1024
    },
    pipeline_kwargs={
        "repetition_penalty":1.1
    }
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We are setting up the ***document_qa***, a **RetrievalQA** object, that we are going to use to run the questions. 

The ***stuff*** type is the simplest type of chain that we can have. I get the documents from the retiever and use the language model to obtain responses. 

In [19]:
chain_type = "stuff"  
document_qa = RetrievalQA.from_chain_type(
    llm=hf_llm, chain_type="stuff", retriever=retriever
)


Time to call the chain and obtain the responses!

In [20]:
#Sample question for newscatcher dataset. 
response = document_qa.invoke("Can I buy a Toshiba laptop?")

#Sample question for BBC Dataset. 
#response = document_qa.run("Who is going to meet boris johnson?")

display(response)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


{'query': 'Can I buy a Toshiba laptop?',
 'result': ' No, Toshiba officially stopped making laptops in April 2023.\n\n'}

## USING THE NEW LCEL Architecture from LangChain. 
Langchain is recommending use LCEL LangChain Expression Language over Chains. 
I'm using both methods on the notebook, but note that this one is the recommended. 

In [21]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
template = """Answer the question based on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

In [22]:

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | hf_llm
    | StrOutputParser()
)

In [23]:
chain.invoke("Can I buy a Toshiba laptop?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


'Answer: No, Toshiba officially ended its laptop manufacturing operations in 2023.\n\n'