**DSS : BUILDING LARGE LANGUAGE MODELS FOR BUSINESS APPLICATIONS Day 2**

# Creating Conservational AI with Large Language Models for Business

## Training Objective

In this module, we will embark on a journey to explore the fascinating world of Conversational AI and its applications in various business domains. We will delve into the principles, techniques, and best practices for harnessing the potential of LLMs to build robust and effective conversational systems. Whether you are a business professional, data scientist, or developer, this book will equip you with the knowledge and skills needed to leverage LLMs for creating advanced conversational AI solutions tailored to the specific needs of your organization.

- **Large Language Models: Architecture, Transformer, and Key Concepts**
   - Overview of Large Language Models and their Architecture
   - Understanding what is Transformer
   - Explanation of pre-training and fine-tuning of language models
   - Introduction to popular Large Language Models like GPT-3, GPT-2, and BERT
   - Understanding the capabilities and limitations of Large Language Models
   - Explanation of the LangChain concept
   - Setting the API key and .env


- **Building Question-Answering Systems with Large Language Models**
   - Introduction to Question-Answering System
   - Steps involved in connecting databases with LLM
   - Basics of building a Question-Answering System using LLM with a database
   - Demonstration of using OpenAI and LangChain to build a Question-Answering System
   - Using LangChain and OpenAI to build a Question-Answering System with text data
   - Steps involved in connecting CSV data with LLM
   - Demonstration of using LangChain and OpenAI to build a Question-Answering System with text data


- **Text Generation with HuggingFace**
   - Introduction to the Text Generation model in HuggingFace
   - Setting the .env token key
   - Applying HuggingFace's Inference API to use LLM without OpenAI credits
   - Integrating HuggingFace's Inference API into the previously built Question-Answering System
   - Demonstration of using HuggingFace's Inference API to build a Question-Answering System

# Large Language Models


**What is Large Language Models?**

Large Language Models (LLMs) is an advanced type of language model that represent a breakthrough in the field of natural language processing (NLP). These models are designed to understand and generate human-like text by leveraging the power of deep learning algorithms and massive amounts of data.

If you've ever chatted with a virtual assistant or interacted with an AI customer service agent, you might have interacted with a large language model without even realizing it. These models have a wide range of applications, from chatbots to language translation to content creation.

Some of the most impressive large language models are developed by OpenAI. Their GPT-3 model, for example, has over [175 billion parameters](https://www.techtarget.com/searchenterpriseai/definition/GPT-3#:~:text=GPT%2D3%20has%20more%20than,(BERT)%20and%20Turing%20NLG.) and is able to perform tasks like [summarization](https://wandb.ai/mostafaibrahim17/ml-articles/reports/Compressing-the-Story-The-Magic-of-Text-Summarization--VmlldzozNTYxMjc2), [question-answering](https://wandb.ai/mostafaibrahim17/ml-articles/reports/The-Answer-Key-Unlocking-the-Potential-of-Question-Answering-With-NLP--VmlldzozNTcxMDE3), and even creative writing.



**How a Large Language Model was Built?**

The architecture of LLMs is based on the Transformer model, which has revolutionized NLP tasks. The Transformer model utilizes a self-attention mechanism that allows the model to focus on different parts of the input sequence, capturing dependencies and relationships between words more effectively. This architecture enables LLMs to generate coherent and contextually relevant responses, making them valuable tools for a wide range of applications.

A large-scale transformer model known as a “large language model” is typically too massive to run on a single computer and is, therefore, provided as a service over an API or web interface. These models are trained on vast amounts of text data from sources such as books, articles, websites, and numerous other forms of written content. By analyzing the statistical relationships between words, phrases, and sentences through this training process, the models can generate coherent and contextually relevant responses to prompts or queries.

*ChatGPT’s GPT-3* model, for instance, was trained on massive amounts of internet text data, giving it the ability to understand various languages and possess knowledge of diverse topics. As a result, it can produce text in multiple styles. While its capabilities may seem impressive, including translation, text summarization, and question-answering, they are not surprising, given that these functions operate using special “grammars” that match up with prompts.

### Understanding what is Transformer

The Transformer is a type of deep learning architecture that has revolutionized the field of natural language processing. It was introduced in the paper ["Attention Is All You Need" by Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762). The Transformer model employs self-attention mechanisms to capture dependencies between words in a sentence, enabling it to learn contextual relationships and generate coherent and contextually relevant text.

The Transformer architecture excels at handling text data which is inherently sequential. They take a text sequence as input and produce another text sequence as output. eg. to translate an input French sentence to English.

<img src="https://jalammar.github.io/images/t/the_transformer_3.png">

The Transformer architecture consists of two main components: the encoder and the decoder. The encoder processes the input sequence and generates a representation, which is then passed to the decoder. The decoder generates the output sequence based on the encoder's representation and previous outputs.

<img src="https://jalammar.github.io/images/t/The_transformer_encoders_decoders.png">

Here are the key components and concepts of the Transformer architecture:

1. **Positional Encoding**: Transformers incorporate positional encoding to provide the model with information about the order of words in the input sequence. Positional encoding is usually added to the input embeddings and allows the model to differentiate between the positions of words.

![positional_encoding](assets/positional_encoding.png)

2. **Self-Attention Mechanism**: Self-attention allows each word in the input sequence to attend to all other words. It computes the attention weight between each pair of words and uses them to generate a weighted sum of the word embeddings. This mechanism enables the model to capture long-range dependencies and learn contextual relationships effectively.

![self_attention](assets/self_attention.png)

### Pre-training and Fine-tuning of Language Models

Pre-training and fine-tuning are two key steps in the training process of language models, including Large Language Models (LLMs). 

**Pre-training**: In the pre-training phase, a language model is trained on a large corpus of unlabeled text data. During this phase, the model learns to predict missing words in sentences based on the surrounding context. It develops an understanding of language patterns, grammar, and contextual relationships. The pre-training process typically involves techniques like masked language modeling, where certain words are randomly masked and the model learns to predict them based on the remaining context.

**Fine-tuning**: After pre-training, the language model is fine-tuned on specific labeled datasets for specific downstream tasks. Fine-tuning involves training the pre-trained model on labeled data related to a particular task, such as question answering, sentiment analysis, or text classification. This process allows the model to adapt to the specific task by learning task-specific patterns and improving its performance. Fine-tuning is performed on a smaller dataset, which is typically task-specific and labeled by human experts.

In the context of Large Language Models (LLMs), the terms "pre-trained" and "fine-tuned" refer to two stages in the model development process. This two-step process offers several advantages:

- Pre-training on large-scale data helps LLMs learn from a diverse range of linguistic patterns and structures, enhancing their language understanding capabilities.
- Fine-tuning allows LLMs to adapt to specific tasks or domains, improving their performance and making them more efficient in generating desired outputs.
- Fine-tuning requires comparatively less labeled data than training from scratch, making it a practical approach when labeled data is limited.

LLMs, such as GPT-3, GPT-2, and BERT, are examples of large-scale language models that have undergone extensive pre-training and fine-tuning processes. They have been trained on vast amounts of text data and have a large number of parameters. This pre-training and fine-tuning approach allows LLMs to capture complex language patterns, generate coherent text, and perform well on a wide range of natural language processing tasks.

### Popular Large Language Models 

Popular Large Language Models (LLMs) are advanced models that have gained significant attention in the field of natural language processing. They have been trained on massive amounts of text data and have a large number of parameters, allowing them to capture complex language patterns and generate coherent text.

Here are some examples of popular LLMs:

1. **[GPT-3](https://openai.com/blog/gpt-3-apps) (Generative Pre-trained Transformer 3)**: GPT-3 is a state-of-the-art language model developed by OpenAI. It is renowned for its impressive size, consisting of 175 billion parameters. GPT-3 has been trained on a vast amount of internet text data, enabling it to understand and generate human-like text. It can perform a wide range of natural language processing tasks, including language translation, text completion, sentiment analysis, and more. GPT-3 has shown remarkable capabilities in generating coherent and contextually relevant responses, making it a powerful tool for various applications.

2. **[GPT-2](https://huggingface.co/gpt2) (Generative Pre-trained Transformer 2)**: GPT-2 is the predecessor to GPT-3, also developed by OpenAI. Although smaller in size with 1.5 billion parameters, GPT-2 still delivers impressive language generation capabilities. It has been trained on diverse internet text sources, allowing it to produce high-quality text in a variety of styles and topics. GPT-2 is widely used for tasks such as text completion, text generation, and language understanding.

3. **[BERT](https://machinelearningmastery.com/a-brief-introduction-to-bert/) (Bidirectional Encoder Representations from Transformers)**: BERT is a groundbreaking language model developed by Google. It introduced the concept of bidirectional training, which significantly improved the understanding of context in natural language processing. BERT has been trained on large-scale text data and employs a transformer architecture. It excels in various language understanding tasks, including question-answering, sentiment analysis, named entity recognition, and more. BERT has set new benchmarks in several natural language processing tasks and has been widely adopted in both research and industry.


### Capabilities and limitations of Large Language Models

Large language models like GPT-3, GPT-2, and BERT exhibit impressive capabilities in tasks such as :

- text generation, 
- language translation, 
- text understanding
- sentiment analysis,
- text summarization
- question answering,
- etc.

They can understand complex language structures, generate coherent text, and perform well on a range of natural language processing tasks. 

However, it's essential to acknowledge the limitations of LLMs:

- Bias and Ethical Concerns: LLMs can inherit biases present in the training data, leading to biased or controversial outputs. Ensuring fairness, diversity, and ethical use of LLMs is a critical challenge.
- Contextual Understanding: While LLMs can generate coherent text, they may sometimes struggle with understanding the broader context or resolving ambiguous statements.
- Lack of Real-World Knowledge: LLMs are trained on vast amounts of text data, but they lack true real-world experience and common-sense reasoning abilities. They may provide accurate information but lack true understanding.
- Computational Requirements: LLMs are computationally intensive, requiring significant computational resources for training and inference. This can limit their accessibility and scalability for some applications.
- Data Dependency: LLMs heavily rely on the quality and diversity of the training data. Inadequate or biased data can impact their performance and generalization capabilities.

## Introduction to LangChain

[LangChain](https://python.langchain.com/docs/get_started/introduction.html) is a framework for developing applications powered by language models that refers to the integration of multiple language models and APIs to create a powerful and flexible language processing pipeline. It involves connecting different language models, such as OpenAI's GPT-3 or GPT-2, with other tools and APIs to enhance their functionality and address specific business needs. 

The LangChain concept aims to leverage the strengths of each language model and API to create a comprehensive language processing system. It allows developers to combine different models for tasks like question answering, text generation, translation, summarization, sentiment analysis, and more.

The core idea of the library is that we can **“chain”**“ together different components to create more advanced use cases around LLMs. Chains may consist of multiple components from several modules:

1. **Prompt templates**: Prompt templates are templates for different types of prompts. Like “chatbot” style templates, ELI5 question-answering, etc

2. **LLMs**: Large language models like GPT-3, BLOOM, etc

3. **Agents**: Agents use LLMs to decide what actions should be taken. Tools like web search or calculators can be used, and all are packaged into a logical loop of operations.

4. **Memory**: Short-term memory, long-term memory.





### Environment Set-up

Using LangChain will usually require integrations with one or more model providers, data stores, APIs, etc. For this example, we'll use OpenAI's model APIs.

#### Setting API key and `.env`

Accessing the API requires an API key, which you can get by creating an account and heading here. When setting up an API key and using a .env file in your Python project, you follow these general steps:

1. **Obtain an API key**: If you're working with an external API or service that requires an API key, you need to obtain one from the provider. This usually involves signing up for an account and generating an API key specific to your project.

2. **Create a .env file**: In your project directory, create a new file and name it ".env". This file will store your API key and other sensitive information securely.

3. **Store API key in .env**: Open the .env file in a text editor and add a line to store your API key. The format should be `API_KEY=your_api_key`, where "API_KEY" is the name of the variable and "your_api_key" is the actual value of your API key. Make sure not to include any quotes or spaces around the value.

4. **Load environment variables**: In your Python code, you need to load the environment variables from the .env file before accessing them. Import the dotenv module and add the following code at the beginning of your script:

```python
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
```

> `dotenv` library is a popular Python library that simplifies the process of loading environment variables from a .env file into your Python application. It allows you to store configuration variables separately from your code, making it easier to manage sensitive information such as API keys, database credentials, or other environment-specific settings.


In [62]:
from dotenv import load_dotenv

load_dotenv()

True

## LangChain Quickstart

In `LangChain`, a QuickStart involves working with three key components: Prompt, Chain, and Agent. 

With the **Prompt, Chain, and Agent** components working together, we can engage in interactive conversations with the language model. The Prompt sets the context or initiates the conversation, the Chain maintains the conversation history, and the Agent manages the communication between the user and the language model.

Using these components, we can build dynamic and **interactive applications** that involve back-and-forth interactions with the language model, allowing we to create **conversational agents**, **chatbots**, **question-answering systems**, and more.

To interact `LangChain` library with an OpenAI language model, we should:

1. **Importing the Required Module**: The code imports the LangChain library by using the statement `from langchain import OpenAI`.

2. **Creating an OpenAI Instance**: The code creates an instance of the `OpenAI` class and assigns it to the variable `llm`. This instance represents the connection to the OpenAI language model.

3. **Setting the Temperature Parameter**: The `temperature` parameter is passed to the `OpenAI` instance during its initialization. Temperature is a parameter that controls the randomness of the language model's output. 
> A higher temperature value (e.g., 0.9) makes the generated text more **diverse and creative**, while a lower value (e.g., 0.2) makes it more **focused and deterministic**.


In [63]:
from langchain import OpenAI

llm = OpenAI(temperature=0.1, ) #gpt3

By creating an instance of `OpenAI` and setting the desired temperature, we can now use the `llm` object to interact with the OpenAI language model. We can pass prompts or messages to the `llm` object, receive the generated responses, and customize the behavior of the language model using additional parameters and methods provided by the LangChain library.

### Prompt

#### Basic Prompt

**Prompt** refers to the initial input or instruction given to the language model to generate a response. It sets the context and provides guidance for the language model to produce relevant and coherent text. 

In this example, the prompt asks for a suggestion of a good name for a brand specializing in local burgers. 

In [64]:
prompt = "What is a good name for a brand that makes local burger?"
print(llm.predict(prompt))



Burger Towne.


The language model then uses its knowledge and training to generate a response that fits the given prompt. Notice every re-run it generate new answer.

> The `llm.predict()` function is called with the prompt as the input. This function sends the prompt to the language model and **generates** a response based on the given input. The generated text represents the **language model's prediction** or **completion** of the prompt.

We can also did it in other languages, let's try with Bahasa

In [35]:
print(llm.predict("Nama yang bagus untuk brand yang membuat pisang goreng mentai?"))



Delicious Delightful's Pisang Goreng Mentai.


By simply providing a prompt in Bahasa (Indonesian language), we can obtain a generated text response in Bahasa as well. This showcases the versatility of language models like LangChain in understanding and generating text in various languages, allowing for multilingual applications and interactions.

#### Prompt Templates

LLM applications typically utilize a prompt template instead of directly inputting user queries into the LLM. This approach involves incorporating the user input into a larger text context known as a prompt template.

A **prompt template** is a structured format designed to generate prompts in a consistent manner. It consists of a text string, referred to as the "template," which can incorporate various parameters provided by the end user to create a dynamic prompt.

The prompt template can include:

- Instructions to guide the language model's response.
- A set of few-shot examples to assist the language model in generating more accurate and contextually appropriate outputs.
- A question posed to the language model.

In the previous example, the text passed to the model contained instructions to generate a brand name based on a given description. In our application, it would be convenient for users to only provide the description of their company or product without the need to explicitly provide instructions to the model.

To create a prompt template using LangChain, we begin by importing the `PromptTemplate` class from the `langchain.prompts` module. This class allows us to create and manipulate prompt templates.

In [36]:
from langchain.prompts import PromptTemplate

**Create** a prompt template: Use the `PromptTemplate.from_template()` method to create a `PromptTemplate` object from the template string. 

In this case, the template string is "What is a good name for a brand that makes {product}?", where `{product}` acts as a placeholder for the product name.

In [41]:
# Create a prompt template
template_prompt = PromptTemplate.from_template("What is a good name for a brand that makes {rumah}?")

**Format** the prompt template: Use the `.format()` method of the `PromptTemplate` object to replace the placeholder in the template with the desired value. In this case, the placeholder `{product}` is replaced with the string "local burger".

In [43]:
# Format the prompt template
prompt = template_prompt.format(rumah="local burger")

# Print the prompt
print(prompt)

What is a good name for a brand that makes local burger?


Notice the instruction changes automatically based on user input, this instruction will be input to `llm` to generate the response. Let's get the response generated by the language model (`llm`) based on the given prompt.

In [44]:
print(llm.predict(prompt))



Burger Boro.


Because this is a template, it can handle more than one input, for example.

In [47]:
# defines a string template for a poem
template = "Write a {adjective} poem about {subject}"

# creates a prompt template
poem_template = PromptTemplate(
    input_variables=["adjective", "subject"],
    template=template,
)

print(poem_template)

input_variables=['adjective', 'subject'] output_parser=None partial_variables={} template='Write a {adjective} poem about {subject}' template_format='f-string' validate_template=True


Formats the template by replacing the placeholders `{adjective}` and `{subject}` with the provided values. The resulting string will be "Write a sad poem about ducks".

In [48]:
poem_template.format(adjective='sad', subject='ducks')

'Write a sad poem about ducks'

In [50]:
# generate a response
print(llm.predict(poem_template.format(adjective='sad', subject='ducks')))



The little ducks waddle on by,
Hoping for some company,
But sadly no one stops,
To offer them some kindness

The power of the wind ripples along,
Their hearts and bodies heavy and worn,
Lonely days on the lake,
No one stopping to take

Just quack after quack,
Resounding like a sad lack,
Of something they desire,
A place in this world to thrive

The Mallards take flight,
But this is their plight,
In the sky looking down,
At the pond so still and sound

The ducks mourn and cry,
For the life they could have had,
As they look up to the sky,
Unable to comprehend why.


In [51]:
# generate a response
print(llm.predict(poem_template.format(adjective='sedih', subject='bebek')))



A bebek in the pond, so still and small
Its wings so pure and white, never to be flown at all
A life of peace and bliss, never to face any woes
A flower waiting to bloom, held back by freezing snow

The ripples of its pond, the world without a sound
An onlooker stares in awe, but all else goes unseen around
A gentle splash, a soft breath, whatever the life ahead
A single moment of joy, held in this fragile thread

Growing old in the pond, while life around it whizzes
A fleeting eternity, never again in its soul to be eased
A murky fate always looming, as the years will wear it thin
No more ripples, no more dream, a bebek left to give in


In [52]:
poem_template.format(adjective='sedih', subject='bebek')

'Write a sedih poem about bebek'

In [57]:
from langchain.prompts import PromptTemplate

template = "buatlah resep membuat {makanan} dengan {bahan} and please translate it to english"

poem_template = PromptTemplate(
    input_variables=["makanan", "bahan"],
    template=template,
)

print(poem_template)

input_variables=['makanan', 'bahan'] output_parser=None partial_variables={} template='buatlah resep membuat {makanan} dengan {bahan} and please translate it to english' template_format='f-string' validate_template=True


In [58]:
print(llm.predict(poem_template.format(makanan='gorengan', bahan='bebek')))



Bahan-bahan : 
1 bebek 
Tepung bumbu 
Minyak goreng untuk menggoreng

Cara Membuat : 
1. Potong bebek menjadi bagian-bagian yang kecil 
2. Marinasi potongan bebek dengan tepung bumbu selama 15 - 20 menit 
3. Panaskan minyak dalam wajan 
4. Goreng potongan bebek hingga berwarna kuning keemasan. 
5. Angkat dan sajikan.

English Translation : 
Ingredients : 
1 duck
Seasoning flour
Frying oil for frying

Instructions : 
1. Cut the duck into small pieces
2. Marinate the duck pieces with seasoning flour for 15 - 20 minutes
3. Heat the oil in the pan 
4. Fry the duck pieces until golden brown. 
5. Lift and serve.


We can create a prompt template that acts as a naming consultant for new companies

In [61]:
# Define the prompt template
template = """
I want you to act as a naming consultant for new companies.

Here are some examples of good company names:

- search engine, Google
- social media, Facebook
- video sharing, YouTube

The name should be short, catchy and easy to remember. 

What is a good name for a brand that makes {product}?
"""

# Create a PromptTemplate object
brand_template = PromptTemplate(
    input_variables=["product"],
    template=template,
)

# Format the prompt template with specific industry values
batik_prompt = brand_template.format(product='batik')

# Print the formatted prompt
print(llm.predict(batik_prompt))


BatikBali.


By using prompt templates, we can easily generate prompts for various industries by filling in the specific values for the variables. This approach allows us to create dynamic and **customizable** prompts for the naming consultant application.

### Chain

Now that we have our model and prompt template, we can combine them by creating a "chain". Chains provide a mechanism to link or connect multiple components, such as models, prompts, and other chains.

The most common type of chain is an LLMChain, which involves passing the input through a PromptTemplate and then to an LLM. We can create an LLMChain using our existing model and prompt template.

For example, if we want to generate a response using our template, our workflow would be as follows:

1. Create the prompt based on input with `template_prompt`

In [67]:
# Create a prompt template
template_prompt = PromptTemplate
    .from_template("What is a good name for a brand that makes {product}?")
prompt = template_prompt.format(product="rendang mozarella")

print(prompt)

What is a good name for a brand that makes rendang mozarella?


2. Generate response from prompt with `llm`

In [69]:
print(llm.predict(prompt))



Mozarella Rendang Co.


We can simplify the workflow by chaining (link) them up with `Chains`

In [70]:
# Import LLMChain class from langchain
from langchain.chains import LLMChain

In [77]:
# Chain the prompt template and llm
chain = LLMChain(llm=llm, prompt=template_prompt, verbose = True)

In [78]:
# Execute the chained model and prompt template
print(chain.run('rendang mozarella'))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWhat is a good name for a brand that makes rendang mozarella?[0m

[1m> Finished chain.[0m


Mozarella Rendang Co.


In [81]:
template = """
I want you to act as a naming consultant for new companies.

Here are some examples of good company names:

- search engine, Google
- social media, Facebook
- video sharing, YouTube

The name should be short, catchy and easy to remember. 

What is a good name for a brand that makes {product}?
"""

# Create a PromptTemplate object
brand_template = PromptTemplate(
    input_variables=["product"],
    template=template,
)

chain = LLMChain(llm=llm, prompt=brand_template, verbose = True)

print(chain.run('bakso'))



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
I want you to act as a naming consultant for new companies.

Here are some examples of good company names:

- search engine, Google
- social media, Facebook
- video sharing, YouTube

The name should be short, catchy and easy to remember. 

What is a good name for a brand that makes bakso?
[0m

[1m> Finished chain.[0m

BaksoBros.


The `chain.run()` method generates a response from the LLM model based on the provided input. 

By chaining the LLM model and the prompt template using the `LLMChain` class, we can conveniently pass inputs through the template and obtain contextually relevant responses from the model. This simple chain allows us to generate responses with just **one line** of code for each new input. Understanding the workings of this basic chain will serve as a solid foundation for working with more intricate chains.

### Agents

In more complex workflows, it becomes crucial to have the ability to make decisions and choose actions based on the given context. This is where agents come into play.

Agents utilize a language model to **determine which actions** to take and in what sequence. They have a set of tools at their disposal, and they continually select, execute, and evaluate these tools until they arrive at the optimal solution. Agents provide a dynamic and adaptable approach to **problem-solving** within the LangChain framework, allowing for more sophisticated and flexible workflows.

To load an agent in LangChain, you need to consider the following components:

- **LLM/Chat model:** This refers to the language model that powers the agent. It is responsible for generating responses based on the given input. You can choose from various pre-trained models or use your own custom models.

- **Tools:** Tools are functions or methods that perform specific tasks within the agent's workflow. These can include actions like Google Search, Database lookup, Python REPL (Read-Eval-Print Loop), or even other chains. LangChain provides a set of predefined tools with their specifications, which you can refer to in the Tools documentation.

- **Agent name:** The agent name is a string that identifies a supported agent class. Each agent class is parameterized by the prompt that the language model uses to determine the appropriate action to take. In this context, we will focus on using the standard supported agents, rather than implementing custom agents. You can explore the list of supported agents and their specifications to choose the most suitable one for your application.

For the specific example mentioned, we will utilize the `wikipedia` tool to query and retrieve responses based on Wikipedia information. This tool allows the agent to access relevant information from Wikipedia and provide informative responses based on the given input.

**Import the required modules**: The code starts by importing the necessary modules from LangChain, such as `AgentType`, `initialize_agent`, and `load_tools`. These modules provide the functionalities required to create and configure the agent.

In [82]:
from langchain.agents import AgentType, initialize_agent, load_tools

**Define the language model for the agent**: In this example, the `llm_agent` is initialized with the `OpenAI` class, which represents the language model. The `temperature` parameter determines the level of randomness in the generated responses.

In [86]:
# The language model we're going to use to control the agent.
llm_agent = OpenAI(temperature=0)

**Load the tools**: The `load_tools` function is used to load the desired tools for the agent. In this case, the tools "wikipedia" and "llm-math" are loaded. 

> The "wikipedia" tool allows the agent to access information from Wikipedia, while the "llm-math" tool utilizes the language model for mathematical operations.

In [94]:
# The tools we'll give the Agent access to. Note that the 'llm-math' tool uses an LLM, so we need to pass that in.
tools = load_tools(["wikipedia", "llm-math"], llm=llm_agent)

**Initialize the agent**: The `initialize_agent` function is called to create an agent instance. It takes the loaded tools, the language model (`llm_agent`), the agent type (`AgentType.ZERO_SHOT_REACT_DESCRIPTION`), and an optional `verbose` parameter. The agent type determines the behavior of the agent, such as generating responses based on descriptions or reacting to user inputs.

In [98]:
# Finally, let's initialize an agent with the tools, the language model, and the type of agent we want to use.
agent = initialize_agent(tools, llm_agent, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
agent

AgentExecutor(memory=None, callbacks=None, callback_manager=None, verbose=True, agent=ZeroShotAgent(llm_chain=LLMChain(memory=None, callbacks=None, callback_manager=None, verbose=False, prompt=PromptTemplate(input_variables=['input', 'agent_scratchpad'], output_parser=None, partial_variables={}, template='Answer the following questions as best you can. You have access to the following tools:\n\nWikipedia: A wrapper around Wikipedia. Useful for when you need to answer general questions about people, places, companies, facts, historical events, or other subjects. Input should be a search query.\nCalculator: Useful for when you need to answer questions about math.\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [Wikipedia, Calculator]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can

Let's test it out

In [99]:
agent.run("What year did Lionel Messi Joined Barcelona? What is his current age raised to the 0.43 power?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to find out when Messi joined Barcelona and then calculate his current age raised to the 0.43 power.
Action: Wikipedia
Action Input: Lionel Messi[0m
Observation: [36;1m[1;3mPage: Lionel Messi
Summary: Lionel Andrés Messi (Spanish pronunciation: [ljoˈnel anˈdɾes ˈmesi] (listen); born 24 June 1987), also known as Leo Messi, is an Argentine professional footballer who plays as a forward for and captains both Major League Soccer club Inter Miami and the Argentina national team. Widely regarded as one of the greatest players of all time, Messi has won a record seven Ballon d'Or awards and a record six European Golden Shoes, and in 2020 he was named to the Ballon d'Or Dream Team. Until leaving the club in 2021, he had spent his entire professional career with Barcelona, where he won a club-record 34 trophies, including ten La Liga titles, seven Copa del Rey titles and the UEFA Champions League four times. With his countr

'Lionel Messi joined Barcelona in 2004 and is currently 34 years old, with his age raised to the 0.43 power being 4.555498776452875.'

In [100]:
agent.run('Siapa president Republik Indonesia')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to find out who the president of Indonesia is
Action: Wikipedia
Action Input: President of Indonesia[0m
Observation: [36;1m[1;3mPage: President of Indonesia
Summary: The president of the Republic of Indonesia (Indonesian: Presiden Republik Indonesia) is both the head of state and the head of government of the Republic of Indonesia. The president leads the executive branch of the Indonesian government and is the commander-in-chief of the Indonesian National Armed Forces. Since 2004, the president and vice president are directly elected to a five-year term, once renewable, allowing for a maximum of 10 years in office.
Joko Widodo is the seventh and current president of Indonesia. He assumed office on 20 October 2014.

Page: List of presidents of Indonesia
Summary: The president is the head of state and also head of government of the Republic of Indonesia. The president leads the executive branch of the Indonesian gov

"Joko Widodo is the seventh and current president of Indonesia, and Ma'ruf Amin is the 13th and current vice president of Indonesia."

By executing these steps, we establish an agent that can utilize various tools and interact with the chosen language model to generate contextually relevant responses based on the given input.

# Build Question Answering System

## Introduction to Question-Answer System

As we know, LangChain is an open-source library that provides developers with powerful tools for building applications using Large Language Models (LLMs). In our previous example, we saw how we could use an LLM to generate responses based on a given question. However, there may be cases where we need to ask more specific questions related to our business domain. For instance, we might want to ask the LLM about our company's top revenue-generating product.

LLMs have certain limitations when it comes to specific contextual knowledge, as they are trained on a vast amount of general information. To overcome this limitation, we can provide additional documents or context to the LLM. The idea is to retrieve relevant documents related to our question from a corpus or database and then pass them along with the original question to the LLM. This allows the LLM to generate a response that is informed by the specific information contained in the retrieved documents.

These documents can come from various sources such as databases, PDF files, plain text files, or even information extracted from websites. By connecting and feeding these documents to the LLM, we can build a powerful Question-Answer System that leverages the LLM's language generation capabilities while incorporating domain-specific knowledge.

In this section, we will explore how to connect and feed a database and text information to LLM to build Question-Answer System that can provide contextually relevant answers to specific business-related questions.

## Database

### Connecting Database

LangChain provide function that connect database to LLM, it called SQL Database. It also provide a function to chaining between the database, model llm and an agent that will execute SQL query based on natural language prompt

The integration process involves establishing a **connection** to the database, **defining** the necessary SQL queries, and **utilizing** the LLM and agent to execute those queries based on user prompts. This allows for a user-friendly and intuitive way to interact with databases, leveraging the language model's capabilities to understand and process natural language input.

To use LangChain to connect with a database, we need to utilize the `SQLDatabase` class. This class allows us to establish a connection between LangChain and an SQL database, enabling seamless integration of natural language queries and execution of SQL commands.

In [101]:
from langchain import SQLDatabase, SQLDatabaseChain

At this part we need to load the data, we will use the chinook data from our academy class as example. You need to explicitly explain what kind of database you load, for example `sqlite:///`.

Then we can just load the database using `SQLDatabase` from `langchain`.

In [102]:
dburi = "sqlite:///data_input/chinook.db"

db = SQLDatabase.from_uri(dburi)


> `from_uri()` allows us to interact with the database and perform various operations such as querying data.

After that, we chain the `db` to the model, creating an agent that can search for answers in the Chinook database based on the prompt input.

Let's try a prompt to find out how many rows are there in the "tracks" table of this database.

In [103]:
llm = OpenAI(temperature=0) # parameter temperature

In [110]:
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)

db_chain.run("How many rows is in the tracks table of this db?")

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).




[1m> Entering new SQLDatabaseChain chain...[0m
How many rows is in the tracks table of this db?
SQLQuery:[32;1m[1;3mSELECT COUNT(*) FROM tracks;[0m
SQLResult: [33;1m[1;3m[(3503,)][0m
Answer:[32;1m[1;3mThe tracks table has 3503 rows.[0m
[1m> Finished chain.[0m


'The tracks table has 3503 rows.'

In [111]:
db_chain.run("please describe this database for me? please avoid ambigous name")



[1m> Entering new SQLDatabaseChain chain...[0m
please describe this database for me?
SQLQuery:[32;1m[1;3mSELECT "ArtistId", "Name" FROM artists
          UNION
          SELECT "EmployeeId", "LastName" FROM employees
          UNION
          SELECT "GenreId", "Name" FROM genres
          UNION
          SELECT "MediaTypeId", "Name" FROM media_types
          UNION
          SELECT "PlaylistId", "Name" FROM playlists
          UNION
          SELECT "AlbumId", "Title" FROM albums
          UNION
          SELECT "CustomerId", "LastName" FROM customers
          UNION
          SELECT "InvoiceId", "InvoiceDate" FROM invoices
          UNION
          SELECT "TrackId", "Name" FROM tracks
          UNION
          SELECT "InvoiceLineId", "UnitPrice" FROM invoice_items
          UNION
          SELECT "PlaylistId", "TrackId" FROM playlist_track
          LIMIT 5;[0m
SQLResult: [33;1m[1;3m[(1, 0.99), (1, 1), (1, 2), (1, 3), (1, 4)][0m
Answer:[32;1m[1;3mThis database contains inf

'This database contains information about artists, employees, genres, media types, playlists, albums, customers, invoices, tracks, invoice items, and playlist tracks.'

In [112]:
db_chain.run("can you please make a dataset that contains track info, artist name, and genres")



[1m> Entering new SQLDatabaseChain chain...[0m
can you please make a dataset that contains track info, artist name, and genres
SQLQuery:[32;1m[1;3mSELECT t."Name" AS TrackName, a."Name" AS ArtistName, g."Name" AS GenreName
FROM tracks t
INNER JOIN albums al ON t."AlbumId" = al."AlbumId"
INNER JOIN artists a ON al."ArtistId" = a."ArtistId"
INNER JOIN genres g ON t."GenreId" = g."GenreId"
LIMIT 5;[0m
SQLResult: [33;1m[1;3m[('For Those About To Rock (We Salute You)', 'AC/DC', 'Rock'), ('Put The Finger On You', 'AC/DC', 'Rock'), ("Let's Get It Up", 'AC/DC', 'Rock'), ('Inject The Venom', 'AC/DC', 'Rock'), ('Snowballed', 'AC/DC', 'Rock')][0m
Answer:[32;1m[1;3mThe dataset contains track info, artist name, and genres.[0m
[1m> Finished chain.[0m


'The dataset contains track info, artist name, and genres.'

Notice that the output contains several components: 
- `SQLQuery`, which provides information about the process the model used to search for the answer using SQL.
- `SQLResult`, which represents the result obtained from executing the SQL query on our database.
- `Answer`, which converts the `SQLResult` into natural language and displays it as the final answer.

### Basics of Building Question-Answer System using LLM

The `SqlDatabaseChain` is a powerful tool that enables us to answer questions by querying a SQL database. It seamlessly integrates the language model's capabilities with SQL queries, providing a convenient way to retrieve specific information from structured data stored in a database. With the `SqlDatabaseChain`, we can easily harness the power of both the language model and the SQL database to build a robust question-answering system for your data-driven applications.

Another example we use the question:

> all sales in rock genre in 2012

In [115]:
db_chain.run("all sales in rock genre in 2012 that happened in germany based on invoice please dont use limit statement")



[1m> Entering new SQLDatabaseChain chain...[0m
all sales in rock genre in 2012 that happened in germany based on invoice please dont use limit statement
SQLQuery:[32;1m[1;3mSELECT i.InvoiceId, i.InvoiceDate, i.BillingCountry, t.GenreId, t.Name 
FROM invoices i 
INNER JOIN invoice_items ii ON i.InvoiceId = ii.InvoiceId 
INNER JOIN tracks t ON ii.TrackId = t.TrackId 
WHERE t.GenreId = 1 
AND strftime('%Y', i.InvoiceDate) = '2012' 
AND i.BillingCountry = 'Germany'[0m
SQLResult: [33;1m[1;3m[(291, '2012-06-30 00:00:00', 'Germany', 1, 'Creep'), (291, '2012-06-30 00:00:00', 'Germany', 1, 'Dark Corners'), (293, '2012-07-13 00:00:00', 'Germany', 1, 'Boris The Spider')][0m
Answer:[32;1m[1;3mThere were 3 sales in rock genre in 2012 that happened in Germany based on invoice.[0m
[1m> Finished chain.[0m


'There were 3 sales in rock genre in 2012 that happened in Germany based on invoice.'

Kalau mau jadi data frame bisa copy dari sana

In [120]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('data_input/chinook.db')

pd.read_sql_query('''SELECT i.InvoiceId, i.InvoiceDate, i.BillingCountry, t.GenreId, t.Name 
FROM invoices i 
INNER JOIN invoice_items ii ON i.InvoiceId = ii.InvoiceId 
INNER JOIN tracks t ON ii.TrackId = t.TrackId 
WHERE t.GenreId = 1 
AND strftime('%Y', i.InvoiceDate) = '2012' 
AND i.BillingCountry = 'Germany' 
''', con = conn)

Unnamed: 0,InvoiceId,InvoiceDate,BillingCountry,GenreId,Name
0,291,2012-06-30 00:00:00,Germany,1,Creep
1,291,2012-06-30 00:00:00,Germany,1,Dark Corners
2,293,2012-07-13 00:00:00,Germany,1,Boris The Spider


In [121]:
db_chain.run("We want the returned DataFrame to contain only the Pop genre and only when the UnitPrice of the track is 0.99")



[1m> Entering new SQLDatabaseChain chain...[0m
We want the returned DataFrame to contain only the Pop genre and only when the UnitPrice of the track is 0.99
SQLQuery:[32;1m[1;3mSELECT * FROM tracks t JOIN genres g ON t.GenreId = g.GenreId WHERE g.Name = 'Pop' AND t.UnitPrice = 0.99 LIMIT 5;[0m
SQLResult: [33;1m[1;3m[(323, 'Dig-Dig, Lambe-Lambe (Ao Vivo)', 29, 1, 9, 'Cassiano Costa/Cintia Maviane/J.F./Lucas Costa', 205479, 6892516, 0.99, 9, 'Pop'), (324, 'Pererê', 29, 1, 9, 'Augusto Conceição/Chiclete Com Banana', 198661, 6643207, 0.99, 9, 'Pop'), (325, 'TriboTchan', 29, 1, 9, 'Cal Adan/Paulo Levi', 194194, 6507950, 0.99, 9, 'Pop'), (326, 'Tapa Aqui, Descobre Ali', 29, 1, 9, 'Paulo Levi/W. Rangel', 188630, 6327391, 0.99, 9, 'Pop'), (327, 'Daniela', 29, 1, 9, 'Jorge Cardoso/Pierre Onasis', 230791, 7748006, 0.99, 9, 'Pop')][0m
Answer:[32;1m[1;3mThe DataFrame contains 5 tracks from the Pop genre with a UnitPrice of 0.99.[0m
[1m> Finished chain.[0m


'The DataFrame contains 5 tracks from the Pop genre with a UnitPrice of 0.99.'

In [122]:
db_chain.run("Tampilkan lagu dengan Genre Pop")



[1m> Entering new SQLDatabaseChain chain...[0m
Tampilkan lagu dengan Genre Pop
SQLQuery:[32;1m[1;3mSELECT "Name" FROM tracks WHERE "GenreId" = (SELECT "GenreId" FROM genres WHERE "Name" = 'Pop') LIMIT 5;[0m
SQLResult: [33;1m[1;3m[('Dig-Dig, Lambe-Lambe (Ao Vivo)',), ('Pererê',), ('TriboTchan',), ('Tapa Aqui, Descobre Ali',), ('Daniela',)][0m
Answer:[32;1m[1;3mLagu dengan Genre Pop adalah Dig-Dig, Lambe-Lambe (Ao Vivo), Pererê, TriboTchan, Tapa Aqui, Descobre Ali, dan Daniela.[0m
[1m> Finished chain.[0m


'Lagu dengan Genre Pop adalah Dig-Dig, Lambe-Lambe (Ao Vivo), Pererê, TriboTchan, Tapa Aqui, Descobre Ali, dan Daniela.'

## Structured Data

### Connecting to CSV

Structured data is not only stored in database files; it can also be stored in other formats such as `.xlsx` and `.csv`, which represent data in a tabular form with columns and rows. In addition to providing agents to generate answers from databases using SQL based on natural language prompts, LangChain also offers agents to generate answers based on **tabular structured data sources**, such as CSV files. In this section, we will demonstrate how to utilize the agent for CSV data.


To begin, let's define the file path of our dataset `rice.csv`, which contains rice transaction.

In [123]:
filepath = "data_input/rice.csv"

Next, we will create an agent specifically designed for working with CSV data. This agent will allow us **to query and retrieve information from the `rice.csv` dataset**. Since we are using the same LLM model as in the SQL part, there is no need to redefine the LLM. We can utilize the existing LLM model for our CSV agent.

In [124]:
from langchain.agents import create_csv_agent
agent = create_csv_agent(llm, filepath, verbose=True)

Then we just run ask the question about our data.

In [125]:
agent.run("berikan detail banyaknya transaksi yang terjadi di setiap format")

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: saya harus menghitung jumlah transaksi yang terjadi di setiap format
Action: python_repl_ast
Action Input: df.groupby('format')['receipt_id'].count()[0m
Observation: [36;1m[1;3mformat
hypermarket     999
minimarket     7088
supermarket    3913
Name: receipt_id, dtype: int64[0m
Thought:[32;1m[1;3m Saya sekarang tahu jawaban akhir
Final Answer: Hypermarket memiliki 999 transaksi, Minimarket memiliki 7088 transaksi, dan Supermarket memiliki 3913 transaksi.[0m

[1m> Finished chain.[0m


'Hypermarket memiliki 999 transaksi, Minimarket memiliki 7088 transaksi, dan Supermarket memiliki 3913 transaksi.'

In [126]:
agent.run("apakah di data ini ada yang masih NA")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Saya harus mengecek apakah ada nilai yang masih NA
Action: python_repl_ast
Action Input: df.isna().sum()[0m
Observation: [36;1m[1;3mUnnamed: 0          0
receipt_id          0
receipts_item_id    0
purchase_time       0
category            0
sub_category        0
format              0
unit_price          0
discount            0
quantity            0
yearmonth           0
dtype: int64[0m
Thought:[32;1m[1;3m Saya sekarang tahu jawabannya
Final Answer: Tidak, tidak ada nilai yang masih NA.[0m

[1m> Finished chain.[0m


'Tidak, tidak ada nilai yang masih NA.'

In [127]:
agent.run('apakah ada data yang duplicate')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Saya harus mencari tahu apakah ada data yang sama
Action: python_repl_ast
Action Input: df.duplicated()[0m
Observation: [36;1m[1;3m0        False
1        False
2        False
3        False
4        False
         ...  
11995    False
11996    False
11997    False
11998    False
11999    False
Length: 12000, dtype: bool[0m
Thought:[32;1m[1;3m Saya sekarang tahu jawabannya
Final Answer: Tidak ada data yang duplicate.[0m

[1m> Finished chain.[0m


'Tidak ada data yang duplicate.'

In [128]:
agent.run('coba bersihkan data yang duplikat')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: saya harus menemukan data yang duplikat
Action: python_repl_ast
Action Input: df.drop_duplicates()[0m
Observation: [36;1m[1;3m       Unnamed: 0  receipt_id  receipts_item_id     purchase_time category   
0               1     9622257          32369294   7/22/2018 21:19     Rice  \
1               2     9446359          31885876   7/15/2018 16:17     Rice   
2               3     9470290          31930241   7/15/2018 12:12     Rice   
3               4     9643416          32418582    7/24/2018 8:27     Rice   
4               5     9692093          32561236   7/26/2018 11:28     Rice   
...           ...         ...               ...               ...      ...   
11995       11996     5760491          17555486  12/15/2017 21:06     Rice   
11996       11997     5598782          16999147   12/2/2017 14:12     Rice   
11997       11998     5735850          17434503  12/13/2017 19:17     Rice   
11998       11999    

'Data telah dibersihkan dari duplikat.'

In [129]:
agent.run('berikan saya rekomendasi dimana tempat paling murah untuk saya membeli beras')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: saya harus mencari rata-rata harga beras di supermarket dan minimarket
Action: python_repl_ast
Action Input: df.groupby('format')['unit_price'].mean()[0m
Observation: [36;1m[1;3mformat
hypermarket    71205.458458
minimarket     67135.569554
supermarket    74921.182150
Name: unit_price, dtype: float64[0m
Thought:[32;1m[1;3m rekomendasi saya adalah minimarket
Final Answer: Rekomendasi saya adalah untuk membeli beras di minimarket.[0m

[1m> Finished chain.[0m


'Rekomendasi saya adalah untuk membeli beras di minimarket.'

In [130]:
agent.run('coba deskripsikan isi dari data ini apa')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: Saya harus mencari tahu informasi tentang data ini
Action: python_repl_ast
Action Input: df.info()[0m<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        12000 non-null  int64  
 1   receipt_id        12000 non-null  int64  
 2   receipts_item_id  12000 non-null  int64  
 3   purchase_time     12000 non-null  object 
 4   category          12000 non-null  object 
 5   sub_category      12000 non-null  object 
 6   format            12000 non-null  object 
 7   unit_price        12000 non-null  float64
 8   discount          12000 non-null  int64  
 9   quantity          12000 non-null  int64  
 10  yearmonth         12000 non-null  object 
dtypes: float64(1), int64(5), object(5)
memory usage: 1.0+ MB

Observation: [36;1m[1;3mNone[0m
Thought:[

'Data ini berisi informasi tentang pembelian, termasuk nomor resi, ID item, waktu pembelian, kategori, sub-kategori, format, harga unit, diskon, jumlah, dan bulan tahun pembelian.'

In [131]:
agent.run('berikan saya ringkasan statistika dari data ini')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: saya harus menggunakan fungsi statistik untuk menghasilkan ringkasan
Action: python_repl_ast
Action Input: df.describe()[0m
Observation: [36;1m[1;3m        Unnamed: 0    receipt_id  receipts_item_id     unit_price   
count  12000.00000  1.200000e+04      1.200000e+04   12000.000000  \
mean    6000.50000  7.650135e+06      2.457950e+07   70013.146313   
std     3464.24595  1.873838e+06      7.171105e+06   29905.391437   
min        1.00000  3.173994e+06      9.282023e+06    9395.000000   
25%     3000.75000  5.983209e+06      1.820694e+07   61900.000000   
50%     6000.50000  7.443618e+06      2.234337e+07   63500.000000   
75%     9000.25000  9.149786e+06      3.108379e+07   66000.000000   
max    12000.00000  1.146321e+07      3.842939e+07  219400.000000   

            discount      quantity  
count   12000.000000  12000.000000  
mean      835.305750      1.332917  
std      6207.475704      0.980304  
min     -

'Ringkasan statistik dari dataframe df adalah sebagai berikut: jumlah, rata-rata, standar deviasi, nilai minimum, nilai tengah, nilai maksimum, diskon, dan kuantitas.'

Notice that there are additional components in the output:
- `Thought`: This represents the **agent's thought** process on how to solve the problem based on the given prompt. It provides insights into the agent's decision-making and the reasoning behind its actions.
- `Action`: This describes the **actions taken** by the agent to solve the problem. In this case, it involves using the `python_repl_ast` tool, which is a Python shell. It also indicates the specific `pandas` command used by the agent to extract the result from the CSV data.
- `Final Answer`: This is the natural language representation of the **answer** derived from the result of the `Action Input`. It presents the final response to the prompt in a human-readable format.

## Unstructured Data

The company stores not only structured data but also unstructured data, such as **meeting summaries, task reports, and product descriptions**. When we need to retrieve information or ask questions about these documents, it usually requires manual searching and reading through them.

However, what if we could leverage the power of LLM models to find the answers for us? What if we have unique documents or regulations specific to our company? With OpenAI Embeddings, we can add our own company-specific information to the LLM model. This allows us to build a question-answering system that can provide answers based on our own company documents.

In [132]:
from langchain.document_loaders import DirectoryLoader, TextLoader #untuk load data
from langchain.text_splitter import CharacterTextSplitter # untuk split kata kata
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings()

Implementing LangChain involves utilizing various modules and components to work with documents and text data effectively, we can use:

- `DirectoryLoader` and `TextLoader` are document loaders that allow us to load documents from a directory or individual text files.
- `CharacterTextSplitter` is a text splitter that helps us split text into smaller units, such as characters.
- `OpenAIEmbeddings` is a module that provides embeddings for our text data. Embeddings are numerical representations of words or sentences that capture their meaning.
- `Chroma` is a vector store that stores the embeddings in a way that allows for efficient similarity searches.
- `RetrievalQA` is a question-answering module that uses the embeddings and vector store to retrieve answers from the documents.

By creating an instance of `OpenAIEmbeddings()` and assigning it to the variable `embeddings`, we can now use it to encode and represent our text data. This enables us to build powerful question-answering systems using the RetrievalQA module.

Let's utilize the `summary.txt` file, which contains summaries of coal news for Australia, Indonesia, and China. We can use the `TextLoader` module to load this text file and process its contents.

In [133]:
loader = TextLoader('data_input/summary.txt')

In [136]:
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=2500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
#texts

The `texts` object contains the result of splitting the loaded documents into smaller chunks of text. Each chunk of text is limited to a maximum of 2500 characters (`chunk_size`) and there is no overlap between the chunks (`chunk_overlap`). 

The purpose of splitting the text into smaller chunks is to facilitate processing and analysis of the document content in a more manageable way. The `texts` object now holds these smaller text chunks, which can be further utilized for various text processing tasks such as information retrieval or natural language understanding.

In [138]:
documents

[Document(page_content='Summary for Australia:\n\n- Australia\'s coal and gas exports may reduce by half within the next five years due to the passing of its peak and the efforts of Asian countries to decrease greenhouse gas emissions. The earnings of minerals and energy exports are predicted to reach $464bn in 2022-23 from $128bn in thermal coal exports and $91bn in liquidified natural gas (LNG) exports. These figures have resulted from the global energy crisis caused by Russia\'s invasion of Ukraine, leading to high fossil fuel prices, causing the replacement of Russian gas with alternative supplies in northern hemisphere nations. \n\n- The seaborne coal market grew by 5.9% year-on-year to 1208 million tonnes in 2022, reversing the negative trend of previous years, according to shipbroker Banchero Costa. Although Australia\'s coal exports declined by 5% in 2022 due to China\'s adoption  of alternative markets, relations between the two countries have mended and coal shipments are exp

Next, we create a `docsearch` object using the `Chroma.from_documents` method to build a vector store from the `texts`. This `docsearch` object enables efficient similarity search and retrieval of documents based on their text representations.

In [137]:
docsearch = Chroma.from_documents(texts, embeddings)

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).


To create a question-answering system, we use the `RetrievalQA.from_chain_type` method to construct a `qa_chain` object. This chain type leverages an LLM model, specified by `OpenAI()`, and utilizes the `docsearch` object as a retriever to find relevant documents for answering questions.

In [139]:
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(), 
    chain_type="stuff",
    retriever=docsearch.as_retriever()
)

Now we can use the `qa_chain.run()` method to interact with the question-answering system and obtain answers based on the given prompts or questions.

In [140]:
qa_chain.run("What are the effects of legislations surrounding emissions on the Australia coal market?")

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).


" Australia's coal and gas exports may reduce by half within the next five years due to the efforts of Asian countries to decrease greenhouse gas emissions. Coal producers are in talks with the government of New South Wales, following the government's announcement that coal miners should reserve up to 10% of production for domestic supply to control rising energy costs in Australia."

In [141]:
qa_chain.run("Is there an export ban on Coal in Indonesia? Why?")

' Yes, Indonesia has implemented a ban on coal exports to ensure adequate supply for its state-owned electricity companies.'

In [142]:
qa_chain.run("Who are the main exporters of Coal to China? What is the role of Indonesia in this?")

' The main exporters of coal to China are Indonesia, Russia and Mongolia. Indonesia is the largest exporter, accounting for 58.3% of total imports. Technology and investment from China is set to play a vital role in Indonesian coal deep-processing.'

The output of `qa_chain.run()` provides the generated answer based on the given prompts or questions using the question-answering system.

In [144]:
qa_chain.run("please give me summary of this document")

" This document provides information about the impacts of the global energy crisis and the Indonesian coal ban on coal and gas exports from Australia, China, and Indonesia. Australia's coal and gas exports are projected to decrease by half in the next five years due to the passing of its peak and efforts to reduce greenhouse gas emissions. In China, coal imports from Russia have risen and Mongolia is set to be the largest coking coal seller to China for the third consecutive year. Meanwhile, Indonesia is set to produce a record 695 million tonnes of coal in 2023 and export 518 million tonnes. Ford is investing in a nickel processing facility in Indonesia and the country's economy is expected to experience strong growth in 2023."

# HuggingFace

Hugging Face is a company that specializes in natural language processing (NLP) and provides tools and models to help with NLP tasks. 

One of their popular offerings is the **"Transformers"**" library, which offers **"pre-trained**" models for various NLP tasks like text classification, sentiment analysis, and question answering. These models are based on advanced architectures like GPT (Generative Pre-trained Transformer) developed by OpenAI. 

Hugging Face also has a platform called the **""Hugging Face Hub"**" where users can **"access and share models**" and datasets. It simplifies the process of using pre-trained models and promotes collaboration among researchers and developers. Langchain allows us to connect with the Hugging Face API, giving us access to additional Transformers models.

## Setting up API Key and `.env` file

#### `.env` file

To use the Hugging Face API, we need to create an API Key. You can create your API Key by going to [this link](https://huggingface.co/settings/tokens). Once we have the API Key, we need to store it in a file called `.env`.

To create the `.env` file, open a text editor and enter the following information:

```plaintext
OPENAI_API_KEY={your_openai_api_key}
HUGGINGFACEHUB_API_TOKEN={your_huggingface_api_key}
```

Replace `{your_openai_api_key}` and `{your_huggingface_api_key}` with the respective API keys.

Save the file as `.env` in the same directory as your Python script or notebook.

To load the environment variables from the `.env` file, we can use the `load_dotenv()` function from the `dotenv` library. Make sure to add the following line of code at the beginning of your script or notebook:

```python
from dotenv import load_dotenv

load_dotenv()
```

In [145]:
load_dotenv()

True

By doing this, Python will recognize and use the API keys stored in the `.env` file throughout your code.

## Implementation

To implement Hugging Face models and utilize the Hugging Face Hub in Langchain, we need to import the following modules:
- `HuggingFaceHub` to access and interact with the Hugging Face models and datasets from the hub.
- `LLMChain` to create a chain using a Hugging Face language model for text generation and processing.
- `PromptTemplate` to create customizable prompt templates for generating prompts based on user inputs.

In [146]:
from langchain import HuggingFaceHub, LLMChain
from langchain.prompts import PromptTemplate

Similar to the OpenAI integration, when working with Hugging Face models in Langchain, we also need to specify the desired model to use. This involves selecting a pre-trained model from the Hugging Face Hub based on our task or application.

Once the model is chosen, we can set various parameters specific to the Hugging Face model, such as:

- `temperature` parameter to control the randomness of the generated text,
- `max_length` parameter to limit the length of the generated output.

These parameters allow us to **fine-tune** the behavior of the Hugging Face model and customize its responses to align with our specific requirements. By leveraging these settings, we can optimize the model's output and achieve the desired results in our text generation and processing tasks.

In [152]:
hub_llm = HuggingFaceHub(
    repo_id='gpt2',
    model_kwargs={'temperature': 0, 'max_length': 50}
)

To create a chain using Hugging Face models in Langchain, we need to define a prompt template that specifies the desired format of the input. 

This prompt template acts as a guideline for structuring the input data that will be passed to the model. Once the prompt template is defined, we can create the chain by initializing an instance of the LLMChain class. This chain connects the prompt template and the Hugging Face model, allowing us to generate responses based on the provided input.

In [148]:
prompt = PromptTemplate(
    input_variables=["question"],
    template="""Question: {question}"""
)

hub_chain = LLMChain(prompt=prompt, llm=hub_llm, verbose=True)

Let's ask some interesting questions

In [149]:
hub_chain.run("who won FIFA World Cup in the year 1994?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mQuestion: who won FIFA World Cup in the year 1994?[0m

[1m> Finished chain.[0m


'\n\nA: The winner of the World Cup in 1994 was the United States.\n\nQ: What was the most important thing about the World Cup?\n\nA: The World'

The output from the code is the response generated by the language model based on the given prompt. However, in this case, the output may not make sense or provide the correct answer because the model used (`hub_llm`) may not have been specifically trained to answer questions about the "FIFA World Cup in the year 1994". 

The language model's response is based on the patterns it has learned from the training data, and if the specific information is not present in the training data, the model may generate a response that is not accurate or relevant to the question. It's important to note that the effectiveness of the response depends on the training data and the capabilities of the language model.

## Integrating HuggingFace's Inference API

What if we want to use a free model from Hugging Face to create a question answering system based on our database? We can import a pre-trained model that can translate natural language queries into SQL queries. One of the models available is `t5-base-finetuned-wikiSQL`.

To integrate the model, we need to create an `HuggingFaceHub` object. We can set the `repo_id` parameter to specify the repository ID of the desired model, and the `model_kwargs` to configure the model. For example, we can set `temperature=0` to get deterministic responses. This allows us to retrieve the model from the Hugging Face Hub and use it for our question answering tasks.

In [153]:
# Import the llm model from huggingface
hf_llm_t5 = HuggingFaceHub(
    repo_id='mrm8488/t5-base-finetuned-wikiSQL',
    model_kwargs={'temperature': 0, 'max_token':2000}
)

After importing the necessary modules and loading the model, we can create a prompt template using `PromptTemplate`. 

- The `input_variables` parameter specifies the input variables we want to use, 
- The `template` parameter defines the template string with the placeholders for the input variables. 

This allows us to easily customize the prompt based on our specific question answering needs.

In [154]:
# Create prompt template
prompt_db = PromptTemplate(
    input_variables=['question'],
    template="Translate English to SQL: {question}"
)

After setting up the prompt template and model, we can chain all the components together using `LLMChain`. This allows us to connect the prompt template with the language model and create a unified system for question answering. 


In [155]:
# Chaining
hub_chain = LLMChain(prompt=prompt_db, llm=hf_llm_t5, verbose=True)


Once the chain is established, we can use the `run()` method to generate responses based on the input provided through the prompt template.

In [157]:
# Run the template
hub_chain.run("How many rows is in the tracks's table?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mTranslate English to SQL: How many rows is in the tracks's table?[0m

[1m> Finished chain.[0m


'SELECT COUNT Rows FROM table WHERE Tracks = Tracks'

The incorrect answer may be due to a difference in syntax between the T5 model and the OpenAI model. In the T5 model, the query should include `FROM tracks` instead of just `table`. This highlights the importance of using the correct syntax for the specific model being used.

It's worth noting that HuggingFace provides a wide range of models that are designed for specific tasks. These models have been fine-tuned on specific datasets and can be a valuable resource for various natural language processing tasks, including question answering. By selecting the appropriate model for the task at hand, we can improve the accuracy and reliability of the answers generated.

# Summary

In this module, we explored various aspects of working with large language models (LLMs) and building question-answering systems. We started by understanding the architecture and key concepts of large language models, including the Transformer model and the process of pre-training and fine-tuning. We also introduced popular LLMs like GPT-3, GPT-2, and BERT, discussing their capabilities and limitations. We then delved into the LangChain concept, which allows us to connect databases and text data with LLMs for question answering. We demonstrated the use of OpenAI and LangChain to build question-answering systems with both databases and text data. Additionally, we explored text generation using HuggingFace's models, covering the setup and integration of the HuggingFace Hub. Overall, this module provided insights into harnessing the power of large language models for question answering and text generation tasks.

# Reference

- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)