# Introduction
An LLM application built using the langchain open source framework to take unstructured text data as input, extract relevant insights based on a query prompt and generate contextual answers on a chatbot interface

The following components encapsulated in langchain abstraction are used to orchestrate the application:
1.  A **CSVdataloader** to read the unstructured data corresponding to movie scripts classified by their genres from a csv file. Data source is present [here](https://huggingface.co/datasets/aneeshas/imsdb-genre-movie-scripts)
2.  A **text splitter** to split the loaded data into manageable chunks that can be stored and retrieved as the context for building responses
3.  An **Embedding Model** to translate unstructured text data into embedding vectors for retrieval through semantic search or other methods during prompt processing
4.  A **vectorstore** for storing the generated embeddings to retrieve during query processing
5.  A **PromptTemplate** to compile the system prompt that fine-tunes the behavior of the LLM model underlying the chat applicatiom
6.  A **Conversation Buffer Memory** to store the chat context of the chatbot
7.  A **Conversational Retrieval Chain** that wires the different components together to build the internal plumbing required for the chat application

The application is inspired by my learnings from the short course named **LangChain: Chat with Your Data** ([link](https://www.deeplearning.ai/short-courses/langchain-chat-with-your-data/)) and other short courses on LLMs offered on the DeepLearning.AI portal

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/enron-email-dataset/emails.csv


### Installing and Importing the required modules and libraries


The modules required for this application can be installed using the code snippet below if not already present in your environment

In [4]:
!pip install --upgrade datasets
!pip install --upgrade langchain
!pip install --upgrade chromadb
!pip install --upgrade tiktoken
!pip install transformers
!pip install InstructorEmbedding
!pip install -U sentence-transformers
!pip install openai
!pip install gradio



The snippet that follows has all the necessary library imports for orchestrating the chat application

In [5]:
from langchain.vectorstores.chroma import Chroma
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.prompts import PromptTemplate
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from datasets import load_dataset
import numpy as np
import pandas as pd

An API key of OpenAI needs to be generated as the OpenAI API is used in the application

In [6]:
import sys
import csv
csv.field_size_limit(sys.maxsize)
openai_api_key = '<enter your open ai api key here>'

### Fetching and loading the Dataset

The dataset on movie scripts to be used as context for generating query responses is fetched from the Hugging Face repository and converted to pandas format for ease of processing

In [7]:
script_data = load_dataset('aneeshas/imsdb-genre-movie-scripts')
script_data['train'].set_format('pandas')
script_df = script_data['train'][:]
script_df.head()

Downloading readme:   0%|          | 0.00/512 [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/80.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/150 [00:00<?, ? examples/s]

Unnamed: 0,Action,Horror,Sci-Fi,Comedy,Drama
0,15 Minutes\r\n\r\n\r\n\tFADE IN\r\n\r\n\ton th...,28 DAYS LATER\r\n \r\n...,Twelve Monkeys\r\n\r\n\r\n\t\t\t\tTWELVE MONKE...,Ten Things I Hate About You - by Karen McCulla...,12 AND HOLDING\r\n \r\n \r\n...
1,2012\r\n \r\n \r\n ...,A QUIET PLACE\r\n\r\n\r\n\r\n ...,2001: A SPACE ODYSSEY\r\n\r\n\t\t\t\t\t Scr...,12 - Script\r\n\r\n\r\n\r\n\r\nCUT FROM BLACK\...,Twelve Monkeys\r\n\r\n\r\n\t\t\t\tTWELVE MONKE...
2,30 MINUTES OR LESS\r\n\r\n\r\n\r\n\r\n ...,THE ADDAMS FAMILY\r\n\r\n ...,2012\r\n \r\n \r\n ...,17 AGAIN\r\n \r\n \r\n ...,12 YEARS A SLAVE\r\n\r\n\r\n\r\n\r\n ...
3,"""48 HRS."" -- Unknown draft/writers\r\n\r\n\r\n...",AFTER.LIFE\r\n\r\n \r\...,28 DAYS LATER\r\n \r\n...,30 MINUTES OR LESS\r\n\r\n\r\n\r\n\r\n ...,127 HOURS\r\n\r\n\r\n\r\n ...
4,A MOST VIOLENT YEAR\r\n\r\n\r\n\r\n\r\n ...,"""Alien"", early draft, by Dan O'Bannon\r\n\r\n\...",9\r\n \r\n \r\n ...,"""48 HRS."" -- Unknown draft/writers\r\n\r\n\r\n...",1492: CONQUEST OF PARADISE\r\n\r\n\r\n\r\n\r\n...


The dataset has movie scripts organized across their relevant genres. This table structure is not optimal for processing and hence is converted into a flat file through the melt() function in pandas

In [8]:
script_unp_df = pd.melt(script_df,id_vars=None,value_vars=['Action','Horror','Sci-Fi','Comedy','Drama'],var_name='Genre',value_name='Script')
script_unp_df.head(5)

Unnamed: 0,Genre,Script
0,Action,15 Minutes\r\n\r\n\r\n\tFADE IN\r\n\r\n\ton th...
1,Action,2012\r\n \r\n \r\n ...
2,Action,30 MINUTES OR LESS\r\n\r\n\r\n\r\n\r\n ...
3,Action,"""48 HRS."" -- Unknown draft/writers\r\n\r\n\r\n..."
4,Action,A MOST VIOLENT YEAR\r\n\r\n\r\n\r\n\r\n ...


### The Algorithm
What follows are the different steps involved in building the chat interface and the underlying LLM-powered logic of the chat application

#### 1. Loading Data Required for Context Generation

To speed up execution and also to accommodate the pricing constraints of OpenAI APIs, only 3 samples from the database of movie scripts are selected and loaded using the CSVLoader of langchain

In [9]:
script_samples = script_unp_df.sample(3)
script_samples.to_csv('script_data_by_genre')
loader = CSVLoader(
    file_path = '/kaggle/working/script_data_by_genre',
)
sc_loaded_data = loader.load()

#### 2. Splitting unstructured text data into manageable chunks

The individual movie scripts have large character lengths which makes it sub-optimal to store them for retrieval during LLM processing. The text splitter can split each data point into multiple chunks with the metadata keeping a track of the original data entry from where the chunk is generated

In [10]:
rcts = RecursiveCharacterTextSplitter(chunk_size=256,chunk_overlap=32)
script_chunks = rcts.split_documents(sc_loaded_data)

#### 3. Embedding Generation from text chunks and storage in a vectorstore

The text chunks are stored in a vector store to form the knowledge base for on-demand retrieval and response generation based on the system prompt and query received from the user. In order to facilitate greater ease of indexing and accessing the vector store, embeddings are generated from the unstructured text and stored in the vector store

Data access can be performed using techniques such as semantic search, maximal marginal relevance search, etc. for ensuring that the right portions of the knowledge base are leveraged for response generation

In [11]:
embedding = HuggingFaceInstructEmbeddings()
script_db = Chroma.from_documents(documents=script_chunks,embedding=embedding)
print(script_db._collection.count())

Downloading (…)c7233/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Downloading (…)9fb15c7233/README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading (…)b15c7233/config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)c7233/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.41k [00:00<?, ?B/s]

Downloading (…)15c7233/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer


caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


max_seq_length  512
5913


#### 4. Instantiation of LLM based chat agent for prompt processing

An OpenAI chat agent based on the GPT-3.5 model can be used to understand the query and generate relevant responses based on the contextual references fetched from the vectorstore database

In [12]:
llm = ChatOpenAI(model_name='gpt-3.5-turbo',temperature=0.1,openai_api_key = openai_api_key)

#### 5. Creation of the Query Template

The query template should be designed with clear instructions on how to respond to a particular question shared by the user. Using the PromptTemplate module offered by langchain, it is possible to leave a placeholder for the user query which be filled in real-time during the execution of the chat application

In [17]:
template = """
    You are an assistant that helps with answering questions on movie storylines.\ 
    Please read the question demilited by 3 backticks and follow the instructions mentioned in the steps:
    Step 1: Check if the question is related to a movie or the user is just exchanging pleasantries
    Step 2: If the user is exchanging pleasantries, respond likewise in a friendly manner
    Step 3: If the question is regarding a movie, please use the context below on movie scripts from the script field to generate a response
    If you don't know the answer, please professionally acknowledge it.\
    Step 4: Check if the user has more questions\
    Step 5: If yes, repeat steps 1 to 4. If no, thank the user for his time.\
    Restrict every answer to less than 30 words and keep it concise like a free flowing conversation\
    If the answer is not known,professionally acknowledge the same\
    Portray a friendly demeanour when answering the questions and \
    {context}
    ```Question:{question}```
"""
QUERY_PROMPT = PromptTemplate.from_template(template)

#### 6. Creation of Memory Buffer for tracking Chat History

The chains (workflows) for retrieving the relevant data points and responding to user questions are stateless. In order to store the chat history across multiple queries in a conversation, the buffer memory is used

In [21]:
chat_bot_memory = ConversationBufferMemory(
    memory_key = 'chat_history',
    return_messages = True
)

#### 7. Chaining the various components to build the conversational logic

The ConversationalRetrievalChain ties together the various components instantiated across the previous 6 steps to create a workflow for understanding and responding to the user queries in a chat application

In [22]:
chatbot_chain = ConversationalRetrievalChain.from_llm(
    llm = llm,
    retriever=script_db.as_retriever(),
    memory = chat_bot_memory,
    combine_docs_chain_kwargs = {'prompt':QUERY_PROMPT}
)


def respond(question,chat_history):
    result = chatbot_chain({'question':question})
    current_response = result['answer']
    chat_history.append((question,current_response))
    return "",chat_history

### The Chat Interface

The gradio library of Hugging Face can be used to create the user interface for the chat application so that the various use cases can be validated by stakeholders who have a black box understanding of the application.

The actions (button click) on the gradio interface trigger invocations to the conversational retrieval chains for generation of the appropriate response

In [23]:
import gradio as gr
gr.close_all()

with gr.Blocks() as demo:
    chatbot = gr.Chatbot(height=500)
    msg = gr.Textbox(label='Question')
    btn = gr.Button('Submit')
    clr_btn = gr.ClearButton(components=[chatbot,msg],value='Clear Console')
    btn.click(fn=respond,inputs=[msg,chatbot],outputs=[msg,chatbot])
    msg.submit(fn=respond,inputs=[msg,chatbot],outputs=[msg,chatbot])
demo.queue().launch(share=True)

Running on local URL:  http://127.0.0.1:7862
Running on public URL: https://e710c97d1dec03a3d3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


