## The Outside-in Analysis Assistant

For this application, the langchain framework is leveraged for running Retrieval Augmented Generation (RAG) on the annual report of a target company to extract the challenges faced along specific dimensions and the strategic business priorities along specific dimensions. Together, these inform the competencies that need to be cultivated in the company to sustain a strategic advantage in the market

There are two categories of LLM chains involved here
* The first category of LLM chains built using Langchain Expression Language (LCEL) to extract relevant data points from the vectorized company annual report with langchain functional APIs (called as extraction chains)
* The second category of LLM chains to weave the extracted data in a suitable structure for executive consumption and also identify the potential business competencies required

The results generated by the workflow are captured in the following queries
* Business challenges: **Variable** - b_challenges
* Strategic business priorities: **Variable** - b_priorities
* Competency: **Variable** - potential_competency

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/annual-report-input/2022-sensata-ar.pdf
/kaggle/input/example-drhp/Example_DRHP.pdf
/kaggle/input/sample-drhp/imagine-marketing-limited-drhp.pdf
/kaggle/input/annual-input-report/Microchip_Annual.pdf


Install the required modules and import the required libraries

In [2]:
#!pip install --upgrade pip
!pip install typing_extensions openai langchain --quiet
!pip install transformers --quiet
!pip install InstructorEmbedding --quiet
!pip install chromadb --quiet
!pip install sentence-transformers --quiet
!pip uninstall typing_extensions --yes --quiet
!pip uninstall openai --yes --quiet
!pip install typing_extensions==3.10.0.2 openai==0.28.0 --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 14.0.1 which is incompatible.
google-cloud-bigquery 2.34.4 requires packaging<22.0dev,>=14.3, but you have packaging 23.2 which is incompatible.
jupyterlab 4.0.5 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
jupyterlab-lsp 5.0.1 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
jupyterlab-lsp 5.0.1 requires jupyterlab<5.0.0a0,>=4.0.6, but you have jupyterlab 4.0.5 which is incompatible.
libpysal 4.9.2 requires shapely>=2.0.1, but you have shapely 1.8.5.post1 which is incompatible.
momepy 0.7.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
pymc

In [3]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.schema.output_parser import StrOutputParser
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser
from typing import List
from pydantic import BaseModel,Field
from langchain.utils.openai_functions import convert_pydantic_to_openai_function
from langchain.chains import SequentialChain
from langchain.chains import LLMChain
import transformers
import torch
import json
import openai
import time

Use the PyPDFLoader library to parse content from the PDF annual report and store it in Chroma vector database for retrieval with queries

In [4]:
loader = PyPDFLoader('/kaggle/input/example-drhp/Example_DRHP.pdf')
pages = loader.load_and_split()

#Set the start_page and end_page as user inputs to extract the relevant document pages for vectorization
start_page = 57
end_page = 86
relevant_pages = list()
relevant_pages = [page for page in pages if page.metadata['page']>=start_page and page.metadata['page']<=end_page]

#Use the sentence-transformer based embeddings for vector representation of content to store in the Chroma database
embedding_routine = HuggingFaceInstructEmbeddings()

#Ensure that the relevant page sections are strored in the vector database with the embeddings generator
vectordb = Chroma.from_documents(documents=relevant_pages,embedding=embedding_routine)

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.41k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


Define pydantic classes with field values to be extracted from the vector database. These classes will be translated into openai native functions that can be plugged into LLM chains for retrieval of specific data pieces

In [5]:
class Challenge(BaseModel):
    """Detailed information of challenges and constraints faced by the company that hinder business value. These are rich illustrations of specific issues relevant to the company and not generic statements from field description. If nothing is found, please mention not found and do not make up things"""
    customers: str = Field(description="Anything related to customer demand patterns and changes in customer preferences")
    competition: str = Field(description="Anything related to competitive pressure and substitutes in the market")
    production_factors: str = Field(description="Anything related to shortages or ineffective deployment of factors of production or service delivery")
    procurement: str = Field(description="Anything related to supply side constraints in procurement of raw materials or services")
    supply_chain: str = Field(description="Anything related to supply chain constraints, labor constraints, or organization model constraints")
    process_inefficiency: str = Field(description="Anything related to process inefficiencies and functional silos across departments")
        
class StratPrior(BaseModel):
    """Detailed information on strategic business priorities of the company. These are rich illustrations of areas specific to the company and not generic statements. If nothing is found, please mention not found and do not make up things"""
    new_product: str = Field(description="Details on new product or services targeted for growth")
    new_market: str = Field(description="Details on potential new market targets for growth acceleration")
    efficiency: str = Field(description="Details on efficiency improvements targeted for internal operations")
    digitization: str = Field(description="Details on initiatives identified for digitization and analytics")
    technology: str = Field(description="Details on requirements for a change in technological landscape")
    operating_model: str = Field(description="Details on planned interventions for operating model transformation")

Translate the pydantic classes into langchain API functions and bind them with the instance of a ChatOpenAI model to extract the challenges and strategic priorities of the target company from the vector database

In [6]:
fn_extracts = [
    convert_pydantic_to_openai_function(f) for f in [
        Challenge, StratPrior
    ]
]

#Instantiate an OpenAI endpoint powered by GPT3.5 LLM for running the queries encapsulated in the langchain functions 
#on the retrieved data from the vector database
model = ChatOpenAI(model_name='gpt-3.5-turbo',temperature=0.05,openai_api_key='sk-****************************************')

#Bind the langchain API functions with the model to inform the retrieval of relevant data points 
info_extractor = model.bind(functions=fn_extracts)

  warn_deprecated(


The prompt template here is desiged for insertion into the LLM chain. The query parameter of the prompt determines the langchain function that should be invoked for retrieving the relevant data pieces from the vector database 

In [7]:
instruction = """Please respond to questions on the business areas based on the reference context. If the answer is not known, do not mention the field description in the response"""

prompt = ChatPromptTemplate.from_messages([
    ("system",instruction),
    ("user","{query} answered by using:{context}")
])

Define helper function that feed the output generated by one chain in the right format to the prompt template of the subsequent chains

In [11]:
#A json cleanup function that solves for the common errors faced during generation of json-formatted string by OpenAI LLM
#This is inserted as the last step in an LLM chain to translate LLM-generated output in json format after performing a few validation checks
def clean_str_for_json(in_str):
    str_output = in_str.additional_kwargs['function_call']['arguments']
    try:
        formt_json = json.loads(str_output)
        return formt_json
    except ValueError as e:
        try:
            formt_json = json.loads(str_output+"\"\n}")
            return formt_json
        except ValueError as e:
            return json.loads(str_output[:-3]+"}")


#Use the function to combine values corresponding to common keys generated as json object from LLM extraction chain 
#This is used to handle the limitation posed by OpenAI APIs which can process only 3 queries per minute and parse the vectorized data in chunks of 3 blocks each. 
def combine_content_with_common_key(list_input):
    #combined_out = {key:".".join([item[key] for item in list_input if item[key] not in ['Not found']]) for key in list_input[0].keys()}
    combined_out = {k:".".join([d.get(k) for d in list_input if k in d]) for k in set().union(*list_input)}
    for key in combined_out.keys():
        if combined_out[key] == '':
            combined_out.pop(key)
    return combined_out

#This function and the next are used to convert json objects to strings for plugging into prompt templates
def concat_dict(gen_output):
    out_text = "".join([key+":"+val for key,val in gen_output.items() if val is not None])
    return out_text

def compile_para(json_text):
    return ".".join([val for val in json_text.values() if val not in ['None']])

#Adds a delay of 60 seconds to overcome the query limits poased by OpenAI 
def delay():
    time.sleep(60)

Using Langchain Expression Language (LCEL), build a customized data extraction chain with the prompt, function-embedded retriever model and the json cleanup function. This chain can perform the required operation based on the invocation and return the result in json format

In [9]:
extraction_chain = prompt | info_extractor | clean_str_for_json

Extract the relevant content sections by running a semantic search with the query on the vector database. Invoke the extraction chain query on the retrieved content sections to pull out the company challenges in json format

In [12]:
query_pp = "Challenges and constraints related to market, customers, competition, suppliers or internal processes"
context_pp = vectordb.similarity_search(query=query_pp,k=27)
challenges = list()
for tr in range(3):
    challenges += [extraction_chain.invoke({"query":query_pp,"context":context_pp[tr*9+i*3:tr*9+i*3+3]}) for i in range(3)]
    delay()
conct_challenges = combine_content_with_common_key(challenges)

Extract the relevant content sections by running a semantic search with the query on the vector database. Invoke the extraction chain query on the retrieved content sections to pull out the company challenges in json format

In [13]:
query_stimp = "Strategic priorities and initiatives envisioned by the company related to product, market, internal processes or capability development"
context_stimp = vectordb.similarity_search(query=query_stimp,k=9)
sb_priorities = list()
for tr in range(3):
    sb_priorities += [extraction_chain.invoke({"query":query_stimp,"context":context_stimp[tr*9+i*3:tr*9+i*3+3]}) for i in range(3)]
    delay()
conct_sb_priorities = combine_content_with_common_key(sb_priorities)

Create separate llm chains to parse the retrieved data extracts by the extraction chains and create professional content suitable for the consumption of executives with respect to the business challenges and strategic priorities

In [14]:
interpret_model = ChatOpenAI(model_name='gpt-3.5-turbo',temperature=0.05,openai_api_key='sk-sk-****************************************')
pp_prompt = ChatPromptTemplate.from_template(
    "Summarize the specific challenges of the company in one or two sentences based on the reference text in triple backticks. Please avoid generic statements, be as specific as possible and eliminate duplication. Please lay out challenge details in crisp format and structure response as JSON\n```{challenges}```"
)
chain_pp = LLMChain(llm=interpret_model,prompt=pp_prompt)
b_challenges = json.loads(chain_pp.run(concat_dict(conct_challenges)))

  warn_deprecated(


In [15]:
sb_prompt = ChatPromptTemplate.from_template(
    "Summarize the specific business priorities of the company in one or two sentences based on the reference text in triple backticks. Please avoid generic statements, be as specific as possible and eliminate duplication. Please lay out priority details in crisp format and structure response as JSON\n```{sb_priorities}```"
)
chain_sb = LLMChain(llm=interpret_model,prompt=sb_prompt)
b_priorities = json.loads(chain_sb.run(concat_dict(conct_sb_priorities)))

Create an LLM chain comprising of a prompt and an LLM model that processes the business challenges and business priorities generated by the prior chains to generate recommendations on the competencies that the business should focus on developing 

In [16]:
comp_prompt = ChatPromptTemplate.from_template(
    """Identify the differentiating competencies that need to be developed by the company. A competency is a specific area of expertise that would help improve product, brand or internal operations developed through the right alignment of system, process reinvention and operating model transformation. 
    A competency is not an initiative. The competencies should be able to solve the challenges enclosed in ### and help achieve strategic business priorities enclosed in tripe backticks. 
    Please lay the response in json format having a crisp verbiage suitable for CXO consumption\n###{b_challenges}###\n```{b_priorities}```. 
    Each competency should be accompanied by short description as value of less than 30 words outlining the goal it would achieve. No need to list the challenges and business priorities separately. 
    Follow the unrelated example enclosed in $$$ for guidance with response enclosed in ***
    $$$###Fragmented and adhoc reporting which is manually driven and incoherence between the insights across reports###
    ```Simplified and systematic procedures for generating business insights to be consumed by business executives```$$$
    ***Reporting factory:Trigger based generation of reports backed by reporting templates to simplify and consoilidate delivery of insights***"""
)
chain_comp = LLMChain(llm=interpret_model,prompt=comp_prompt)
competency = json.loads(chain_comp.run({"b_challenges":compile_para(b_challenges),"b_priorities":compile_para(b_priorities)}))

In [17]:
comp_dict = competency['competencies']
potential_competency = {item['competency']:item['description'] for item in comp_dict}
potential_competency

{'Strengthen credit appraisal and risk management systems': 'Improve the ability to assess creditworthiness and manage risks associated with loans',
 'Invest in technology systems and processes': 'Enhance operational and managerial efficiency through the adoption of advanced technology',
 'Deploy strong technology systems': 'Enable swift response to market opportunities and challenges through robust technology infrastructure',
 'Grow Gold loan business': "Expand the company's gold loan portfolio to increase market share and revenue",
 'Explore opportunities in rural India': 'Tap into the potential of rural areas in India for gold loan financing',
 'Implement technology-led processing systems': 'Improve efficiency and risk management capabilities through the adoption of technology-driven processing systems'}