<a href="https://colab.research.google.com/github/sanjeeth-baliga/The-Learning-Expedition/blob/main/Business_Value_Delivery_with_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

This piece of works demonstrates the utility of Large Language Models in strategizing and ideating suitable solutions to business problems by applying the wisdom of internal expert systems in the context of customer feedback. Dataset on Amazon reviews captured in the English language from a publicly available Hugging Face [data source](https://huggingface.co/datasets/amazon_reviews_multi) has been leveraged for demonstrating this experiment.

This experiment comprises of four broad stages


1.   **Dataset load and augmentation:** This stage involves extraction and preparation of the customer reviews from the dataset for analysis in the subsequent stages. The sentiment of each review is determined through an LLM at this stage to augment the information associated with each review
2.   **Semantic Similarity Search for Extraction of Relevant Reviews:** The semantic kernel is leveraged in this stage to orchestrate the retrieval of a specific set of reviews based on a query with the product category of interest and sentiment corresponding to the business requirements. The extracted customer reviews are then assessed with an LLM to identify the underlying key factors to be used for root cause analyses
3.  **Root Cause Analysis with Cosine Similarity Evaluation of Sentence Embeddings:** This stage relies on the technique of semantic similarity with vector embeddings to identify the root causes fetched from expert systems that correspond the most to the identified key customer issues
4. **Solution Ideation with Cosine Similarity Evaluation of Sentence Embeddings:** Once the root causes are identified, the internal expert systems can be relied upon to ideate the right solutions with leveraging the semantic similarity technique on the vector embeddings of the root causes



The solution relies heavily on the open source Hugging Face LLMs and the semantic kernel from Azure

In [162]:
!pip install transformers
!pip install semantic_kernel
!pip install datasets
!pip install chromadb
!pip install sentence-transformers



In [201]:
from datasets import load_dataset
from datasets import Dataset
from transformers import AutoTokenizer,AutoModelForSequenceClassification,AutoModelForSeq2SeqLM
from sentence_transformers import SentenceTransformer
import semantic_kernel as sk
from semantic_kernel.connectors.ai.hugging_face import HuggingFaceTextCompletion, HuggingFaceTextEmbedding
from semantic_kernel.connectors.memory.chroma import ChromaMemoryStore
from IPython.display import display, Markdown
import torch
import numpy as np
import pandas as pd
import itertools

##Dataset load and augmentation

For experimenting with the idea, the amazon reviews dataset captured in English has been leveraged as it reflects the customer experience associated with purchase and after-sales service of the e-commerce platform

Only a small fraction of data entries is used here to accommodate the computing resource constraints. And it is not expected to make a lot of difference as LLMs have been primarily subjected to in-context learning

In [167]:
reviews = load_dataset('amazon_reviews_multi','en',split = 'train[:10%]')
reviews = reviews.select(range(600))

Downloading builder script:   0%|          | 0.00/7.11k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/78.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.97M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/200000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

To augment the dataset for subsequent semantic similarity search, the distilbert LLM has been used for sentiment estimation which is padded along with the reviews

In [164]:
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'

#Instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

#Instantiate the model
id2label = {0:'Negative',1:'Positive'}
label2id = {id2label[key]:key for key in id2label.keys()}
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2,id2label=id2label,label2id=label2id)

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [165]:
#Function to identify the sentiment of each review and pad it into the review field for subsequent parsing by LLM
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
def gauge_sentiment(item):
  review = [item['review_title'][i].strip()+': '+item['review_body'][i].strip() for i in range(len(item['review_title']))]
  encoded_data = tokenizer(review,truncation=True,padding=True,max_length=512,return_tensors='pt')
  device = model.device
  encoded_data.to(device)
  with torch.no_grad():
    generated_data = model(**encoded_data).logits
  sentiment = [model.config.id2label[data.argmax().item()] for data in generated_data]
  item['review_data'] = [f"""###Product Category:{item['product_category'][i]}###Review:{review[i]}\
                        ###Sentiment:{sentiment[i]}###Ratings:{item['stars'][i]}""" for i in range(len(item['product_category']))]
  return item

At the step of sentiment extraction, the unwanted columns can be flushed from the dataset. Our column of interest is the review_data column

In [168]:
review_processed = reviews.map(gauge_sentiment,batched=True,batch_size=100,remove_columns=list(reviews.features.keys()))

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

##Semantic Similarity Search for Extraction of Relevant Reviews

The semantic kernel is invoked to orchestrate the similarity search process and identify the relevant reviews that match a query. An embedding generation service tied to a Hugging Face LLM can be added as a connector to the semantic kernel for this purpose. The set of reviews for parsing can be stored in the chroma vector database

In [169]:
kernel = sk.Kernel()
kernel.add_text_embedding_generation_service("sentence-transformers/all-MiniLM-L6-v2",HuggingFaceTextEmbedding("sentence-transformers/all-MiniLM-L6-v2"))
kernel.register_memory_store(memory_store=ChromaMemoryStore(persist_directory='product_review'))
memory_collection = 'review_store'
print("You made an open source kernel using an open source AI model!")

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

You made an open source kernel using an open source AI model!


The reviews augmented with the sentiment information are stored in the Chroma Memory vector store as embeddings for subsequent retreival through semantic similarity

In [170]:
for i,review_context in enumerate(review_processed['review_data']):
  await kernel.memory.save_information_async(memory_collection,id=f'Review-{i}',text=review_context)

The semantic kernel can be leveraged to run a semantic similarity search query on the stored vector embeddings to obtain the most relevant reviews based on the specified product category

In [181]:
query = 'What are the common challenges faced in the apparel category based on the reviews'
analysis_context = await kernel.memory.search_async(memory_collection,query,limit=15,min_relevance_score=0.5)
k_repository = [result.text.split('###')[2].strip().split(':')[2].strip() for result in analysis_context]

The raw reviews cannot be used directly for further processing. They require cleanup and summarization to extract the primary issue for subsequent root cause analysis.

For this purpose, the flan-t5 model is used in in-context learning mode as it is a versatile LLM fine-tuned to a number of specific summarization and extraction tasks

In [190]:
#The flan-t5 model requires a specific prompt for best performance based on the text pattern it was fine-tuned on
prompt = [f'Here is a dialogue:{text}\nWhat was going on the conversation?' for text in k_repository]

from transformers import AutoTokenizer,AutoModelForSeq2SeqLM
summary_checkpoint = 'google/flan-t5-base'
summary_token = AutoTokenizer.from_pretrained(summary_checkpoint)
summary_model = AutoModelForSeq2SeqLM.from_pretrained(summary_checkpoint)
summary_token.add_special_tokens({'pad_token': '[PAD]'})

#A function is defined to encode the raw reviews and extract their key traits with the flan-t5 LLM
def extract_issue(item):
  encoded_data = summary_token(item,truncation=True,padding=True,max_length=512,return_tensors='pt')
  issue_tokens = summary_model.generate(input_ids = encoded_data.input_ids,max_length=512)
  decoded_tokens = summary_token.batch_decode(issue_tokens,skip_special_tokens=True)
  return decoded_tokens

The cleaned up problems are stored in a pain points pandas data frame for subsequent processing

In [215]:
pain_points = [extract_issue(text)[0] for text in prompt]
pp_frame = pd.DataFrame({'PainPoints':pain_points})

##Root Cause Analysis with Cosine Similarity Evaluation of Sentence Embeddings

Once the reviews are cleaned up and key issues represented by them are extracted, the key issues can be evaulated for root cause determination

To perform the root cause analysis, a json file was created with synthetically generated data (using chat-gpt) on the common industrial problems faced in e-commerce and their root causes. This data can be replaced by the historical issue patterns from a company's internal data store

In [216]:
#The code snippet here loads the JSON file and structures it into a format suitable for embedding vector generation
rca = load_dataset('json',data_files = 'define.json')
rca_dict={'Problem':[],'Reasons':[]}
for item in rca['train']['CustomerOperationalProblems'][0]:
  rca_dict['Problem'].append(item['Problem'])
  rca_dict['Reasons'].append(item['Reasons'])
rca_set = Dataset.from_dict(rca_dict)

Once the data on industrial issue patterns is loaded, the problems captured in it are matched with that of the problem list identified from customer reviews.

For matching and similarity detection, the embedding vector associated with each problem identified from customer reviews is evaluated against each problem in the industrial issue patterns through a cosine similarity function.

In [217]:
# A sentence transformer is used to generate the vector embedding for each sentence representing a problem
embedding_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

#The function generates a sentence embedding for each problem specified in the industry issue pattern and
#integrates the resulting answer into a dataframe with the text representation of the problem
def generate_frame_with_embedding(data):
  def gen_embedding(item):
    item['ProblemEmbedding'] = embedding_model.encode(item['Problem'])
    return item
  embed_dataset = data.map(gen_embedding)
  embed_dataset.set_format('pandas')
  frame = embed_dataset[:]
  return frame

In [218]:
rca_frame = generate_frame_with_embedding(rca_set)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

The customer pain points (problems) extracted from the reviews are also subjected to embeddings generation with the resultant embeddings stored in the pain points data frame

In [219]:
pain_point_embedding = embedding_model.encode(pp_frame['PainPoints'])
pp_frame['PainPointEmbedding'] = pain_point_embedding.tolist()

The embeddings of the customer pain points are subjected to cosine similarity with the embeddings of each industrial issue. The top 2 industrial issues with the highest similarity are integrated into the pain points data frame

In [220]:
pp_embedding = torch.tensor(pp_frame['PainPointEmbedding'])
prob_embedding = torch.tensor(rca_frame['ProblemEmbedding'])
cossim = torch.nn.CosineSimilarity(dim=1,eps=1e-6)
pp_frame['MatchingIndustrialIssue'] = [rca_frame.loc[torch.topk(cossim(p_emb,prob_embedding),2).indices.tolist(),'Problem'].tolist() for p_emb in pp_embedding]

The root causes associated with the top 2 industrial issues are also looked up and their embedding vectors are generated.

Next, even among the identified root causes, the top 2 which generate the best similarity match with the original customer pain point are integrated into the pain points data frame

In [221]:
root_cause = []
root_cause_embedding = []

#The code snippet that follows looks up the root causes corresponding to the matching industrial issues and then
#extracts the top 2 root causes amongst those
for idx,core_problem in enumerate(pp_frame['MatchingIndustrialIssue']):
  reason = [rca_frame[item==rca_frame.Problem]['Reasons'].values[0].tolist() for item in core_problem]
  reason = list(itertools.chain(*reason))
  reason_embeddings = [embedding_model.encode(item) for item in reason]
  reason_embeddings_tensor = torch.tensor(reason_embeddings)
  top_reasons_index = torch.topk(cossim(torch.tensor(pp_frame.loc[idx,'PainPointEmbedding']),reason_embeddings_tensor),2).indices.tolist()
  root_cause.append([reason[top_i] for top_i in top_reasons_index])
  root_cause_embedding.append([reason_embeddings[top_i] for top_i in top_reasons_index])

#The top 2 likely root causes are integrated into the RootCause column while the corresponding embedding vectors are
#integrated into the RootCauseEmbedding column
pp_frame['RootCause'] = root_cause
pp_frame['RootCauseEmbedding'] = root_cause_embedding

##Solution Ideation with Cosine Similarity Evaluation of Sentence Embeddings

The next step involves using the mapped root causes to ideate a set of interventions that can be set in motion to resolve the customer issues

For this purpose, a set of industrial solutions corresponding to the root causes of the industrial issue patterns were fetched in JSON format (using chat-gpt) for reference. This list can be easily replaced with the expert advisory on solutions generated internally within the company

In [222]:
#The code snippet here loads the JSON file and structures it into a format suitable for embedding vector generation
ideas = load_dataset('json',data_files = 'ideas.json')
idea_dict = {'RootCause':[],'Ideas':[]}
for idea in ideas['train']['OperationalProblemIdeas'][0]:
  idea_dict['RootCause'].append(idea['RootCause'])
  idea_dict['Ideas'].append(idea['Ideas'])
idea_set = Dataset.from_dict(idea_dict)

In [223]:
#The function generates a sentence embedding for each industrial root cause and
#integrates the resulting answer into a dataframe with the text representation of the root causes and their corresponding ideas
def generate_idea_frame_with_embedding(data):
  def gen_embedding_for_idea(item):
    item['RootCauseEmbedding'] = embedding_model.encode(item['RootCause'])
    return item
  embedding_set = data.map(gen_embedding_for_idea)
  embedding_set.set_format('pandas')
  embedding_frame = embedding_set[:]
  return embedding_frame

In [224]:
idea_embedding_frame = generate_idea_frame_with_embedding(idea_set)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [225]:
rootcause_emb = torch.tensor(idea_embedding_frame['RootCauseEmbedding'])
idea_list = []

#The code snippet that follows matches the root cause embeddings corresponding to the customer pain points with
#that of the root causes in the industrial solutions to look up the right ideas to generate interventions for each customer pain point
for idx,rc_embed in enumerate(pp_frame['RootCauseEmbedding']):
  top_match = [torch.topk(cossim(torch.tensor(r_cause),rootcause_emb),1).indices.item() for r_cause in rc_embed]
  ideas = [idea_embedding_frame.loc[i,'Ideas'].tolist() for i in top_match]
  ideas = set(itertools.chain(*ideas))
  idea_list.append(list(ideas))

#The interventions identified for each customer pain point are integrated into the IdeaList column
pp_frame['IdeaList'] = idea_list

The exhaustive set of ideas can be listed for the business to act upon and resolve the customer pain points

In [226]:
print('The company should considering implementing the following ideas to overcome customer issues')
idea_overall = list(set(itertools.chain(*idea_list)))
idea_overall.sort()
print('-> '+'\n-> '.join(idea_overall))

The company should considering implementing the following ideas to overcome customer issues
-> Organizational Change: Improve coordination between shipping and inventory teams to minimize stockouts and overstock situations.
-> Organizational Change: Improve quality control and inspection processes to catch discrepancies before shipping.
-> Organizational Change: Involve product experts in the content creation process to prevent inaccuracies.
-> Policy Change: Clearly communicate delivery time frames with a buffer for potential delays to manage customer expectations.
-> Policy Change: Create a process for regularly updating product images to reflect any changes or variations.
-> Policy Change: Implement strict guidelines for product descriptions, ensuring accuracy and completeness.
-> Policy Change: Include prepaid return labels with all orders, making it easier for customers to initiate returns.
-> Product Improvement: Invest in high-quality product photography and image editing to acc