## AI summarization of passenger chats

This notebook has AI summarization of baly passenger chats using llama3 (ollama APIs) and openAI.
The data is downloaded using freshchat rest apis. 

### Setup

You need 
  1. Input data frame, which has the freshchat data for any given period.
  2. openai .env file, which contains api key etc
  3. ollama endpoint

In [15]:
%%capture
pip install --quiet ollama openai tiktoken

## Data Load
We first load the freshchat data, and then clean it up, to ensure we only have unique conversations, and all boilerplate text is removed.

In [52]:
import pandas as pd

df = pd.read_csv('data.passenger.20240524.csv', dtype = { 'conversation_id': str, 'user_name': str, 'messages': str},usecols = { 'user_name', 'conversation_id', 'messages' })
df.head()

Unnamed: 0,user_name,conversation_id,messages
0,علي بازوكه,28516b4d-3cef-4270-95b0-43ae1a3139c2,['Conversation reopened due to new incoming me...
1,هشام,180bba97-f80d-4cb4-b29c-e789620ae9f7,"['🖕🖕🖕', 'Conversation was reopened by هشام', '..."
2,علي الموسوي الموسوي,8e66eba4-46e8-42ce-9ee5-7d092c7f2af9,['Conversation was reopened by علي الموسوي ال...
3,همام علي,f9262057-afbc-4527-9b0a-cafb1e0e97d8,['إذا ممكن اني طلبت رحله قبل شهر تقريبا \nوالو...
4,عصام ابو عسل,58ea8ba5-c0e8-45b4-a72d-81fd61fe2c15,"['Conversation was reopened by عصام ابو عسل', ..."


In [53]:
df = df.drop_duplicates(subset='conversation_id', keep='first')
df.describe()

Unnamed: 0,user_name,conversation_id,messages
count,681,681,681
unique,651,681,662
top,علي,28516b4d-3cef-4270-95b0-43ae1a3139c2,['السلام عليكم']
freq,4,1,6


In [54]:
import re
removeChatRegex = r"^(Conversation|التحدث مع موظف الدعم|السلام عليكم).*"

df.head()

Unnamed: 0,user_name,conversation_id,messages
0,علي بازوكه,28516b4d-3cef-4270-95b0-43ae1a3139c2,['Conversation reopened due to new incoming me...
1,هشام,180bba97-f80d-4cb4-b29c-e789620ae9f7,"['🖕🖕🖕', 'Conversation was reopened by هشام', '..."
2,علي الموسوي الموسوي,8e66eba4-46e8-42ce-9ee5-7d092c7f2af9,['Conversation was reopened by علي الموسوي ال...
3,همام علي,f9262057-afbc-4527-9b0a-cafb1e0e97d8,['إذا ممكن اني طلبت رحله قبل شهر تقريبا \nوالو...
4,عصام ابو عسل,58ea8ba5-c0e8-45b4-a72d-81fd61fe2c15,"['Conversation was reopened by عصام ابو عسل', ..."


Seems like the following is a must do for reading array of strings from the file

In [55]:
from ast import literal_eval
df['messages'] = df['messages'].apply(literal_eval)

In [109]:
print(df['messages'][12])

['السائق طلب مبلغ إضافي', 'مشكلة في رحلة سابقة', 'الكابتن اخذ  مبلغ اضافي لعب بعداد الكروة']


In [57]:
import re

def filter_common_strings(array):
    filtered_strings = [ 'Conversation reopened due to new incoming message', 'سلام عليكم']
    filtered_re = r"^(Conversation|BLY-).*"
    return [x for x in array if x not in filtered_strings and re.search(filtered_re,x) == None]

df['messages'] = df['messages'].apply(filter_common_strings)

In [58]:
df.head()

Unnamed: 0,user_name,conversation_id,messages
0,علي بازوكه,28516b4d-3cef-4270-95b0-43ae1a3139c2,"[نسيت سماعات, نسيت يمه سماعات ودك علي مي جاوب ..."
1,هشام,180bba97-f80d-4cb4-b29c-e789620ae9f7,"[🖕🖕🖕, اني دافحص البرنامج همه مرضه يوافقون راسا..."
2,علي الموسوي الموسوي,8e66eba4-46e8-42ce-9ee5-7d092c7f2af9,"[ان شاء الـلَّــــﷻـــه, يرجى الاهتمام بهذه ال..."
3,همام علي,f9262057-afbc-4527-9b0a-cafb1e0e97d8,[إذا ممكن اني طلبت رحله قبل شهر تقريبا \nوالول...
4,عصام ابو عسل,58ea8ba5-c0e8-45b4-a72d-81fd61fe2c15,[التحدث مع موظف الدعم]


## OpenAI Summarization

In [23]:
%%capture
pip install openai

In [24]:
from openai import AzureOpenAI
from dotenv import load_dotenv
import os
load_dotenv()

True

In [25]:
aclient = AzureOpenAI(
        api_key=os.environ['AZURE_OPENAI_KEY'],
        api_version = "2023-05-15")

deployment=os.environ['AZURE_OPENAI_DEPLOYMENT']
ENCODING_MODEL = "gpt-3.5-turbo"

In [26]:
import tiktoken
tokenizer = tiktoken.encoding_for_model('gpt-3.5-turbo')

context = ''
token_length = 0

for x in df['messages']:
    chat = '.'.join(x)
    token_length += len(tokenizer.encode(chat))
    if token_length >= 3000:
        break
    context += '\n\n' + chat

print(token_length,context)

3171 

نسيت سماعات.نسيت يمه سماعات ودك علي مي جاوب .نسيت أغراضي خلال الرحلة

🖕🖕🖕.اني دافحص البرنامج همه مرضه يوافقون راسا حته ما الحك الغي .همه الكباتن ليش يوافقون على رحله نقطة الوصول والانطلاق نفس المكان .دفتح الحظر بابه

ان شاء الـلَّــــﷻـــه.يرجى الاهتمام بهذه الامور.فضلا عن نظافة وترتيب السيارة.وكذلك رقم ونوع ولون السيارة.غالبا الكابتن الموجود بالتطبيق يختلف عن الحقيقة

إذا ممكن اني طلبت رحله قبل شهر تقريبا 
والولد بقى يراسلني ويعاكسني وكتله اني مره مزوجه تلافي للمشكله 
ورجع كل شويه يراسلني اني. معجب بيج 
ومااذكر يارحله منهم بالضبط
بس ارقام ليراسلني منهم عندي.التحدث مع موظف الدعم.رجاءاً كلش محتاجتكم

التحدث مع موظف الدعم

استفسار عن المحفظة داخل التطبيق.تمام حبيبي .بعد الف كال ماعدي راح اضيفه عل محفضه .حبيبي اطيته خمسه رجعلي الف 

اي نعم نفس الشي صار بالرحلة الثانية

شكراً استاذ.اقتراحي تخفيض الاجور هذا يعود بمردود مادي قوي للشركة ومنافسه مشروعه وعدم قبول طلبات السواق وقتراح اخر تقديم جائزة افضل سائق شهرياً من خلال النقاط والتقييم الممتاز الي يحصل عليه هذا يشجعه على العمل بصورك م

In [124]:
systemPrompt = "You are an arabic  expert working for Baly, a ride hailing company in Iraq"
def get_completion(prompt):
    messages = [
                {"role": "system", "content": systemPrompt},
                {"role":"user", "content":prompt}
               ]
    response = aclient.chat.completions.create(
            model = deployment,
            messages = messages,
            temperature = 0)
   #         max_tokens = 900)
    return response.choices[0].message.content

In [28]:
prompt = f"""
        Each line representas a conversation, containing messages from a single user in a session. Generate top 5 actionable insights for management. 
        For each of your insights, you must include conversations with original text and its english translation that best support it..
        ```{context}```
         """
response = get_completion(prompt)
print(response)

Insight 1: Improve driver behavior and professionalism
- Conversation: إذا ممكن اني طلبت رحله قبل شهر تقريبا والولد بقى يراسلني ويعاكسني وكتله اني مره مزوجه تلافي للمشكله ورجع كل شويه يراسلني اني. معجب بيج
- Translation: If possible, I requested a ride about a month ago and the driver kept messaging me and harassing me, saying that I'm married to avoid the problem. He keeps messaging me every now and then, saying that I like him. 

Insight 2: Improve customer support response time and effectiveness
- Conversation: استفسار عن المحفظة داخل التطبيق.تمام حبيبي .بعد الف كال ماعدي راح اضيفه عل محفضه .حبيبي اطيته خمسه رجعلي الف
- Translation: Inquiry about the wallet inside the app. Okay, my dear. After a thousand calls, I still haven't added it to the wallet. I gave him five, he returned a thousand.

Insight 3: Address issues with fare calculation and payment discrepancies
- Conversation: والرحلة بالتطبيق 3000.صديقي دفعت 7000.على هذا الرقم.07724111351.صديقي رحت رحلة والكابتن اخذ مبلغ اضافي ك

## Using Llama3

Now let's try to do the same summarization using llama3 on ollama. 

In [10]:
%pip install --upgrade --quiet ollama langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.


In [59]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)

print(splitter.count_tokens(text=prompt))



3016


In [60]:
import ollama

ollamaBase = 'http://ollama-alibo-gpu-testing.apps.private.okd4.teh-2.snappcloud.io/'
llmModel = 'llama3'
client = ollama.Client(ollamaBase)

def ollama_llm(prompt):
    response = client.chat(model=llmModel, options = { 'temperature': 0}, messages=[{'role': 'system', 'content': systemPrompt},{'role': 'user', 'content': prompt}])
    return response['message']['content']

In [68]:
olmcontext = ''
token_length = 0
for x in df['messages']:
    chat = '.'.join(x)
    token_length += splitter.count_tokens(text=chat)
    if token_length >= 7000:
        print("reached max length",token_length)
        break
    olmcontext += '\n\n' + chat

llmprompt = f"""
        Each line representas a conversation, containing messages from a single user in a session. Generate top 5 actionable insights for management. 
        For each of your insights, you must include conversations with original text and its english translation that best support it..
        ```{olmcontext}```
         """

reached max length 7107


In [69]:
print(ollama_llm(llmprompt))

I'm happy to help you with your ride-hailing company's CSAT score improvement! As an expert analyst, I'll provide you with some insights and suggestions based on the data you've shared.

Firstly, it seems that there are some issues with the driver's behavior, such as not following the route correctly, taking a longer route than necessary, and not providing a smooth ride. This could be affecting the overall CSAT score of your company.

To improve the CSAT score, I would suggest implementing some changes to the driver training program. For instance, you could provide additional training on navigation and route optimization, as well as emphasize the importance of providing a smooth ride for passengers.

Additionally, it might be helpful to implement a system that allows passengers to rate their drivers after each trip. This would give you valuable feedback on how your drivers are performing and allow you to identify any areas where they may need additional training or support.

It's also 

### Classification

Todo: Need to introduce function calling here to make the responses structured.

In [70]:
from langchain_experimental.llms.ollama_functions import OllamaFunctions
from langchain_core.pydantic_v1 import BaseModel

In [147]:
zero_shot_prompt = '''
You are analysing all chats with customer support and classifying them into one of six categories.
The six categories are Fare dispute, Payment Issue, Forgot an item, Map/Pickup issues, Different Driver/Car, Acceptance/Cancellation issue
If you can't tell what it is, say Could not classify. Below are some examples for each category

Fare Dispute: Driver charged more, incorrect fare, didn't return money/amount to wallet
Map/Pickup issues: Can't find Driver, route taken was incorrect, can't find address on app
Acceptance/Cancellation issue: Driver cancelled ride/asked to cancel
Payment Issue: Problems making payments
Different Driver/Car: Mismatch of plate number or driver 

Chat:

{0}

The classification is:'''

def classify(query):\
    return ollama_llm(zero_shot_prompt.format(query))



In [158]:
from tqdm import tqdm
pdf = df.head(20)
pdf['class'] = [ classify(x) for x in pdf['messages'] ]
pdf.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pdf['class'] = [ classify(x) for x in pdf['messages'] ]


Unnamed: 0,user_name,conversation_id,messages,class
0,علي بازوكه,28516b4d-3cef-4270-95b0-43ae1a3139c2,"[نسيت سماعات, نسيت يمه سماعات ودك علي مي جاوب ...","Based on the chat, I would classify it as ""For..."
1,هشام,180bba97-f80d-4cb4-b29c-e789620ae9f7,"[🖕🖕🖕, اني دافحص البرنامج همه مرضه يوافقون راسا...",Could not classify. The chat appears to be in ...
2,علي الموسوي الموسوي,8e66eba4-46e8-42ce-9ee5-7d092c7f2af9,"[ان شاء الـلَّــــﷻـــه, يرجى الاهتمام بهذه ال...","Based on the chat, I would classify it as ""Dif..."
3,همام علي,f9262057-afbc-4527-9b0a-cafb1e0e97d8,[إذا ممكن اني طلبت رحله قبل شهر تقريبا \nوالول...,Could not classify. The chat appears to be a c...
4,عصام ابو عسل,58ea8ba5-c0e8-45b4-a72d-81fd61fe2c15,[التحدث مع موظف الدعم],Could not classify. The chat appears to be a g...
5,بقع,d46cbb0c-f693-42a1-96e2-cb8f0b97d80e,"[استفسار عن المحفظة داخل التطبيق, تمام حبيبي ,...",Could not classify. The chat appears to be a c...
6,Esraa Alhayali,d1cab72e-7129-4ea5-9f09-97f240f7510e,[اي نعم نفس الشي صار بالرحلة الثانية],Could not classify. The chat message appears t...
7,marwa,5e17693a-9081-4fcf-a48b-36c8208a8f33,"[شكراً استاذ, اقتراحي تخفيض الاجور هذا يعود بم...",Could not classify. The chat appears to be a t...
8,عبدالله حسام,f4ecb87d-4ecf-4c0b-a1f4-77119c6b5f31,"[شكرا الكم, بس اليوم انتبهت مستقطع من التطبيق ...","Based on the chat, I would classify it as:\n\n..."
9,عبود,d0218a6d-bee0-4e50-ac48-ecb4815379a6,"[والرحلة بالتطبيق 3000, صديقي دفعت 7000, على ه...",Fare Dispute\n\nReasoning: The customer is com...
