# JioMart product metadata extraction

In this notebook we will be analyzing tweets from users for Delta Air Lines support reps using OpenAI GPT3.5 API. The analysis is done in 2 steps:

1. Intent identification: find out why a user is writing a tweet, this could be one of 4 reasons:
    - Raise a complaint/grievance
    - Ask a question
    - Share a good experience
    - Other reasons

2. Reasons for complaints: for the tweets where a complaint is being raised, we want to identify the major causes which would be one of:
    - Flight Cancellation
    - Flight Delays
    - Bad Flight Experience
    - Lost/Damaged Luggage
    - Flight Attendant Complaints
    - Flight Booking Problems
    - Poor Customer Service
    - Other

### Data Source

### Environment Setup
This notebook was run using Python 3.8 with following packages installed:
- langchain==0.0.177
- pandas==2.0.1
- opendatasets==0.1.22

## Load data

In [1]:
import opendatasets as od
from tempfile import mkdtemp
import pandas as pd

In [2]:
data_dir = mkdtemp()
od.download("https://www.kaggle.com/datasets/mohit2512/jio-mart-product-items", 
            data_dir=data_dir)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

  aarshayjain


Your Kaggle Key:

  ········


Downloading jio-mart-product-items.zip to /var/folders/rj/4_r942cx5xv13l6l57n5wz3m0000gn/T/tmp1s9w6lpe/jio-mart-product-items


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.31M/2.31M [00:00<00:00, 30.3MB/s]







In [3]:
df = pd.read_csv(f"{data_dir}/jio-mart-product-items/jio_mart_items.csv")
# drop href column
df.drop(columns="href", inplace=True)
# add ID column
df = df.reset_index(names=["id"])
df.shape

(162313, 5)

In [4]:
df.sample(2)

Unnamed: 0,id,category,sub_category,items,price
100140,100140,Home & Kitchen,Garden & Outdoor,Kraft Seeds Rajnigandha Tuberose Flower Bulbs ...,75000.0
20689,20689,Groceries,Personal Care,Fem Gold Ultra Creme Bleach 30 g,76.0


In [5]:
df.category.value_counts()

category
Home & Kitchen    60335
Groceries         46044
Fashion           26101
Electronics       19022
Beauty            10741
Jewellery            70
Name: count, dtype: int64

## Process mobiles and tablets

In [6]:
df_mnt = df.loc[(df.category == "Electronics") & (df.sub_category == "Mobiles & Tablets")]
df_mnt.shape

(3810, 5)

In [7]:
# take sample of 100
df_mnt = df_mnt.sample(100)
df_mnt.shape

(100, 5)

In [59]:
df_mnt["items"].sample(5).tolist()

['Regor Finger Grip/Selfie Holder & Mobile Stand for iPhones & Android Smartphones (Eiffel Tower)',
 'itek WCH003_BL 4 USB Ports 1 Amp Rapid Charge Wall Adapter',
 'Fossil Q Gen 4 Hr FTW6015 Smart Watch, Nude',
 'Noise Colorfit Pro 2 Full Touch Control Smart Watch (Deep Wine)',
 'Oppo A15 32 GB, 3 GB RAM, Dynamic Black, Mobile Phone']

## Prompt Engineering

Visit the [OpenAI API Playground](https://platform.openai.com/playground?mode=chat) to find the right prompts.

Let's say we only care about phones and smart watches. So we have to do few things:
1. find whether the item is a phone or a watch
2. if its a phone, extract: 
    - brand
    - model
    - storage
    - ram
    - color
3. if its a watch, extract: 
    - brand
    - model
    - size
    - storage
    - special_features

### prompts cached

identify phone vs watch

```
given a product description of an electronics item, you are supposed to identify whether it is a phone or a smartwatch. if it is neither of them, return null. 

reply in only one word.

description:
Inclu For Lenovo K3 Note Waterproof,Artificial Leather,Scratch Resident,Magnetic Lock Holster Case
```

additions for separating case:
```
some examples of null categories are phone cases
```
```
some of the product descriptions would appear like a phone but are a case for the phone, those should be marked as null and not phone.
```

get phone information
```
given a product description of a mobile phone, you are supposed to extract the following information from it in a json format:
- brand: which brand is selling the phone
- model: model of the specific phone
- storage: the amount of data that can be stored in the phone
- ram: memory of the phone
- color
- special_features: any other features mentioned

if the entered description is not for a mobile phone, return null.

create the json for the following:

Redmi Note 11 Pro Plus 5G 6 BM RAM, 128 GB, Phantom White, Mobile Phone
```

## Process all items

### Setup langchain

[API reference](https://python.langchain.com/en/latest/modules/models/chat/integrations/openai.html)

In [9]:
# load open ai api key
from dotenv import load_dotenv
load_dotenv("../.env")

True

In [10]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (SystemMessagePromptTemplate, 
                                    HumanMessagePromptTemplate, 
                                    ChatPromptTemplate)

In [11]:
def build_chat_prompt(template):
    system_message_prompt = SystemMessagePromptTemplate.from_template("")
    user_message_prompt = HumanMessagePromptTemplate.from_template(template)
    return ChatPromptTemplate.from_messages([
        system_message_prompt,
        user_message_prompt])

In [12]:
chat = ChatOpenAI(temperature=0)

### Parse types

In [15]:
type_prompt = build_chat_prompt("""
given a product description of an electronics item, you are supposed to identify \
whether it is a phone or a smartwatch. if it is neither of them, return null. 

some of the product descriptions would appear like a phone but are other items. \
eg, phone cases, screen protectors. those should be marked as null and not phone.

reply in only one word.

description:
{description}
""")

In [16]:
desc = df_mnt["items"].sample(1).tolist()[0]
response = chat(type_prompt.format_prompt(description=desc).to_messages())
print(f"description: {desc} \n\nresponse: {response.content}")

description: POCO F4 GT Back Screen Protector By Ctel, 3D Back Skin Carbon Fiber Ultra-Thin Protective Film (2 Packs) Transparent Back Cover For POCO F4 GT 

response: null


In [17]:
# create batch messages to pass to api
batch_messages_type = [type_prompt.format_prompt(description=desc).to_messages()
                       for desc in df_mnt["items"]]
len(batch_messages_type)

100

In [18]:
response_type = chat.generate(batch_messages_type)

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID 90fb8377f9632ae1f7c4dfa17e9b1366 in your message.).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID e34384adf8c26d1325529e3a7618792f in your message.).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised RateLimitError: That model is currently overloaded with other requests. You can r

In [19]:
df_mnt.loc[:, "type"] = [x[0].text.lower() for x in response_type.generations]

In [20]:
df_mnt.type.value_counts()

type
null          60
phone         27
smartwatch    13
Name: count, dtype: int64

## Extract phone metadata

In [21]:
df_phone = df_mnt.loc[df_mnt.type == "phone"]
df_phone.shape

(27, 6)

In [22]:
# initialize our results
phone_db = {}

In [23]:
phone_meta_prompt = build_chat_prompt("""
given a product description of a mobile phone, you are supposed to extract the following \
information from it in a json format:
- brand
- model: model of the specific phone
- storage: the amount of data that can be stored in the phone
- ram: memory of the phone
- color

create the json for the following:

{description}
""")

In [24]:
desc = df_phone["items"].sample(1).tolist()[0]
response = chat(phone_meta_prompt.format_prompt(description=desc).to_messages())
print(f"description: {desc} \n\nresponse: {response.content}")

description: OPPO A15s 64 GB, 4 GB RAM, Dynamic Black, Mobile Phone 

response: {
  "brand": "OPPO",
  "model": "A15s",
  "storage": "64 GB",
  "ram": "4 GB",
  "color": "Dynamic Black"
}


In [25]:
# create batch messages to pass to api
batch_messages_phone = [phone_meta_prompt.format_prompt(description=desc).to_messages()
                       for desc in df_phone["items"]]
len(batch_messages_phone)

27

In [26]:
response_phone = chat.generate(batch_messages_phone)

In [29]:
import re
import json

In [70]:
def extract_dict(content, id):
    try:
        pattern = re.compile("(\{(?s:.*?)\})")
        json_text = pattern.search(content).group(1)
        return {**json.loads(json_text), **{"id": id}}
    except:
        print(f"unable to parse json for id: {id}, description: {content}, returning empty")
        return {}

In [51]:
df_phone_meta = pd.DataFrame([
    extract_dict(x[0].text, id)
    for id, x in zip(df_phone.id, response_phone.generations)])\
    .merge(df_mnt[["id", "items"]], how="left", on=["id"])

In [54]:
df_phone_meta.sample(5)

Unnamed: 0,brand,model,storage,ram,color,id,items
22,Redmi,Note 10S,64 GB,6 GB,Frost White,133029,"Redmi Note 10S 64 GB, 6 GB RAM, Frost White, M..."
4,Apple,iPhone 13 Pro,512 GB,Not specified,Green,132766,"Apple iPhone 13 Pro 512 GB, Green"
25,Xiaomi,Redmi 10 Prime,128 GB,6 GB,Astral White,132831,"Xiaomi Redmi 10 Prime 128 GB, 6 GB RAM, Astral..."
21,Samsung,Galaxy Z Series Fold3 5G,256 GB,12 GB,Phantom Green,132871,"Samsung Galaxy Z Series Fold3 5G 256 GB, 12 GB..."
12,Redmi,Note 10S,128 GB,6 GB,Frost White,133079,"Redmi Note 10S 128 GB, 6 GB RAM, Frost White, ..."


## Extract smartwatch metadata

In [55]:
df_watch = df_mnt.loc[df_mnt.type == "smartwatch"]
df_watch.shape

(13, 6)

In [62]:
watch_meta_prompt = build_chat_prompt("""
given a product description of a smart watch, you are supposed to extract the following \
information from it in a json format:
- brand
- model: model of the specific watch
- size
- ram: memory of the phone
- color
- special_features: any additional features mentioned

if any value is missing, return null. 

create the json for the following:

{description}
""")

In [63]:
desc = df_watch["items"].sample(1).tolist()[0]
response = chat(watch_meta_prompt.format_prompt(description=desc).to_messages())
print(f"description: {desc} \n\nresponse: {response.content}")

description: Samsung Galaxy Fit Lite SM-R375 Fitness Band, Black 

response: {
  "brand": "Samsung",
  "model": "Galaxy Fit Lite SM-R375",
  "size": null,
  "ram": null,
  "color": "Black",
  "special_features": "Fitness Band"
}


In [64]:
# create batch messages to pass to api
batch_messages_watch = [watch_meta_prompt.format_prompt(description=desc).to_messages()
                       for desc in df_watch["items"]]
len(batch_messages_watch)

13

In [65]:
response_watch = chat.generate(batch_messages_watch)

In [71]:
df_watch_meta = pd.DataFrame([
    extract_dict(x[0].text, id)
    for id, x in zip(df_watch.id, response_watch.generations)])\
    .merge(df_mnt[["id", "items"]], how="left", on=["id"])

unable to parse json for id: 133921, description: {
  "brand": "Apple",
  "model": "Watch Series SE GPS + Cellular",
  "size": "40 mm",
  "ram": null,
  "color": {
    "case": "Silver Aluminum",
    "band": "Abyss Blue Sport"
  },
  "special_features": [
    "GPS",
    "Cellular connectivity"
  ]
}, returning empty


In [72]:
df_watch_meta.sample(5)

Unnamed: 0,brand,model,size,ram,color,special_features,id,items
8,Noise,Colorfit Pro 2,,,Deep Wine,Full Touch Control,133399.0,Noise Colorfit Pro 2 Full Touch Control Smart ...
2,Fossil,Q Gen 4 Hr FTW6015,,,Nude,,134024.0,"Fossil Q Gen 4 Hr FTW6015 Smart Watch, Nude"
1,Pebble,Revo Smartwatch,"1.3""",,Black,"[HD Touchscreen, Bluetooth Calling, Rolling UI...",133813.0,"Pebble Revo Smartwatch, 1.3"" HD Touchscreen, B..."
10,Samsung,Watch 4 Classic LTE,42 mm,,Black,"[Smartwatch with Bluetooth Connectivity, IP68 ...",133872.0,Samsung Watch 4 Classic LTE 42 mm Smartwatch w...
3,Amazfit,Huami GTS A1914,,,Obsidian Black,,133984.0,"Amazfit Huami GTS A1914 Smart Watch, Obsidian ..."


### What did this cost us?

As per the [pricing](https://openai.com/pricing#language-models) documentation of Open AI, the gpt-3.5-turbo api is charged at \\$0.001 for every 1K tokens. Below we can see that we used 201068 tokens in total, resulting in a total cost of **\\$0.40**. This is fairly low, but we only passed in a 1000 tweets. The cost would ramp up as we analyze more data.

In [77]:
response_phone.llm_output

{'token_usage': {'prompt_tokens': 2781,
  'completion_tokens': 1232,
  'total_tokens': 4013},
 'model_name': 'gpt-3.5-turbo'}

In [80]:
all_results = [response_type, response_phone, response_watch]
total_tokens = sum([response.llm_output["token_usage"]["total_tokens"] for response in all_results])
cost = total_tokens / 1000 * 0.002
print(f"Totak cost: ${cost}")

Totak cost: $0.035332
