## Data Processing 2: Standardizing Product Data with Generative AI


<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/smvinodkumar910/market-mirror/blob/main/backend/04_processing_data_02.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2Fsmvinodkumar910%2Fmarket-mirror%2Frefs%2Fheads%2Fmain%2Fbackend%2F04_processing_data_02.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/smvinodkumar910/market-mirror/refs/heads/main/backend/04_processing_data_02.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://github.com/smvinodkumar910/market-mirror/blob/main/backend/04_processing_data_02.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/475654/github-color.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment.

In [None]:
import sys

if "google.colab" in sys.modules:
    # Support for third party widgets
    from google.colab import auth, output

    auth.authenticate_user()
    output.enable_custom_widget_manager()

### Setting-up Environment

* Please change the variables `PROJECT_ID`, `BUCKET_NAME`, `LOCATION` details to your own project as required.

In [91]:
import os

PROJECT_ID = "market-mirror-dev"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
BUCKET_NAME = "marke-mirror-dev-data"  # @param {type: "string", placeholder: "[your-bucket-name]", isTemplate: true}
LOCATION = "US"  # @param {type: "string", placeholder: "[your-region]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

if not LOCATION or LOCATION == "[your-region]":
    LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "US")


In [92]:
os.environ['GOOGLE_CLOUD_PROJECT'] = PROJECT_ID
os.environ['GOOGLE_CLOUD_REGION'] = LOCATION

In [93]:
BQ_BRONZE_DATASET = "APP_MARKET_BRONZE" # @param {type: "string", placeholder: "[bronze-dataset]", isTemplate: true}
BQ_SILVER_DATASET = "APP_MARKET_SILVER" # @param {type: "string", placeholder: "[silver-dataset]", isTemplate: true}
BQ_GOLD_DATASET = "APP_MARKET_GOLD" # @param {type: "string", placeholder: "[gold-dataset]", isTemplate: true}

In [94]:
import bigframes.pandas as bpd
import bigframes.bigquery as bbq
from bigframes.ml import llm

bpd.options.bigquery.project = PROJECT_ID
bpd.options.bigquery.location = LOCATION

In [None]:
# Read all app detail datasets (google, windows, apple)
google_app_df = bpd.read_gbq(f'{PROJECT_ID}.{BQ_SILVER_DATASET}.T_GOOGLE_APP_DETAILS')
windows_app_df = bpd.read_gbq(f'{PROJECT_ID}.{BQ_SILVER_DATASET}.T_WINDOWS_APP_DETAILS')
apple_app_df = bpd.read_gbq(f'{PROJECT_ID}.{BQ_SILVER_DATASET}.T_APPLE_APP_DETAILS')

incompatibilies with previous reads of this table. To read the latest
version, set `use_cache=False` or close the current session with
Session.close() or bigframes.pandas.close_session().
  return method(*args, **kwargs)
incompatibilies with previous reads of this table. To read the latest
version, set `use_cache=False` or close the current session with
Session.close() or bigframes.pandas.close_session().
  return method(*args, **kwargs)
incompatibilies with previous reads of this table. To read the latest
version, set `use_cache=False` or close the current session with
Session.close() or bigframes.pandas.close_session().
  return method(*args, **kwargs)


In [None]:
# View all the column in each dataset
print(google_app_df.columns.tolist())
print(windows_app_df.columns.tolist())
print(apple_app_df.columns.tolist())

['title', 'description', 'summary', 'ratings', 'reviews', 'price', 'free', 'genre']
['Name', 'Price', 'Description', 'Category', 'Size']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'user_rating', 'prime_genre', 'app_desc']


* **In the app details table we are going to keep only required columns**

1. app_name
2. app_genre
3. app_description
4. app_price
5. free_flag
6. app_rating (windows_app_df not having the rating)


### Processing `T_GOOGLE_APP_DETAILS`

In [None]:
## Processing google_app_df

google_app_df = google_app_df.rename(columns={'title':'app_name',
                              'genre':'app_genre',
                              'description':'app_description',
                              'price':'app_price',
                              'free':'free_flag',
                              'ratings':'app_rating'})

# Renaming the column to have common schema 
google_app_df = google_app_df[['app_name','app_genre','app_description','app_price','free_flag','app_rating']]

In [98]:
google_app_df.head()

Unnamed: 0,app_name,app_genre,app_description,app_price,free_flag,app_rating
0,ادعية و اذكار المسلم بالصوت,Education,يضم البرنامج حصن المسلم من الأذكار و الأدعية ا...,0.0,True,13660.0
1,Block Puzzle 99: Fish Go,Puzzle,🐠<b>Block Puzzle 99: Fish Go</b>🐠 is a wonderf...,0.0,True,226.0
2,Speech Blubs: Language Therapy,Parenting,Do you need more proof? Check out the featured...,0.0,True,7496.0
3,Kids flashcard game,Education,Application created for preschool kids to lear...,0.0,True,1433.0
4,Magnet Balls 2: Physics Puzzle,Puzzle,The newest game in the popular Magnet Balls se...,0.0,True,885.0


In [99]:
google_app_df.info()

<class 'bigframes.dataframe.DataFrame'>
Index: 11593 entries, 0 to 11592
Data columns (total 6 columns):
  #  Column           Non-Null Count    Dtype
---  ---------------  ----------------  -------
  0  app_name         11593 non-null    string
  1  app_genre        11593 non-null    string
  2  app_description  11593 non-null    string
  3  app_price        11593 non-null    Float64
  4  free_flag        11593 non-null    boolean
  5  app_rating       11593 non-null    Float64
dtypes: Float64(2), boolean(1), string(3)
memory usage: 568057 bytes


### Processing `T_WINDOWS_APP_DETAILS`

In [None]:
## Processing windows_app_df

windows_app_df = windows_app_df.rename(columns={'Name':'app_name',
                              'Category':'app_genre',
                              'Description':'app_description',
                              'Price':'app_price'
                              })
windows_app_df['app_rating'] = 0

# Updating the free_flag column based on value in app_price field.
windows_app_df['free_flag'] = (windows_app_df['app_price'] == 'Free')


#### Price Extract using `AI.GENERATE_IN`

* In this `windows_app_df`, the `app_price` column is a string. Which is having price information with dicount details with free text.
* To extract the correct price information as float64 or int64 numeric field, we are going to use `AI.GENERATE_INT` function in Bigquery.

In [None]:
# filter the data where free_flag is False, ie. where we would be getting price as numeric, 
# and keeping only app_name and app_price fields to be given as input to llm.
windows_app_price_extract = windows_app_df.loc[windows_app_df['free_flag'] == False, ['app_name','app_price']]

In [None]:
# Writing this subset of data to BQ as temporary storage
windows_app_price_extract.to_gbq(f'{PROJECT_ID}.{BQ_SILVER_DATASET}.T_WINDOWS_APP_PRICE_EXTRACT', if_exists='replace')

* Below  SQL invokes the `AI.GENERATE_INT` function in  Bigquery by passing the app_name and the app_price text as input to the llm to extract correct price information. 
* The LLM output is stored in the table `T_WINDOWS_APP_PRICE_EXTRACT`

**Please make sure the PROJECT_NAME is replaced with your project_id.**

In [None]:
%%bigquery

create or replace table `PROJECT_NAME.APP_MARKET_SILVER.T_WINDOWS_APP_PRICE_EXTRACT` as
select app_name, app_price,
AI.GENERATE_INT(
  prompt =>  concat('I want you to understand the given app_price text which is related to the pricing of the windows application and',
  'answer what is the price of the app. ',
  'If only the price is mentioned with currency, just return the price value.',
   'app_name: ', app_name , ' app_price: ' , app_price ),
  connection_id => 'us.vertex-remote-models',
  endpoint => 'gemini-2.5-flash'
).result  from `PROJECT_NAME.APP_MARKET_SILVER.T_WINDOWS_APP_PRICE_EXTRACT`;

In [None]:
# Read the output data from BQ
windows_app_price_extract = bpd.read_gbq(f'{PROJECT_ID}.{BQ_SILVER_DATASET}.T_WINDOWS_APP_PRICE_EXTRACT')

incompatibilies with previous reads of this table. To read the latest
version, set `use_cache=False` or close the current session with
Session.close() or bigframes.pandas.close_session().
  return method(*args, **kwargs)


In [None]:
# Drop the app_price text column
windows_app_price_extract = windows_app_price_extract.drop(columns=['app_price'])



In [None]:
# Join this llm generated price info with the windows_app_df
windows_app_with_price_extract = bpd.merge( windows_app_df , windows_app_price_extract, on='app_name',how='left')

In [None]:
# view some sample data
windows_app_with_price_extract.head(5)

Unnamed: 0,app_name,app_price,app_description,app_genre,Size,app_rating,free_flag,result
0,HyperX NGENUITY (Beta),Free,"HyperX NGENUITY is powerful, intuitive softwar...",Utilities & tools,151.3 MB,0,True,
1,Real Pool 3D,Free,The world`s best snooker and 8 ball pool game ...,Action & adventure,91.72 MB,0,True,
2,Pick Me Up - Taxi Driver,Free,Have you ever wanted to be a ride share driver...,Action & adventure,123.9 MB,0,True,
3,Mythic Wonders: The Philosopher's Stone (Full),₹ 379.00,FANTASTIC HIDDEN OBJECT PUZZLE ADVENTURE GAME ...,Puzzle & trivia,703.64 MB,0,False,379.0
4,"Warhammer 40,000: Freeblade",Free,Warhammer 40K: Freeblade is the Space Marine G...,Action & adventure,1.35 GB,0,True,


In [None]:
# In the above result we can see that app_preice extracted correctly in result column

# update result as 0 if its null.
windows_app_with_price_extract['result'] = windows_app_with_price_extract['result'].fillna(0)
# Updae llm extracted price to the app_price column
windows_app_with_price_extract['app_price'] = windows_app_with_price_extract['result']
# drop unnecessary columns
windows_app_with_price_extract = windows_app_with_price_extract.drop(columns=['result','Size'])


In [None]:
# view some samples
windows_app_with_price_extract.head()

Unnamed: 0,app_name,app_price,app_description,app_genre,app_rating,free_flag
0,HyperX NGENUITY (Beta),0,"HyperX NGENUITY is powerful, intuitive softwar...",Utilities & tools,0,True
1,Real Pool 3D,0,The world`s best snooker and 8 ball pool game ...,Action & adventure,0,True
2,Pick Me Up - Taxi Driver,0,Have you ever wanted to be a ride share driver...,Action & adventure,0,True
3,Mythic Wonders: The Philosopher's Stone (Full),379,FANTASTIC HIDDEN OBJECT PUZZLE ADVENTURE GAME ...,Puzzle & trivia,0,False
4,"Warhammer 40,000: Freeblade",0,Warhammer 40K: Freeblade is the Space Marine G...,Action & adventure,0,True


In [None]:
# re order the columns as required.
windows_app_with_price_extract= windows_app_with_price_extract[['app_name','app_genre','app_description','app_price','free_flag','app_rating']]
windows_app_with_price_extract.head()



Unnamed: 0,app_name,app_genre,app_description,app_price,free_flag,app_rating
0,HyperX NGENUITY (Beta),Utilities & tools,"HyperX NGENUITY is powerful, intuitive softwar...",0,True,0
1,Real Pool 3D,Action & adventure,The world`s best snooker and 8 ball pool game ...,0,True,0
2,Pick Me Up - Taxi Driver,Action & adventure,Have you ever wanted to be a ride share driver...,0,True,0
3,Mythic Wonders: The Philosopher's Stone (Full),Puzzle & trivia,FANTASTIC HIDDEN OBJECT PUZZLE ADVENTURE GAME ...,379,False,0
4,"Warhammer 40,000: Freeblade",Action & adventure,Warhammer 40K: Freeblade is the Space Marine G...,0,True,0


In [None]:
# final table
windows_app_with_price_extract.info()

<class 'bigframes.dataframe.DataFrame'>
Index: 3960 entries, 0 to 3959
Data columns (total 6 columns):
  #  Column           Non-Null Count    Dtype
---  ---------------  ----------------  -------
  0  app_name         3960 non-null     string
  1  app_genre        3960 non-null     string
  2  app_description  3960 non-null     string
  3  app_price        3960 non-null     Int64
  4  free_flag        3960 non-null     boolean
  5  app_rating       3960 non-null     Int64
dtypes: Int64(2), boolean(1), string(3)
memory usage: 194040 bytes


### Processing `T_APPLE_APP_DETAILS`

In [113]:
## Processing Apple

print(apple_app_df.columns.tolist())

['id', 'track_name', 'size_bytes', 'currency', 'price', 'user_rating', 'prime_genre', 'app_desc']


In [None]:
# see some sample data
apple_app_df.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,user_rating,prime_genre,app_desc
0,992762497,记账·圈子账本(专业版)—可共享的全能记帐本软件,77918208,USD,0.99,4.5,Finance,◆圈子账本，年轻人都在用的记账APP ◆AppStore首页推荐，千万用户的选择 ◆腾讯、...
1,1080243808,Arcadecraft,150849536,USD,0.99,3.5,Games,The hit console game has been refreshed and is...
2,989254177,Good Morning Alarm Clock - Sleep Cycle Tracker,59503616,USD,3.99,4.5,Health & Fitness,Wake up feeling refreshed and ready for the da...
3,956373528,The Lost Heir: The Fall of Daria,47482880,USD,2.99,5.0,Games,Take back the throne that was rightfully yours...
4,646100661,"AOL: News, Email, Weather & Video",85273600,USD,0.0,4.0,News,"Stay informed, entertained and in touch with A..."


In [None]:
# rename the column to have uniform schema
apple_app_df = apple_app_df.rename(columns={'track_name':'app_name',
                              'prime_genre':'app_genre',
                              'app_desc':'app_description',
                              'price':'app_price',
                              'user_rating':'app_rating'})

In [None]:
# add free_flag column
apple_app_df['free_flag'] = (apple_app_df['app_price'] == 0)


In [None]:
# re-order the columns
apple_app_df = apple_app_df[['app_name','app_genre','app_description','app_price','free_flag','app_rating']]

In [None]:
# print all the columns available in the 3 platforms
print(google_app_df.columns.tolist())
print(windows_app_with_price_extract.columns.tolist())
print(apple_app_df.columns.tolist())

['app_name', 'app_genre', 'app_description', 'app_price', 'free_flag', 'app_rating']
['app_name', 'app_genre', 'app_description', 'app_price', 'free_flag', 'app_rating']
['app_name', 'app_genre', 'app_description', 'app_price', 'free_flag', 'app_rating']


* Now we can see that, all the three platform data is having same schema. This will help us to further summarize the data in GOLD layer.

In [None]:
#writing each final dataframe to silver dataset with suffixed as _CLEANED
google_app_df.to_gbq(f"{PROJECT_ID}.{BQ_SILVER_DATASET}.T_GOOGLE_APP_DETAIL_CLEANED", if_exists='replace')
windows_app_with_price_extract.to_gbq(f"{PROJECT_ID}.{BQ_SILVER_DATASET}.T_WINDOWS_APP_DETAIL_CLEANED", if_exists='replace')
apple_app_df.to_gbq(f"{PROJECT_ID}.{BQ_SILVER_DATASET}.T_APPLE_APP_DETAIL_CLEANED", if_exists='replace')

'market-mirror-dev.APP_MARKET_SILVER.T_APPLE_APP_DETAIL_CLEANED'

### Migrating Data to GOLD Layer

In [None]:
#writing each final dataframe to silver dataset with suffixed as _CLEANED
google_app_df.to_gbq(f"{PROJECT_ID}.{BQ_GOLD_DATASET}.T_GOOGLE_APP_DETAIL_FINAL", if_exists='replace')
windows_app_with_price_extract.to_gbq(f"{PROJECT_ID}.{BQ_GOLD_DATASET}.T_WINDOWS_APP_DETAIL_FINAL", if_exists='replace')
apple_app_df.to_gbq(f"{PROJECT_ID}.{BQ_GOLD_DATASET}.T_APPLE_APP_DETAIL_FINAL", if_exists='replace')

'market-mirror-dev.APP_MARKET_GOLD.T_APPLE_APP_DETAIL_FINAL'

* In some of the records, where the LLM could not extract / find price information, it has returned -1. Which has to be updated as 0. Which is one in the bleow step.

**Make sure the PROJECT_NAME is replaced with your project_id**

In [None]:
%%bigquery

UPDATE `PROJECT_NAME.APP_MARKET_SILVER.T_WINDOWS_APP_DETAIL_CLEANED` SET app_price=0 where app_price=-1;