# Process Data

In this notebook we are processing data loaded in Bronze layer and utilize Bigquery AI/ML functions.

### Setting-up Environment

In [20]:
import os

PROJECT_ID = "market-mirror-dev"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
BUCKET_NAME = "marke-mirror-dev-data"  # @param {type: "string", placeholder: "[your-bucket-name]", isTemplate: true}
LOCATION = "US"  # @param {type: "string", placeholder: "[your-region]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

if not LOCATION or LOCATION == "[your-region]":
    LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "US")


In [37]:
os.environ['GOOGLE_CLOUD_PROJECT'] = PROJECT_ID
os.environ['GOOGLE_CLOUD_REGION'] = LOCATION

In [10]:
BQ_BRONZE_DATASET = "APP_MARKET_BRONZE" # @param {type: "string", placeholder: "[bronze-dataset]", isTemplate: true}
BQ_SILVER_DATASET = "APP_MARKET_SILVER" # @param {type: "string", placeholder: "[silver-dataset]", isTemplate: true}
BQ_GOLD_DATASET = "APP_MARKET_GOLD" # @param {type: "string", placeholder: "[gold-dataset]", isTemplate: true}

### Objective

#### Data Definitions



**We have loaded 6 Tables in BQ as Follows:**

**Review Tables:**

1. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.multilingual_mobile_app_reviews_2025`

    * This table contains user reviews of various Apps from different platforms. like Android, iOS, Windows apps. Also it contains reviewed user information like Age, Country, Gender.

    * No. of columns : 15
    * No. of Records : 2514

2. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.googleplaystore_user_reviews`

    * This table having reviews only from GooglePlayStore, along with Sentiment analysis.

    * No. Of Columns : 5
    * No. of Records : 64292


**Product Information Tables:**

3. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.googleplaystore`

    * This table having all app related information from GooglePlaystore. Information include category, genre, app ratings, numbe rof reviews, number of downloads etc.

    * No. Of Columns : 13
    * No. of Records : 10841

4. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.AppleStore`

    * This table having all app related information from Apple Store. Information include category, genre, app ratings, size, price etc.

    * No. Of Columns : 17
    * No. of Records : 7197

5. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.appleStore_description`

    * This table having elaborated description in various languages about the Apps available in Apple Store.

    * No. Of Columns : 5
    * No. of Records : 7197

6. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.windows_store`

    * This table having elaborated description about the Apps available in Windows Store.

    * No. Of Columns : 9
    * No. of Records : 3960





#### Data Engineering

**Review Tables**


*   In the `multilingual_mobile_app_reviews_2025` table, we can see that `application_category` is mapped incorrectly to many apps. Same app is being mapped to irrelavant categories. We are going to update the correct category using **BigQuery AI/ML** Functions.

*   Also, in the same `multilingual_mobile_app_reviews_2025` table, we can see reviews are in various languages. So we  are going to convert them to English using **BigQuery AI/ML** Functions.

* Clean-up and then Combine both the review tables into a single table in Silver layer.

* Load the necessary column only to Silver Layer.


**Product Information Tables**

* Combine the `AppleStore` and `appleStore_description` tables into a single table.

* We can see the `appleStore_description` is having various languages. Convert them into a single language.

* Generate Embeddings for the App description columns to enable vector search.

* Create the `googleplaystore` and `windows_store` as separate tables with necessary columns.

#### Creating Vertex AI Remote Models

In [38]:
!bq mk --connection --location=$GOOGLE_CLOUD_REGION --project_id=$GOOGLE_CLOUD_PROJECT \
    --connection_type=CLOUD_RESOURCE vertex-remote-models

Connection 468982775008.us.vertex-remote-models successfully created


In [40]:
%%bigquery
CREATE OR REPLACE MODEL `{BQ_SILVER_DATASET}.embeddings`
REMOTE WITH CONNECTION `us.vertex-remote-models`
OPTIONS (ENDPOINT = 'text-embedding-005');

Executing query with job ID: df82fefb-da63-492d-b297-b8018f6f6e6e
Query executing: 0.23s


ERROR:
 400 Invalid dataset ID "{{BQ_SILVER_DATASET}}". Dataset IDs must be alphanumeric (plus underscores and dashes) and must be at most 1024 characters long.; reason: invalid, location: {{BQ_SILVER_DATASET}}.embeddings, message: Invalid dataset ID "{{BQ_SILVER_DATASET}}". Dataset IDs must be alphanumeric (plus underscores and dashes) and must be at most 1024 characters long.

Location: US
Job ID: df82fefb-da63-492d-b297-b8018f6f6e6e



In [14]:
import bigframes.pandas as bpd

# Set BigQuery DataFrames options
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = LOCATION

In [23]:
df = bpd.read_gbq_table('APP_MARKET_BRONZE.multilingual_mobile_app_reviews_2025')



incompatibilies with previous reads of this table. To read the latest
version, set `use_cache=False` or close the current session with
Session.close() or bigframes.pandas.close_session().
  return method(*args, **kwargs)


In [24]:
df.count()

review_id            2514
user_id              2514
app_name             2514
app_category         2514
review_text          2455
review_language      2514
rating               2477
review_date          2514
verified_purchase    2514
device_type          2514
num_helpful_votes    2514
user_age             2514
user_country         2473
user_gender          1927
app_version          2484
dtype: Int64

In [16]:
df.head(5)

Unnamed: 0,review_id,user_id,app_name,app_category,review_text,review_language,rating,review_date,verified_purchase,device_type,num_helpful_votes,user_age,user_country,user_gender,app_version
0,470,2402031,Google Maps,Social Networking,Cum veritatis minima. Cumque consectetur quos ...,hi,4.7,2024-02-21 02:48:54+00:00,False,Android,728,75.0,France,Female,6.9.13-beta
1,82,9758155,Udemy,News & Magazines,This app is amazing! Really love the new featu...,th,2.4,2023-12-29 21:36:49+00:00,False,Windows Phone,853,24.0,Thailand,Male,1.8
2,1280,5696685,eBay,Productivity,This app is amazing! Really love the new featu...,hi,1.8,2024-09-07 12:31:24+00:00,True,iOS,1028,40.0,Japan,Non-binary,6.4.22
3,1981,3633909,LinkedIn,Entertainment,Possimus perferendis ducimus adipisci sequi vo...,ja,2.7,2024-03-08 09:07:42+00:00,True,iOS,703,70.0,India,Female,2.4.48
4,1525,1725497,WhatsApp,Social Networking,Placeat quo consectetur.,ar,4.9,2024-05-06 22:33:59+00:00,False,Windows Phone,1080,60.0,Russia,Prefer not to say,11.1


In [19]:
df.describe()

Unnamed: 0,review_id,user_id,rating,review_date,num_helpful_votes,user_age
count,2514.0,2514.0,2477.0,2514.0,2514.0,2514.0
mean,1257.5,5080736.584328,3.021034,,616.704057,44.247812
std,725.873612,2846939.152732,1.149955,,363.745326,18.37229
min,1.0,100599.0,1.0,,0.0,13.0
25%,622.0,2586379.0,2.0,,286.0,28.0
50%,1255.0,5050084.0,3.0,,622.0,44.0
75%,1888.0,7550817.0,4.0,,921.0,60.0
max,2514.0,9995027.0,5.0,,1249.0,75.0


In [31]:
df = bpd.read_gbq_table('APP_MARKET_BRONZE.windows_store')

df.head(5)

Unnamed: 0,Name,Price,Description,Publisher,Date_of_Release,Category,Size,Age_Rating,Languages
0,3D Chess Game,Free,"Play against the right A.I. level for you, in ...",A Trillion Games Ltd,06-02-2014,Card & board,45.6 MB,For ages 3 and up,English(UnitedStates)
1,Edge of Reality: Mark of Fate,₹ 549.00,Big Fish Editor's Choice! This title was selec...,Big Fish Games,18-01-2020,Action & adventure,892.71 MB,For ages 16 and up,English(UnitedStates)
2,Demolition,Free,"Explore, admire, then destroy works of archite...",Khor Chin Heong,11-05-2015,Simulation,2.49 GB,For ages 3 and up,English(UnitedStates)
3,Lonely Mountains: Downhill,Included  +  with Game Pass,Key Features: • Travel to the Lonely Mountains...,Thunderful Publishing,23-10-2019,Action & adventure,,For ages 3 and up,English(UnitedStates)
4,Screen Recorder by Animotica,"₹ 1,099.00",Screen Recorder by Animotica brings you an eas...,Mixilab LLC,03-11-2020,Productivity,24.02 MB,For ages 3 and up,English(UnitedStates)


In [32]:
df.count()

Name               3960
Price              3960
Description        3960
Publisher          3960
Date_of_Release    3954
Category           3960
Size               3642
Age_Rating         3960
Languages          3908
dtype: Int64