# Explore Raw Data

* In this Notebook we are exploring the data loaded in BRONZE layer.
* clean and filter the data with specific columns we require and move the data to silver layer.
* Creating BQ remote embedding model and generative model for data enrichment using GenAI functions in Bigquery.


<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/smvinodkumar910/market-mirror/blob/main/backend/02_explore_data.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2Fsmvinodkumar910%2Fmarket-mirror%2Frefs%2Fheads%2Fmain%2Fbackend%2F02_explore_data.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/smvinodkumar910/market-mirror/refs/heads/main/backend/02_explore_data.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://github.com/smvinodkumar910/market-mirror/blob/main/backend/02_explore_data.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/475654/github-color.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment.


In [None]:
import sys

if "google.colab" in sys.modules:
    # Support for third party widgets
    from google.colab import auth, output

    auth.authenticate_user()
    output.enable_custom_widget_manager()

### Setting-up Environment

* Please change the variables `PROJECT_ID`, `BUCKET_NAME`, `LOCATION` details to your own project as required.

In [None]:
import os

PROJECT_ID = "market-mirror-dev"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
BUCKET_NAME = "marke-mirror-dev-data"  # @param {type: "string", placeholder: "[your-bucket-name]", isTemplate: true}
LOCATION = "US"  # @param {type: "string", placeholder: "[your-region]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

if not LOCATION or LOCATION == "[your-region]":
    LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "US")


In [None]:
os.environ['GOOGLE_CLOUD_PROJECT'] = PROJECT_ID
os.environ['GOOGLE_CLOUD_REGION'] = LOCATION

In [None]:
BQ_BRONZE_DATASET = "APP_MARKET_BRONZE" # @param {type: "string", placeholder: "[bronze-dataset]", isTemplate: true}
BQ_SILVER_DATASET = "APP_MARKET_SILVER" # @param {type: "string", placeholder: "[silver-dataset]", isTemplate: true}
BQ_GOLD_DATASET = "APP_MARKET_GOLD" # @param {type: "string", placeholder: "[gold-dataset]", isTemplate: true}

### Objective

#### Data Definitions



**We have loaded 6 Tables in BQ as Follows:**

**Review Tables:**

1. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.google_play_reviews`

    * This table contains user reviews of various Apps from Google Play store.

    * No. of columns : 9
    * No. of Records : 4888

2. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.googleplaystore_user_reviews`

    * This table contains user reviews of various Apps in Google Playstore with sentiment information.

    * No. of columns : 5
    * No. of Records : 64295


**Product Information Tables:**

2. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.cleanapp`

    * This table having all app related information from GooglePlaystore. Information include category, genre, app ratings, number of reviews, number of downloads etc.

    * No. Of Columns : 29
    * No. of Records : 11593

3. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.AppleStore`

    * This table having all app related information from Apple Store. Information include category, genre, app ratings, size, price etc.

    * No. Of Columns : 17
    * No. of Records : 7197

4. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.appleStore_description`

    * This table having elaborated description in various languages about the Apps available in Apple Store.

    * No. Of Columns : 5
    * No. of Records : 7197

5. `{PROJECT_ID}.{BQ_BRONZE_DATASET}.windows_store`

    * This table having elaborated description about the Apps available in Windows Store.

    * No. Of Columns : 9
    * No. of Records : 3960


#### Data Engineering

**Review Tables**

* We have two review tables `google_play_reviews` and `googleplaystore_user_reviews` in BRONZE layer.

* We explore the data, clean, keep only specific columns.

* Union both tables in to a single table and write to SILVER layer.

* Then use Bigquery GenAI capabilities to generate Sentiment on reviews.

* Utilize GenAI capabilities to generate response to each user.


**Product Information Tables**

* Combine the `AppleStore` and `appleStore_description` tables into a single table.

* We can see the `appleStore_description` is having various languages. Convert them into a single language.

* Generate Embeddings for the App description columns to enable vector search.

* Create the `cleannapp` and `windows_store` as separate tables with necessary columns.

#### Creating Vertex AI Remote Models

* In this section we are creating a CONNECTION object in BQ and then creating REMOTE MODELS.

* Below command creates connection to Vertex AI models in the name of `vertex-remote-models` in Bigquery.

In [None]:
!bq mk --connection --location=$GOOGLE_CLOUD_REGION --project_id=$GOOGLE_CLOUD_PROJECT \
    --connection_type=CLOUD_RESOURCE vertex-remote-models

* Below steps create two  remote model in Bigquery - 

1. remote model named `embeddings` using the `text-embedding-005` model available in Vertex AI.
2. remote model named `gemini` using the `gemini-2.0-flash` model available in Vertex AI.


In [None]:
create_embed_model = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.{BQ_SILVER_DATASET}.embeddings`
REMOTE WITH CONNECTION `us.vertex-remote-models`
OPTIONS (ENDPOINT = 'text-embedding-005');
"""

create_gen_model = f"""
CREATE OR REPLACE MODEL `{PROJECT_ID}.{BQ_SILVER_DATASET}.gemini`
REMOTE WITH CONNECTION `us.vertex-remote-models`
OPTIONS (ENDPOINT = 'gemini-2.0-flash');
"""


In [None]:
# @title Error Handling Tip

'''
If you get error while running below cells to create Remote models related to
Service account privilge, run the below command, after replacing the
`SERVICE_ACCOUNT_EMAIL` with the service account shown in the error.
'''
!gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
    --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" \
    --role="roles/aiplatform.user"


In [None]:
%%bigquery
$create_embed_model

In [None]:
%%bigquery
$create_gen_model

#### Review Table Data Processing

* In this section, we are 
    
    1. reading the two review tables in BRONZE layer
    2. Clean them and filter them with specific required columns.
    3. Union both the the tables.
    4. Write to the SILVER layer as a single table to keep all reviews as a single table.


In [None]:
import bigframes.pandas as bpd
import bigframes.bigquery as bbq
from bigframes.ml import llm

# Set BigQuery DataFrames options
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = LOCATION

##### Exploring app review table `google_play_reviews`

In [None]:
#read data to dataframe
review_df1 = bpd.read_gbq(f'{PROJECT_ID}.{BQ_BRONZE_DATASET}.google_play_reviews')

In [None]:
review_df1.head()

In [None]:
review_df1.columns

In [None]:
#renaming column 'Unnamed: 0' to 'id'
review_df1 = review_df1.rename(columns={'Unnamed: 0':'id'})

In [None]:
#lets keep only necessary column
review_df1_subset = review_df1[['id', 'app_name','app_genre','review_text','rating']]

In [None]:
review_df1_subset.head(5)

In [None]:
review_df1_subset.info()

##### Exploring app review table `googleplaystore_user_reviews`

In [None]:
review_df2 = bpd.read_gbq(f'{PROJECT_ID}.{BQ_BRONZE_DATASET}.googleplaystore_user_reviews')

In [None]:
review_df2.head(5)

In [None]:
review_df2.columns

In [None]:
# keep only the necessary column

review_df2_subset = review_df2[['App','Translated_Review','Sentiment']]

In [None]:
review_df2_subset.info()

In [None]:
review_df2_subset = review_df2_subset.rename(columns={'App':'app_name','Translated_Review':'review_text','Sentiment':'sentiment'})

In [None]:
review_df2_subset.head(5)

In [None]:
# Concat both the review tables.
review_df = bpd.concat([review_df1_subset,review_df2_subset],axis=0)

In [None]:
review_df.head()

In [None]:
review_df.count()

In [None]:
review_df.isna().sum()

In [None]:
# Writing the table to silver layer
review_df.to_gbq(destination_table=f'{PROJECT_ID}.{BQ_SILVER_DATASET}.T_APP_REVIEWS',if_exists='replace')

#### Product Tables Data Processing

* In this section we are reading the Prodct description tables from the 3 platforms - 
    
    1. Google   - table_names: cleanapp
    2. Apple    - table_names: AppleStore, appleStore_description
    3. Windows  - table_names: windows_store

* Clean the data, filter with specific columns required and write the tables to SILVER layer.

##### Exploring Google Apps table `cleanapp`

In [None]:
google_apps_df = bpd.read_gbq(f'{PROJECT_ID}.{BQ_BRONZE_DATASET}.cleanapp')

In [None]:
google_apps_df.head(5)

In [None]:
google_apps_df.columns

In [None]:
google_apps_df.count()

In [None]:
# keep only specific subset of columns
google_apps_df_subset = google_apps_df[['title','description','summary','ratings','reviews','price','free','genre']]

In [None]:
google_apps_df_subset.head(5)

In [None]:
# write to SILVER Layer
google_apps_df_subset.to_gbq(destination_table=f'{PROJECT_ID}.{BQ_SILVER_DATASET}.T_GOOGLE_APP_DETAILS',if_exists='replace')

##### Exploring Windows Apps table `windows_store`

In [None]:
windows_app_df = bpd.read_gbq(f'{PROJECT_ID}.{BQ_BRONZE_DATASET}.windows_store')

In [None]:
windows_app_df.head(5)

In [None]:
#keep only a subset of columns
windows_app_df_subset = windows_app_df[['Name','Price','Description','Category','Size']]

In [None]:
windows_app_df_subset.count()

In [None]:
#write to BQ SILVER Layer
windows_app_df_subset.to_gbq(f'{PROJECT_ID}.{BQ_SILVER_DATASET}.T_WINDOWS_APP_DETAILS',if_exists='replace')

##### Exploring Apple Apps tables `AppleStore`

In [None]:
#Read data
apple_app_df = bpd.read_gbq(f'{PROJECT_ID}.{BQ_BRONZE_DATASET}.AppleStore')

In [None]:
apple_app_df.head(5)

In [None]:
#Keep only a subset of columns
apple_app_df_subset = apple_app_df[['id','track_name','size_bytes','currency','price','user_rating','prime_genre']]

In [None]:
apple_app_df_subset.head(5)

In [None]:
apple_app_df_subset.count()

In [None]:
# Read the 2nd Apple product description table 
apple_app_df2 = bpd.read_gbq(f'{PROJECT_ID}.{BQ_BRONZE_DATASET}.appleStore_description')

In [None]:
apple_app_df2.head()

In [None]:
apple_app_df2_subset = apple_app_df2[['id','app_desc']]

In [None]:
apple_app_df2_subset.count()

In [None]:
#join both the tables
apple_app_df_subset = bpd.merge(apple_app_df_subset,apple_app_df2_subset,left_on='id',right_on='id',how='left')

In [None]:
apple_app_df_subset.head(5)

In [None]:
#finally write the table to BQ Silver layer
apple_app_df_subset.to_gbq(f'{PROJECT_ID}.{BQ_SILVER_DATASET}.T_APPLE_APP_DETAILS',if_exists='replace')