## 환경 초기화
### 이 셀은 필요한 라이브러리를 가져오고, BigQuery 클라이언트를 초기화하며, 분석을 위한 전역 변수를 설정합니다.

**중요: 아래 셀에 현재 실습을 위한 Google Cloud 프로젝트 ID를 입력해야 합니다. 이 값은 실습 환경 내의 모든 리소스에 액세스하는 데 매우 중요합니다.**

In [None]:
# User: Please enter your Project ID in this cell.
PROJECT_ID = 'your-gcp-project-id' # <-- ENTER YOUR ACTUAL PROJECT ID HERE!

# Verify that PROJECT_ID is not empty. If it is, raise an error.
if not PROJECT_ID:
    raise ValueError("ERROR: PROJECT_ID is not set. Please enter your Project ID above.")

print(f"Project ID set to: {PROJECT_ID}")

이제 이 셀을 실행하여 환경을 초기화하세요. 이렇게 하면 필요한 모든 라이브러리를 가져오고, BigQuery에 대한 연결을 설정하며, 실습 전체에서 사용될 주요 변수(예: GCS 버킷 경로)를 정의합니다.

In [None]:
# This cell imports necessary libraries, initializes the BigQuery client,
# and sets up global variables for the analysis.
from google.cloud import bigquery
import pandas as pd
from IPython.display import HTML, display, Image, Video
from google.cloud import storage
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure PROJECT_ID has been defined in the cell above.
if 'PROJECT_ID' not in locals() or not PROJECT_ID:
    raise ValueError("ERROR: PROJECT_ID is not set. Please run the 'Set Your Project ID' cell above first.")

client = bigquery.Client(project=PROJECT_ID, location="us-central1") # Added project argument

# IMPORTANT: Verify this PROJECT_ID matches your lab's project ID.
DATASET_ID = 'cymbal'
REGION = 'us-central1'
CONNECTION_ID_FOR_EXTERNAL_TABLE = f'{REGION}.gemini_conn'
GEMINI_MODEL_NAME = f'{PROJECT_ID}.{DATASET_ID}.gemini_flash_model'
GCS_BUCKET_URI = f'gs://{PROJECT_ID}-bucket'
CSV_GCS_URI = f'{GCS_BUCKET_URI}/review/customer_reviews.csv'
IMAGES_GCS_URI_PATTERN = f'{GCS_BUCKET_URI}/review/images/*'
VIDEOS_GCS_URI_PATTERN = f'{GCS_BUCKET_URI}/review/videos/*'

# Create the dataset if it doesn't exist to avoid errors.
client.create_dataset(DATASET_ID, exists_ok=True)
print(f"Dataset {DATASET_ID} ensured.")
print(f"BigQuery Client Initialized. Project ID: {PROJECT_ID}")

def run_bq_query(sql: str, client: bigquery.Client):
    """A helper function to run BigQuery queries and return results."""
    try:
        query_job = client.query(sql)
        print(f"Job {query_job.job_id} in state {query_job.state}")
        if query_job.statement_type == 'SELECT':
            df = query_job.to_dataframe()
            print(f"Query complete. Fetched {len(df)} rows.")
            return df
        else:
            query_job.result()
            print(f"Query for statement type {query_job.statement_type} complete.")
            return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

## 텍스트 리뷰 외부 테이블 생성
### GCS의 CSV 파일을 직접 가리키는 **외부 테이블**을 명시적으로 정의된 스키마로 생성합니다. 
이 방법은 데이터 로드 및 스키마 자동 감지로 인해 발생할수있는 문제를 방지합니다.

In [None]:

table_id_reviews_external = f"{PROJECT_ID}.{DATASET_ID}.customer_reviews_external"
sql_create_external_table = f"""
CREATE OR REPLACE EXTERNAL TABLE `{table_id_reviews_external}` (
    customer_review_id INT64,
    customer_id INT64,
    location_id INT64,
    review_datetime DATETIME,
    review_text STRING,
    social_media_source STRING,
    social_media_handle STRING,
    product_id INT64,
    rating INT64
)
OPTIONS (
  format = 'CSV',
  uris = ['{CSV_GCS_URI}'],
  field_delimiter = ',',
  skip_leading_rows = 1,
  allow_quoted_newlines = TRUE
);
"""
print(f"Creating external table: {table_id_reviews_external}...")
run_bq_query(sql_create_external_table, client)


## 텍스트 리뷰 테이블 확인

In [None]:
%%bigquery
SELECT * FROM `cymbal.customer_reviews_external`
LIMIT 5

## 이미지 및 비디오용 객체 테이블 생성

In [None]:
# Creates an object table for review images.
table_id_review_images = f"{PROJECT_ID}.{DATASET_ID}.review_images"
sql_create_image_table = f"""
CREATE OR REPLACE EXTERNAL TABLE `{table_id_review_images}`
WITH CONNECTION `{CONNECTION_ID_FOR_EXTERNAL_TABLE}`
OPTIONS (object_metadata = 'SIMPLE', uris = ['{IMAGES_GCS_URI_PATTERN}']);
"""
print(f"\nCreating object table for review images: {table_id_review_images}")
run_bq_query(sql_create_image_table, client)

# Creates an object table for review videos.
table_id_review_videos = f"{PROJECT_ID}.{DATASET_ID}.review_videos"
sql_create_video_table = f"""
CREATE OR REPLACE EXTERNAL TABLE `{table_id_review_videos}`
WITH CONNECTION `{CONNECTION_ID_FOR_EXTERNAL_TABLE}`
OPTIONS (object_metadata = 'SIMPLE', uris = ['{VIDEOS_GCS_URI_PATTERN}']);
"""
print(f"\nCreating object table for review videos: {table_id_review_videos}")
run_bq_query(sql_create_video_table, client)

## BigQuery 객체 테이블 확인 (리뷰 이미지)

In [None]:
%%bigquery
SELECT * FROM `cymbal.review_images`
LIMIT 5

## BigQuery 객체 테이블 확인 (리뷰 비디오)

In [None]:
%%bigquery
SELECT * FROM `cymbal.review_videos`
LIMIT 3

## BigQuery에서 Gemini 모델 생성
### 이 SQL 명령은 BigQuery에 원격 모델을 생성하여 앞서 설정한 연결을 통해 Gemini Flash 엔드포인트에 연결합니다.

In [None]:
sql_create_gemini_model = f"""
CREATE OR REPLACE MODEL `{GEMINI_MODEL_NAME}`
REMOTE WITH CONNECTION `{CONNECTION_ID_FOR_EXTERNAL_TABLE}`
OPTIONS (endpoint = 'gemini-2.0-flash-001');
"""
print(f"Creating Gemini model: {GEMINI_MODEL_NAME}...")
run_bq_query(sql_create_gemini_model, client)

## 키워드 및 감성 텍스트 분석
### 이제 소스 테이블이 올바른 스키마를 갖도록 보장되었으므로, 이 간단하고 효율적인 '패스스루(pass-through)' 패턴을 사용할 수 있습니다. 모델은 각 리뷰를 처리하고 나중에 쉽게 조인할 수 있도록 'customer_review_id'를 통과시킵니다.

In [None]:
# Analyze text for keywords
table_id_reviews_keywords = f"{PROJECT_ID}.{DATASET_ID}.customer_reviews_keywords"
sql_analyze_keywords = f"""
CREATE OR REPLACE TABLE `{table_id_reviews_keywords}` AS
SELECT
  customer_review_id,
  ml_generate_text_llm_result AS keywords_json_string
FROM ML.GENERATE_TEXT(
    MODEL `{GEMINI_MODEL_NAME}`,
    (
      SELECT
        customer_review_id,
        CONCAT('Extract keywords from the following customer review. Return as a JSON string array like {{"keywords": ["keyword1"]}}. Review: ', review_text) AS prompt
      FROM
        `{table_id_reviews_external}`
    ),
    STRUCT(0.2 AS temperature, TRUE AS flatten_json_output)
  );
"""
print("Starting customer review keyword analysis...")
run_bq_query(sql_analyze_keywords, client)


# Analyze text for sentiment
table_id_reviews_analysis = f"{PROJECT_ID}.{DATASET_ID}.customer_reviews_analysis"
sql_analyze_sentiment = f"""
CREATE OR REPLACE TABLE `{table_id_reviews_analysis}` AS
SELECT
  customer_review_id,
  ml_generate_text_llm_result AS sentiment_json_string
FROM ML.GENERATE_TEXT(
    MODEL `{GEMINI_MODEL_NAME}`,
    (
      SELECT
        customer_review_id,
        CONCAT('Classify the sentiment of the following review as "positive", "negative", or "neutral". Return as a JSON string like {{"sentiment": "positive"}}. Review: ', review_text) AS prompt
      FROM
        `{table_id_reviews_external}`
    ),
    STRUCT(0.2 AS temperature, TRUE AS flatten_json_output)
  );
"""
print("\nStarting customer review sentiment analysis...")
run_bq_query(sql_analyze_sentiment, client)

## 텍스트 분석 결과 확인

In [None]:
%%bigquery
SELECT * FROM `cymbal.customer_reviews_keywords`
LIMIT 5

In [None]:
%%bigquery
SELECT * FROM `cymbal.customer_reviews_analysis`
LIMIT 5

## 이미지 및 비디오 분석
### Gemini, BigQuery SQL 및 객체 테이블을 사용하여 이미지 및 비디오 분석

In [None]:
# Invokes Gemini to analyze the content of each image in the object table.
table_id_image_results = f"{PROJECT_ID}.{DATASET_ID}.review_images_results"
sql_analyze_images = f"""
CREATE OR REPLACE TABLE `{table_id_image_results}` AS
SELECT uri, ml_generate_text_llm_result AS image_analysis_json
FROM ML.GENERATE_TEXT( MODEL `{GEMINI_MODEL_NAME}`, TABLE `{table_id_review_images}`,
    STRUCT('For each image, summarize it and extract relevant keywords. Answer in JSON with keys "summary" and "keywords".' AS prompt, TRUE AS flatten_json_output)
);
"""
print("\nStarting image analysis...")
run_bq_query(sql_analyze_images, client)

# Invokes Gemini to analyze the content of each video in the object table.
table_id_video_results = f"{PROJECT_ID}.{DATASET_ID}.review_videos_results"
sql_analyze_videos = f"""
CREATE OR REPLACE TABLE `{table_id_video_results}` AS
SELECT uri, ml_generate_text_llm_result AS video_analysis_json
FROM ML.GENERATE_TEXT( MODEL `{GEMINI_MODEL_NAME}`, TABLE `{table_id_review_videos}`,
    STRUCT('For each video, summarize it and extract keywords. Answer in JSON with keys "summary" and "keywords".' AS prompt, TRUE AS flatten_json_output)
);
"""
print("\nStarting video analysis...")
run_bq_query(sql_analyze_videos, client)

## 이미지 및 비디오 분석 샘플 검토

In [None]:
# This cell fetches and displays media files for direct comparison with the analysis results.
storage_client = storage.Client()

print(f"\n--- Displaying Individual Image Samples & Analysis ---")
df_img_samples = run_bq_query(f"SELECT uri, image_analysis_json FROM `{table_id_image_results}` LIMIT 2", client)
if df_img_samples is not None:
    for _, row in df_img_samples.iterrows():
        print("-" * 30)
        print(f"Analysis for: {row['uri']}")
        display(HTML(f"<pre style='white-space: pre-wrap;'>{row['image_analysis_json']}</pre>"))
        try:
            bucket_name, blob_name = row['uri'].replace("gs://", "").split("/", 1)
            display(Image(data=storage_client.bucket(bucket_name).blob(blob_name).download_as_bytes(), width=300))
        except Exception as e:
            print(f"--> Could not display image {row['uri']}. Error: {e}")

print(f"\n--- Displaying Individual Video Samples & Analysis ---")
df_vid_samples = run_bq_query(f"SELECT uri, video_analysis_json FROM `{table_id_video_results}` LIMIT 1", client)
if df_vid_samples is not None:
    for _, row in df_vid_samples.iterrows():
        print("-" * 30)
        print(f"Analysis for: {row['uri']}")
        display(HTML(f"<pre style='white-space: pre-wrap;'>{row['video_analysis_json']}</pre>"))

video_url=f"https://storage.googleapis.com/{PROJECT_ID}-bucket/review/videos/Review%20Video%20(1).mp4"
Video(video_url, width=640)

## 통합 분석 테이블 생성
### 모든 것을 BigQuery 멀티모달 테이블로 통합

In [None]:
# The regular expression in REGEXP_EXTRACT is corrected to have only one capturing group `(\\d+)`.
# This allows us to join the image/video analysis back to the original review by extracting the review ID from the filename.
table_id_multimodal_reviews = f"{PROJECT_ID}.{DATASET_ID}.multimodal_customer_reviews"
sql_create_multimodal_table = f"""
CREATE OR REPLACE TABLE `{table_id_multimodal_reviews}` AS
WITH
  image_results_parsed AS (
    SELECT SAFE_CAST(REGEXP_EXTRACT(uri, r'Review.*\\((\\d+)\\)') AS INT64) AS customer_review_id, uri AS image_uri, image_analysis_json
    FROM `{table_id_image_results}`
  ),
  video_results_parsed AS (
    SELECT SAFE_CAST(REGEXP_EXTRACT(uri, r'Video.*\\((\\d+)\\)') AS INT64) AS customer_review_id, uri AS video_uri, video_analysis_json
    FROM `{table_id_video_results}`
  )
SELECT
    cr.*, -- Select all columns from the correctly-defined source table
    s.sentiment_json_string,
    k.keywords_json_string,
    irp.image_uri,
    irp.image_analysis_json,
    vrp.video_uri,
    vrp.video_analysis_json
FROM `{table_id_reviews_external}` AS cr
LEFT JOIN `{table_id_reviews_analysis}` AS s ON cr.customer_review_id = s.customer_review_id
LEFT JOIN `{table_id_reviews_keywords}` AS k ON cr.customer_review_id = k.customer_review_id
LEFT JOIN image_results_parsed AS irp ON cr.customer_review_id = irp.customer_review_id
LEFT JOIN video_results_parsed AS vrp ON cr.customer_review_id = vrp.customer_review_id;
"""
print("Creating unified multimodal analysis table...")
run_bq_query(sql_create_multimodal_table, client)

## 통합 테이블 확인

In [None]:
%%bigquery
SELECT * FROM `cymbal.multimodal_customer_reviews` where video_uri is not null

## GenAI를 사용한 감성 분포 시각화


- 이 단계에서는 노트북에 내장된 생성형 AI 어시스턴트를 사용하여 플롯을 생성합니다.
- **+ Code** 버튼을 클릭하여 새 코드 셀을 추가합니다.
- 새 셀 내부에서 **Generate** 버튼을 클릭합니다.
- 프롬프트 상자에 다음을 주석으로 입력합니다:
   - `plot a bar chart for the distribution of text_sentiment in the multimodal_customer_reviews table`
- 제안된 코드를 수락한 다음 셀을 실행하여 차트를 표시합니다. 이를 통해 전체적인 감성 균형을 빠르게 파악할 수 있습니다.


## 실습 랩: GenAI로 플롯 생성하기


- 이제 여러분이 노트북에 내장된 생성형 AI 어시스턴트를 사용할 차례입니다. 간단한 프롬프트를 작성하여 직접 시각화를 생성해 보세요.
- 여러분의 과제는 생성형 AI 어시스턴트에게 새롭고 창의적인 질문을 하여 `table_id_multimodal_reviews`에서 숨겨진 패턴과 인사이트를 발견하는 것입니다.
- 아래는 영감을 줄 수 있는 몇 가지 예시입니다. 이것들을 실행해 보고, 여러분만의 것을 만들어 보세요!

   1. 긍정, 부정, 중립 리뷰의 일일 카운트를 추적하는 선 그래프 생성
   ```
   I want to analyze how customer sentiment has changed day by day.
   Select data from the table_id_multimodal_reviews table and generate a line chart that tracks the daily counts of positive, negative, and neutral sentiments.
   The sentiment is in the 'sentiment_json_string' field, and the date is in the 'review_datetime' field.
   ```
   2. 이미지가 포함된 리뷰의 총 수와 비디오가 포함된 리뷰의 총 수를 비교하는 막대 차트 생성
   ```
   Using table_id_multimodal_reviews, count the number of reviews that have an image and the number of reviews that have a video. Show the result as a bar chart.
   ```
   3. 고객 연령대 '18-29', '30-45', '46-60', '61+'에 대한 긍정, 부정, 중립 리뷰 수를 보여주는 그룹화된 막대 차트 플롯
   ```
   I need a breakdown of sentiment by customer age group.
   First, join the `table_id_multimodal_reviews` table with the `customers` table using `customer_id`.
   Then, create four age groups from the `age` column: '18-29', '30-45', '46-60', and '61+'.
   Finally, create a grouped bar chart where each age group shows the total count of 'positive', 'negative', and 'neutral' sentiments.
   ```
   4. 모든 성별 범주에 걸쳐 긍정, 부정, 중립 리뷰의 총 수를 비교하는 그룹화된 막대 차트 생성
   ```
   I want to analyze if customer sentiment differs by gender.
   Join the `table_id_multimodal_reviews` table with the `customers` table using `customer_id`.
   For each gender, count the total number of 'positive', 'negative', and 'neutral' reviews.
   Present this comparison as a grouped bar chart, where each gender has its own set of bars for the sentiments.
   ```
