[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sandy-lee29/musicapp-review-analysis/blob/main/App_Review_Tagging_with_AI.ipynb)


#📱 App Review Tagging  
## 🎧 Industry: Music
### Companies Analyzed:
*   Spotify
*   Apple Music
*   Amazon Music
*   Youtube Music

In [None]:
%pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.0


In [None]:
from dotenv import load_dotenv, find_dotenv
import os
import json
import uuid
import pandas as pd
import re
import openai
from typing import List, Dict, Any
from openai import OpenAI
import time


##  📌 Bring in App Reviews + Create review_id for IOS reviews

In [None]:
df = pd.read_csv('cleaned_reviews.csv')

In [None]:
# Regular Expression to add review_id as a column
uuid_pattern = re.compile(r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$', re.IGNORECASE)
df['review_id'] = df['review_id'].apply(
    lambda x: str(uuid.uuid4()) if not (isinstance(x, str) and uuid_pattern.match(x)) else x
)

df.head()

Unnamed: 0,review_id,review,rating,time,company,data_source
0,4e0df3ef-2798-4ed3-bff0-1b79ede603c2,improve the shuffle and you will be the best p...,4,2025-04-03 03:45:31,Apple Music,Android
1,962dd7a1-e24f-4ddd-9a6f-451b350f6e88,it wouldnt let me play any song i was trying to,1,2025-04-03 02:24:13,Apple Music,Android
2,e12531ef-3319-480f-a6d6-d6c3b1a04b8b,no response from this application mi telefo ot...,1,2025-04-03 01:32:37,Apple Music,Android
3,4549216e-6e5c-4e53-8299-ae3684a8bd7e,would love it if i could set app language with...,4,2025-04-03 00:05:24,Apple Music,Android
4,2e874922-1ac3-4292-a0ea-09ca236fb667,music is superb just improve song recommendations,5,2025-04-02 16:45:37,Apple Music,Android


In [None]:
df.count()

Unnamed: 0,0
review_id,8803
review,8803
rating,8803
time,8803
company,8803
data_source,8803


# 🎯 Customer Review Analysis Pipeline (High-Level Overview)

## 📌 1. Data Loading & Environment Setup
- Load **customer reviews dataset** (`cleaned_reviews.csv`).
- Retrieve **OpenAI API key** from `.env` file.
- Initialize **OpenAI client** for API calls.

## 📌 2. OpenAI API Call for Review Analysis
**Purpose:** Extract **sentiment, problems, aspects, and topics** from customer reviews.  
**Process:**
- Define **sentiment classification rules**.
- Send **review text** to `GPT-4o-mini`.
- Receive structured **JSON response**:
  - **Sentiment** (positive, neutral, negative)
  - **Problems** (if any)
  - **Aspects** (generalized issue category)
  - **Topics** (standardized topic classification)

## 📌 3. Topic Standardization
Convert extracted **topics** into **predefined categories** for consistency.  
**Matching Process:**
- **Exact Match:** If topic exists in predefined categories, use it.
- **Semantic Similarity Matching:** If no exact match, select the most semantically similar category using sentence embeddings (threshold ≥ 0.6).
- **Fallback:** If no match, classify as `"other"`.

## 📌 4. Data Processing & Transformation
- **Neutral/Negative reviews** must have at least **one problem, aspect, and topic**.
- **Positive reviews** retain sentiment but **exclude problems**.
- Limit extracted issues to **one per 200 characters** to avoid redundancy.
- **Apply topic standardization** for structured categorization.

## 📌 5. Sampling & Output Generation
- Randomly **sample 100 reviews**.
- Execute the **review analysis pipeline**.
- Save processed dataset as **`processed_reviews.csv`**.

## 🎤 Key Takeaways
✔ **Automated** sentiment & issue extraction using **OpenAI API**.  
✔ **Standardized** topic categorization for better insights.  
✔ **Structured & scalable** approach to analyzing customer feedback.  


##  📌 Initialize OpenAI API Key




In [None]:
with open(".env", "w") as f:
    f.write("Add your Key Here")

In [None]:
# Load environment variables from .env file
dotenv_path = "/content/.env"
load_dotenv(dotenv_path)


# Get the OpenAI API key
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialize OpenAI API client with the API key
client = OpenAI(api_key=openai_api_key)

In [None]:
def prompting(task_name: str, system_prompt: str, user_input: str, json_formatter: Dict[str, Any]) -> Dict[str, Any]:
    """
    Standardized function for calling OpenAI API.

    Parameters:
    -----------
    - task_name: str → The name of the request task (e.g., "get_pcs", "extract_problems").
    - system_prompt: str → System prompt defining AI's role and instructions.
    - user_input: str → User input containing the review text for analysis.
    - json_formatter: Dict[str, Any] → JSON schema specifying the expected API response format.

    Returns:
    --------
    - Dict[str, Any] → JSON response from OpenAI API.
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input},
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": task_name,
                "schema": json_formatter,
                "strict": True
            }
        },
        max_tokens=2000,
        temperature=0.7,
    )

    return json.loads(response.choices[0].message.content)

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch

# Load sentence transformer model once
model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity_match(topic: str, predefined_categories: dict, threshold: float = 0.6) -> str:
    topic_embedding = model.encode(topic, convert_to_tensor=True)

    best_match = "other"
    highest_score = 0

    for category, phrases in predefined_categories.items():
        for phrase in phrases:
            phrase_embedding = model.encode(phrase, convert_to_tensor=True)
            score = util.pytorch_cos_sim(topic_embedding, phrase_embedding).item()
            if score > highest_score and score > threshold:
                highest_score = score
                best_match = category
    return best_match

def standardize_topic(topic: str) -> str:
    """Standardizes topic names based on semantic similarity instead of character-level matching."""

    # Normalize the topic string
    topic = topic.lower().strip()
    topic = topic.replace("app ", "").replace("application ", "")
    topic = topic.replace("/", " ").replace("-", " ").replace("_", " ")

    # Broad topics with predefined keywords
    predefined_categories = {
        "usability": [
            "ui issues", "navigation problems", "user experience", "accessibility", "ease of use",
            "complex interface", "poor design choices", "difficult to navigate", "unintuitive layout",
            "hard to use", "confusing interface", "user unfriendly", "complicated menus"
        ],
        "functionality": [
            "feature request", "missing feature", "broken feature", "app functionality",
            "inconsistent behavior", "limited customization options", "feature not working",
            "functionality issues", "feature malfunction", "unavailable features", "feature glitches"
        ],
        "performance": [
            "slow app speed", "app lagging", "performance issues", "crash", "freezes", "buffering",
            "high battery consumption", "excessive data usage", "slow loading", "app hangs",
            "unresponsive app", "performance lag", "battery drain", "memory usage issues"
        ],
        "security": [
            "security risk", "privacy concern", "data protection", "account security",
            "unauthorized charges", "data leaks", "intrusive permissions", "security vulnerability",
            "data breach", "unsecured connection", "privacy invasion", "security flaws"
        ],
        "customer_service": [
            "customer support", "support issues", "help desk response", "slow response time",
            "unhelpful support", "refund issues", "account recovery problems", "poor customer service",
            "no support", "unresponsive support", "support not helpful", "difficult to contact support"
        ],
        "pricing": [
            "pricing issues", "subscription cost", "perceived value", "monetization concerns",
            "hidden fees", "auto renewal problems", "difficult cancellation", "price increase",
            "expensive subscription", "unjustified cost", "billing issues", "overpriced"
        ],
        "content": [
            "content access", "content discovery", "content recommendations", "content quality",
            "region locked content", "missing content", "irrelevant suggestions", "limited library",
            "outdated content", "poor content selection", "content not updating", "content availability"
        ],
        "ads": [
            "too many notifications", "spammy ads", "intrusive popups", "irrelevant ads", "ad frequency",
            "unable to disable ads", "annoying advertisements", "ads interrupting", "excessive advertising",
            "forced ads", "advertising overload", "ads disrupting experience", "excessive advertisements", "advertisements", "ads"
        ],
        "recommendation_system": [
            "playlist recommendations", "algorithm accuracy", "music suggestions", "personalized playlists",
            "recommendation relevance", "song suggestions", "curated playlists", "music discovery",
            "recommendation engine", "suggested tracks", "playlist curation", "music matching"
        ],
        "offline_listening": [
            "download limitations", "offline mode issues", "music downloads", "offline playback",
            "downloaded songs disappearing", "offline access", "download errors", "offline functionality",
            "download restrictions", "offline listening problems", "download quality", "offline mode limitations"
        ],
        "audio_quality": [
            "low bitrate", "sound quality issues", "streaming quality", "audio fidelity", "poor sound quality",
            "audio glitches", "sound distortion", "quality loss", "uneven volume", "audio dropouts",
            "sound clarity", "audio performance", "audio", "audio quality"
        ],
        "cross_platform_compatibility": [
            "device support", "os compatibility", "third party integrations", "platform sync issues",
            "multi device support", "compatibility problems", "integration with other apps", "cross device functionality",
            "platform limitations", "syncing issues", "device pairing", "compatibility errors"
        ]
    }

    # Try exact match first
    for standard_topic, variations in predefined_categories.items():
        if topic in variations:
            return standard_topic

    # Semantic match using transformer
    return semantic_similarity_match(topic, predefined_categories, threshold=0.6)



# 📌 Main Function for Sentiment & Problem Analysis
def apply_sentiment_problem_analysis(df: pd.DataFrame, api_key: str) -> pd.DataFrame:
    """Processes a DataFrame containing customer reviews and extracts structured information."""

    print("Existing columns in DataFrame:", df.columns.tolist())

    # ✅ System Prompt with Updated Rules for Sentiment Classification
    system_prompt = """
    You are analyzing customer reviews to extract structured information.
    Each review contains potential problems, aspects, and sentiments.

    **Sentiment Classification Rules:**
    - **Positive**: The review expresses strong satisfaction, enthusiasm, or clear approval.
      - Even if there is a small issue mentioned, the overall sentiment is clearly positive.
      - Strong recommendation words like "love", "amazing", "best", "highly recommend", "great experience" are key indicators.
    - **Neutral**: The review contains both positive and negative aspects, but does not strongly favor one side.
      - It may include mild suggestions for improvement while still being generally balanced.
    - **Negative**: The review expresses dissatisfaction, frustration, or a clear complaint.



    **Sentiment Classification Rules:**
    - **Positive**: The review expresses strong satisfaction, enthusiasm, or clear approval.
       - **Even if there is a minor issue, the review must be classified as positive if the overall tone is clearly enthusiastic or highly satisfied.**
       - **Positive words such as "love", "amazing", "best", "fantastic", "highly recommend", "great experience", "very useful", "super helpful" strongly indicate a positive sentiment.**
       - If a review contains **praise that outweighs a minor complaint**, classify it as positive.
     - Examples of positive reviews:
       - `"I love this app! It works perfectly and I use it every day."`
       - `"This is the best app I have ever used. Highly recommend it!"`
       - `"Very useful and convenient. A must-have!"`
       - `"Great experience, the recommendations are spot on!"`
       - `"Super helpful, but I wish it had more features."`

    - **Neutral**: The review contains both positive and negative aspects but does not strongly favor one side.
       - The review must be **clearly mixed or balanced** to be considered neutral.
       - If a review is **mostly positive with a small complaint, classify it as positive instead.**
      - Examples of neutral reviews:
       - `"The app is useful, but it crashes sometimes."`
       - `"The service is okay. Not great, but not bad either."`
       - `"It's good, but I wish it had more features."`

     - **Negative**: The review expresses dissatisfaction, frustration, or a clear complaint.


    **Analysis Process:**
    1) Identify sentiment (positive, negative, or neutral).
    2) Extract exact phrasing of problems from negative & neutral reviews.
       - If positive, return an empty list for problems.
       - Extract **max 1 problem per 200 characters** in the review text, prioritizing the most critical issues. If multiple issues exist, rank them by relevance and severity, ensuring that the most significant problems are captured first. Avoid selecting minor or redundant issues when more critical problems are present elsewhere in the text.
    3) Generalize each problem into a broader 'aspect' category.
       - **Each aspect must be at least 2-3 words long**.
    4) Assign each aspect to a corresponding broad 'topic' using standardization rules.
     - Convert topics to **lowercase**.
     - Use **broad** topics (max **2 words**).
     - Remove "app" or "application" from topics.
     - Merge similar topics into **one standard term** to prevent fragmentation.
     - If an exact match exists in the predefined topic list, use the predefined term.
     - If an exact match does not exist, choose the **most semantically similar topic** from the predefined list.
        - Use semantic similarity based on meaning, not just character overlap.
        - Avoid defaulting to "other" unless no reasonable semantic match exists (similarity < 60%).
     - **Industry-specific topics are dynamically updated within the predefined topic list.**
     - General topics apply to all app types (e.g., usability, functionality, performance, security, customer service, value, content, ads).
     - **App-specific topics are updated separately based on the industry category.**
       - **For music streaming apps:** Includes "recommendation system," "offline listening," "audio quality," and "cross-platform compatibility."


    **Output Requirements:**
    - Return structured JSON with review_id, sentiment, problems, aspects, topics.
    - Neutral and Negative reviews **MUST have at least one problem, aspect, and topic.**
    - Positive reviews must be included but should have empty lists for problems, aspects, and topics.
    """

    # ✅ JSON schema for AI response
    json_formatter = {
        "type": "object",
        "properties": {
            "review_id": {"type": "string"},
            "sentiment": {"type": "string"},
            "problems": {"type": "array", "items": {"type": "string"}} if "sentiment" != "positive" else [],
            "aspects": {"type": "array", "items": {"type": "string"}} if "sentiment" != "positive" else [],
            "topics": {"type": "array", "items": {"type": "string"}} if "sentiment" != "positive" else []
        },
        "required": ["review_id", "sentiment", "problems", "aspects", "topics"],
        "additionalProperties": False
    }

    expanded_reviews = []

    for _, row in df.iterrows():
        review_id = str(row["review_id"])
        review_text = row["review"]

        # Retrieve additional fields
        review_time = row.get("time", "")
        review_rating = row.get("rating", "")
        company = row.get("company", "")
        data_source = row.get("data_source", "")

        # Construct user input for AI model
        user_input = f"""
        Analyze the following customer review:
        - Review ID: {review_id}
        - Review Text: "{review_text}"
        """

        try:
            # Call AI model
            response = prompting("analyze_review_sentiment", system_prompt, user_input, json_formatter)

            sentiment = response["sentiment"]

            # ✅ If sentiment is positive, enforce empty lists
            if sentiment == "positive":
                problems, aspects, topics = [], [], []
            else:
                problems = response["problems"]
                aspects = response["aspects"]
                topics = response["topics"]

                # 🔹 Limit Problems to 1 per 200 Characters
                max_problems = max(1, len(review_text) // 200)
                problems, aspects, topics = problems[:max_problems], aspects[:max_problems], topics[:max_problems]

                # ✅ Ensure Neutral & Negative Reviews Always Have At Least One Problem
                if not problems or not aspects or not topics:
                    problems = problems or ["N/A"]
                    aspects = aspects or ["N/A"]
                    topics = topics or ["N/A"]

            # 🔹 Aspect Index & Topic Standardization
            for idx, (problem, aspect, topic) in enumerate(zip(problems, aspects, topics), start=1):
                aspect_index = f"{review_id}_{idx}_{uuid.uuid4().hex[:8]}"
                standardized_topic = standardize_topic(topic)  # ✅ Apply topic standardization

                expanded_reviews.append({
                    "review_id": review_id,
                    "review": review_text,
                    "rating": review_rating,
                    "time": review_time,
                    "aspect_index": aspect_index,
                    "topic": standardized_topic,
                    "problem": problem,
                    "sentiment": sentiment,
                    "aspect": aspect,
                    "modified_flag": False,
                    "company": company,
                    "data_source": data_source
                })

        except Exception as e:
            print(f"Error processing review {review_id}: {e}")

    # Convert processed data into a DataFrame
    processed_df = pd.DataFrame(expanded_reviews)

    # ✅ Arrange columns in a specific order
    column_order = ["review_id", "review", "rating", "time", "aspect_index",
                    "topic", "problem", "sentiment", "aspect", "modified_flag",
                    "company", "data_source"]

    processed_df = processed_df[column_order]

    return processed_df


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Extract 1000 samples from the dataset
df_sample = df.sample(n=1000, random_state=42)

# Execute
processed_sample_df = apply_sentiment_problem_analysis(df_sample, api_key=openai_api_key)

# Save the processed sample data to a CSV file
processed_sample_df.to_csv("Music_1000.csv", index=False)


Existing columns in DataFrame: ['review_id', 'review', 'rating', 'time', 'company', 'data_source']


In [None]:
print(processed_sample_df.head())

                              review_id  \
0  3c0d2476-b866-41d0-ae55-b88e8512e649   
1  da509029-0a92-4b62-bd22-b5d02a652070   
2  443bc365-b40e-46bc-95c3-8e5a2338e962   
3  e8e865dc-109b-470e-b4dc-7a1129e7617c   
4  f7554b67-9ea4-46d6-88d2-59380e0d355c   

                                              review  rating  \
0        unable to share lyrics in my android device       1   
1  app is not working properly unable to play any...       1   
2        it keeps me well entertained and very happy       5   
3  used to be good but now  nope its understandab...       1   
4  as an android user  i really enjoy this app  e...       3   

                  time                                     aspect_index  \
0  2025-03-27 16:48:32  3c0d2476-b866-41d0-ae55-b88e8512e649_1_d9a5e8ec   
1  2025-03-22 04:48:50  da509029-0a92-4b62-bd22-b5d02a652070_1_81510e9d   
2  2025-03-31 21:36:55  443bc365-b40e-46bc-95c3-8e5a2338e962_1_c7e1d272   
3  2025-03-06 17:42:14  e8e865dc-109b-470e-b4dc-7a1129e7