# 🏷️ Automated Data Labeling with Label Studio

This notebook implements a complete workflow for automated data labeling:

1. **📊 Data Sampling**: Read data and sample 10% for manual labeling
2. **🏷️ Manual Labeling**: Send sample to Label Studio for multi-label classification
3. **🤖 Rule Generation**: Extract patterns from labeled data
4. **⚡ Auto Labeling**: Apply rules to label the entire dataset

---


## 📦 1. Setup and Dependencies


In [1]:
# Install required packages
import subprocess
import sys


def install_package(package):
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"✅ {package} installed successfully")
    except:
        print(f"⚠️  {package} installation failed, might already be installed")


# Install Label Studio SDK and other dependencies
packages = ["label-studio-sdk", "pandas", "numpy", "requests", "scikit-learn"]

print("📦 Installing required packages...")
for package in packages:
    install_package(package)

print("\n🎉 All packages ready!")

📦 Installing required packages...
✅ label-studio-sdk installed successfully
✅ pandas installed successfully
✅ numpy installed successfully
✅ requests installed successfully
✅ scikit-learn installed successfully

🎉 All packages ready!


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import json
import requests
import time
import re
from typing import List, Dict, Any
from label_studio_sdk import Client
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings("ignore")

print("📚 Libraries imported successfully!")

📚 Libraries imported successfully!


## 📊 2. Data Loading and Sampling


In [3]:
# Configure data path and sampling parameters
DATA_PATH = "../data/cleaned_google_reviews.csv"
SAMPLE_PERCENTAGE = 0.1  # 10% sampling
RANDOM_STATE = 42

print(f"📁 Loading data from: {DATA_PATH}")
print(f"🎲 Sampling {SAMPLE_PERCENTAGE*100}% of data for manual labeling")

# Load the complete dataset
df_full = pd.read_csv(DATA_PATH)

# Clean and prepare data
df_full = df_full[df_full["review_text"].str.len() > 0].reset_index(drop=True)

print(f"\n📊 Dataset Overview:")
print(f"   Total rows: {len(df_full):,}")
print(f"   Columns: {df_full.columns.tolist()}")
print(f"   Memory usage: {df_full.memory_usage().sum() / 1024**2:.1f} MB")

# Sample data for manual labeling
sample_size = int(len(df_full) * SAMPLE_PERCENTAGE)
df_sample = df_full.sample(n=sample_size, random_state=RANDOM_STATE).reset_index(
    drop=True
)

print(f"\n🎯 Sampling Results:")
print(f"   Sample size: {len(df_sample):,} rows ({SAMPLE_PERCENTAGE*100}%)")
print(f"   Remaining for auto-labeling: {len(df_full) - len(df_sample):,} rows")

# Display sample data
print(f"\n📝 Sample Data Preview:")
display_cols = (
    ["review_text", "rating", "category"]
    if "category" in df_sample.columns
    else ["review_text", "rating"]
)
for i, row in df_sample.head(3).iterrows():
    print(f"\n   Row {i+1}:")
    print(f"   📝 Review: '{row['review_text'][:100]}...'")
    print(f"   ⭐ Rating: {row.get('rating', 'N/A')}")
    if "category" in row:
        print(f"   🏷️  Category: {row.get('category', 'N/A')}")

📁 Loading data from: ../data/cleaned_google_reviews.csv
🎲 Sampling 10.0% of data for manual labeling

📊 Dataset Overview:
   Total rows: 347,087
   Columns: ['user_id', 'user_name', 'review_time', 'rating', 'review_text', 'pics', 'resp', 'gmap_id', 'has_resp', 'resp_text', 'resp_time', 'biz_name', 'description', 'category', 'avg_rating', 'num_of_reviews', 'price_level']
   Memory usage: 45.0 MB

🎯 Sampling Results:
   Sample size: 34,708 rows (10.0%)
   Remaining for auto-labeling: 312,379 rows

📝 Sample Data Preview:

   Row 1:
   📝 Review: 'Love the campground. Staff has gone above and beyond. Have stayed for two months. Sites are paved, a...'
   ⭐ Rating: 5
   🏷️  Category: ['Campground']

   Row 2:
   📝 Review: '1st time evee.give McDonald's a 5 star thanks to guy named Tyler working front counter. Had the best...'
   ⭐ Rating: 5
   🏷️  Category: ['Fast food restaurant', 'Breakfast restaurant', 'Coffee shop', 'Hamburger restaurant', 'Restaurant', 'Sandwich shop']

   Row 3:
   📝 Re

In [12]:
df_full.head()

Unnamed: 0,user_id,user_name,review_time,rating,review_text,pics,resp,gmap_id,has_resp,resp_text,resp_time,biz_name,description,category,avg_rating,num_of_reviews,price_level
0,103563353519118155776,Peri Gray,2018-01-16 17:11:15.780000+00:00,5,Great place to care for our children.,False,,0x532af45db8f30779:0xd9be9359f1e56178,False,,,CRST WIC Office,,,4.7,8.0,0.0
1,101824980797027237888,Suzy Berndt,2018-07-30 03:45:50.314000+00:00,5,Th sw y are so nice,False,,0x532af45db8f30779:0xd9be9359f1e56178,False,,,CRST WIC Office,,,4.7,8.0,0.0
2,108711640480272777216,Rosemary Red Legs,2018-07-07 13:11:33.932000+00:00,5,Went with my daughter,False,,0x532af45db8f30779:0xd9be9359f1e56178,False,,,CRST WIC Office,,,4.7,8.0,0.0
3,111135746986864017408,hypnotherapycw,2017-02-18 23:59:28.190000+00:00,5,Julie and the crew are AMAZING. DONATE DONATE ...,False,,0x532af4588c5f80b1:0x19574964b8ecd9a0,False,,,Cheyenne River Youth Project,,['Youth social services organization'],4.5,35.0,0.0
4,108987444312280645632,C J Blue Coat,2016-02-25 10:10:42.607000+00:00,2,They dont have any activities for youth. If so...,False,,0x532af4588c5f80b1:0x19574964b8ecd9a0,False,,,Cheyenne River Youth Project,,['Youth social services organization'],4.5,35.0,0.0


## 🔧 3. Label Studio Configuration


In [4]:
# Label Studio Configuration
LABEL_STUDIO_URL = "http://localhost:8080"
API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbl90eXBlIjoicmVmcmVzaCIsImV4cCI6ODA2MzY1MTU1NCwiaWF0IjoxNzU2NDUxNTU0LCJqdGkiOiI3NWFhOGY5NjQwZDU0NmVjOTJhZGIxOTljMzBmYjk3ZSIsInVzZXJfaWQiOiIxIn0.DiyYKh46ZGAewSsq8Fe82JfBpVJllyL_-s2P0UxWLp4"  # Replace with your actual API key

# Multi-label classification configuration
LABEL_CONFIG = """
<View>
  <Header value="Review Classification Task"/>
  
  <Text name="review" value="$review_text" granularity="word"/>
  
  <Header value="Select all applicable labels:"/>
  <Choices name="classification" toName="review" choice="multiple">
    <Choice value="Advertisement" hint="Contains promotional content, contact info, or marketing language"/>
    <Choice value="Irrelevant" hint="Off-topic content not related to the business/service"/>
    <Choice value="Fake_Rant" hint="Complaints from users who likely never visited the business"/>
  </Choices>
  
  <Header value="Additional Context:"/>
  <Text name="metadata" value="$metadata" granularity="word"/>
</View>
"""

print("🔧 Label Studio Configuration:")
print(f"   URL: {LABEL_STUDIO_URL}")
print(f"   Labels: Advertisement, Irrelevant, Fake_Rant")
print(f"   Type: Multi-label classification")
print("\n⚠️  Make sure Label Studio is running: label-studio start")

🔧 Label Studio Configuration:
   URL: http://localhost:8080
   Labels: Advertisement, Irrelevant, Fake_Rant
   Type: Multi-label classification

⚠️  Make sure Label Studio is running: label-studio start


## 🏷️ 4. Label Studio Integration


In [5]:
class LabelStudioManager:
    """Manager class for Label Studio operations"""

    def __init__(self, url: str, api_key: str):
        self.url = url
        self.api_key = api_key
        self.headers = {
            "Authorization": f"Token {api_key}",
            "Content-Type": "application/json",
        }
        self.client = None

    def connect(self):
        """Test connection to Label Studio"""
        try:
            response = requests.get(f"{self.url}/api/version", headers=self.headers)
            if response.status_code == 200:
                print(f"✅ Connected to Label Studio: {response.json()}")
                self.client = Client(url=self.url, api_key=self.api_key)
                return True
            else:
                print(f"❌ Connection failed: {response.status_code}")
                return False
        except Exception as e:
            print(f"❌ Connection error: {e}")
            return False

    def create_project(self, title: str, label_config: str) -> int:
        """Create a new Label Studio project"""
        payload = {
            "title": title,
            "label_config": label_config,
            "expert_instruction": "Label reviews with applicable categories: Advertisement, Irrelevant, or Fake_Rant",
        }

        response = requests.post(
            f"{self.url}/api/projects/", headers=self.headers, data=json.dumps(payload)
        )

        if response.status_code == 201:
            project_id = response.json()["id"]
            print(f"✅ Project created with ID: {project_id}")
            return project_id
        else:
            print(
                f"❌ Project creation failed: {response.status_code} - {response.text}"
            )
            return None

    def prepare_tasks(self, df: pd.DataFrame) -> List[Dict]:
        """Convert DataFrame to Label Studio task format"""
        tasks = []

        for idx, row in df.iterrows():
            # Create metadata string from all non-text columns
            metadata_fields = []
            for col, val in row.items():
                if col != "review_text" and pd.notna(val):
                    metadata_fields.append(f"{col}: {val}")

            metadata_str = " | ".join(metadata_fields)

            task = {
                "data": {
                    "review_text": str(row.get("review_text", "")),
                    "metadata": metadata_str,
                    "row_id": idx,  # Keep track of original row
                }
            }
            tasks.append(task)

        print(f"✅ Prepared {len(tasks)} tasks for labeling")
        return tasks

    def import_tasks(self, project_id: int, tasks: List[Dict]) -> bool:
        """Import tasks to Label Studio project"""
        response = requests.post(
            f"{self.url}/api/projects/{project_id}/import",
            headers=self.headers,
            data=json.dumps(tasks),
        )

        if response.status_code in [200, 201]:
            print(f"✅ Successfully imported {len(tasks)} tasks")
            return True
        else:
            print(f"❌ Task import failed: {response.status_code} - {response.text}")
            return False

    def get_annotations(self, project_id: int) -> List[Dict]:
        """Retrieve annotations from project"""
        response = requests.get(
            f"{self.url}/api/projects/{project_id}/export",
            headers=self.headers,
            params={"exportType": "JSON"},
        )

        if response.status_code == 200:
            annotations = response.json()
            print(f"✅ Retrieved {len(annotations)} annotations")
            return annotations
        else:
            print(f"❌ Failed to retrieve annotations: {response.status_code}")
            return []


# Initialize Label Studio Manager
ls_manager = LabelStudioManager(LABEL_STUDIO_URL, API_KEY)
print("🔧 Label Studio Manager initialized")

🔧 Label Studio Manager initialized


In [6]:
# Test connection and create project
print("🔌 Testing Label Studio connection...")

if ls_manager.connect():
    # Create project for manual labeling
    project_title = f"Review Classification - {time.strftime('%Y%m%d_%H%M%S')}"
    project_id = ls_manager.create_project(project_title, LABEL_CONFIG)

    if project_id:
        # Prepare and import sample tasks
        print(f"\n📤 Preparing tasks for manual labeling...")
        tasks = ls_manager.prepare_tasks(df_sample)

        if ls_manager.import_tasks(project_id, tasks):
            print(f"\n🎉 Setup Complete!")
            print(f"   Project ID: {project_id}")
            print(f"   Tasks imported: {len(tasks)}")
            print(f"   Access URL: {LABEL_STUDIO_URL}/projects/{project_id}")
            print(f"\n👉 Next Steps:")
            print(f"   1. Go to Label Studio UI and label the sample data")
            print(f"   2. Run the next cell to extract labeling rules")
            print(f"   3. Apply rules to auto-label the full dataset")
        else:
            print("❌ Failed to import tasks")
    else:
        print("❌ Failed to create project")
else:
    print("❌ Could not connect to Label Studio")
    print("\n🔧 Troubleshooting:")
    print("   1. Make sure Label Studio is running: label-studio start")
    print("   2. Check if the URL is correct")
    print("   3. Verify your API key is valid")

🔌 Testing Label Studio connection...
✅ Connected to Label Studio: {'release': '1.20.0', 'label-studio-os-package': {'version': '1.20.0', 'short_version': '1.20', 'latest_version_from_pypi': '1.20.0', 'latest_version_upload_time': '2025-07-01T07:29:54', 'current_version_is_outdated': False}, 'label-studio-os-backend': {'message': 'fix: FIT-306: Zooming out the page breaks audio rendering of the wavef ...', 'commit': 'fb90125beebf3f951d844194cb401cd22d8f18b9', 'date': '2025/06/27 07:31:54', 'branch': '', 'version': '1.20.0+0.gfb90125'}, 'label-studio-frontend': {'message': 'fix: FIT-306: Zooming out the page breaks audio rendering of the wavef ...', 'commit': 'fb9012', 'date': '2025-06-27T12:31:54.000Z', 'branch': 'develop'}, 'dm2': {'message': 'fix: FIT-306: Zooming out the page breaks audio rendering of the wavef ...', 'commit': 'fb9012', 'date': '2025-06-27T12:31:54.000Z', 'branch': 'develop'}, 'label-studio-converter': {'version': '1.0.18'}, 'edition': 'Community', 'lsf': {'message':

## 🤖 5. Extract Labeling Rules from Manual Annotations


In [7]:
class RuleExtractor:
    """Extract labeling rules from manually labeled data"""

    def __init__(self):
        self.rules = {"Advertisement": [], "Irrelevant": [], "Fake_Rant": []}

    def extract_rules_from_annotations(self, annotations: List[Dict]) -> Dict:
        """Extract pattern-based rules from labeled data"""
        labeled_data = {"Advertisement": [], "Irrelevant": [], "Fake_Rant": []}

        # Process annotations to extract labeled examples
        for annotation in annotations:
            if "annotations" in annotation and len(annotation["annotations"]) > 0:
                review_text = annotation["data"]["review_text"]
                labels = annotation["annotations"][0]["result"]

                # Extract selected labels
                selected_labels = []
                for label in labels:
                    if label["from_name"] == "classification":
                        selected_labels.extend(label["value"]["choices"])

                # Add to appropriate categories
                for category in ["Advertisement", "Irrelevant", "Fake_Rant"]:
                    if category in selected_labels:
                        labeled_data[category].append(review_text.lower())

        print(f"📊 Labeled Data Distribution:")
        for category, texts in labeled_data.items():
            print(f"   {category}: {len(texts)} examples")

        # Extract keyword-based rules
        self._extract_keyword_rules(labeled_data)

        return self.rules

    def _extract_keyword_rules(self, labeled_data: Dict[str, List[str]]):
        """Extract keyword patterns from labeled examples"""

        # Advertisement patterns
        ad_patterns = [
            r"\b(call|phone|contact)\b",
            r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",  # Phone numbers
            r"\b(visit|website|www|http)\b",
            r"\b(discount|sale|promo|deal|offer|coupon)\b",
            r"\b(free|limited|special)\b",
            r"@\w+\.[a-z]+",  # Email patterns
        ]

        # Irrelevant patterns
        irrelevant_patterns = [
            r"\b(weather|traffic|politics|government)\b",
            r"\b(my car|my phone|my house)\b",
            r"\b(news|television|movie|sports)\b",
            r"\b(unrelated|off.topic|nothing to do)\b",
        ]

        # Fake rant patterns
        fake_rant_patterns = [
            r"\b(never been|never visited|never went)\b",
            r"\b(heard|looks like|seems like|probably)\b",
            r"\b(all these places|these types|hate these)\b",
            r"\b(avoid|stay away|waste of time)\b",
        ]

        self.rules = {
            "Advertisement": ad_patterns,
            "Irrelevant": irrelevant_patterns,
            "Fake_Rant": fake_rant_patterns,
        }

        # Enhance rules based on actual labeled data
        self._enhance_rules_from_examples(labeled_data)

        print(f"\n🔍 Extracted Rules:")
        for category, patterns in self.rules.items():
            print(f"   {category}: {len(patterns)} patterns")

    def _enhance_rules_from_examples(self, labeled_data: Dict[str, List[str]]):
        """Enhance rules by analyzing common words in labeled examples"""
        from collections import Counter
        import re

        for category, texts in labeled_data.items():
            if len(texts) > 0:
                # Extract common words from this category
                all_words = []
                for text in texts:
                    words = re.findall(r"\b\w+\b", text.lower())
                    all_words.extend(words)

                # Find most common words (excluding common stop words)
                stop_words = {
                    "the",
                    "a",
                    "an",
                    "and",
                    "or",
                    "but",
                    "in",
                    "on",
                    "at",
                    "to",
                    "for",
                    "of",
                    "with",
                    "by",
                    "is",
                    "was",
                    "are",
                    "were",
                    "be",
                    "been",
                    "have",
                    "has",
                    "had",
                    "do",
                    "does",
                    "did",
                    "will",
                    "would",
                    "could",
                    "should",
                    "this",
                    "that",
                    "they",
                    "them",
                    "their",
                }

                word_counts = Counter(
                    [w for w in all_words if len(w) > 3 and w not in stop_words]
                )
                top_words = word_counts.most_common(5)

                # Add high-frequency words as patterns
                for word, count in top_words:
                    if count >= 2:  # Only if appears in multiple examples
                        pattern = f"\\b{re.escape(word)}\\b"
                        if pattern not in self.rules[category]:
                            self.rules[category].append(pattern)


# Initialize rule extractor
rule_extractor = RuleExtractor()
print("🤖 Rule Extractor initialized")

🤖 Rule Extractor initialized


In [8]:
# Extract rules from manual annotations
print("📥 Retrieving manual annotations...")

# Note: Make sure you have labeled some data in Label Studio before running this
try:
    annotations = ls_manager.get_annotations(project_id)

    if len(annotations) > 0:
        print(f"\n🔍 Extracting labeling rules from {len(annotations)} annotations...")
        extracted_rules = rule_extractor.extract_rules_from_annotations(annotations)

        print(f"\n📋 Final Rules Summary:")
        for category, patterns in extracted_rules.items():
            print(f"\n   🎯 {category} ({len(patterns)} rules):")
            for i, pattern in enumerate(patterns[:5], 1):  # Show first 5 rules
                print(f"      {i}. {pattern}")
            if len(patterns) > 5:
                print(f"      ... and {len(patterns) - 5} more")

        print(f"\n✅ Rules extracted successfully!")
        print(f"📊 Ready to apply to full dataset ({len(df_full):,} rows)")

    else:
        print("⚠️  No annotations found. Please label some data in Label Studio first.")
        print(f"   Go to: {LABEL_STUDIO_URL}/projects/{project_id}")

        # Use default rules if no annotations
        print("\n🔧 Using default rule patterns...")
        extracted_rules = rule_extractor.rules

except Exception as e:
    print(f"❌ Error retrieving annotations: {e}")
    # Use default rules
    extracted_rules = rule_extractor.rules
    print("🔧 Using default rule patterns...")

📥 Retrieving manual annotations...
❌ Failed to retrieve annotations: 404
⚠️  No annotations found. Please label some data in Label Studio first.
   Go to: http://localhost:8080/projects/None

🔧 Using default rule patterns...


## ⚡ 6. Auto-Label Full Dataset


In [9]:
class AutoLabeler:
    """Apply extracted rules to automatically label the full dataset"""

    def __init__(self, rules: Dict[str, List[str]]):
        self.rules = rules
        self.compiled_patterns = {}
        self._compile_patterns()

    def _compile_patterns(self):
        """Compile regex patterns for better performance"""
        for category, patterns in self.rules.items():
            compiled = []
            for pattern in patterns:
                try:
                    compiled.append(re.compile(pattern, re.IGNORECASE))
                except re.error:
                    print(f"⚠️  Invalid regex pattern skipped: {pattern}")
            self.compiled_patterns[category] = compiled

        print(
            f"✅ Compiled {sum(len(p) for p in self.compiled_patterns.values())} regex patterns"
        )

    def label_text(self, text: str) -> Dict[str, bool]:
        """Apply rules to label a single text"""
        labels = {"Advertisement": False, "Irrelevant": False, "Fake_Rant": False}

        if pd.isna(text) or text == "":
            return labels

        text_lower = str(text).lower()

        for category, patterns in self.compiled_patterns.items():
            for pattern in patterns:
                if pattern.search(text_lower):
                    labels[category] = True
                    break  # One match is enough for this category

        return labels

    def label_dataframe(
        self, df: pd.DataFrame, text_column: str = "review_text"
    ) -> pd.DataFrame:
        """Apply auto-labeling to entire dataframe"""
        print(f"🚀 Starting auto-labeling on {len(df):,} rows...")

        # Initialize result columns
        df_labeled = df.copy()
        df_labeled["advertisement"] = False
        df_labeled["irrelevant"] = False
        df_labeled["fake_rant"] = False

        # Apply labeling rules
        batch_size = 1000
        total_batches = (len(df) + batch_size - 1) // batch_size

        for i in range(0, len(df), batch_size):
            batch_end = min(i + batch_size, len(df))
            batch_num = i // batch_size + 1

            print(
                f"   Processing batch {batch_num}/{total_batches} ({i+1}-{batch_end})..."
            )

            for idx in range(i, batch_end):
                text = df.iloc[idx][text_column]
                labels = self.label_text(text)

                df_labeled.iloc[idx, df_labeled.columns.get_loc("advertisement")] = (
                    labels["Advertisement"]
                )
                df_labeled.iloc[idx, df_labeled.columns.get_loc("irrelevant")] = labels[
                    "Irrelevant"
                ]
                df_labeled.iloc[idx, df_labeled.columns.get_loc("fake_rant")] = labels[
                    "Fake_Rant"
                ]

        # Calculate statistics
        stats = {
            "advertisement": df_labeled["advertisement"].sum(),
            "irrelevant": df_labeled["irrelevant"].sum(),
            "fake_rant": df_labeled["fake_rant"].sum(),
        }

        print(f"\n✅ Auto-labeling completed!")
        print(f"\n📊 Labeling Results:")
        for label, count in stats.items():
            percentage = (count / len(df_labeled)) * 100
            print(f"   {label}: {count:,} ({percentage:.1f}%)")

        return df_labeled


# Initialize auto-labeler
auto_labeler = AutoLabeler(extracted_rules)
print("⚡ Auto-labeler initialized with extracted rules")

✅ Compiled 0 regex patterns
⚡ Auto-labeler initialized with extracted rules


In [10]:
# Apply auto-labeling to full dataset
print(f"🎯 Applying auto-labeling to full dataset...")
print(f"📊 Dataset size: {len(df_full):,} rows")

# Apply auto-labeling
df_auto_labeled = auto_labeler.label_dataframe(df_full, "review_text")

print(f"\n🎉 Auto-labeling pipeline completed!")

# Display sample results
print(f"\n📋 Sample Auto-Labeled Results:")
sample_results = df_auto_labeled[
    ["review_text", "advertisement", "irrelevant", "fake_rant"]
].head(10)

for idx, row in sample_results.iterrows():
    labels = []
    if row["advertisement"]:
        labels.append("Advertisement")
    if row["irrelevant"]:
        labels.append("Irrelevant")
    if row["fake_rant"]:
        labels.append("Fake_Rant")

    label_str = ", ".join(labels) if labels else "Clean"

    print(f"\n   Row {idx}:")
    print(f"   📝 Review: '{row['review_text'][:80]}...'")
    print(f"   🏷️  Labels: {label_str}")

# Calculate final statistics
total_rows = len(df_auto_labeled)
labeled_rows = (
    df_auto_labeled[["advertisement", "irrelevant", "fake_rant"]].any(axis=1).sum()
)
clean_rows = total_rows - labeled_rows

print(f"\n📈 Final Statistics:")
print(f"   Total rows processed: {total_rows:,}")
print(f"   Rows with violations: {labeled_rows:,} ({labeled_rows/total_rows*100:.1f}%)")
print(f"   Clean rows: {clean_rows:,} ({clean_rows/total_rows*100:.1f}%)")

🎯 Applying auto-labeling to full dataset...
📊 Dataset size: 347,087 rows
🚀 Starting auto-labeling on 347,087 rows...
   Processing batch 1/348 (1-1000)...
   Processing batch 2/348 (1001-2000)...
   Processing batch 3/348 (2001-3000)...
   Processing batch 4/348 (3001-4000)...
   Processing batch 5/348 (4001-5000)...
   Processing batch 6/348 (5001-6000)...
   Processing batch 7/348 (6001-7000)...
   Processing batch 8/348 (7001-8000)...
   Processing batch 9/348 (8001-9000)...
   Processing batch 10/348 (9001-10000)...
   Processing batch 11/348 (10001-11000)...
   Processing batch 12/348 (11001-12000)...
   Processing batch 13/348 (12001-13000)...
   Processing batch 14/348 (13001-14000)...
   Processing batch 15/348 (14001-15000)...
   Processing batch 16/348 (15001-16000)...
   Processing batch 17/348 (16001-17000)...
   Processing batch 18/348 (17001-18000)...
   Processing batch 19/348 (18001-19000)...
   Processing batch 20/348 (19001-20000)...
   Processing batch 21/348 (20001-

## 💾 7. Save Results


In [11]:
# Save auto-labeled dataset
import os
from datetime import datetime

# Create output directory
output_dir = "../outputs/auto_labeled"
os.makedirs(output_dir, exist_ok=True)

# Generate filename with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_filename = f"auto_labeled_reviews_{timestamp}.csv"
output_path = os.path.join(output_dir, output_filename)

# Save the labeled dataset
df_auto_labeled.to_csv(output_path, index=False)

print(f"💾 Auto-labeled dataset saved:")
print(f"   📁 Path: {output_path}")
print(f"   📊 Size: {len(df_auto_labeled):,} rows")
print(f"   🏷️  Columns: {df_auto_labeled.columns.tolist()}")

# Save rules for future use
rules_filename = f"labeling_rules_{timestamp}.json"
rules_path = os.path.join(output_dir, rules_filename)

with open(rules_path, "w") as f:
    json.dump(extracted_rules, f, indent=2)

print(f"\n📋 Labeling rules saved:")
print(f"   📁 Path: {rules_path}")

# Create summary report
summary = {
    "timestamp": timestamp,
    "original_dataset": DATA_PATH,
    "total_rows_processed": len(df_auto_labeled),
    "sample_size_for_manual_labeling": len(df_sample),
    "sample_percentage": SAMPLE_PERCENTAGE,
    "label_statistics": {
        "advertisement": int(df_auto_labeled["advertisement"].sum()),
        "irrelevant": int(df_auto_labeled["irrelevant"].sum()),
        "fake_rant": int(df_auto_labeled["fake_rant"].sum()),
        "clean": int(
            (
                ~df_auto_labeled[["advertisement", "irrelevant", "fake_rant"]].any(
                    axis=1
                )
            ).sum()
        ),
    },
    "rules_count": {
        category: len(patterns) for category, patterns in extracted_rules.items()
    },
    "output_files": {"labeled_data": output_path, "rules": rules_path},
}

summary_path = os.path.join(output_dir, f"summary_{timestamp}.json")
with open(summary_path, "w") as f:
    json.dump(summary, f, indent=2)

print(f"\n📊 Summary report saved: {summary_path}")

print(f"\n🎉 AUTO-LABELING PIPELINE COMPLETE!")
print(f"\n📋 Workflow Summary:")
print(f"   1. ✅ Loaded {len(df_full):,} reviews from {DATA_PATH}")
print(
    f"   2. ✅ Sampled {len(df_sample):,} rows ({SAMPLE_PERCENTAGE*100}%) for manual labeling"
)
print(f"   3. ✅ Sent sample to Label Studio for manual labeling")
print(
    f"   4. ✅ Extracted {sum(len(p) for p in extracted_rules.values())} labeling rules"
)
print(f"   5. ✅ Applied rules to auto-label {len(df_auto_labeled):,} rows")
print(f"   6. ✅ Saved results to {output_dir}")

print(f"\n🚀 Ready for ML model training with labeled data!")

💾 Auto-labeled dataset saved:
   📁 Path: ../outputs/auto_labeled/auto_labeled_reviews_20250829_153319.csv
   📊 Size: 347,087 rows
   🏷️  Columns: ['user_id', 'user_name', 'review_time', 'rating', 'review_text', 'pics', 'resp', 'gmap_id', 'has_resp', 'resp_text', 'resp_time', 'biz_name', 'description', 'category', 'avg_rating', 'num_of_reviews', 'price_level', 'advertisement', 'irrelevant', 'fake_rant']

📋 Labeling rules saved:
   📁 Path: ../outputs/auto_labeled/labeling_rules_20250829_153319.json

📊 Summary report saved: ../outputs/auto_labeled/summary_20250829_153319.json

🎉 AUTO-LABELING PIPELINE COMPLETE!

📋 Workflow Summary:
   1. ✅ Loaded 347,087 reviews from ../data/cleaned_google_reviews.csv
   2. ✅ Sampled 34,708 rows (10.0%) for manual labeling
   3. ✅ Sent sample to Label Studio for manual labeling
   4. ✅ Extracted 0 labeling rules
   5. ✅ Applied rules to auto-label 347,087 rows
   6. ✅ Saved results to ../outputs/auto_labeled

🚀 Ready for ML model training with labeled dat