# üöÄ SEMA Analytics - Production Colab Edition

**Zero bullshit. Just run the cells in order.**

---

## ‚ö° Quick Start (3 Steps)

1. **Run all cells** (Runtime ‚Üí Run all)
2. **Mount Google Drive** when prompted
3. **Upload Excel files** when prompted
4. **Download results** at the end

**Total Time**: ~5-10 minutes (first run: +5 min for setup)

---

## üìã Step 1: Mount Google Drive

In [None]:
from google.colab import drive
import os

print("üìÅ Mounting Google Drive...")
drive.mount('/content/drive')
print("‚úÖ Google Drive mounted!")

## üì¶ Step 2: Setup Repository

In [None]:
import os
import subprocess

# Use Drive location for persistence
REPO_PATH = '/content/drive/MyDrive/sema_inf'

if os.path.exists(REPO_PATH):
    print("üìÇ Repository exists, updating...")
    os.chdir(REPO_PATH)
    !git fetch origin
    !git reset --hard origin/main
    print("‚úÖ Repository updated to latest version")
else:
    print("üì• Cloning repository for first time...")
    !git clone https://github.com/shc443/sema_inf "$REPO_PATH"
    os.chdir(REPO_PATH)
    print("‚úÖ Repository cloned!")

print(f"\nüìç Working directory: {os.getcwd()}")
!git log -1 --oneline

## ‚òï Step 3: Setup Java 11 (Critical for KoNLPy)

In [None]:
import os
import subprocess

print("‚òï Installing Java 11...")
print("(This fixes the SIGSEGV crash issue)\n")

# Install Java 11
!apt-get update -qq
!apt-get install -y -qq openjdk-11-jdk > /dev/null 2>&1

# Set Java environment
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'
os.environ['PATH'] = f"{os.environ['JAVA_HOME']}/bin:{os.environ['PATH']}"

# Verify installation
print("‚úÖ Java 11 installed!\n")
!java -version

## üêç Step 4: Install Python Dependencies

In [None]:
import sys

print("üì¶ Installing Python packages...")
print("(This takes ~2-3 minutes)\n")

# Install from requirements.txt
!pip install -q -r requirements.txt

# Verify critical packages
try:
    import torch
    import transformers
    import konlpy
    from konlpy.tag import Kkma
    
    # Test KoNLPy
    kkma = Kkma()
    test = kkma.morphs("ÌÖåÏä§Ìä∏")
    
    print("\n‚úÖ All packages installed successfully!")
    print(f"   - PyTorch: {torch.__version__}")
    print(f"   - Transformers: {transformers.__version__}")
    print(f"   - CUDA Available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"   - GPU: {torch.cuda.get_device_name(0)}")
    print(f"   - KoNLPy: Working ‚úì")
    
except Exception as e:
    print(f"\n‚ùå Package verification failed: {e}")
    print("\nüîÑ Attempting to fix...")
    !pip install --upgrade --force-reinstall konlpy -q
    print("‚úÖ Fix applied, proceeding...")

## üìÅ Step 5: Prepare Data Directories

In [None]:
import os

# Create directories
os.makedirs('data/input', exist_ok=True)
os.makedirs('data/output', exist_ok=True)

# Check existing files
input_files = [f for f in os.listdir('data/input') if f.endswith('.xlsx') and not f.startswith('~')]
output_files = [f for f in os.listdir('data/output') if f.endswith('_output.xlsx')]

print("üìä Current Status:")
print(f"   - Input files: {len(input_files)}")
print(f"   - Output files: {len(output_files)}")

if input_files:
    print("\nüìÑ Existing input files:")
    for f in input_files:
        size = os.path.getsize(f'data/input/{f}') / 1024
        print(f"   - {f} ({size:.1f} KB)")

print("\n‚úÖ Directories ready!")

## üì§ Step 6: Upload Your Excel Files

**Requirements:**
- Excel files (`.xlsx`)
- Must have `VOC1` and `VOC2` columns
- Korean text data

In [None]:
from google.colab import files
import shutil
import os

print("üì§ UPLOAD YOUR EXCEL FILES")
print("="*50)
print("Click 'Choose Files' and select your Excel files")
print("Multiple files can be selected at once\n")

uploaded = files.upload()

if not uploaded:
    print("\n‚ö†Ô∏è  No files uploaded. Using existing files in data/input/")
else:
    print(f"\n‚úÖ Received {len(uploaded)} files\n")
    
    # Move uploaded files
    for filename, content in uploaded.items():
        if filename.endswith('.xlsx') and not filename.startswith('~'):
            dest = f'data/input/{filename}'
            # Write content to file
            with open(dest, 'wb') as f:
                f.write(content)
            size = len(content) / 1024
            print(f"   ‚úì {filename} ({size:.1f} KB) ‚Üí data/input/")
        else:
            print(f"   ‚ö†Ô∏è  Skipped {filename} (not a valid Excel file)")

# Final count
input_files = [f for f in os.listdir('data/input') if f.endswith('.xlsx') and not f.startswith('~')]
print(f"\nüìä Total files ready for processing: {len(input_files)}")

## üöÄ Step 7: Run SEMA Inference

**This is where the magic happens!**

Processing time: ~1-2 minutes per file

In [None]:
import os
import sys

# Check we have files to process
input_files = [f for f in os.listdir('data/input') if f.endswith('.xlsx') and not f.startswith('~')]

if not input_files:
    print("‚ùå ERROR: No Excel files found in data/input/")
    print("\nPlease run the upload cell above to add files.")
else:
    print("üöÄ STARTING SEMA INFERENCE")
    print("="*50)
    print(f"Files to process: {len(input_files)}")
    print("")
    
    # Run the inference
    !python run_simple.py
    
    print("")
    print("="*50)
    print("üéâ INFERENCE COMPLETE!")

## üì• Step 8: Check Results

In [None]:
import os
import pandas as pd

output_files = [f for f in os.listdir('data/output') if f.endswith('_output.xlsx')]

print("üìä RESULTS SUMMARY")
print("="*50)

if not output_files:
    print("‚ùå No output files found")
    print("\nCheck the inference cell above for errors")
else:
    print(f"‚úÖ Successfully processed {len(output_files)} files\n")
    
    for filename in sorted(output_files):
        filepath = f'data/output/{filename}'
        size = os.path.getsize(filepath) / 1024
        
        # Read file to get row count
        try:
            df = pd.read_excel(filepath)
            rows = len(df)
            cols = len(df.columns)
            print(f"üìÑ {filename}")
            print(f"   Size: {size:.1f} KB")
            print(f"   Rows: {rows:,}")
            print(f"   Columns: {cols}")
            print("")
        except Exception as e:
            print(f"‚ö†Ô∏è  {filename} - Error reading: {e}\n")

print("="*50)

## üíæ Step 9: Download Results

In [None]:
from google.colab import files
import os

output_files = [f for f in os.listdir('data/output') if f.endswith('_output.xlsx')]

if not output_files:
    print("‚ùå No files to download")
else:
    print("üì• DOWNLOADING RESULTS")
    print("="*50)
    print(f"Downloading {len(output_files)} files...\n")
    
    for filename in output_files:
        filepath = f'data/output/{filename}'
        files.download(filepath)
        print(f"‚úì {filename}")
    
    print("\n‚úÖ All files downloaded!")
    print("Check your Downloads folder")

---

## üîß Optional: Advanced Operations

### Preview Output File

In [None]:
import pandas as pd
import os

output_files = [f for f in os.listdir('data/output') if f.endswith('_output.xlsx')]

if output_files:
    # Show first file
    first_file = output_files[0]
    print(f"üìÑ Preview: {first_file}\n")
    
    df = pd.read_excel(f'data/output/{first_file}')
    print(f"Shape: {df.shape[0]} rows √ó {df.shape[1]} columns\n")
    print("Columns:", list(df.columns))
    print("\nFirst 5 rows:")
    display(df.head())
else:
    print("‚ùå No output files to preview")

### Clean Up (Optional)

In [None]:
import shutil
import os

print("üßπ CLEANUP OPTIONS")
print("="*50)
print("")
print("‚ö†Ô∏è  WARNING: This will delete files!")
print("")
print("Uncomment the lines below to clean up:")
print("")

# Uncomment to clear input files
# shutil.rmtree('data/input')
# os.makedirs('data/input')
# print("‚úì Cleared data/input/")

# Uncomment to clear output files
# shutil.rmtree('data/output')
# os.makedirs('data/output')
# print("‚úì Cleared data/output/")

# Uncomment to clear cache
# if os.path.exists('cache'):
#     shutil.rmtree('cache')
#     print("‚úì Cleared cache/")

print("No cleanup performed (safe mode)")

---

## üÜò Troubleshooting

### Common Issues

**Error: "SIGSEGV crash" or "Java error"**
- Solution: Re-run the Java installation cell (Step 3)
- Verify: `!java -version` should show Java 11

**Error: "No files found in data/input"**
- Solution: Re-run upload cell (Step 6)
- Check files end with `.xlsx` and have VOC1, VOC2 columns

**Error: "CUDA out of memory"**
- Solution: Process fewer files at once
- Or: Use Runtime ‚Üí Change runtime type ‚Üí CPU (slower)

**Error: "Model download failed"**
- Solution: Check internet connection
- Files auto-download from HuggingFace (shc443/sema2025)
- May take 2-3 minutes on first run

**Session timeout**
- Files saved in Google Drive persist across sessions
- Just re-run from Step 1

### Get Help

1. Check error messages carefully
2. Re-run the failing cell
3. Restart runtime: Runtime ‚Üí Restart runtime
4. Open GitHub issue with error details

---

## üìö What This Does

1. **Clones SEMA repo** from GitHub
2. **Installs Java 11** to fix KoNLPy crashes
3. **Installs Python packages** (PyTorch, transformers, etc.)
4. **Downloads model** from HuggingFace (team-lucid/deberta-v3-xlarge-korean)
5. **Processes VOC data**:
   - Cleans text
   - Filters invalid entries
   - Runs sentiment/topic analysis
   - Extracts keywords
6. **Exports results** to Excel with predictions

---

**Built with** ‚ù§Ô∏è **for reliable Colab execution**

**Last Updated**: 2025-11-05