# Semantic Analysis Pipeline

This notebook demonstrates the semantic text analysis capabilities using our custom analyzers.

## Setup
Import required packages and configure the environment:



In [1]:
# import os
import sys
from pathlib import Path
# import asyncio
# import json
# import logging
# from typing import Dict, Any, List, Tuple, Optional
# from pprint import pprint

# import pandas as pd


In [2]:
# Add project root to Python path
project_root = str(Path().resolve().parent)
if project_root not in sys.path:
    sys.path.append(project_root)
    print(f"Added {project_root} to Python path")

Added C:\Users\tja\OneDrive - Rastor-instituutti ry\Tiedostot\Rastor-instituutti\kehittäminen\analytiikka\repos\semantic-text-analyzer to Python path


TODO:
- define language
- fix logging
- compound words

### Import notebook helper files

In [3]:
from src.nb_helpers.environment import setup_notebook_env, verify_environment
from src.nb_helpers.logging import configure_logging
from src.nb_helpers.debug import (
    debug_theme_analysis,
    debug_category_analysis,
    debug_keyword_analysis,
    debug_full_pipeline
)
from src.nb_helpers.excel import analyze_excel_content



In [4]:
# Setup environment and logging
setup_notebook_env()


In [5]:
verify_environment()


Environment Check Results:

Basic Setup:
-----------
✓ Project root in path
✓ FileUtils initialized
✓ .env file loaded

Environment Variables:
---------------------
✓ OPENAI_API_KEY set
✓ ANTHROPIC_API_KEY set

Project Structure:
-----------------
✓ Raw data exists
✓ Processed data exists
✓ Configuration exists
✓ Main config.yaml exists

Environment Status: Ready ✓


True

In [11]:
configure_logging(level="DEBUG")

## Analysis Functions

### Single Analysis with Debug Output
Run detailed analysis for a single text:


In [7]:
example_texts = {
    "Business Analysis": """
        Q3 revenue increased by 15% with strong growth in enterprise sales.
        Customer retention improved while acquisition costs decreased.
        New market expansion initiatives are showing positive early results.
    """,
    
    "Technical Content": """
        The application uses microservices architecture with containerized deployments.
        Data processing pipeline incorporates machine learning models for prediction.
        System monitoring ensures high availability and performance metrics.
    """,
    
    "Mixed Content": """
        The IT department's cloud migration project reduced infrastructure costs by 25%.
        DevOps implementation improved deployment frequency while maintaining quality.
        Monthly recurring revenue from SaaS products grew steadily.
    """,
    "koulutussisältö":
    """
        Verkko-oppimisalusta sisältää interaktiivisia moduuleja ja oman tahdin edistymisen seurannan. 
        Virtuaaliluokat mahdollistavat reaaliaikaisen yhteistyön opiskelijoiden ja ohjaajien välillä. 
        Digitaaliset arviointityökalut antavat välitöntä palautetta oppimistuloksista.
    """,
    "tekninen_sisältö":
    """
        Koneoppimismalleja koulutetaan suurilla datajoukolla tunnistamaan kaavoja. 
        Neuroverkon arkkitehtuuri sisältää useita kerroksia piirteiden erottamiseen. 
        Datan esikäsittely ja piirteiden suunnittelu ovat keskeisiä vaiheita prosessissa.

    """
}

In [9]:
# text = example_texts["Mixed Content"]
text = example_texts["koulutussisältö"]
# Debug specific analyzer


In [None]:
await debug_theme_analysis(text)



Debug Theme Analysis

Input Text:
--------------------
The IT department's cloud migration project reduced infrastructure costs by 25%.
        DevOps implementation improved deployment frequency while maintaining quality.
        Monthly recurring revenue from SaaS products grew steadily.

Running Analysis...
--------------------


In [None]:
await debug_category_analysis(text)



Debug Category Analysis

Input Text:
--------------------
The IT department's cloud migration project reduced infrastructure costs by 25%.
        DevOps implementation improved deployment frequency while maintaining quality.
        Monthly recurring revenue from SaaS products grew steadily.

Running Analysis...
--------------------

Categories Found:

  • technical
    Confidence: [█████████████████░░░] (0.85)
    The text discusses a cloud migration project and DevOps implementation, both of which are technical processes related to software development and system management.
    Evidence:
      - cloud migration project
      - DevOps implementation
      - improved deployment frequency

  • business
    Confidence: [███████████████░░░░░] (0.75)
    The text mentions reduced infrastructure costs and growth in monthly recurring revenue, which are key indicators of business performance and financial health.
    Evidence:
      - reduced infrastructure costs by 25%
      - monthly rec

CategoryOutput(language='en', error=None, success=True, categories=[CategoryInfo(name='technical', confidence=0.85, explanation='The text discusses a cloud migration project and DevOps implementation, both of which are technical processes related to software development and system management.', evidence=['cloud migration project', 'DevOps implementation', 'improved deployment frequency'], themes=['cloud computing', 'DevOps', 'infrastructure management']), CategoryInfo(name='business', confidence=0.75, explanation='The text mentions reduced infrastructure costs and growth in monthly recurring revenue, which are key indicators of business performance and financial health.', evidence=['reduced infrastructure costs by 25%', 'monthly recurring revenue from SaaS products grew steadily'], themes=['cost reduction', 'revenue growth', 'SaaS business model'])], explanations={'technical': 'The text discusses a cloud migration project and DevOps implementation, both of which are technical processes

In [12]:
await debug_keyword_analysis(text)



Debug Keyword Analysis

Input Text:
--------------------
Verkko-oppimisalusta sisältää interaktiivisia moduuleja ja oman tahdin edistymisen seurannan. 
        Virtuaaliluokat mahdollistavat reaaliaikaisen yhteistyön opiskelijoiden ja ohjaajien välillä. 
        Digitaaliset arviointityökalut antavat välitöntä palautetta oppimistuloksista.

Running Analysis...
--------------------

Keywords Found:
  • verkko-oppimisalusta [████████████████████] (1.00)
  • interaktiivisia      [████████████████████] (1.00)
  • moduuleja            [████████████████████] (1.00)
  • tahdin               [████████████████████] (1.00)
  • virtuaaliluokat      [████████████████████] (1.00)
  • reaaliaikaisen       [████████████████████] (1.00)
  • oman                 [█████████░░░░░░░░░░░] (0.48)
  • edistymisen          [█████████░░░░░░░░░░░] (0.48)
  • seurannan            [█████████░░░░░░░░░░░] (0.48)
  • mahdollistavat       [█████████░░░░░░░░░░░] (0.48)

Debug Information:
--------------------

Confid

KeywordAnalysisResult(keywords=[KeywordInfo(keyword='verkko-oppimisalusta', score=1.0, domain='technical', compound_parts=None), KeywordInfo(keyword='interaktiivisia', score=1.0, domain='technical', compound_parts=None), KeywordInfo(keyword='moduuleja', score=1.0, domain='technical', compound_parts=None), KeywordInfo(keyword='tahdin', score=1.0, domain='general', compound_parts=None), KeywordInfo(keyword='virtuaaliluokat', score=1.0, domain='technical', compound_parts=None), KeywordInfo(keyword='reaaliaikaisen', score=1.0, domain='technical', compound_parts=None), KeywordInfo(keyword='oman', score=0.48, domain=None, compound_parts=None), KeywordInfo(keyword='edistymisen', score=0.48, domain=None, compound_parts=None), KeywordInfo(keyword='seurannan', score=0.48, domain=None, compound_parts=None), KeywordInfo(keyword='mahdollistavat', score=0.48, domain=None, compound_parts=None)], compound_words=[], domain_keywords={'technical': ['verkko-oppimisalusta', 'interaktiivisia', 'moduuleja', 

In [None]:
# Or run full pipeline with debug info
await debug_full_pipeline(text)


### Batch Processing from Excel
Process multiple texts from Excel file:


In [None]:
await analyze_excel_content(
    input_file="test_content.xlsx",  # Input Excel file path
    output_file="analysis_results",  # Output filename (without extension)
    content_column="content"         # Column containing text to analyze
)


## Parameters
- Configure analyzers using parameter files
- Control output detail with DebugOptions
- Set logging level for verbosity control

## Example Outputs
The analysis provides:
- Keywords with confidence scores
- Theme identification and descriptions
- Category classification with evidence
- Confidence visualizations with Unicode bars

## Notes
- Set logging level to WARNING to minimize output
- Use debug functions for detailed analysis inspection
- Excel output combines all analysis types