# Semantic Analysis Pipeline

This notebook demonstrates the semantic text analysis capabilities using our custom analyzers.

## Setup
Import required packages and configure the environment:



In [1]:
# At start of notebook
import sys
from pathlib import Path
import logging
import os

# Add project root to Python path
project_root = str(Path().resolve().parent)
if project_root not in sys.path:
    sys.path.append(project_root)

In [2]:
import logging
from src.nb_helpers.logging import configure_logging

# Set up environment with DEBUG level
from src.nb_helpers.environment import setup_notebook_env, verify_environment
setup_notebook_env(log_level="DEBUG")

# Any verification needed will maintain DEBUG level
verify_environment(log_level="DEBUG")

2024-11-16 13:36:26 - src.utils.FileUtils.file_utils - DEBUG - Initialized FileUtils with log level: DEBUG
2024-11-16 13:36:26 - src.utils.FileUtils.file_utils - DEBUG - Project root: c:\Users\tja\OneDrive - Rastor-instituutti ry\Tiedostot\Rastor-instituutti\kehittäminen\analytiikka\repos\semantic-text-analyzer
2024-11-16 13:36:26 - src.nb_helpers.environment - DEBUG - Before environment setup - Root logger level: DEBUG
2024-11-16 13:36:26 - src.nb_helpers.environment - DEBUG - After environment setup - Root logger level: DEBUG
2024-11-16 13:36:26 - src.utils.FileUtils.file_utils - DEBUG - Initialized FileUtils with log level: DEBUG
2024-11-16 13:36:26 - src.utils.FileUtils.file_utils - DEBUG - Project root: c:\Users\tja\OneDrive - Rastor-instituutti ry\Tiedostot\Rastor-instituutti\kehittäminen\analytiikka\repos\semantic-text-analyzer


Environment Check Results:

Basic Setup:
-----------
✓ Project root in path
✓ FileUtils initialized
✓ .env file loaded

Environment Variables:
---------------------
✓ OPENAI_API_KEY set
✓ ANTHROPIC_API_KEY set

Project Structure:
-----------------
✓ Raw data exists
✓ Processed data exists
✓ Configuration exists
✓ Main config.yaml exists

Environment Status: Ready ✓


True

In [3]:
# Any verification needed will maintain DEBUG level
verify_environment(log_level="DEBUG")

2024-11-16 13:36:26 - src.utils.FileUtils.file_utils - DEBUG - Initialized FileUtils with log level: DEBUG
2024-11-16 13:36:26 - src.utils.FileUtils.file_utils - DEBUG - Project root: c:\Users\tja\OneDrive - Rastor-instituutti ry\Tiedostot\Rastor-instituutti\kehittäminen\analytiikka\repos\semantic-text-analyzer


Environment Check Results:

Basic Setup:
-----------
✓ Project root in path
✓ FileUtils initialized
✓ .env file loaded

Environment Variables:
---------------------
✓ OPENAI_API_KEY set
✓ ANTHROPIC_API_KEY set

Project Structure:
-----------------
✓ Raw data exists
✓ Processed data exists
✓ Configuration exists
✓ Main config.yaml exists

Environment Status: Ready ✓


True

In [4]:
# # Setup environment first
# from src.nb_helpers.environment import setup_notebook_env
# setup_notebook_env()

# # Configure logging levels
# import logging
# from src.nb_helpers.logging import configure_logging, verify_logging_setup

# # Set root logger to DEBUG
# root = logging.getLogger()
# root.setLevel(logging.DEBUG)
# for handler in root.handlers:
#     handler.setLevel(logging.DEBUG)

# # Set module loggers to DEBUG
# for name in ["src.nb_helpers.analyzers", "src.analyzers.keyword_analyzer", 
#             "src.analyzers.theme_analyzer", "src.analyzers.category_analyzer", 
#             "src.utils.FileUtils.file_utils"]:
#     logging.getLogger(name).setLevel(logging.DEBUG)


# Keep HTTP loggers at INFO
for name in ["httpx", "httpcore", "openai", "anthropic"]:
    logging.getLogger(name).setLevel(logging.INFO)





In [5]:
detailed_logging_info = True
if detailed_logging_info:
    from src.nb_helpers.logging import verify_logging_setup_with_hierarchy
    # Configure logging
    # configure_logging(level="DEBUG")
    # Verify with detailed information
    verify_logging_setup_with_hierarchy()



Logging Configuration:
--------------------------------------------------

Logger: root
Set Level: DEBUG
Effective Level: DEBUG
Propagates to root: True
Handlers:
  Handler 1 level: DEBUG

Logger: src.nb_helpers.analyzers
Hierarchy:
  src: NOTSET
  src.nb_helpers: NOTSET
  src.nb_helpers.analyzers: NOTSET
Set Level: NOTSET
Effective Level: DEBUG
Propagates to root: True
No handlers (uses root handlers)

Logger: src.analyzers.keyword_analyzer
Hierarchy:
  src: NOTSET
  src.analyzers: NOTSET
  src.analyzers.keyword_analyzer: NOTSET
Set Level: NOTSET
Effective Level: DEBUG
Propagates to root: True
No handlers (uses root handlers)

Logger: src.analyzers.theme_analyzer
Hierarchy:
  src: NOTSET
  src.analyzers: NOTSET
  src.analyzers.theme_analyzer: NOTSET
Set Level: NOTSET
Effective Level: DEBUG
Propagates to root: True
No handlers (uses root handlers)

Logger: src.analyzers.category_analyzer
Hierarchy:
  src: NOTSET
  src.analyzers: NOTSET
  src.analyzers.category_analyzer: NOTSET
Set Lev

In [20]:
# Run environment verification
# from src.nb_helpers.environment import verify_environment
# verify_environment()

In [6]:
# Import other modules after logging is configured
from src.nb_helpers.analyzers import (
    analyze_keywords, 
    analyze_themes,
    analyze_categories,
    analyze_text,
    AnalysisOptions
)

options = AnalysisOptions(
    show_confidence=True,
    show_evidence=True,
    show_keywords=True,
    show_raw_data=True,
    debug_mode=True
)

In [7]:
# Test logging
logger = logging.getLogger("src.analyzers.keyword_analyzer")
logger.debug("Testing keyword analyzer logging")

2024-11-16 13:37:00 - src.analyzers.keyword_analyzer - DEBUG - Testing keyword analyzer logging


TODO:
- define language

<!-- ## Analysis Functions

### Single Analysis with Debug Output
Run detailed analysis for a single text: -->


In [8]:
example_texts = {
    "Business Analysis": """
        Q3 revenue increased by 15% with strong growth in enterprise sales.
        Customer retention improved while acquisition costs decreased.
        New market expansion initiatives are showing positive early results.
    """,
    
    "Technical Content": """
        The application uses microservices architecture with containerized deployments.
        Data processing pipeline incorporates machine learning models for prediction.
        System monitoring ensures high availability and performance metrics.
    """,
    
    "Mixed Content": """
        The IT department's cloud migration project reduced infrastructure costs by 25%.
        DevOps implementation improved deployment frequency while maintaining quality.
        Monthly recurring revenue from SaaS products grew steadily.
    """,
    "koulutus":
    """
        Verkko-oppimisalusta sisältää interaktiivisia moduuleja ja oman tahdin edistymisen seurannan. 
        Virtuaaliluokat mahdollistavat reaaliaikaisen yhteistyön opiskelijoiden ja ohjaajien välillä. 
        Digitaaliset arviointityökalut antavat välitöntä palautetta oppimistuloksista.
    """,
    "tekninen":
    """
        Koneoppimismalleja koulutetaan suurilla datajoukolla tunnistamaan kaavoja. 
        Neuroverkon arkkitehtuuri sisältää useita kerroksia piirteiden erottamiseen. 
        Datan esikäsittely ja piirteiden suunnittelu ovat keskeisiä vaiheita prosessissa.

    """
}

In [9]:
# text = example_texts["Mixed Content"]
# text = example_texts["koulutussisältö"]
# Debug specific analyzer

# Example usage
text = example_texts["Mixed Content"]

In [10]:
await analyze_keywords(text, options=options)


2024-11-16 13:37:40 - src.nb_helpers.analyzers - DEBUG - Starting keyword analysis
2024-11-16 13:37:40 - src.nb_helpers.analyzers - DEBUG - Initialized TextAnalyzer with options: AnalysisOptions(show_confidence=True, show_evidence=True, show_keywords=True, show_raw_data=True, debug_mode=True)
2024-11-16 13:37:40 - src.nb_helpers.analyzers - DEBUG - Starting Keyword analysis
2024-11-16 13:37:40 - src.nb_helpers.analyzers - DEBUG - Parameter file: None
2024-11-16 13:37:40 - src.nb_helpers.analyzers - DEBUG - Creating keyword tester with parameter file: None



Keyword Analysis

Input Text:
--------------------
The IT department's cloud migration project reduced infrastructure costs by 25%.
        DevOps implementation improved deployment frequency while maintaining quality.
        Monthly recurring revenue from SaaS products grew steadily.

Analyzing...
--------------------


2024-11-16 13:38:05 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



Keywords Found:
  • cloud migration      [██████████████████░░] (0.94)
  • infrastructure costs [█████████████████░░░] (0.89)
  • monthly recurring revenue [█████████████████░░░] (0.87)
  • DevOps implementation [████████████████░░░░] (0.84)
  • deployment frequency [███████████████░░░░░] (0.79)
  • SaaS products        [█████████████░░░░░░░] (0.70)
  • infrastructure       [██████████░░░░░░░░░░] (0.53)
  • migration            [█████████░░░░░░░░░░░] (0.48)
  • cost                 [█████████░░░░░░░░░░░] (0.48)
  • implementation       [█████████░░░░░░░░░░░] (0.48)

Debug Information:
--------------------
{
  "keywords": [
    {
      "keyword": "cloud migration",
      "score": 0.9439199999999998,
      "domain": "technical",
      "compound_parts": [
        "cloud",
        "migration"
      ]
    },
    {
      "keyword": "infrastructure costs",
      "score": 0.8942399999999999,
      "domain": "business",
      "compound_parts": [
        "infrastructure",
        "costs"
      

KeywordAnalysisResult(keywords=[KeywordInfo(keyword='cloud migration', score=0.9439199999999998, domain='technical', compound_parts=['cloud', 'migration']), KeywordInfo(keyword='infrastructure costs', score=0.8942399999999999, domain='business', compound_parts=['infrastructure', 'costs']), KeywordInfo(keyword='monthly recurring revenue', score=0.8693999999999998, domain='business', compound_parts=['monthly', 'recurring', 'revenue']), KeywordInfo(keyword='DevOps implementation', score=0.8445599999999999, domain='technical', compound_parts=['DevOps', 'implementation']), KeywordInfo(keyword='deployment frequency', score=0.7948799999999998, domain='technical', compound_parts=['deployment', 'frequency']), KeywordInfo(keyword='SaaS products', score=0.6955199999999998, domain='business', compound_parts=['SaaS', 'products']), KeywordInfo(keyword='infrastructure', score=0.528, domain=None, compound_parts=['infra', 'structure']), KeywordInfo(keyword='migration', score=0.48, domain=None, compound

In [None]:
await analyze_themes(text, options=options)



Debug Theme Analysis

Input Text:
--------------------
The IT department's cloud migration project reduced infrastructure costs by 25%.
        DevOps implementation improved deployment frequency while maintaining quality.
        Monthly recurring revenue from SaaS products grew steadily.

Running Analysis...
--------------------


In [None]:
await analyze_categories(text, options=options)



Debug Category Analysis

Input Text:
--------------------
The IT department's cloud migration project reduced infrastructure costs by 25%.
        DevOps implementation improved deployment frequency while maintaining quality.
        Monthly recurring revenue from SaaS products grew steadily.

Running Analysis...
--------------------

Categories Found:

  • technical
    Confidence: [█████████████████░░░] (0.85)
    The text discusses a cloud migration project and DevOps implementation, both of which are technical processes related to software development and system management.
    Evidence:
      - cloud migration project
      - DevOps implementation
      - improved deployment frequency

  • business
    Confidence: [███████████████░░░░░] (0.75)
    The text mentions reduced infrastructure costs and growth in monthly recurring revenue, which are key indicators of business performance and financial health.
    Evidence:
      - reduced infrastructure costs by 25%
      - monthly rec

CategoryOutput(language='en', error=None, success=True, categories=[CategoryInfo(name='technical', confidence=0.85, explanation='The text discusses a cloud migration project and DevOps implementation, both of which are technical processes related to software development and system management.', evidence=['cloud migration project', 'DevOps implementation', 'improved deployment frequency'], themes=['cloud computing', 'DevOps', 'infrastructure management']), CategoryInfo(name='business', confidence=0.75, explanation='The text mentions reduced infrastructure costs and growth in monthly recurring revenue, which are key indicators of business performance and financial health.', evidence=['reduced infrastructure costs by 25%', 'monthly recurring revenue from SaaS products grew steadily'], themes=['cost reduction', 'revenue growth', 'SaaS business model'])], explanations={'technical': 'The text discusses a cloud migration project and DevOps implementation, both of which are technical processes

In [None]:
# Or run full pipeline with debug info
await debug_full_pipeline(text)


### Batch Processing from Excel
Process multiple texts from Excel file:


In [None]:
await analyze_excel_content(
    input_file="test_content.xlsx",  # Input Excel file path
    output_file="analysis_results",  # Output filename (without extension)
    content_column="content"         # Column containing text to analyze
)


## Parameters
- Configure analyzers using parameter files
- Control output detail with DebugOptions
- Set logging level for verbosity control

## Example Outputs
The analysis provides:
- Keywords with confidence scores
- Theme identification and descriptions
- Category classification with evidence
- Confidence visualizations with Unicode bars

## Notes
- Set logging level to WARNING to minimize output
- Use debug functions for detailed analysis inspection
- Excel output combines all analysis types