# Parent-Infant Interaction Analysis Pipeline

## Setup and Dependencies
```python
import torch
import whisper
import pandas as pd
import numpy as np
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
from dataclasses import dataclass
from typing import Optional, List
from scipy.signal import butter, filtfilt
import os
import subprocess
import logging
from pathlib import Path
from pydub import AudioSegment
from pydub.silence import split_on_silence
import soundfile as sf
from typing import Union, List, Tuple
```
Key dependencies:
- OpenAI Whisper for speech recognition
- PyDub for audio processing
- FFmpeg for audio file manipulation
- Pandas for data handling
- NumPy for numerical operations

## Pipeline Overview
This notebook implements a three-stage pipeline for analyzing parent-infant interactions:
1. Audio Preprocessing
2. Speech Recognition
3. Accuracy Analysis

### 1. Audio Preprocessing (`AudioPreprocessor` class)
Prepares audio files for optimal speech recognition:
- Converts stereo to mono (required for Whisper)
- Adjusts sample rate to 16kHz
- Normalizes volume
- Optional silence removal (not recommended for interaction analysis)

Example usage:
```python
preprocessor = AudioPreprocessor(input_file)
processed_file = preprocessor.process_audio(
    output_dir=output_dir,
    remove_silence=False  # Keep natural pauses
)
```

### 2. Speech Recognition (`transcribe_audio` function)
Uses Whisper ASR with parent-infant specific settings:
- Large-v2 model for accuracy
- Custom prompt for infant vocalizations
- Word-level timestamps
- Speaker detection capabilities

Output includes:
- Transcribed text
- Timing information
- Optional speaker labels

### 3. Accuracy Analysis (`EnhancedTranscriptionComparator` class)
Compares Whisper output with human transcriptions:
- Text similarity scoring
- Word-level accuracy metrics
- Detailed analysis reports

### Key Metrics
1. Text Similarity
   - SequenceMatcher score (Ratcliff/Obershelp algorithm)
   - Formula: ratio = 2.0 * M / T
     * M = sum of lengths of matched subsequences
     * T = total length of both strings combined

2. Word-level Analysis
   - Word count comparison
   - Correct words identification
   - Mismatch detection
   - Accuracy percentage calculation

### Output Format
CSV file with columns:
- Time Range
- Whisper Transcription
- Human Transcription
- SequenceMatcher Score
- Word Count (Human)
- Word Count (Whisper)
- Correct Words
- Mismatches
- Accuracy (%)
- Comments/Notes

## Results Interpretation
The comparison output provides:
- Time-aligned transcriptions
- Word-level accuracy metrics
- Quality indicators
- Automated issue detection

Example metrics from current run:
- Total segments: 95
- Average accuracy: 65.3%
- High quality matches: 39
- Low quality matches: 38

## Future Improvements
Consider:
- GPU acceleration for faster processing
- Enhanced speaker detection
- Controlled vocabulary for infant sounds
- Batch processing capabilities

## Suggested File Structure

```
File Structure Overview:
project/wearable/
├── media/[MONTH]/                    # Audio files
│   ├── IW_[ID]_[MONTH]_TL.wav       # Original audio
│   └── processed/                    # Processed audio files
│       ├── IW_[ID]_[MONTH]_TL_mono.wav
│       ├── IW_[ID]_[MONTH]_TL_mono_16k.wav
│       └── IW_[ID]_[MONTH]_TL_mono_16k_normalized.wav
│
└── transcription/[MONTH]/            # Transcription files
    ├── IW_[ID]_[MONTH]_whisper_results_without_speaker.csv  # Whisper output
    └── IW_[ID]_[MONTH]_transcription_comparison.csv         # Analysis results

Naming Convention:
- [ID]: Participant ID (e.g., 002)
- [MONTH]: Recording month (e.g., 12mon)
```

## Usage Workflow

### 1. Preprocess Audio
```python
# Initialize and process audio
preprocessor = AudioPreprocessor("/path/to/media/12mon/IW_002_12_TL.wav")
processed_file = preprocessor.process_audio(
    output_dir="/path/to/media/12mon/processed",
    remove_silence=False    # Keep silence for interaction analysis
)
# Output: IW_002_12_TL_mono_16k_normalized.wav
```

### 2. Transcribe Audio
```python
# Run Whisper transcription
df, result = transcribe_audio(
    audio_path="/path/to/processed/IW_002_12_TL_mono_16k_normalized.wav",
    speaker_hints=["[Mother]", "[Father]", "[Baby]"]
)
df.to_csv("002_12_whisper_results.csv", index=False)
```

### 3. Compare Transcriptions
```python
# Compare with human transcription
results_df, stats = compare_and_export(
    whisper_df,    # Whisper output
    human_df,      # Human transcription
    "002_12_transcription_comparison.csv"
)
```

### Expected Outputs
1. Processed Audio: `*_mono_16k_normalized.wav`
2. Whisper Output: `*_whisper_results.csv`
3. Analysis: `*_transcription_comparison.csv`


## Technical Notes
- Time tolerance: 0.5 seconds for segment matching
- Similarity threshold: 0.3 for considering matches
- Text cleaning includes:
  * Case normalization
  * Punctuation removal
  * Common transcription variant standardization
- High quality matches: Accuracy > 80%
- Low quality matches: Accuracy < 60%





In [None]:
###preprocessing

import os
import subprocess
import logging
from pathlib import Path
from typing import Union, List

class AudioPreprocessor:
    def __init__(self, input_path: Union[str, Path]):
        """Initialize audio preprocessor with input file path"""
        self.input_path = Path(input_path)
        self.setup_logging()
        
    def setup_logging(self):
        """Setup logging configuration"""
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s'
        )
        self.logger = logging.getLogger(__name__)
    
    def process_audio(self, 
                     output_dir: Union[str, Path],
                     apply_noise_reduction: bool = True,
                     noise_reduction_strength: float = 0.21,
                     target_volume: float = 1.5) -> Path:
        """
        Process audio using FFmpeg filter chain:
        1. Convert to mono
        2. Adjust sample rate to 16kHz
        3. Optional noise reduction
        4. Normalize volume
        
        Args:
            output_dir: Directory to save processed audio
            apply_noise_reduction: Whether to apply noise reduction
            noise_reduction_strength: Strength of noise reduction (0.01 to 1.0)
            target_volume: Target volume for normalization
            
        Returns:
            Path to processed audio file
        """
        output_dir = Path(output_dir)
        output_dir.mkdir(exist_ok=True)
        
        # Construct output path
        output_path = output_dir / f"{self.input_path.stem}_processed.wav"
        
        # Build FFmpeg filter chain
        filters = [
            'aresample=16000',  # Convert to 16kHz
            'aformat=sample_fmts=s16:channel_layouts=mono',  # Convert to mono
        ]
        
        # Add noise reduction if requested
        if apply_noise_reduction:
            filters.append(f'anlmdn=s={noise_reduction_strength}')
        
        # Add volume normalization
        filters.append(f'volume={target_volume}')
        
        # Join filters into a single filter chain
        filter_chain = ','.join(filters)
        
        # Build FFmpeg command
        command = [
            'ffmpeg',
            '-i', str(self.input_path),
            '-af', filter_chain,
            '-ar', '16000',  # Ensure 16kHz output
            '-ac', '1',      # Ensure mono output
            '-y',            # Overwrite output if exists
            str(output_path)
        ]
        
        # Log the command for debugging
        self.logger.info("Running FFmpeg command:")
        self.logger.info(' '.join(command))
        
        try:
            # Run FFmpeg command
            subprocess.run(command, 
                         check=True,
                         capture_output=True,
                         text=True)
            
            self.logger.info(f"Successfully processed audio to: {output_path}")
            return output_path
            
        except subprocess.CalledProcessError as e:
            self.logger.error(f"FFmpeg error: {e.stderr}")
            raise
        except Exception as e:
            self.logger.error(f"Error during processing: {str(e)}")
            raise

# Example usage
if __name__ == "__main__":
    input_file = Path("/Users/yueyan/Documents/project/wearable/media/12mon/raw/IW_002_12_TL.wav")
    output_dir = Path("/Users/yueyan/Documents/project/wearable/media/12mon/processed")
    
    try:
        # Initialize preprocessor
        preprocessor = AudioPreprocessor(input_file)
        
        # Process audio
        processed_file = preprocessor.process_audio(
            output_dir=output_dir,
            apply_noise_reduction=True,
            noise_reduction_strength=0.01,  # Adjust between 0.01 and 1.0
            target_volume=1.5              # Adjust as needed
        )
        
        print(f"Processed file: {processed_file}")
            
    except Exception as e:
        logging.error(f"Processing failed: {str(e)}")
        import traceback
        traceback.print_exc()

In [None]:
whisper transciption with speaker recognition
import whisper
import pandas as pd
from datetime import timedelta
import numpy as np

def transcribe_audio(audio_path, speaker_hints=None):
    """Transcribe audio using Whisper with enhanced speaker detection"""
    print("Loading model...")
    model = whisper.load_model("large-v2", device="cpu")
    
    # Keep the parent-infant context in the prompt but make it more transcription-focused
    initial_prompt = """This is a parent-infant interaction recording. 
    Please transcribe all speech, including infant vocalizations, parent speech, and any notable sounds.
    Include both verbal and non-verbal vocalizations."""
    
    print("Transcribing...")
    result = model.transcribe(
        audio_path,
        word_timestamps=True,
        verbose=True,
        initial_prompt=initial_prompt,
        language="en",
        task="transcribe"
    )
    
    # Process segments
    segments = []
    
    # First pass: Get clean transcriptions with timing
    for segment in result["segments"]:
        text = segment["text"].strip()
        start = segment["start"]
        end = segment["end"]
        duration = end - start
        
        # Skip empty segments
        if not text:
            continue
            
        segment_info = {
            'start': start,
            'end': end,
            'start_time': str(timedelta(seconds=int(start))),
            'end_time': str(timedelta(seconds=int(end))),
            'text': text,
            'duration': duration
        }
        
        # Optional: Add speaker detection without modifying the original text
        if speaker_hints:
            speaker = detect_speaker(text, speaker_hints)
            segment_info['speaker'] = speaker
        
        segments.append(segment_info)
    
    # Create DataFrame
    df = pd.DataFrame(segments)
    
    return df, result

def detect_speaker(text, speaker_hints):
    """Separate function for speaker detection that doesn't modify the transcription"""
    text_lower = text.lower()
    
    # Basic speaker detection logic
    if any(word in text_lower for word in ['crying', 'cries', 'waa']):
        return '[Baby Crying]'
    elif any(word in text_lower for word in ['laugh', 'giggle']):
        return '[Baby Laughing]'
    elif any(word in text_lower for word in ['goo', 'gah', 'bah', 'coo']):
        return '[Infant Vocalization]'
    elif '[mother]' in text_lower or 'mom' in text_lower:
        return '[Mother]'
    elif '[father]' in text_lower or 'dad' in text_lower:
        return '[Father]'
    else:
        return '[Unspecified]'

# Example usage
if __name__ == "__main__":
    speaker_hints = [
        "[Mother]", "[Father]", "[Baby]",
        "[Infant Vocalization]", "[Baby Crying]", "[Baby Laughing]"
    ]

    ###replace with your audio file
    try:
        df, result = transcribe_audio(
            "/Users/yueyan/Documents/project/wearable/media/12mon/your file name",
            speaker_hints=speaker_hints
        )
        
        # Save results; replace with your path
        df.to_csv("/Users/yueyan/Documents/project/wearable/transcription/12mon/your file name to be saved", index=False)
        
        # Print statistics
        print("\nTranscription Statistics:")
        print(f"Total segments: {len(df)}")
        
        # If speaker detection is enabled
        if 'speaker' in df.columns:
            print("\nSpeaker distribution:")
            print(df['speaker'].value_counts())
        
        print("\nFirst few transcriptions:")
        print(df[['start_time', 'end_time', 'text']].head())
        
        # Optionally display with speakers if available
        if 'speaker' in df.columns:
            print("\nFirst few transcriptions with speakers:")
            print(df[['start_time', 'end_time', 'speaker', 'text']].head())
        
    except Exception as e:
        print(f"Error occurred: {str(e)}")
        import traceback
        traceback.print_exc()

In [7]:
##whisper transcription without speaker recognition
import whisper
import pandas as pd
from datetime import timedelta
import numpy as np
def transcribe_audio(audio_path):
    """Transcribe audio using Whisper"""
    print("Loading model...")
    model = whisper.load_model("large-v2", device="cpu")
    
    # Keep the parent-infant context in the prompt but make it more transcription-focused
    initial_prompt = """This is a parent-infant interaction recording. """
#    Please transcribe all speech, including infant vocalizations, parent speech, and any notable sounds.
#    Include both verbal and non-verbal vocalizations."""
    
    print("Transcribing...")
    result = model.transcribe(
        audio_path,
        word_timestamps=True,
        verbose=True,
        initial_prompt=initial_prompt,
        language="en",
        task="transcribe"
    )
    
    # Process segments
    segments = []
    
    # First pass: Get clean transcriptions with timing
    for segment in result["segments"]:
        text = segment["text"].strip()
        start = segment["start"]
        end = segment["end"]
        duration = end - start
        
        # Skip empty segments
        if not text:
            continue
            
        segment_info = {
            'start': start,
            'end': end,
            'start_time': str(timedelta(seconds=int(start))),
            'end_time': str(timedelta(seconds=int(end))),
            'text': text,
            'duration': duration
        }
        
        segments.append(segment_info)
    
    # Create DataFrame
    df = pd.DataFrame(segments)
    
    return df, result

# Example usage
if __name__ == "__main__":
    try:
        df, result = transcribe_audio(
            "/Users/yueyan/Documents/project/wearable/media/12mon/raw/IW_038_12_TL.wav"
        )
        
        # Save results
        df.to_csv("/Users/yueyan/Documents/project/wearable/transcription/whisper/12mon/IW_038_12_whisper_wopre_wos_111924.csv", index=False)
        
        # Print statistics
        print("\nTranscription Statistics:")
        print(f"Total segments: {len(df)}")
        
        print("\nFirst few transcriptions:")
        print(df[['start_time', 'end_time', 'text']].head())
        
    except Exception as e:
        print(f"Error occurred: {str(e)}")
        import traceback
        traceback.print_exc()

Loading model...


  checkpoint = torch.load(fp, map_location=device)


Transcribing...




[00:00.880 --> 00:02.560]  It's just Dougie, so...
[00:02.560 --> 00:02.980]  Come here, you.
[00:03.400 --> 00:04.640]  ...if you don't mind watching him play.
[00:06.100 --> 00:08.080]  Oh no, look what I do!
[00:13.220 --> 00:15.420]  The camera is really fast.
[00:22.800 --> 00:27.660]  Your play area is about where your leg is, so you guys have all that room to play.
[00:28.080 --> 00:31.000]  Okay, and I'll be back in four minutes to pick up the toys.
[00:33.760 --> 00:35.460]  Dougie, are you just going to eat the block?
[00:37.360 --> 00:38.160]  Oh, squeaky!
[00:43.820 --> 00:44.220]  Dougie...
[00:46.400 --> 00:47.780]  Oh, did it tickle?
[00:48.240 --> 00:51.080]  Did it tickle? Here, got it? Oh, you got it!
[00:51.980 --> 00:52.360]  Come on.
[00:53.460 --> 00:53.860]  Look.
[01:01.040 --> 01:01.780]  Cool.
[01:12.220 --> 01:15.720]  Oh, you got spit on this one already.
[01:17.260 --> 01:18.260]  What is he doing?
[01:21.780 --> 01:23.520]  What is that?
[01:25.040 --> 01:

In [8]:
###load datasets
whisper_df = pd.read_csv("/Users/yueyan/Documents/project/wearable/transcription/whisper/12mon/IW_038_12_whisper_wopre_wos_111924.csv")
human_df = pd.read_csv("/Users/yueyan/Documents/project/wearable/transcription/human/12mon/IW_038_12_human_transcription.csv")

In [9]:
import pandas as pd
import numpy as np
from difflib import SequenceMatcher
from tqdm import tqdm
import string
import re
from typing import Tuple, Dict, List

class EnhancedTranscriptionComparator:
    def __init__(self, whisper_df: pd.DataFrame, human_df: pd.DataFrame):
        """Initialize with enhanced text cleaning and comparison capabilities"""
        self.whisper_data = whisper_df.copy()
        self.human_data = human_df.copy()
        
        # Pre-clean the data
        self.human_data = self.human_data[self.human_data['text'].notna()]
        if 'end ' in self.human_data.columns:
            self.human_data = self.human_data.rename(columns={'end ': 'end'})
        
        # Convert times to numeric
        self.human_data['start'] = pd.to_numeric(self.human_data['start'])
        self.human_data['end'] = pd.to_numeric(self.human_data['end'])
        
        # Clean text and store originals
        self.whisper_data['clean_text'] = self.whisper_data['text'].apply(self.clean_text)
        self.human_data['clean_text'] = self.human_data['text'].apply(self.clean_text)
        self.whisper_data['original_text'] = self.whisper_data['text']
        self.human_data['original_text'] = self.human_data['text']
        
        # Find overlap range
        self.overlap_start = max(self.whisper_data['start'].min(), self.human_data['start'].min())
        self.overlap_end = min(self.whisper_data['end'].max(), self.human_data['end'].max())

    @staticmethod
    def clean_text(text: str) -> str:
        """Enhanced text cleaning function"""
        if pd.isna(text):
            return ""
            
        text = str(text).lower()
        text = text.translate(str.maketrans("", "", string.punctuation))
        text = " ".join(text.split())
        
        replacements = {
            'uhm': 'um', 'uhh': 'uh', 'hmm': 'hm',
            'mhm': 'mm', 'yeah': 'yes', 'yah': 'yes',
            'nah': 'no'
        }
        
        for old, new in replacements.items():
            text = re.sub(r'\b' + old + r'\b', new, text)
            
        return text

    def get_word_metrics(self, human_text: str, whisper_text: str) -> Dict[str, int]:
        """Calculate detailed word-level metrics"""
        human_words = set(self.clean_text(human_text).split())
        whisper_words = set(self.clean_text(whisper_text).split())
        
        correct_words = len(human_words.intersection(whisper_words))
        total_words_human = len(human_words)
        total_words_whisper = len(whisper_words)
        mismatches = max(total_words_human, total_words_whisper) - correct_words
        
        return {
            'word_count_human': total_words_human,
            'word_count_whisper': total_words_whisper,
            'correct_words': correct_words,
            'mismatches': mismatches
        }

    def generate_comparison_results(self, time_tolerance: float = 0.5, 
                                  similarity_threshold: float = 0.3) -> pd.DataFrame:
        """Generate comprehensive comparison results for spreadsheet"""
        results = []
        
        human_segments = self.human_data[
            (self.human_data['start'] >= self.overlap_start) &
            (self.human_data['end'] <= self.overlap_end)
        ]
        
        whisper_segments = self.whisper_data[
            (self.whisper_data['start'] >= self.overlap_start) &
            (self.whisper_data['end'] <= self.overlap_end)
        ]
        
        for _, human_seg in tqdm(human_segments.iterrows(), desc="Analyzing segments"):
            potential_matches = whisper_segments[
                (whisper_segments['start'] >= human_seg['start'] - time_tolerance) &
                (whisper_segments['start'] <= human_seg['end'] + time_tolerance)
            ]
            
            for _, whisper_seg in potential_matches.iterrows():
                similarity = SequenceMatcher(
                    None,
                    human_seg['clean_text'],
                    whisper_seg['clean_text']
                ).ratio()
                
                if similarity > similarity_threshold:
                    # Get word-level metrics
                    word_metrics = self.get_word_metrics(
                        human_seg['original_text'],
                        whisper_seg['original_text']
                    )
                    
                    # Calculate accuracy
                    accuracy = (word_metrics['correct_words'] / word_metrics['word_count_human'] * 100) \
                        if word_metrics['word_count_human'] > 0 else 0
                    
                    # Generate comments
                    comments = []
                    if similarity < 0.5:
                        comments.append("Low similarity score")
                    if abs(word_metrics['word_count_human'] - word_metrics['word_count_whisper']) > 3:
                        comments.append("Significant word count difference")
                    if accuracy < 60:
                        comments.append("Low accuracy")
                    
                    results.append({
                        'Time Range': f"{human_seg['start']:.1f}-{human_seg['end']:.1f}",
                        'Whisper Transcription': whisper_seg['original_text'],
                        'Human Transcription': human_seg['original_text'],
                        'SequenceMatcher Score': round(similarity, 3),
                        'Word Count (Human)': word_metrics['word_count_human'],
                        'Word Count (Whisper)': word_metrics['word_count_whisper'],
                        'Correct Words': word_metrics['correct_words'],
                        'Mismatches': word_metrics['mismatches'],
                        'Accuracy (%)': round(accuracy, 1),
                        'Comments/Notes': "; ".join(comments) if comments else "OK"
                    })
        
        return pd.DataFrame(results)

def compare_and_export(whisper_file: pd.DataFrame, human_file: pd.DataFrame, 
                      output_path: str = "transcription_comparison.csv") -> Tuple[pd.DataFrame, Dict]:
    """Compare transcriptions and export results"""
    # Initialize comparator
    comparator = EnhancedTranscriptionComparator(whisper_file, human_file)
    
    # Generate comparison results
    results_df = comparator.generate_comparison_results()
    
    # Calculate summary statistics
    stats = {
        'Total Segments': len(results_df),
        'Average Accuracy': results_df['Accuracy (%)'].mean(),
        'Average Similarity': results_df['SequenceMatcher Score'].mean(),
        'High Quality Matches': len(results_df[results_df['Accuracy (%)'] > 80]),
        'Low Quality Matches': len(results_df[results_df['Accuracy (%)'] < 60])
    }
    
    # Export to CSV
    results_df.to_csv(output_path, index=False)
    
    # Print summary
    print("\nComparison Summary:")
    print(f"Total segments analyzed: {stats['Total Segments']}")
    print(f"Average accuracy: {stats['Average Accuracy']:.1f}%")
    print(f"Average similarity score: {stats['Average Similarity']:.3f}")
    print(f"High quality matches (>80%): {stats['High Quality Matches']}")
    print(f"Low quality matches (<60%): {stats['Low Quality Matches']}")
    
    return results_df, stats

# Example usage
if __name__ == "__main__":
    # Assume whisper_df and human_df are your input DataFrames; replace with your desired path for the comparison results to be saved
    results_df, stats = compare_and_export(whisper_df, human_df, "/Users/yueyan/Documents/project/wearable/transcription/comparison/12mon/IW_038_12_transcription_comparison.csv")
    
    # Display first few results
    print("\nSample comparison results:")
    print(results_df[['Time Range', 'Human Transcription', 
                     'Whisper Transcription', 'Accuracy (%)']].head())

Analyzing segments: 180it [00:00, 3118.03it/s]


Comparison Summary:
Total segments analyzed: 121
Average accuracy: 74.2%
Average similarity score: 0.806
High quality matches (>80%): 72
Low quality matches (<60%): 39

Sample comparison results:
  Time Range  Human Transcription  \
0  37.2-38.4          oh squeaky.   
1  46.5-47.9    oh did it tickle?   
2  46.5-47.9    oh did it tickle?   
3  48.4-49.5  did it tickle here.   
4  52.1-52.3             come on.   

                          Whisper Transcription  Accuracy (%)  
0                                  Oh, squeaky!         100.0  
1                            Oh, did it tickle?         100.0  
2  Did it tickle? Here, got it? Oh, you got it!         100.0  
3  Did it tickle? Here, got it? Oh, you got it!         100.0  
4                                      Come on.         100.0  



