<a href="https://colab.research.google.com/github/vanderbilt-data-science/ai-days-collaboration/blob/main/Process_cvs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Days Collaboration Matcher
# ===========================
This notebook extracts research profiles from CVs and research statements using Claude 3.7 Sonnet, structures the information in JSON format, and prepares data for collaboration matching.

# Section 1: Initial Setup
Run this section once at the beginning of your session to set up the environment, import libraries, and connect to Google Drive.


## 1.1 Mount Google Drive
This connects your Google Drive to access files and save results.

In [39]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1.2 Install Required Packages
These packages are needed for processing documents and using the Claude API.


In [40]:
# Install required packages
!pip install anthropic PyPDF2 python-docx pandas tqdm



## 1.3 Import Libraries
Import all necessary libraries for file processing, data manipulation, and API interaction.


In [41]:
# Import necessary libraries
import os
import pandas as pd
import PyPDF2
import docx
import re
import json
import time
import datetime
import anthropic
from tqdm.notebook import tqdm
from google.colab import userdata





## 1.4 Define Paths and Constants
Configure file locations and other important settings.

In [42]:
# Define paths for multiple input directories
INPUT_DIRS = [
    "/content/drive/My Drive/Data Science/Symposium/AI Days Resumes_Research",
    "/content/drive/My Drive/Data Science/Symposium/Recent_CV_uploads"
]

# Output directory for saving profiles
OUTPUT_PATH = "/content/drive/My Drive/Data Science/Symposium/AI Days Output"
OUTPUT_SUMMARY_FILE = os.path.join(OUTPUT_PATH, "all_profiles.json")

# Create output directory if it doesn't exist
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

# Google Sheet information for form responses
FORM_RESPONSES_SHEET_ID = "1347yg-cKPv3VWmXSg6M7X-AI4QDCDETQUBPaxXWpfmc"
FORM_RESPONSES_RANGE = "Form Responses 1!A:Z"  # This gets all columns, we'll filter to what we need

# Column name for research statements in the form
RESEARCH_STATEMENT_COLUMN = "OPTION 2: Paste your Information\n\nPaste your professional summary and interests below (max 500 words) "

# Column names for identifying information
EMAIL_COLUMN = "Email"
FIRST_NAME_COLUMN = "First Name"
LAST_NAME_COLUMN = "Last Name"
AFFILIATION_COLUMN = "Affiliation"  # Note the space at the end


## 1.5 Set Up API Key
Configure access to the Anthropic Claude API.

In [43]:
# Setup Anthropic API key
from google.colab import userdata
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')
anthropic_client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)


# Section 2: Core Functions
This section contains all the core functions for processing documents, extracting text, and generating research profiles using Claude 3.7 Sonnet.




## 2.1 File Processing Functions
Functions to extract text from various document formats (PDF, DOCX, TXT).

In [44]:
def extract_text_from_pdf(file_path):
    """Extract text from PDF files"""
    text = ""
    try:
        with open(file_path, 'rb') as f:
            pdf_reader = PyPDF2.PdfReader(f)
            for page in pdf_reader.pages:
                text += page.extract_text() + "\n"
    except Exception as e:
        print(f"Error extracting PDF {file_path}: {str(e)}")
    return text

def extract_text_from_docx(file_path):
    """Extract text from DOCX files"""
    try:
        doc = docx.Document(file_path)
        return "\n".join([para.text for para in doc.paragraphs])
    except Exception as e:
        print(f"Error extracting DOCX {file_path}: {str(e)}")
        return ""

def extract_text_from_file(file_path):
    """Extract text from various file formats"""
    _, ext = os.path.splitext(file_path)
    ext = ext.lower()

    try:
        if ext == '.pdf':
            return extract_text_from_pdf(file_path)
        elif ext in ['.docx', '.doc']:
            return extract_text_from_docx(file_path)
        elif ext == '.txt':
            with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                return f.read()
        else:
            print(f"Unsupported file format: {ext}")
            return ""
    except Exception as e:
        print(f"Error extracting text from {file_path}: {str(e)}")
        return ""

## 2.2 Research Profile Extraction
Function to use Claude 3.7 Sonnet to extract structured information from documents.


In [45]:
def extract_research_profile(text, file_name):
    """
    Extract research profile using Claude 3.7 Sonnet in structured JSON format

    Returns:
        Dictionary with structured research profile information
    """
    # Create system prompt
    system_prompt = """You are an expert assistant helping to extract structured information from academic resumes and research statements.
    Extract the information according to the specified JSON format, ensuring all fields are included with the correct structure."""

    # Create user prompt with structured JSON format
    user_prompt = f"""
    Analyze this document and extract key research information in the following JSON format:

    {{
      "basic_info": {{
        "name": "Full name of the researcher",
        "email": "Email address if available, otherwise null",
        "affiliation": "University or organization name",
        "role": "The person's role (faculty, physician-scientist, clinical_faculty, clinician_in_training, postdoc, research_scientist, graduate_student, undergraduate_student, research_staff, industry, community, or other)"
      }},
      "research_profile": {{
        "primary_focus": "A 1-2 sentence description of their main research area",
        "methodologies": ["Method 1", "Method 2", "Method 3"],
        "domains": ["Application domain 1", "Application domain 2"],
        "research_summary": "A 150-200 word paragraph describing their research in detail"
      }},
      "collaboration_potential": {{
        "expertise_offered": ["Specific expertise 1", "Specific expertise 2"],
        "resources_available": ["Dataset", "Tool", "Framework"],
        "complementary_fields": ["Field 1", "Field 2"]
      }},
      "keywords": ["keyword1", "keyword2", "keyword3", "keyword4", "keyword5"]
    }}

    Guidelines:
    1. Extract information explicitly stated in the document when available
    2. Make reasonable inferences for missing fields, marking these as "[inferred]"
    3. Use "null" for fields where no information is available and no inference can be made
    4. Limit lists to 3-5 items, prioritizing the most significant ones
    5. Ensure the research_summary provides a coherent overview of their work
    6. Focus on information most relevant for identifying potential research collaborations
    7. For the "role" field, determine the most appropriate category:
       - faculty: Academic professors and researchers without significant clinical duties
       - physician-scientist: Medical doctors who conduct significant research
       - clinical_faculty: Primarily clinical practitioners with academic appointments
       - clinician_in_training: Residents, fellows, and others in clinical training positions
       - postdoc: Postdoctoral researchers
       - research_scientist: Non-faculty research professionals
       - graduate_student: Master's and PhD students
       - undergraduate_student: Bachelor's degree students
       - research_staff: Research assistants, lab managers, etc.
       - industry: Private sector professionals
       - community: Community organization members
       - other: Roles that don't fit the above categories

    The JSON structure must be strictly followed without additional narrative or explanation outside the JSON object.

    Document:
    {text[:15000]}
    """

    try:
        # Call Anthropic API with Claude 3.7 Sonnet
        response = anthropic_client.messages.create(
            model="claude-3-7-sonnet-20250219",
            max_tokens=1000,
            system=system_prompt,
            messages=[
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.2
        )

        # Extract and parse the JSON response
        response_text = response.content[0].text

        # Find JSON in the response (in case Claude adds any commentary)
        json_match = re.search(r'({.*})', response_text, re.DOTALL)
        if json_match:
            json_str = json_match.group(1)
            profile = json.loads(json_str)

            # Add file information
            profile['file_info'] = {
                'file_name': file_name,
                'processing_time': pd.Timestamp.now().isoformat()
            }

            return profile
        else:
            # If no JSON formatting, return an error structure
            return {
                "basic_info": {
                    "name": "Error parsing",
                    "email": "Error parsing",
                    "affiliation": "Error parsing",
                    "role": "Error parsing"
                },
                "research_profile": {
                    "primary_focus": "Error parsing JSON from response",
                    "methodologies": [],
                    "domains": [],
                    "research_summary": response_text
                },
                "collaboration_potential": {
                    "expertise_offered": [],
                    "resources_available": [],
                    "complementary_fields": []
                },
                "keywords": [],
                "file_info": {
                    "file_name": file_name,
                    "processing_time": pd.Timestamp.now().isoformat(),
                    "error": "Failed to parse JSON from response"
                }
            }

    except Exception as e:
        print(f"Error extracting research profile from {file_name}: {str(e)}")
        return {
            "basic_info": {
                "name": "Error",
                "email": "Error",
                "affiliation": "Error",
                "role": "Error"
            },
            "research_profile": {
                "primary_focus": "Error processing",
                "methodologies": [],
                "domains": [],
                "research_summary": f"Error: {str(e)}"
            },
            "collaboration_potential": {
                "expertise_offered": [],
                "resources_available": [],
                "complementary_fields": []
            },
            "keywords": [],
            "file_info": {
                "file_name": file_name,
                "processing_time": pd.Timestamp.now().isoformat(),
                "error": str(e)
            }
        }


## 2.3 Directory Scanning Functions
Functions to scan multiple directories and get file lists.


In [46]:
def scan_directory(directory, limit=None):
    """
    Scan a single directory for files

    Args:
        directory: Path to scan
        limit: Optional limit on number of files to return

    Returns:
        List of file paths
    """
    file_paths = []

    try:
        # Walk through the directory
        for root, _, files in os.walk(directory):
            for file in files:
                # Only include document files
                _, ext = os.path.splitext(file)
                if ext.lower() in ['.pdf', '.docx', '.doc', '.txt']:
                    file_path = os.path.join(root, file)
                    file_paths.append(file_path)
    except Exception as e:
        print(f"Error scanning directory {directory}: {str(e)}")

    # Limit the number of files if specified
    if limit and len(file_paths) > limit:
        file_paths = file_paths[:limit]

    return file_paths

def scan_all_directories(directories, limit=None):
    """
    Scan multiple directories and combine results

    Args:
        directories: List of directory paths to scan
        limit: Optional limit on total number of files to return

    Returns:
        List of file paths
    """
    all_files = []

    for directory in directories:
        print(f"Scanning directory: {directory}")
        files = scan_directory(directory)
        print(f"Found {len(files)} files")
        all_files.extend(files)

    # Limit the total number of files if specified
    if limit and len(all_files) > limit:
        all_files = all_files[:limit]

    print(f"Total files found: {len(all_files)}")
    return all_files

## 2.4 Profile Display Function
Function to display extracted profiles in a readable format.

In [47]:
def display_profiles(profiles, limit=None):
    """Display the extracted profiles in a readable format"""
    # Limit number of profiles to display if specified
    if limit and len(profiles) > limit:
        display_profiles = profiles[:limit]
        print(f"Displaying {limit} of {len(profiles)} profiles")
    else:
        display_profiles = profiles

    for i, profile in enumerate(display_profiles):
        print(f"\n{'='*80}\nProfile {i+1}: {profile.get('file_info', {}).get('file_name', 'Unknown')}\n{'='*80}")

        # Basic Info
        basic_info = profile.get('basic_info', {})
        print(f"Name: {basic_info.get('name', 'Not extracted')}")
        print(f"Email: {basic_info.get('email', 'Not extracted')}")
        print(f"Affiliation: {basic_info.get('affiliation', 'Not extracted')}")
        print(f"Role: {basic_info.get('role', 'Not extracted')}")

        # Research Profile
        research_profile = profile.get('research_profile', {})
        print(f"\nPrimary Focus: {research_profile.get('primary_focus', 'Not extracted')}")

        print("\nMethodologies:")
        for method in research_profile.get('methodologies', []):
            print(f"- {method}")

        print("\nDomains:")
        for domain in research_profile.get('domains', []):
            print(f"- {domain}")

        print(f"\nResearch Summary:\n{research_profile.get('research_summary', 'Not extracted')}")

        # Collaboration Potential
        collab = profile.get('collaboration_potential', {})
        print("\nExpertise Offered:")
        for expertise in collab.get('expertise_offered', []):
            print(f"- {expertise}")

        print("\nResources Available:")
        for resource in collab.get('resources_available', []):
            print(f"- {resource}")

        print("\nComplementary Fields:")
        for field in collab.get('complementary_fields', []):
            print(f"- {field}")

        # Keywords
        print("\nKeywords:")
        for keyword in profile.get('keywords', []):
            print(f"- {keyword}")

    if limit and len(profiles) > limit:
        print(f"\n... and {len(profiles) - limit} more profiles (not displayed)")

## 2.5 Form Response Processing Functions
Functions to load and process text-based research statements from responses.

In [48]:
# %% [markdown]

def load_form_responses(sheet_id, range_name):
    """
    Load form responses from Google Sheets with special handling for multiline headers
    """
    from googleapiclient.discovery import build
    from google.auth import default
    import pandas as pd

    print(f"Loading form responses from sheet ID: {sheet_id}")

    try:
        # Authenticate and build the service
        creds, _ = default()
        service = build('sheets', 'v4', credentials=creds)

        # Call the Sheets API to get raw values
        sheet = service.spreadsheets()
        result = sheet.values().get(spreadsheetId=sheet_id, range=range_name).execute()
        values = result.get('values', [])

        if not values:
            print("No data found in the sheet.")
            return pd.DataFrame()

        # Print raw headers to help with debugging
        headers = values[0]
        print("Raw headers from API:")
        for i, header in enumerate(headers):
            print(f"  {i}: {header}")

        # Get research statement column index (usually column K = index 10)
        research_col_index = 10  # Default to column K

        # Create DataFrame
        df = pd.DataFrame(values[1:], columns=headers)

        # Handle research statement column directly from raw data
        if len(headers) > research_col_index:
            research_header = headers[research_col_index]
            print(f"Using research statement column: '{research_header}'")

            # Add a simpler column name that's easier to reference
            df['research_statement'] = None

            # Directly copy values from the raw data
            for i, row in enumerate(values[1:]):
                if len(row) > research_col_index and row[research_col_index]:
                    df.at[i, 'research_statement'] = row[research_col_index]

        print(f"Loaded {len(df)} form responses")

        # Count non-empty research statements
        if 'research_statement' in df.columns:
            non_empty = df['research_statement'].notna().sum()
            print(f"Found {non_empty} non-empty research statements")

            # Show examples
            if non_empty > 0:
                examples = df[df['research_statement'].notna()].head(3)
                print("\nExamples of research statements:")
                for i, row in examples.iterrows():
                    text = row['research_statement']
                    preview = text[:100] + "..." if len(text) > 100 else text
                    print(f"  Row {i+1}: {preview}")

        return df

    except Exception as e:
        print(f"Error loading form responses: {str(e)}")
        return pd.DataFrame()

def extract_profile_from_text_response(row, research_column, email_column, first_name_column, last_name_column):
    """
    Extract research profile from text-based response using Claude

    Args:
        row: Row from form responses DataFrame
        research_column: Column name containing the research statement
        email_column: Column name containing email address
        first_name_column: Column name containing respondent's first name
        last_name_column: Column name containing respondent's last name

    Returns:
        Dictionary with structured research profile information
    """
    # Get identification info
    email = row.get(email_column, '')
    first_name = row.get(first_name_column, '')
    last_name = row.get(last_name_column, '')
    full_name = f"{first_name} {last_name}".strip()

    # Get research statement text
    text = row.get(research_column, '')

    if not text or str(text).strip() == '':
        print(f"No text found for {full_name} ({email})")
        return None

    # Use the same extraction function we use for files
    print(f"Extracting profile for {full_name} ({email})")
    profile = extract_research_profile(text, f"Form Response - {full_name}")

    # Add form response info
    profile['form_info'] = {
        'email': email,
        'first_name': first_name,
        'last_name': last_name,
        'full_name': full_name,
        'response_type': 'text',
        'processing_time': pd.Timestamp.now().isoformat()
    }

    return profile

def process_form_responses(sheet_id, range_name, email_column, first_name_column, last_name_column, processed_emails=None):
    """
    Process all form responses with text-based research statements

    Args:
        sheet_id: The ID of the Google Sheet
        range_name: The range to load
        email_column: Column name containing email addresses
        first_name_column: Column name containing first names
        last_name_column: Column name containing last names
        processed_emails: Set of emails that have already been processed

    Returns:
        List of extracted profiles
    """
    # Initialize processed_emails if not provided
    if processed_emails is None:
        processed_emails = set()
    else:
        print(f"Skipping {len(processed_emails)} already processed form submissions")

    # Load form responses with special handling for research statements
    df = load_form_responses(sheet_id, range_name)

    if df.empty:
        return []

    # Track emails processed in this run to avoid duplicates
    currently_processed = set()
    profiles = []

    print(f"Processing {len(df)} form responses")

    # Use the simplified 'research_statement' column
    research_column = 'research_statement'

    for i, row in df.iterrows():
        try:
            # Extract email for identification
            email = row.get(email_column, '')

            # Skip if no email (unlikely but possible)
            if not email:
                print(f"Skipping row {i+1}: No email address found")
                continue

            # Skip if already processed in a previous run
            if email in processed_emails:
                print(f"Skipping previously processed submission from {email}")
                continue

            # Skip if already processed in this run (duplicate in sheet)
            if email in currently_processed:
                print(f"Skipping duplicate entry for {email}")
                continue

            # Check if this row has a text-based research statement
            if research_column in row and row[research_column] and isinstance(row[research_column], str) and row[research_column].strip():
                profile = extract_profile_from_text_response(row, research_column, email_column, first_name_column, last_name_column)

                if profile:
                    # Save individual profile
                    safe_email = email.replace('@', '_at_').replace('.', '_dot_')
                    output_file = os.path.join(OUTPUT_PATH, f"form_response_{safe_email}_profile.json")
                    with open(output_file, 'w', encoding='utf-8') as f:
                        json.dump(profile, f, indent=2)

                    profiles.append(profile)
                    currently_processed.add(email)
                    print(f"Processed form response {i+1}/{len(df)} for {email}")
            else:
                print(f"Skipping row {i+1}: No research statement text found")

        except Exception as e:
            print(f"Error processing row {i+1}: {str(e)}")

    print(f"Extracted {len(profiles)} profiles from form responses")
    return profiles

# Section 3: Processing Functions
Functions for tracking processed files, estimating runtime, and handling incremental updates.



## 3.1 File Tracking Functions
Functions to track processed files and identify new submissions.

In [49]:
def load_existing_profiles():
    """
    Load existing profiles from the output summary file

    Returns:
        List of profiles, a set of processed file paths, and a set of processed emails
    """
    profiles = []
    processed_files = set()
    processed_emails = set()

    try:
        if os.path.exists(OUTPUT_SUMMARY_FILE):
            with open(OUTPUT_SUMMARY_FILE, 'r', encoding='utf-8') as f:
                profiles = json.load(f)

            # Extract file paths and emails from the loaded profiles
            for profile in profiles:
                # Track file-based profiles
                if 'file_info' in profile and 'file_path' in profile['file_info']:
                    processed_files.add(profile['file_info']['file_path'])

                # Track form-based profiles by email
                if 'form_info' in profile and 'email' in profile['form_info']:
                    processed_emails.add(profile['form_info']['email'])

            print(f"Loaded {len(profiles)} existing profiles")
            print(f"Found {len(processed_files)} previously processed files")
            print(f"Found {len(processed_emails)} previously processed form submissions")
        else:
            print("No existing profiles found")
    except Exception as e:
        print(f"Error loading existing profiles: {str(e)}")

    return profiles, processed_files, processed_emails

def identify_new_files(all_files, processed_files):
    """
    Identify new files that haven't been processed yet

    Args:
        all_files: List of all file paths
        processed_files: Set of previously processed file paths

    Returns:
        List of file paths that haven't been processed
    """
    new_files = [f for f in all_files if f not in processed_files]
    print(f"Found {len(new_files)} new files to process")
    return new_files

def save_all_profiles(profiles):
    """
    Save all profiles to the output summary file

    Args:
        profiles: List of profiles to save
    """
    try:
        with open(OUTPUT_SUMMARY_FILE, 'w', encoding='utf-8') as f:
            json.dump(profiles, f, indent=2)
        print(f"Saved {len(profiles)} profiles to {OUTPUT_SUMMARY_FILE}")
    except Exception as e:
        print(f"Error saving profiles: {str(e)}")

## 3.2 Estimation Function
Function to estimate runtime and cost for processing files.


In [50]:
def estimate_processing(files_to_process, sample_size=5):
    """
    Process a sample of files and estimate total runtime and cost

    Args:
        files_to_process: List of files to process
        sample_size: Number of files to process for the estimate

    Returns:
        Dictionary with estimation results
    """
    total_files = len(files_to_process)

    if total_files == 0:
        print("No files to process.")
        return None

    # Use the minimum of sample_size or total_files
    actual_sample_size = min(sample_size, total_files)

    # Process a sample of files
    sample_files = files_to_process[:actual_sample_size]
    print(f"Processing {actual_sample_size} sample files out of {total_files} total files...")

    # Track timing and token usage
    start_time = time.time()
    token_counts = []

    # Process each file in the sample
    results = []

    for file_path in tqdm(sample_files):
        file_name = os.path.basename(file_path)

        try:
            # Extract text from file
            text = extract_text_from_file(file_path)

            if text:
                # Estimate token count (rough estimate: 4 chars per token)
                estimated_tokens = len(text) // 4
                token_counts.append(min(estimated_tokens, 15000 // 4))  # Cap at our limit

                # Extract research profile
                profile = extract_research_profile(text, file_name)

                # Add file info
                profile['file_info'] = {
                    'file_name': file_name,
                    'file_path': file_path,
                    'processing_time': pd.Timestamp.now().isoformat()
                }

                # Save individual profile
                output_file = os.path.join(OUTPUT_PATH, f"{os.path.splitext(file_name)[0]}_profile.json")
                with open(output_file, 'w', encoding='utf-8') as f:
                    json.dump(profile, f, indent=2)

                results.append(profile)
                print(f"Processed: {file_name}")
            else:
                print(f"No text extracted from: {file_name}")

        except Exception as e:
            print(f"Error processing {file_name}: {str(e)}")

    # Display the results
    if len(results) > 0:
        print("\nSample profiles extracted:")
        display_profiles(results)

    # Calculate timing
    end_time = time.time()
    total_sample_time = end_time - start_time
    avg_time_per_file = total_sample_time / max(len(sample_files), 1)
    estimated_total_time = avg_time_per_file * total_files

    # Calculate estimated token usage
    avg_tokens_per_file = sum(token_counts) / max(len(token_counts), 1) if token_counts else 0
    estimated_total_tokens_in = avg_tokens_per_file * total_files

    # Estimate output tokens (typically smaller than input)
    estimated_tokens_out = estimated_total_tokens_in * 0.3  # Rough estimate

    # Calculate cost (Claude 3.7 Sonnet pricing)
    # As of March 2025, using approximations
    input_cost_per_1k = 0.03  # $0.03 per 1K input tokens for Claude 3.7 Sonnet
    output_cost_per_1k = 0.15  # $0.15 per 1K output tokens for Claude 3.7 Sonnet

    estimated_input_cost = (estimated_total_tokens_in / 1000) * input_cost_per_1k
    estimated_output_cost = (estimated_tokens_out / 1000) * output_cost_per_1k
    estimated_total_cost = estimated_input_cost + estimated_output_cost

    # Display results
    print("\n" + "="*80)
    print("PROCESSING ESTIMATE SUMMARY")
    print("="*80)
    print(f"Sample size: {len(sample_files)} files")
    print(f"Total files to process: {total_files}")
    print(f"\nTiming Information:")
    print(f"  Average processing time per file: {avg_time_per_file:.2f} seconds")
    print(f"  Estimated total processing time: {estimated_total_time:.2f} seconds " +
          f"({estimated_total_time/60:.2f} minutes, {estimated_total_time/3600:.2f} hours)")

    print(f"\nToken Usage Estimate:")
    print(f"  Average input tokens per file: {avg_tokens_per_file:.0f}")
    print(f"  Estimated total input tokens: {estimated_total_tokens_in:.0f}")
    print(f"  Estimated total output tokens: {estimated_tokens_out:.0f}")

    print(f"\nCost Estimate:")
    print(f"  Estimated input cost: ${estimated_input_cost:.2f}")
    print(f"  Estimated output cost: ${estimated_output_cost:.2f}")
    print(f"  Estimated total cost: ${estimated_total_cost:.2f}")

    print("\nNote: These are rough estimates based on the sample. Actual values may vary.")
    print("="*80)

    # Return estimation results
    return {
        'sample_size': len(sample_files),
        'total_files': total_files,
        'avg_time_per_file': avg_time_per_file,
        'estimated_total_time': estimated_total_time,
        'estimated_total_time_minutes': estimated_total_time/60,
        'estimated_total_time_hours': estimated_total_time/3600,
        'avg_tokens_per_file': avg_tokens_per_file,
        'estimated_total_tokens_in': estimated_total_tokens_in,
        'estimated_total_tokens_out': estimated_tokens_out,
        'estimated_input_cost': estimated_input_cost,
        'estimated_output_cost': estimated_output_cost,
        'estimated_total_cost': estimated_total_cost,
        'sample_profiles': results
    }

## 3.3 Full Processing Function
Function to process all files or only new files.

In [51]:
def process_files(files_to_process, progress_update=None):
    """
    Process a list of files and extract research profiles

    Args:
        files_to_process: List of files to process
        progress_update: Optional callback function for progress updates

    Returns:
        List of extracted profiles
    """
    results = []
    total = len(files_to_process)

    start_time = time.time()

    for i, file_path in enumerate(tqdm(files_to_process)):
        file_name = os.path.basename(file_path)

        try:
            # Extract text from file
            text = extract_text_from_file(file_path)

            if text:
                # Extract research profile
                profile = extract_research_profile(text, file_name)

                # Add file info
                profile['file_info'] = {
                    'file_name': file_name,
                    'file_path': file_path,
                    'processing_time': pd.Timestamp.now().isoformat()
                }

                # Save individual profile
                output_file = os.path.join(OUTPUT_PATH, f"{os.path.splitext(file_name)[0]}_profile.json")
                with open(output_file, 'w', encoding='utf-8') as f:
                    json.dump(profile, f, indent=2)

                results.append(profile)
                print(f"Processed ({i+1}/{total}): {file_name}")
            else:
                print(f"No text extracted from: {file_name}")

        except Exception as e:
            print(f"Error processing {file_name}: {str(e)}")

        # Provide progress update if callback is provided
        if progress_update:
            progress_update(i+1, total)

    end_time = time.time()
    elapsed_time = end_time - start_time

    print(f"\nProcessing complete! Processed {len(results)} files in {elapsed_time:.2f} seconds")
    print(f"Average time per file: {elapsed_time/max(len(files_to_process), 1):.2f} seconds")

    return results

# Section 4: Main Execution Functions
Functions to run the profile extraction process with different options.

In [52]:
def confirm_action(message):
    """
    Ask for user confirmation before proceeding

    Args:
        message: Message to display

    Returns:
        True if user confirms, False otherwise
    """
    response = input(f"{message} (y/n): ").strip().lower()
    return response == 'y' or response == 'yes'

def run_profile_extraction(run_updates_only=True, sample_size=5, run_estimation=True, include_form_responses=True):
    """
    Run the profile extraction process
    """
    print(f"Running profile extraction with updates_only={run_updates_only}, include_form_responses={include_form_responses}")

    # Scan all directories for files
    all_files = scan_all_directories(INPUT_DIRS)

    if run_updates_only:
        # Load existing profiles and identify new files
        existing_profiles, processed_files, processed_emails = load_existing_profiles()
        files_to_process = identify_new_files(all_files, processed_files)

        if not files_to_process and not include_form_responses:
            print("No new files to process.")
            return existing_profiles
    else:
        # Confirm overwrite
        if os.path.exists(OUTPUT_SUMMARY_FILE):
            if not confirm_action("This will reprocess all files and overwrite existing profiles. Continue?"):
                print("Operation cancelled.")
                return None

        # Process all files
        existing_profiles = []
        processed_emails = set()
        files_to_process = all_files

    # Process files if needed
    new_profiles = []
    if files_to_process:
        # Run estimation if requested
        if run_estimation:
            estimation_results = estimate_processing(files_to_process, sample_size)

            if estimation_results:
                # Ask for confirmation before proceeding with full processing
                if not confirm_action(f"Estimated processing time: {estimation_results['estimated_total_time_minutes']:.2f} minutes\nEstimated cost: ${estimation_results['estimated_total_cost']:.2f}\n\nProceed with processing {len(files_to_process)} files?"):
                    print("File processing cancelled.")
                    files_to_process = []

        # Process files
        if files_to_process:
            new_profiles = process_files(files_to_process)

    # Combine new profiles with existing ones (if in update mode)
    all_profiles = existing_profiles + new_profiles

    # Process form responses if requested
    if include_form_responses:
        print("\nProcessing text-based form responses...")
        form_profiles = process_form_responses(
            FORM_RESPONSES_SHEET_ID,
            FORM_RESPONSES_RANGE,
            EMAIL_COLUMN,
            FIRST_NAME_COLUMN,
            LAST_NAME_COLUMN,
            processed_emails if run_updates_only else None  # Only pass if doing incremental update
        )

        # Combine with file-based profiles
        all_profiles = all_profiles + form_profiles

    # Save all profiles
    if new_profiles or (include_form_responses and len(form_profiles) > 0):
        save_all_profiles(all_profiles)

    print(f"Profile extraction complete. Total profiles: {len(all_profiles)}")
    return all_profiles

# Section 5: Interactive Run
Use this section to run the profile extraction process interactively.


In [53]:
# Set parameters for the run
run_updates_only = True  # Set to False to reprocess all files
include_form_responses = True  # Set to False to skip form responses
sample_size = 5  # Number of files to use for estimation
run_estimation = False  # Set to False to skip estimation

# Run the profile extraction process
profiles = run_profile_extraction(
    run_updates_only=run_updates_only,
    sample_size=sample_size,
    run_estimation=run_estimation,
    include_form_responses=include_form_responses
)

Running profile extraction with updates_only=True, include_form_responses=True
Scanning directory: /content/drive/My Drive/Data Science/Symposium/AI Days Resumes_Research
Found 152 files
Scanning directory: /content/drive/My Drive/Data Science/Symposium/Recent_CV_uploads
Found 27 files
Total files found: 179
Loaded 256 existing profiles
Found 177 previously processed files
Found 79 previously processed form submissions
Found 2 new files to process


  0%|          | 0/2 [00:00<?, ?it/s]

Error extracting DOCX /content/drive/My Drive/Data Science/Symposium/AI Days Resumes_Research/Ryan Patrick - Sr. Director of Analytics - Ryan Patrick.doc: Package not found at '/content/drive/My Drive/Data Science/Symposium/AI Days Resumes_Research/Ryan Patrick - Sr. Director of Analytics - Ryan Patrick.doc'
No text extracted from: Ryan Patrick - Sr. Director of Analytics - Ryan Patrick.doc
Error extracting DOCX /content/drive/My Drive/Data Science/Symposium/AI Days Resumes_Research/Golann_CV_Jan 2025 for website - Joanne Golann.doc: file '/content/drive/My Drive/Data Science/Symposium/AI Days Resumes_Research/Golann_CV_Jan 2025 for website - Joanne Golann.doc' is not a Word file, content type is 'application/vnd.openxmlformats-officedocument.themeManager+xml'
No text extracted from: Golann_CV_Jan 2025 for website - Joanne Golann.doc

Processing complete! Processed 0 files in 0.87 seconds
Average time per file: 0.44 seconds

Processing text-based form responses...
Skipping 79 already p

## Display Results
Execute this cell to display the extracted profiles.

# Display the extracted profiles (limit to 10 for readability)
if profiles:
    display_profiles(profiles, limit=10)

# Section 6: Future Sections (Placeholders)
The following sections will be implemented in future iterations.

## 6.1 Collaboration Analysis
This section will analyze the extracted profiles to identify potential collaborations.

TODO: Implement collaboration matching algorithms.

## 6.2 Email Notifications
This section will generate and send email notifications to participants.
TODO: Implement email notification functionality.

## 6.3 Advanced Visualization
This section will provide advanced visualizations of the research network.
TODO: Implement research network visualization.