# 📊 CenanoInk Data Loader - Web of Science Excel Files

## 🎯 Purpose
This notebook handles the loading, consolidation, and initial validation of Web of Science Excel files for the CenanoInk nanotechnology research project.

## 📁 Expected Data Structure
- Input: 39 Excel files (.xls format) from Web of Science
- Each file contains ~1000 scientific article records
- Total: ~39,000 articles about nanomaterials and nanocoatings

## 🔄 Workflow
1. **Load** all Excel files from directory
2. **Consolidate** into single DataFrame
3. **Map** WoS columns to standard format
4. **Validate** data quality and completeness
5. **Export** consolidated dataset for analysis

In [1]:
# 📦 DEPENDENCIES AND CONFIGURATION
import pandas as pd
import glob
import os
from datetime import datetime
from typing import Dict, List, Optional, Any
import warnings
warnings.filterwarnings('ignore')

# Configuration
CONFIG = {
    'data_directory': '/home/delon/Modelos/modeloCenanoInk/data/raw/arquivos_excel_artigo_cienciometrico/',
    'output_directory': '/home/delon/Modelos/modeloCenanoInk/data/processed/',
    'expected_files': 39,
    'file_pattern': '*.xls'
}

print("📊 CenanoInk Data Loader - Web of Science Excel")
print("=" * 50)
print(f"📁 Data directory: {CONFIG['data_directory']}")
print(f"💾 Output directory: {CONFIG['output_directory']}")
print(f"📋 Expected files: {CONFIG['expected_files']}")

📊 CenanoInk Data Loader - Web of Science Excel
📁 Data directory: /home/delon/Modelos/modeloCenanoInk/data/raw/arquivos_excel_artigo_cienciometrico/
💾 Output directory: /home/delon/Modelos/modeloCenanoInk/data/processed/
📋 Expected files: 39


In [2]:
# 🔍 EXCEL FILE DISCOVERY AND VALIDATION
import os 
import glob 
from typing import List
from datetime import datetime # Added for file modification timestamp

# CONFIG is assumed to be defined in a previous cell (e.g., cell 'ba70942e') and accessible here.

def discover_excel_files(directory: str, file_pattern: str, expected_files_count: int) -> List[str]:
    """
    Discovers, validates, and lists Excel files in a specified directory. 📂✨

    This function meticulously searches for files matching a given pattern (e.g., '*.xls')
    within the provided directory. It then validates the number of files found against
    an expected count, providing clear feedback. Finally, it prints a summary of the
    findings, including a sample of the discovered files with their names, sizes,
    and last modification dates.

    Args:
        directory (str): The absolute path to the directory where Excel files are located.
                         Example: '/path/to/your/data/excel_files'
        file_pattern (str): The glob pattern used to identify Excel files.
                            Example: '*.xls', 'data_*.xlsx'
        expected_files_count (int): The anticipated number of Excel files to be found.
                                    This helps in quick verification of data integrity.

    Returns:
        List[str]: A sorted list of absolute paths to the discovered Excel files.
                   Returns an empty list if no files are found, if the directory
                   does not exist, or if an error occurs during the search.
    
    Side Effects:
        Prints detailed logs, warnings, and error messages to the console.
    """
    print("🔄 [INFO] Initializing Excel file discovery and validation process...")
    print("===================================================================")

    # Step 1: Validate the input directory path
    print(f"📁 [VALIDATION] Checking existence of directory: {directory}")
    if not os.path.isdir(directory):
        print(f"❌ [ERROR] Directory not found: {directory}. Please verify the path in your CONFIG.")
        print("===================================================================")
        return []
    print(f"✅ [SUCCESS] Directory confirmed: {directory}")

    # Step 2: Construct the full search pattern
    search_pattern = os.path.join(directory, file_pattern)
    print(f"🔍 [SEARCH] Using search pattern: {search_pattern}")

    # Step 3: Discover files using the glob module
    excel_files: List[str] = []
    try:
        excel_files = glob.glob(search_pattern)
        print(f"💡 [RESULT] Glob search completed. Found {len(excel_files)} potential item(s) matching pattern.")
    except Exception as e:
        print(f"❌ [ERROR] A critical error occurred during file search with glob: {e}")
        print("This could be due to an invalid pattern or system issues. Aborting discovery.")
        print("===================================================================")
        return []

    # Step 4: File Validation and Reporting
    if not excel_files:
        print(f"⚠️ [WARN] No Excel files found matching the pattern '{file_pattern}' in '{directory}'.")
        print("Possible reasons: Incorrect directory, wrong file pattern, or no such files exist there.")
        print("Please double-check the 'data_directory' and 'file_pattern' in your CONFIG settings.")
        print("===================================================================")
        return []

    found_files_count = len(excel_files)
    print(f"📊 [COUNT] Successfully identified {found_files_count} Excel file(s).")

    # Step 5: Validate against the expected file count
    if found_files_count != expected_files_count:
        print(f"⚠️ [VALIDATION WARN] Expected {expected_files_count} file(s), but actually found {found_files_count}.")
        print("This discrepancy might indicate missing data, extra unprocessed files, or an outdated 'expected_files' count in CONFIG.")
    else:
        print(f"✅ [VALIDATION SUCCESS] File count ({found_files_count}) perfectly matches the expectation ({expected_files_count} files).")

    # Step 6: File Sample Display (with enhanced details)
    print("\n📋 [INFO] Displaying a sample of discovered files (up to 5 files, sorted alphabetically):")
    
    sorted_excel_files = sorted(excel_files) # Sort for consistent output
    
    MAX_SAMPLES_TO_SHOW = 5 
    for i, file_path in enumerate(sorted_excel_files[:MAX_SAMPLES_TO_SHOW]):
        try:
            file_name = os.path.basename(file_path)
            file_size_bytes = os.path.getsize(file_path)
            file_size_mb = file_size_bytes / (1024 * 1024)
            last_modified_timestamp = os.path.getmtime(file_path)
            # Format timestamp to a readable string
            last_modified_date = datetime.fromtimestamp(last_modified_timestamp).strftime('%Y-%m-%d %H:%M:%S')
            
            print(f"  {i+1:2d}. 📄 File: {file_name} | Size: {file_size_mb:.2f} MB | Modified: {last_modified_date}")
        except OSError as e:
            file_name_fallback = os.path.basename(file_path) 
            print(f"  {i+1:2d}. ❓ Error accessing metadata for: {file_name_fallback}. Reason: {e}")

    if found_files_count > MAX_SAMPLES_TO_SHOW:
        print(f"     ... and {found_files_count - MAX_SAMPLES_TO_SHOW} more file(s) not listed in this sample.")
    
    print("\n👍 [COMPLETE] Excel file discovery and validation process finished successfully.")
    print("===================================================================")
    return sorted_excel_files

# --- FUNCTION EXECUTION ---
# This part calls the function defined above.
# It relies on 'CONFIG' being available from a previous cell.
# Imports like 'datetime' are handled at the top of this cell's code.

print("🚀 Executing the refactored 'discover_excel_files' function using CONFIG values...")
excel_files = discover_excel_files(
    directory=CONFIG['data_directory'],
    file_pattern=CONFIG['file_pattern'],
    expected_files_count=CONFIG['expected_files']
)

# --- POST-EXECUTION SUMMARY ---
# This provides a quick summary of what happened.
if excel_files:
    print(f"\n✅ [SUMMARY] Successfully discovered and validated {len(excel_files)} Excel file(s).")
    print(f"📝 The variable 'excel_files' is now populated with their paths and ready for use in subsequent cells.")
else:
    print("\n❌ [SUMMARY] File discovery did not yield any results or encountered an error during the process.")
    print("Please review the detailed logs printed above to diagnose the issue. Check CONFIG and file locations.")

🚀 Executing the refactored 'discover_excel_files' function using CONFIG values...
🔄 [INFO] Initializing Excel file discovery and validation process...
📁 [VALIDATION] Checking existence of directory: /home/delon/Modelos/modeloCenanoInk/data/raw/arquivos_excel_artigo_cienciometrico/
✅ [SUCCESS] Directory confirmed: /home/delon/Modelos/modeloCenanoInk/data/raw/arquivos_excel_artigo_cienciometrico/
🔍 [SEARCH] Using search pattern: /home/delon/Modelos/modeloCenanoInk/data/raw/arquivos_excel_artigo_cienciometrico/*.xls
💡 [RESULT] Glob search completed. Found 39 potential item(s) matching pattern.
📊 [COUNT] Successfully identified 39 Excel file(s).
✅ [VALIDATION SUCCESS] File count (39) perfectly matches the expectation (39 files).

📋 [INFO] Displaying a sample of discovered files (up to 5 files, sorted alphabetically):
   1. 📄 File: savedrecs1.xls | Size: 4.29 MB | Modified: 2025-04-23 15:47:08
   2. 📄 File: savedrecs10.xls | Size: 4.25 MB | Modified: 2025-04-23 15:59:24
   3. 📄 File: savedr

In [3]:
# 📊 EXCEL FILE STRUCTURE ANALYSIS
def analyze_excel_structure(file_path: str) -> Dict[str, Any]:
    """
    Analyze the structure of a single Excel file
    """
    try:
        df = pd.read_excel(file_path, engine='xlrd')
        
        analysis = {
            'file_name': os.path.basename(file_path),
            'rows': len(df),
            'columns': len(df.columns),
            'column_names': list(df.columns),
            'success': True,
            'error': None
        }
        
        # Check for key WoS columns
        key_columns = ['Article Title', 'Abstract', 'Authors', 'Publication Year']
        analysis['key_columns_present'] = {col: col in df.columns for col in key_columns}
        
        return analysis
        
    except Exception as e:
        return {
            'file_name': os.path.basename(file_path),
            'success': False,
            'error': str(e)
        }

def analyze_all_files(file_paths: List[str]) -> Dict[str, Any]:
    """
    Analyze structure of all Excel files
    """
    print("\n📊 ANALYZING FILE STRUCTURES")
    print("=" * 35)
    
    analyses = []
    successful_files = 0
    total_rows = 0
    
    for i, file_path in enumerate(file_paths):
        print(f"📄 Analyzing {i+1}/{len(file_paths)}: {os.path.basename(file_path)}")
        
        analysis = analyze_excel_structure(file_path)
        analyses.append(analysis)
        
        if analysis['success']:
            successful_files += 1
            total_rows += analysis['rows']
            print(f"  ✅ {analysis['rows']} rows, {analysis['columns']} columns")
        else:
            print(f"  ❌ Error: {analysis['error']}")
    
    # Summary
    print(f"\n📊 ANALYSIS SUMMARY")
    print(f"✅ Successful files: {successful_files}/{len(file_paths)}")
    print(f"📋 Total estimated rows: {total_rows:,}")
    
    # Check structure consistency
    if successful_files > 1:
        first_success = next(a for a in analyses if a['success'])
        consistent_structure = True
        
        for analysis in analyses:
            if analysis['success']:
                if set(analysis['column_names']) != set(first_success['column_names']):
                    consistent_structure = False
                    print(f"⚠️ Structure inconsistency in {analysis['file_name']}")
        
        if consistent_structure:
            print("✅ All files have consistent structure")
    
    return {
        'analyses': analyses,
        'successful_files': successful_files,
        'total_rows': total_rows,
        'consistent_structure': consistent_structure if successful_files > 1 else True
    }

# Analyze files if they were found
if excel_files:
    structure_analysis = analyze_all_files(excel_files)
else:
    print("⚠️ No files to analyze")
    structure_analysis = None


📊 ANALYZING FILE STRUCTURES
📄 Analyzing 1/39: savedrecs1.xls


  ✅ 1000 rows, 72 columns
📄 Analyzing 2/39: savedrecs10.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 3/39: savedrecs11.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 4/39: savedrecs12.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 5/39: savedrecs13.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 6/39: savedrecs14.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 7/39: savedrecs15.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 8/39: savedrecs16.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 9/39: savedrecs17.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 10/39: savedrecs18.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 11/39: savedrecs19.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 12/39: savedrecs2.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 13/39: savedrecs20.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 14/39: savedrecs21.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 15/39: savedrecs22.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 16/39: savedrecs23.xls
  ✅ 1000 rows, 72 columns
📄 Analyzing 17/39: savedrecs24.xls
  ✅ 1000 rows, 72 columns
📄 Analy

In [4]:
# 🔗 EXCEL FILES CONSOLIDATION
def consolidate_excel_files(file_paths: List[str]) -> Optional[pd.DataFrame]:
    """
    Consolidate all Excel files into a single DataFrame
    """
    print("\n🔗 CONSOLIDATING EXCEL FILES")
    print("=" * 32)
    
    if not file_paths:
        print("❌ No files to consolidate")
        return None
    
    dataframes = []
    total_records = 0
    errors = []
    
    for i, file_path in enumerate(file_paths, 1):
        try:
            print(f"📄 Loading {i}/{len(file_paths)}: {os.path.basename(file_path)}")
            
            # Load Excel file
            df_temp = pd.read_excel(file_path, engine='xlrd')
            
            # Add metadata columns
            df_temp['source_file'] = os.path.basename(file_path)
            df_temp['batch_number'] = i
            df_temp['load_timestamp'] = datetime.now().isoformat()
            
            dataframes.append(df_temp)
            total_records += len(df_temp)
            
            print(f"  ✅ Loaded: {len(df_temp):,} records")
            
        except Exception as e:
            error_msg = f"Error in {os.path.basename(file_path)}: {str(e)}"
            print(f"  ❌ {error_msg}")
            errors.append(error_msg)
    
    if not dataframes:
        print("❌ No files loaded successfully!")
        return None
    
    # Consolidate DataFrames
    print(f"\n🔄 Consolidating {len(dataframes)} DataFrames...")
    consolidated_df = pd.concat(dataframes, ignore_index=True)
    
    print(f"\n✅ CONSOLIDATION COMPLETE")
    print(f"📊 Total records: {len(consolidated_df):,}")
    print(f"📋 Total columns: {len(consolidated_df.columns)}")
    print(f"📁 Files processed: {len(dataframes)}/{len(file_paths)}")
    
    if errors:
        print(f"\n⚠️ Errors encountered ({len(errors)}):")
        for error in errors[:3]:  # Show first 3 errors
            print(f"  - {error}")
        if len(errors) > 3:
            print(f"  ... and {len(errors) - 3} more errors")
    
    return consolidated_df

# Consolidate files if available
if excel_files and structure_analysis and structure_analysis['successful_files'] > 0:
    consolidated_data = consolidate_excel_files(excel_files)
else:
    print("⚠️ Prerequisites not met for consolidation")
    consolidated_data = None


🔗 CONSOLIDATING EXCEL FILES
📄 Loading 1/39: savedrecs1.xls
  ✅ Loaded: 1,000 records
📄 Loading 2/39: savedrecs10.xls
  ✅ Loaded: 1,000 records
📄 Loading 3/39: savedrecs11.xls
  ✅ Loaded: 1,000 records
📄 Loading 4/39: savedrecs12.xls
  ✅ Loaded: 1,000 records
📄 Loading 5/39: savedrecs13.xls
  ✅ Loaded: 1,000 records
📄 Loading 6/39: savedrecs14.xls
  ✅ Loaded: 1,000 records
📄 Loading 7/39: savedrecs15.xls
  ✅ Loaded: 1,000 records
📄 Loading 8/39: savedrecs16.xls
  ✅ Loaded: 1,000 records
📄 Loading 9/39: savedrecs17.xls
  ✅ Loaded: 1,000 records
📄 Loading 10/39: savedrecs18.xls
  ✅ Loaded: 1,000 records
📄 Loading 11/39: savedrecs19.xls
  ✅ Loaded: 1,000 records
📄 Loading 12/39: savedrecs2.xls
  ✅ Loaded: 1,000 records
📄 Loading 13/39: savedrecs20.xls
  ✅ Loaded: 1,000 records
📄 Loading 14/39: savedrecs21.xls
  ✅ Loaded: 1,000 records
📄 Loading 15/39: savedrecs22.xls
  ✅ Loaded: 1,000 records
📄 Loading 16/39: savedrecs23.xls
  ✅ Loaded: 1,000 records
📄 Loading 17/39: savedrecs24.xls
  ✅ L

In [5]:
# 🗺️ WEB OF SCIENCE COLUMN MAPPING
def map_wos_columns(df: pd.DataFrame) -> pd.DataFrame:
    """
    Map Web of Science columns to standardized format for CenanoInk pipeline
    """
    print("\n🗺️ MAPPING WOS COLUMNS TO STANDARD FORMAT")
    print("=" * 45)
    
    # Standard column mapping for WoS data
    column_mapping = {
        'Article Title': 'Title',
        'Abstract': 'Abstract',
        'Author Keywords': 'Author_Keywords',
        'Keywords Plus': 'Keywords_Plus',
        'Authors': 'Authors',
        'Publication Year': 'Year',
        'Source Title': 'Journal',
        'DOI': 'DOI',
        'Addresses': 'Addresses',
        'Research Areas': 'Research_Areas',
        'Web of Science Categories': 'WoS_Categories',
        'Times Cited': 'Citations',
        'Volume': 'Volume',
        'Issue': 'Issue',
        'Pages': 'Pages'
    }
    
    # Identify available columns
    available_columns = list(df.columns)
    mapped_columns = {}
    missing_columns = []
    
    print("📋 Column mapping status:")
    for wos_col, standard_col in column_mapping.items():
        if wos_col in available_columns:
            mapped_columns[wos_col] = standard_col
            print(f"  ✅ {wos_col} → {standard_col}")
        else:
            missing_columns.append(wos_col)
            print(f"  ❌ {wos_col} → {standard_col} (missing)")
    
    # Apply mapping
    df_mapped = df.copy()
    df_mapped = df_mapped.rename(columns=mapped_columns)
    
    print(f"\n📊 Mapping results:")
    print(f"  ✅ Mapped columns: {len(mapped_columns)}")
    print(f"  ❌ Missing columns: {len(missing_columns)}")
    print(f"  📋 Total columns after mapping: {len(df_mapped.columns)}")
    
    return df_mapped

# Apply column mapping if data is available
if consolidated_data is not None:
    mapped_data = map_wos_columns(consolidated_data)
else:
    print("⚠️ No consolidated data available for mapping")
    mapped_data = None


🗺️ MAPPING WOS COLUMNS TO STANDARD FORMAT
📋 Column mapping status:
  ✅ Article Title → Title
  ✅ Abstract → Abstract
  ✅ Author Keywords → Author_Keywords
  ✅ Keywords Plus → Keywords_Plus
  ✅ Authors → Authors
  ✅ Publication Year → Year
  ✅ Source Title → Journal
  ✅ DOI → DOI
  ✅ Addresses → Addresses
  ✅ Research Areas → Research_Areas
  ❌ Web of Science Categories → WoS_Categories (missing)
  ❌ Times Cited → Citations (missing)
  ✅ Volume → Volume
  ✅ Issue → Issue
  ❌ Pages → Pages (missing)

📊 Mapping results:
  ✅ Mapped columns: 12
  ❌ Missing columns: 3
  📋 Total columns after mapping: 75


In [6]:
# ✅ DATA QUALITY VALIDATION
def validate_data_quality(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Comprehensive data quality validation for the consolidated dataset
    """
    print("\n✅ DATA QUALITY VALIDATION")
    print("=" * 30)
    
    validation = {
        'total_records': len(df),
        'total_columns': len(df.columns),
        'essential_columns': {},
        'abstract_quality': {},
        'title_quality': {},
        'year_distribution': {},
        'nano_content': {}
    }
    
    # Check essential columns
    essential_cols = ['Title', 'Abstract', 'Authors', 'Year']
    print("📋 Essential columns status:")
    
    for col in essential_cols:
        if col in df.columns:
            non_null = df[col].notna().sum()
            percentage = (non_null / len(df)) * 100
            validation['essential_columns'][col] = {
                'present': True,
                'non_null': non_null,
                'percentage': percentage
            }
            print(f"  ✅ {col}: {non_null:,} records ({percentage:.1f}%)")
        else:
            validation['essential_columns'][col] = {'present': False}
            print(f"  ❌ {col}: Not found")
    
    # Abstract quality analysis
    if 'Abstract' in df.columns:
        valid_abstracts = df['Abstract'].notna() & (df['Abstract'].str.len() > 50)
        validation['abstract_quality'] = {
            'total': len(df),
            'with_abstract': df['Abstract'].notna().sum(),
            'valid_abstracts': valid_abstracts.sum(),
            'avg_length': df['Abstract'].str.len().mean()
        }
        
        aq = validation['abstract_quality']
        print(f"\n📝 Abstract Quality:")
        print(f"  With abstract: {aq['with_abstract']:,} ({(aq['with_abstract']/aq['total'])*100:.1f}%)")
        print(f"  Valid abstracts (>50 chars): {aq['valid_abstracts']:,} ({(aq['valid_abstracts']/aq['total'])*100:.1f}%)")
        print(f"  Average length: {aq['avg_length']:.0f} characters")
    
    # Title quality analysis
    if 'Title' in df.columns:
        valid_titles = df['Title'].notna() & (df['Title'].str.len() > 10)
        validation['title_quality'] = {
            'total': len(df),
            'with_title': df['Title'].notna().sum(),
            'valid_titles': valid_titles.sum(),
            'avg_length': df['Title'].str.len().mean()
        }
        
        tq = validation['title_quality']
        print(f"\n🏷️ Title Quality:")
        print(f"  With title: {tq['with_title']:,} ({(tq['with_title']/tq['total'])*100:.1f}%)")
        print(f"  Valid titles (>10 chars): {tq['valid_titles']:,} ({(tq['valid_titles']/tq['total'])*100:.1f}%)")
        print(f"  Average length: {tq['avg_length']:.0f} characters")
    
    # Year distribution
    if 'Year' in df.columns:
        years = pd.to_numeric(df['Year'], errors='coerce')
        year_dist = years.value_counts().sort_index()
        validation['year_distribution'] = {
            'min_year': int(years.min()) if years.notna().any() else None,
            'max_year': int(years.max()) if years.notna().any() else None,
            'unique_years': len(years.unique()),
            'top_years': year_dist.head(5).to_dict()
        }
        
        yd = validation['year_distribution']
        if yd['min_year']:
            print(f"\n📅 Year Distribution:")
            print(f"  Period: {yd['min_year']} - {yd['max_year']}")
            print(f"  Unique years: {yd['unique_years']}")
            print(f"  Top years:")
            for year, count in list(yd['top_years'].items())[:3]:
                print(f"    {year}: {count:,} articles")
    
    # Nanotechnology content analysis
    nano_terms = ['nano', 'nanoparticle', 'nanotechnology', 'nanomaterial', 'coating', 'paint']
    nano_counts = {}
    
    for term in nano_terms:
        count_total = 0
        for col in ['Title', 'Abstract']:
            if col in df.columns:
                count = df[col].astype(str).str.lower().str.contains(term, na=False).sum()
                count_total += count
        nano_counts[term] = count_total
    
    validation['nano_content'] = nano_counts
    
    print(f"\n🔬 Nanotechnology Content:")
    for term, count in nano_counts.items():
        if count > 0:
            print(f"  {term}: {count:,} mentions")
    
    return validation

# Validate data quality if available
if mapped_data is not None:
    data_validation = validate_data_quality(mapped_data)
else:
    print("⚠️ No mapped data available for validation")
    data_validation = None


✅ DATA QUALITY VALIDATION
📋 Essential columns status:
  ✅ Title: 38,323 records (100.0%)
  ✅ Abstract: 38,237 records (99.8%)
  ✅ Authors: 38,323 records (100.0%)
  ✅ Year: 38,323 records (100.0%)

📝 Abstract Quality:
  With abstract: 38,237 (99.8%)
  Valid abstracts (>50 chars): 38,237 (99.8%)
  Average length: 1266 characters

🏷️ Title Quality:
  With title: 38,323 (100.0%)
  Valid titles (>10 chars): 38,323 (100.0%)
  Average length: 110 characters

📅 Year Distribution:
  Period: 2014 - 2025
  Unique years: 12
  Top years:
    2014: 2,192 articles
    2015: 2,384 articles
    2016: 2,622 articles

🔬 Nanotechnology Content:
  nano: 21,792 mentions
  nanoparticle: 6,424 mentions
  nanotechnology: 110 mentions
  nanomaterial: 558 mentions
  coating: 46,224 mentions
  paint: 1,814 mentions


In [7]:
# 💾 EXPORT CONSOLIDATED DATASET
import numpy as np
def export_consolidated_data(df: pd.DataFrame, 
                           output_dir: str = None,
                           include_metadata: bool = True) -> str:
    """
    Export consolidated dataset to CSV with metadata
    """
    print("\n💾 EXPORTING CONSOLIDATED DATASET")
    print("=" * 35)
    
    # Initialize exported_file_path to None at the beginning of the function
    # This ensures it has a value even if the export fails early.
    exported_file_path = None
    
    if output_dir is None:
        output_dir = CONFIG['output_directory']
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Generate filename with timestamp
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    filename = f'cenanoink_wos_consolidated_{timestamp}.csv'
    filepath = os.path.join(output_dir, filename)
    
    # Export main dataset
    print(f"📁 Exporting to: {filepath}")
    df.to_csv(filepath, index=False, encoding='utf-8')
    
    file_size = os.path.getsize(filepath) / (1024 * 1024)  # MB
    print(f"✅ Dataset exported: {len(df):,} records, {file_size:.1f} MB")
    
    # Export metadata if requested
    if include_metadata and data_validation:
        metadata_filename = f'cenanoink_wos_metadata_{timestamp}.json'
        metadata_filepath = os.path.join(output_dir, metadata_filename)
        
        import json
        # Ensure convert_numpy_types is defined or imported if needed,
        # or handle numpy types directly if they appear in 'data_validation'
        
        # Helper function to convert NumPy types to native Python types for JSON serialization
        def convert_numpy_types(obj):
            if isinstance(obj, (np.integer, np.int64, np.int32)):
                return int(obj)
            elif isinstance(obj, (np.floating, np.float64, np.float32)):
                return float(obj)
            elif isinstance(obj, np.ndarray):
                return obj.tolist()
            elif isinstance(obj, dict):
                return {k: convert_numpy_types(v) for k, v in obj.items()}
            elif isinstance(obj, list):
                return [convert_numpy_types(i) for i in obj]
            return obj

        metadata = {
            'export_timestamp': datetime.now().isoformat(),
            'source_files_count': len(excel_files) if excel_files else 0,
            'total_records': len(df),
            'total_columns': len(df.columns),
            'data_validation': convert_numpy_types(data_validation), # Apply conversion here
            'column_names': list(df.columns)
        }
        
        with open(metadata_filepath, 'w', encoding='utf-8') as f:
            json.dump(metadata, f, indent=2, ensure_ascii=False)
        
        print(f"📊 Metadata exported: {metadata_filepath}")
    
    return filepath

# Export data if available
if mapped_data is not None:
    # This is where exported_file was defined.
    # It's good practice to ensure it's defined in all paths.
    exported_file = export_consolidated_data(mapped_data)
    print(f"\n🎉 DATA LOADING COMPLETE!")
    print(f"📄 Ready for analysis: {exported_file}")
    print(f"📊 Next step: Use '02_nanomaterials_database.ipynb' for pattern analysis")
else:
    print("\n⚠️ No data available for export")
    print("🔧 Check file paths and permissions")
    exported_file = None # Initialize exported_file to None if export is skipped


💾 EXPORTING CONSOLIDATED DATASET
📁 Exporting to: /home/delon/Modelos/modeloCenanoInk/data/processed/cenanoink_wos_consolidated_20250610_121227.csv
✅ Dataset exported: 38,323 records, 125.4 MB
📊 Metadata exported: /home/delon/Modelos/modeloCenanoInk/data/processed/cenanoink_wos_metadata_20250610_121227.json

🎉 DATA LOADING COMPLETE!
📄 Ready for analysis: /home/delon/Modelos/modeloCenanoInk/data/processed/cenanoink_wos_consolidated_20250610_121227.csv
📊 Next step: Use '02_nanomaterials_database.ipynb' for pattern analysis


In [8]:
# 📋 SUMMARY AND NEXT STEPS
print("\n" + "=" * 60)
print("📋 CENANOINK DATA LOADER SUMMARY")
print("=" * 60)

if excel_files and mapped_data is not None:
    print(f"✅ Successfully processed {len(excel_files)} Excel files")
    print(f"📊 Total records loaded: {len(mapped_data):,}")
    print(f"📋 Columns available: {len(mapped_data.columns)}")
    
    # Conditionally print the exported file path
    if exported_file:
        print(f"💾 Data exported to: {exported_file}")
    else:
        print("💾 Data export was skipped or did not produce a file path.")
    
    # Key statistics
    if data_validation:
        if 'abstract_quality' in data_validation and data_validation['abstract_quality']:
            valid_abstracts = data_validation['abstract_quality']['valid_abstracts']
            print(f"📝 Valid abstracts: {valid_abstracts:,}")
        
        if 'nano_content' in data_validation:
            nano_total = sum(data_validation['nano_content'].values())
            print(f"🔬 Nanotechnology mentions: {nano_total:,}")
    
    print("\n🎯 NEXT STEPS:")
    print("  1. Run '02_nanomaterials_database.ipynb' for pattern extraction")
    print("  2. Use '03_gemini_analysis.ipynb' for AI analysis") # This might need updating based on new notebook names
    print("  3. Generate reports with '04_reporting_system.ipynb'") # This might need updating
    print("  4. Orchestrate full pipeline with '05_main_orchestrator.ipynb'") # This might need updating
    
else:
    print("❌ Data loading failed or was incomplete.")
    # If mapped_data is None, it implies earlier stages might have issues.
    # exported_file would be None due to the change in the previous cell.
    if not excel_files:
        print("  - No Excel files were found or processed.")
    if mapped_data is None and excel_files: # Files found, but mapping/consolidation failed
        print("  - Data consolidation or mapping might have failed.")
        
    print("\n🔧 TROUBLESHOOTING:")
    print("  1. Check if Excel files exist in the data directory and match the pattern.")
    print("  2. Verify file permissions and that files are not corrupt.")
    print("  3. Ensure 'xlrd' library is installed if using .xls files.")
    print("  4. Review logs in previous cells for specific error messages.")
    print("  5. Check available disk space if export is failing.")

print("\n🎨 CenanoInk Project - Nanotechnology Research Pipeline")
print("=" * 60)


📋 CENANOINK DATA LOADER SUMMARY
✅ Successfully processed 39 Excel files
📊 Total records loaded: 38,323
📋 Columns available: 75
💾 Data exported to: /home/delon/Modelos/modeloCenanoInk/data/processed/cenanoink_wos_consolidated_20250610_121227.csv
📝 Valid abstracts: 38,237
🔬 Nanotechnology mentions: 76,922

🎯 NEXT STEPS:
  1. Run '02_nanomaterials_database.ipynb' for pattern extraction
  2. Use '03_gemini_analysis.ipynb' for AI analysis
  3. Generate reports with '04_reporting_system.ipynb'
  4. Orchestrate full pipeline with '05_main_orchestrator.ipynb'

🎨 CenanoInk Project - Nanotechnology Research Pipeline
