# Discovering Companion Documentation

Scientific datasets often come with external documentation that provides crucial context:
- **README files**: Dataset descriptions, methodology
- **Scripts**: Processing code, examples
- **Citations**: Papers, DOIs, authors
- **Documentation**: User guides, technical notes

## Why Companions Matter

Academic data frequently has richer metadata in companion files than in the data files themselves. Finding and incorporating this information makes data more FAIR.

In [1]:
# Setup
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent))

from companion_finder import CompanionDocFinder
from companion_extractor import CompanionDocExtractor

## Example 1: Create Sample Companion Files

In [2]:
# Create sample README
readme_file = Path("sample_data/README.md")
with open(readme_file, 'w') as f:
    f.write("""# Ocean Temperature Dataset

## Description
This dataset contains sea surface temperature measurements from 2020-2023.
Data was collected using satellite remote sensing (MODIS Aqua).

## Variables
- **sst**: Sea surface temperature in Celsius
- **lat**: Latitude in degrees north
- **lon**: Longitude in degrees east
- **time**: Days since 2020-01-01

## Contact
Institution: Demo University Oceanography Department
Email: data@demo.edu
Version: 1.0
Date: 2023-05-15

## License
CC BY 4.0
""")

print(f"Created: {readme_file}")

Created: sample_data/README.md


In [3]:
# Create sample citation file
citation_file = Path("sample_data/CITATION.txt")
with open(citation_file, 'w') as f:
    f.write("""Citation Information

If you use this dataset, please cite:

Smith, J., Johnson, A., & Williams, B. (2023). 
Global Sea Surface Temperature Analysis Using MODIS.
Journal of Marine Science, 45(3), 234-256.
DOI: 10.1234/jms.2023.001

Dataset DOI: 10.5281/zenodo.1234567
URL: https://example.com/datasets/sst-2023
""")

print(f"Created: {citation_file}")

Created: sample_data/CITATION.txt


In [4]:
# Create sample processing script
script_file = Path("sample_data/process_data.py")
with open(script_file, 'w') as f:
    f.write('''#!/usr/bin/env python
"""
Sea Surface Temperature Data Processing

This script processes raw MODIS L2 data and generates
daily gridded sea surface temperature files.

Author: Jane Smith
Date: 2023-05-15
Version: 1.0
"""
import netCDF4
import numpy as np

def process_sst_data(input_file, output_file):
    """Process SST data with quality control"""
    # Quality control: remove values outside physical range
    # SST should be between -2°C and 40°C
    pass

# Processing parameters
GRID_RESOLUTION = 0.25  # degrees
QC_THRESHOLD = 3  # Quality control threshold
''')

print(f"Created: {script_file}")

Created: sample_data/process_data.py


## Example 2: Find Companions for a Data File

In [5]:
# Find companions
data_file = Path("sample_data/ocean_temperature.nc")
finder = CompanionDocFinder()

companions = finder.find_companions(data_file)

print("Companion Documents Found:")
print("=" * 60)
for doc_type, files in companions.items():
    if files:
        print(f"\n{doc_type.upper().replace('_', ' ')}:")
        for f in files:
            print(f"  - {f.name}")

# Get summary
summary = finder.get_companion_summary(companions)
print(f"\nSummary: {summary}")

Companion Documents Found:

READMES:
  - README.md

CITATIONS:
  - CITATION.txt

SCRIPTS:
  - process_data.py

Summary: Readmes: 1 file(s); Citations: 1 file(s); Scripts: 1 file(s)


## Example 3: Extract Content from Companions

In [6]:
extractor = CompanionDocExtractor()

# Extract README
if companions['readmes']:
    readme_data = extractor.extract_readme(companions['readmes'][0])
    
    print("README Extraction:")
    print("=" * 60)
    print(f"File: {readme_data['filepath']}")
    print(f"\nContent Preview (first 300 chars):")
    print(readme_data['content'][:300] + "...")
    
    if readme_data.get('metadata'):
        print(f"\nExtracted Metadata:")
        for key, value in readme_data['metadata'].items():
            print(f"  {key}: {value}")

README Extraction:
File: sample_data/README.md

Content Preview (first 300 chars):
# Ocean Temperature Dataset

## Description
This dataset contains sea surface temperature measurements from 2020-2023.
Data was collected using satellite remote sensing (MODIS Aqua).

## Variables
- **sst**: Sea surface temperature in Celsius
- **lat**: Latitude in degrees north
- **lon**: Longitude...


In [7]:
# Extract citation info
if companions['citations']:
    citation_data = extractor.extract_citation_info(companions['citations'][0])
    
    print("\nCitation Extraction:")
    print("=" * 60)
    print(f"File: {Path(citation_data['filepath']).name}")
    
    if citation_data.get('doi'):
        print(f"DOI: {citation_data['doi']}")
    
    if citation_data.get('authors'):
        print(f"Authors: {', '.join(citation_data['authors'])}")
    
    if citation_data.get('year'):
        print(f"Year: {citation_data['year']}")
    
    if citation_data.get('url'):
        print(f"URL: {citation_data['url']}")


Citation Extraction:
File: CITATION.txt
DOI: 10.1234/jms.2023.001
Year: 2023
URL: https://example.com/datasets/sst-2023


In [8]:
# Extract script metadata
if companions['scripts']:
    script_data = extractor.extract_script_metadata(companions['scripts'][0])
    
    print("\nScript Extraction:")
    print("=" * 60)
    print(f"File: {Path(script_data['filepath']).name}")
    print(f"Language: {script_data['language']}")
    
    if script_data.get('docstring'):
        print(f"\nDocstring:")
        print(script_data['docstring'][:200] + "...")
    
    if script_data.get('metadata'):
        print(f"\nExtracted Metadata:")
        for key, value in script_data['metadata'].items():
            print(f"  {key}: {value}")
    
    if script_data.get('imports'):
        print(f"\nImports: {', '.join(script_data['imports'][:5])}")


Script Extraction:
File: process_data.py
Language: py

Docstring:
Sea Surface Temperature Data Processing

This script processes raw MODIS L2 data and generates
daily gridded sea surface temperature files.

Author: Jane Smith
Date: 2023-05-15
Version: 1.0...

Extracted Metadata:
  author: Jane Smith
  date: 2023-05-15
  version: 1.0

Imports: netCDF4, numpy


## Example 4: Create Searchable Summary

In [9]:
# Collect all companion data
companion_data = []

for readme in companions['readmes']:
    companion_data.append(extractor.extract_readme(readme))

for citation in companions['citations']:
    companion_data.append(extractor.extract_citation_info(citation))

for script in companions['scripts']:
    companion_data.append(extractor.extract_script_metadata(script))

# Create searchable summary
summary_text = extractor.create_companion_summary(companion_data)

print("Searchable Summary from Companions:")
print("=" * 60)
print(summary_text[:500] + "...")
print("\n" + "=" * 60)
print(f"Total length: {len(summary_text)} characters")
print("\nThis text will be combined with data file metadata for search.")

Searchable Summary from Companions:
# Ocean Temperature Dataset ## Description This dataset contains sea surface temperature measurements from 2020-2023. Data was collected using satellite remote sensing (MODIS Aqua). ## Variables - **sst**: Sea surface temperature in Celsius - **lat**: Latitude in degrees north - **lon**: Longitude in degrees east - **time**: Days since 2020-01-01 ## Contact Institution: Demo University Oceanography Department Email: data@demo.edu Version: 1.0 Date: 2023-05-15 ## License CC BY 4.0 DOI: 10.1234/jm...

Total length: 857 characters

This text will be combined with data file metadata for search.


## Example 5: Directory-Wide Companion Discovery

In [10]:
# Find all companions in directory
dir_companions = finder.find_directory_companions(Path("sample_data"))

print("All Companions in Directory:")
print("=" * 60)
for doc_type, files in dir_companions.items():
    print(f"\n{doc_type.replace('_', ' ').title()}: {len(files)}")
    for f in files:
        print(f"  - {f.name}")

All Companions in Directory:

Readmes: 1
  - README.md

Citations: 1
  - CITATION.txt

Documentation: 0

Scripts: 1
  - process_data.py


## Impact on FAIR Principles

Incorporating companion documents improves:

1. **Findability**: More keywords and context for search
2. **Accessibility**: Contact information, URLs, DOIs
3. **Interoperability**: Processing examples, format details
4. **Reusability**: Citations, methodology, license info

## Best Practices

When creating datasets:
- Include a README.md with dataset description
- Add CITATION.txt with proper attribution
- Provide example processing scripts
- Document data quality and methodology

## Next Steps

- **Notebook 04**: Generate embeddings and build search index
- **Notebook 05**: See companions integrated into batch indexing