# COVID-19 Data Exploration

This notebook explores real COVID-19 datasets for our detection pipeline.

We'll examine various datasets for both pipeline stages:
1. **Stage 1**: Unstructured text for NER (clinical notes, research papers, etc.)
2. **Stage 2**: Structured EHR data for classification

In [1]:
import os
import json
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from datetime import datetime

# Add project root to path
sys.path.append('..')
from src.data_fetcher import list_available_datasets

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)

## Available COVID-19 Datasets

First, let's review the available datasets we can use for our pipeline.

In [2]:
datasets = list_available_datasets()

for i, dataset in enumerate(datasets, 1):
    print(f"{i}. {dataset['name']}")
    print(f"   Description: {dataset['description']}")
    print(f"   Data Type: {dataset['data_type']}")
    print(f"   Access: {dataset['access']}")
    print()

1. CORD-19 Research Papers
   Description: COVID-19 Open Research Dataset of scientific papers
   Data Type: Unstructured text (research papers)
   Access: Public

2. CDC COVID-19 Case Surveillance
   Description: De-identified patient-level data on COVID-19 cases
   Data Type: Structured data (patient records)
   Access: Public

3. COVID-19 Clinical Trials
   Description: Clinical trials related to COVID-19 with detailed descriptions
   Data Type: Semi-structured text (trial descriptions)
   Access: Public

4. COVID-19 Twitter Dataset
   Description: Tweets related to COVID-19 symptoms and experiences
   Data Type: Unstructured text (social media)
   Access: Public

5. MIMIC-III Clinical Database
   Description: Medical information for ICU patients (includes some COVID cases)
   Data Type: Structured data + unstructured clinical notes
   Access: Requires credential application

6. i2b2 NLP Research Datasets
   Description: Clinical NLP datasets with some COVID-19 related content
   Da

## Stage 1: Unstructured Text Data for NER

For our NER pipeline, we need unstructured text with descriptions of COVID-19 symptoms, treatments, and clinical presentations.

### CORD-19 Research Papers

The COVID-19 Open Research Dataset (CORD-19) contains scientific papers about COVID-19 and related coronaviruses.

In [3]:
# Load CORD-19 metadata (if available)
cord19_path = '../data/external/cord19_metadata.csv'

if os.path.exists(cord19_path):
    cord19_df = pd.read_csv(cord19_path)
    print(f"Loaded {len(cord19_df)} CORD-19 papers")
    display(cord19_df.head())
    
    # Show columns
    print("\nColumns:")
    for col in cord19_df.columns:
        print(f" - {col}")
else:
    print("CORD-19 metadata not found. Please download it first using the data_fetcher.py module.")

ParserError: Error tokenizing data. C error: EOF inside string starting at row 424824

### Clinical Trials Text Data

ClinicalTrials.gov provides detailed descriptions of COVID-19 trials, including symptoms, eligibility criteria, and interventions.

In [5]:
# Load clinical trials data (if available)
trials_path = '../data/external/covid19_clinical_trials.json'

if os.path.exists(trials_path):
    with open(trials_path, 'r') as f:
        trials_data = json.load(f)
    
    print(f"Loaded {len(trials_data['StudyFieldsResponse']['StudyFields'])} clinical trials")
    
    # Convert to DataFrame
    trials_df = pd.DataFrame(trials_data['StudyFieldsResponse']['StudyFields'])
    display(trials_df.head())
    
    # Sample detailed descriptions
    if 'DetailedDescription' in trials_df.columns:
        print("\nSample Clinical Trial Description:")
        sample_desc = trials_df[trials_df['DetailedDescription'].apply(lambda x: len(x) > 0)]['DetailedDescription'].iloc[0][0]
        print(sample_desc[:500] + "...")
else:
    print("Clinical trials data not found. Please download it first using the data_fetcher.py module.")

Clinical trials data not found. Please download it first using the data_fetcher.py module.


### Twitter Data Analysis

Twitter data provides real-world accounts of COVID-19 symptoms and experiences.

In [None]:
# Load Twitter data (if available)
twitter_path = '../data/external/covid19_tweets.tsv'

if os.path.exists(twitter_path):
    # Read with pandas if available
    try:
        twitter_df = pd.read_csv(twitter_path, sep='\t')
        print(f"Loaded {len(twitter_df)} COVID-19 tweets")
        display(twitter_df.head())
        
        # Show sample tweets
        tweet_col = [col for col in twitter_df.columns if 'text' in col.lower() or 'tweet' in col.lower()][0]
        print("\nSample tweets mentioning symptoms:")
        symptom_tweets = twitter_df[twitter_df[tweet_col].str.contains('symptom|cough|fever|breath', case=False, na=False)]
        for tweet in symptom_tweets[tweet_col].head(3).tolist():
            print(f" - {tweet}")
    except Exception as e:
        print(f"Error reading Twitter data: {e}")
else:
    print("Twitter data not found. Please download it first using the data_fetcher.py module.")

## Stage 2: Structured EHR Data for Classification

For our classification model, we need structured patient data with COVID-19 diagnoses.

### CDC COVID-19 Case Surveillance Data

This dataset contains de-identified patient data collected by state health departments.

In [6]:
# Load CDC case surveillance data (if available)
cdc_path = '../data/external/covid19_case_surveillance.csv'

if os.path.exists(cdc_path):
    # Read just the first 10000 rows to avoid memory issues
    cdc_df = pd.read_csv(cdc_path, nrows=10000)
    print(f"Loaded 10000 rows from CDC COVID-19 case surveillance data")
    display(cdc_df.head())
    
    # Column info
    print("\nColumns:")
    for col in cdc_df.columns:
        print(f" - {col}")
        
    # Basic statistics
    if 'current_status' in cdc_df.columns:
        print("\nCase status distribution:")
        display(cdc_df['current_status'].value_counts())
else:
    print("CDC data not found. Please download it first using the data_fetcher.py module.")

Loaded 10000 rows from CDC COVID-19 case surveillance data


Unnamed: 0,cdc_case_earliest_dt,cdc_report_dt,pos_spec_dt,onset_dt,current_status,sex,age_group,race_ethnicity_combined,hosp_yn,icu_yn,death_yn,medcond_yn
0,2024/03/08,2024/03/08,,,Laboratory-confirmed case,Male,70 - 79 Years,Unknown,Missing,Missing,Unknown,Missing
1,2022/09/12,2022/09/12,,,Laboratory-confirmed case,Male,70 - 79 Years,Unknown,Missing,Missing,Unknown,Missing
2,2022/07/21,2022/07/21,,,Laboratory-confirmed case,Male,70 - 79 Years,Unknown,Missing,Missing,Unknown,Missing
3,2022/11/10,2022/11/10,,,Probable Case,Male,70 - 79 Years,Unknown,Missing,Missing,Unknown,Missing
4,2022/08/22,2022/08/22,,,Probable Case,Male,70 - 79 Years,Unknown,Missing,Missing,Unknown,Missing



Columns:
 - cdc_case_earliest_dt 
 - cdc_report_dt
 - pos_spec_dt
 - onset_dt
 - current_status
 - sex
 - age_group
 - race_ethnicity_combined
 - hosp_yn
 - icu_yn
 - death_yn
 - medcond_yn

Case status distribution:


current_status
Laboratory-confirmed case    8159
Probable Case                1841
Name: count, dtype: int64

### MIMIC-III Clinical Notes

MIMIC contains real clinical notes, but requires credentialed access.

In [7]:
# Load MIMIC notes (if available)
mimic_path = '../data/external/mimic_notes.csv'

if os.path.exists(mimic_path):
    # Read just the first 1000 rows to avoid memory issues
    mimic_df = pd.read_csv(mimic_path, nrows=1000)
    print(f"Loaded 1000 rows from MIMIC clinical notes")
    display(mimic_df.head())
    
    # Note types
    if 'CATEGORY' in mimic_df.columns:
        print("\nNote categories:")
        display(mimic_df['CATEGORY'].value_counts())
        
    # Sample text
    if 'TEXT' in mimic_df.columns:
        print("\nSample clinical note:")
        sample_note = mimic_df['TEXT'].iloc[0]
        print(sample_note[:500] + "...")
else:
    print("MIMIC data not found. Note that MIMIC requires credentialed access.")

MIMIC data not found. Note that MIMIC requires credentialed access.


## Data Integration Strategy

Based on our exploration, here's how we can combine these datasets for our pipeline:

### Stage 1: NER Data Integration

For the NER stage, we need to extract medical entities from unstructured text. Here's our strategy:

1. **Primary source**: CORD-19 abstracts and clinical trials descriptions
   - Rich medical terminology and symptom descriptions
   - Formal medical language similar to clinical notes

2. **Secondary source**: Twitter data
   - Real-world symptom descriptions in lay language
   - Useful for understanding how patients describe symptoms

In [None]:
# Example of combining text sources for NER training
def prepare_ner_corpus(cord19_path=None, trials_path=None, twitter_path=None):
    """
    Prepare a corpus of texts for NER training.
    """
    corpus = []
    
    # Add CORD-19 abstracts if available
    if cord19_path and os.path.exists(cord19_path):
        cord19_df = pd.read_csv(cord19_path)
        if 'abstract' in cord19_df.columns:
            abstracts = cord19_df['abstract'].dropna().tolist()
            corpus.extend(abstracts[:100])  # Limit to 100 for demonstration
            print(f"Added {len(abstracts[:100])} CORD-19 abstracts")
    
    # Add clinical trial descriptions if available
    if trials_path and os.path.exists(trials_path):
        with open(trials_path, 'r') as f:
            trials_data = json.load(f)
        
        trials_df = pd.DataFrame(trials_data['StudyFieldsResponse']['StudyFields'])
        if 'DetailedDescription' in trials_df.columns:
            descriptions = [desc[0] for desc in trials_df['DetailedDescription'] if desc]
            corpus.extend(descriptions[:50])  # Limit to 50 for demonstration
            print(f"Added {len(descriptions[:50])} clinical trial descriptions")
    
    # Add tweets if available
    if twitter_path and os.path.exists(twitter_path):
        twitter_df = pd.read_csv(twitter_path, sep='\t')
        tweet_col = [col for col in twitter_df.columns if 'text' in col.lower() or 'tweet' in col.lower()][0]
        symptom_tweets = twitter_df[twitter_df[tweet_col].str.contains('symptom|cough|fever|breath', case=False, na=False)]
        tweets = symptom_tweets[tweet_col].dropna().tolist()
        corpus.extend(tweets[:200])  # Limit to 200 for demonstration
        print(f"Added {len(tweets[:200])} COVID-related tweets")
    
    print(f"\nTotal corpus size: {len(corpus)} documents")
    return corpus

# Don't run this yet - we'll wait until we have the data
# ner_corpus = prepare_ner_corpus(
#     cord19_path='../data/external/cord19_metadata.csv',
#     trials_path='../data/external/covid19_clinical_trials.json',
#     twitter_path='../data/external/covid19_tweets.tsv'
# )

### Stage 2: Classification Data Integration

For the classification stage, we need structured patient data with COVID-19 diagnoses. Here's our strategy:

1. **Primary source**: CDC Case Surveillance data
   - Contains demographic information and COVID-19 test results
   - Large sample size for training classification models

2. **Secondary source**: Extracted features from NER
   - Will add extracted symptoms and severity as features
   - Bridges the gap between unstructured text and structured prediction

In [None]:
# Example of preparing classification data
def prepare_classification_data(cdc_path=None, mimic_path=None):
    """
    Prepare structured data for COVID-19 classification.
    """
    classification_data = None
    
    # Use CDC data if available
    if cdc_path and os.path.exists(cdc_path):
        cdc_df = pd.read_csv(cdc_path, nrows=10000)  # Limit rows for demonstration
        
        # Select relevant columns and clean
        relevant_cols = [col for col in cdc_df.columns if col in [
            'current_status', 'sex', 'age_group', 'race', 'ethnicity',
            'hosp_yn', 'icu_yn', 'death_yn', 'medcond_yn'
        ]]
        
        if relevant_cols:
            classification_data = cdc_df[relevant_cols].copy()
            print(f"Prepared classification data with {len(classification_data)} rows and {len(relevant_cols)} features")
        else:
            print("No relevant columns found in CDC data")
    
    # Use MIMIC data if available
    if mimic_path and os.path.exists(mimic_path) and classification_data is None:
        # This would require custom processing for MIMIC
        print("MIMIC data processing would be implemented here")
    
    if classification_data is None:
        print("No classification data available")
        return None
        
    return classification_data

# Don't run this yet - we'll wait until we have the data
# classification_data = prepare_classification_data(
#     cdc_path='../data/external/covid19_case_surveillance.csv',
#     mimic_path='../data/external/mimic_notes.csv'
# )

## Next Steps

Based on our exploration, here are the next steps for our COVID-19 detection pipeline:

1. **Download the datasets**
   - CORD-19 metadata and papers
   - CDC case surveillance data
   - Clinical trials data
   - Twitter data (if available)

2. **Prepare NER training data**
   - Extract and clean text from multiple sources
   - Annotate a sample for training our NER models

3. **Prepare classification features**
   - Process structured CDC data
   - Prepare to integrate extracted entities from NER stage

4. **Implement the complete pipeline**
   - NER to extract medical entities
   - Classification to predict COVID-19 likelihood

The next notebook will focus on preparing the NER training data from these sources.