# Machine Learning Course: Data Exploration

## Notebook 1: Initial Data Exploration of the Metabric Dataset

### Learning Objectives
By the end of this notebook, you will be able to:
1. Load and examine the structure of transcriptomics and clinical data
2. Identify and handle missing values in genomics datasets (if there are any)
3. Explore clinical variables and their distributions
4. Understand sample types and patient characteristics
5. Visualize key patterns in the data
6. Assess data quality and identify potential issues

### Dataset Overview
The Metabric (Molecular Taxonomy of Breast Cancer International Consortium) dataset contains:
- **Gene expression data**: ~24,000 genes across ~2,000 breast cancer samples
- **Clinical data**: Patient demographics, tumor characteristics, and survival outcomes
- **Sample metadata**: Information about sample collection and processing

---

## 1. Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [None]:
# 📝 ACTIVITY 1: Import Required Libraries
# 
# Your task: Import the necessary libraries for data analysis and visualization
# 
# TODO: Import the following libraries with appropriate aliases:
# 1. pandas (as pd) - for data manipulation
# 2. numpy (as np) - for numerical operations  
# 3. matplotlib.pyplot (as plt) - for plotting
# 4. seaborn (as sns) - for statistical visualization
# 5. os - for file system operations
# 6. warnings - to suppress warning messages
#
# Write your code below:
# from scipy.stats import stats, chi2_contingency
# import numpy as np

## 2. Task 1: Data Loading

Let's start by loading the Metabric dataset. We'll load both the gene expression data and clinical data.

### 2.1 Load Data

In [None]:
# 📝 ACTIVITY 2: Load Gene Expression Data
#
# Your task: Load the Illumina microarray expression data (tab-separated format)
#
# TODO: Implement file loading with error handling:
# 1. Use a try-except block to handle potential file loading errors (optional)
# 2. Create the full file path using os.path.join(DATA_PATH, EXPRESSION_FILE)
# 3. Check if the file exists using os.path.exists()
#
# TODO: Load the expression data correctly:
# 4. The file is tab-separated with NO comment lines
# 5. First column is Hugo_Symbol (gene names), second is Entrez_Gene_Id
# 6. Use pd.read_csv(file_path, sep='\t', index_col=0) to load with gene symbols as index
# 7. This will make Hugo_Symbol the row index (gene names)
# 8. You can drop the Entrez_Gene_Id column if desired: .drop('Entrez_Gene_Id', axis=1)
#
# TODO: Display data information:
# 9. Print the shape of the loaded data
# 10. Print number of genes (rows) and samples (columns) with comma formatting
# 11. Print the data type (should be float64 for expression values)
# 12. Show first few gene names (index) and sample names (columns)
# 13. Display a small preview (first 5 genes × 5 samples)
#
# TODO: Validate the data structure:
# 14. Check that sample names follow MB-XXXX format
# 15. Verify expression values are in reasonable range (log2 or similar scale)
# 16. Check for any obvious data quality issues

# 17. Load the other two datasets containing patient information and sample information

# Write your code below:
# try:
#     expression_file_path = ...

## 3. Task 2: Data Exploration

Now let's explore the loaded data to understand its structure, quality, and characteristics.

### 3.1 Expression Data Exploration

In [None]:
# 📝 ACTIVITY 7: Explore Expression Data Structure
#
# Your task: Examine the basic structure and characteristics of gene expression data
#
# TODO: Display basic dataset information:
# 1. Print the shape of expression_data
# 2. Print the data type of the first column (use .dtypes.iloc[0])
# 3. Print the index name (genes) - handle case where it might be None, if any
# 4. Print the columns name (samples) - handle case where it might be None, if any
# 5. Check for na values in expression data
#
# TODO: Show sample of data structure:
# 6. Print "First few genes:" and display first 10 gene names as a list
# 7. Print "First few samples:" and display first 10 sample names as a list  
# 8. Print "Expression data preview:" and display first 5 rows × 5 columns
# 9. Use dataframe[a:b, c:d] to display
#
# TODO: Provide context:
# 10. Check if sample normalisation is present
# 11. Explain what the expression values likely represent (Z-scores, log2, etc.)
#
# Expected output: Structured overview of expression data dimensions, types, and preview

# Write your code below:
# ...

### 3.2 Clinical Data Exploration

In [None]:
# 📝 ACTIVITY 11: Explore Clinical Data Structure
#
# Your task: Examine the structure and content of clinical data
#
# TODO: Display basic clinical data information:
# 2. Print number of patients using len(clinical_data) with comma formatting
# 3. Print number of variables using len(clinical_data.columns) with comma formatting
#
# TODO: Show column information systematically:
# 4. Print "Column names and data types:"
# 5. Loop through clinical_data.columns and clinical_data.dtypes using zip()
# 6. Print each column with index number, name (left-aligned in 30 characters), and data type
# 7. Use string formatting: f"{i+1:2d}. {col:<30} {str(dtype):<10}"
#
# TODO: Display data preview:
# 8. Print "First few rows:"
# 9. Use clinical_data.head() to show first 5 rows in nice format
#
# Expected output: Systematic overview of clinical data structure, columns, and preview

# Write your code below:
# ...

### 3.4 Sample Types and Patient Characteristics

In [None]:
# 📝 ACTIVITY 7: Patient Characteristics and Visualization
#
# Your task: Analyze and visualize key patient characteristics
#
# TODO: Analyze categorical variables:
# 1. Display distributions of key clinical variables
# 2. Create simple visualizations (bar plots or pie charts)
# 3. Identify any data quality issues
#
# TODO: Basic visualization:
# 4. Create 2-3 plots showing important clinical distributions
# 5. Use matplotlib/seaborn for visualization
#
# Expected output: Clinical characteristics summary with visualizations

# Write your code below:
# # Analyze key categorical variables
# for var in categorical_vars[:4]:  # First 4 categorical variables
#     if var in clinical_data.columns:
#         counts = clinical_data[var].value_counts()
#         print(f"\n{var}:")
#         for val, count in counts.head(3).items():
#             pct = (count/len(clinical_data))*100
#             print(f"  {val}: {count} ({pct:.1f}%)")
# 
# # Simple visualization
# fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# if len(categorical_vars) >= 2:
#     clinical_data[categorical_vars[0]].value_counts().plot(kind='bar', ax=axes[0])
#     axes[0].set_title(f'Distribution of {categorical_vars[0]}')
#     
#     clinical_data[categorical_vars[1]].value_counts().plot(kind='pie', ax=axes[1])
#     axes[1].set_title(f'Distribution of {categorical_vars[1]}')
# 
# plt.tight_layout()
# plt.show()

## 4. Summary and Key Findings

Let's summarize the key findings from our data exploration.

In [None]:
#  ACTIVITY 18: Save Exploration Results (Optional)
#
# Your task: Save exploration results for future reference and analysis
# Note: This is an optional activity for organizing your findings
#
# TODO: Create results directory structure:
# 1. Use os.makedirs('../results/exploration', exist_ok=True) to create directory
# 2. This will store your exploration outputs for later use
#
# TODO: Create and save exploration summary dictionary:
# 3. Create exploration_summary dictionary with sections:
#    - dataset_info: shapes and missing value percentages
#    - expression_stats: basic statistics (if expression data available)  
#    - clinical_variables: lists of numerical and categorical variables
# 4. Handle cases where expression_data might not be available
#
# TODO: Save gene statistics (if expression data available):
# 5. Create DataFrame with gene-level statistics:
#    - gene names, mean, std, variance, min, max for each gene
# 6. Save as CSV: '../results/exploration/gene_statistics.csv'
# 7. Use .to_csv(filename, index=False)
#
# TODO: Save sample statistics (if expression data available):
# 8. Create DataFrame with sample-level statistics:
#    - sample names, mean, std, median, quartiles for each sample
# 9. Save as CSV: '../results/exploration/sample_statistics.csv'
#
# TODO: Save clinical data summary:
# 10. Create dictionary with value counts for each categorical variable
# 11. Use json.dump() to save as '../results/exploration/clinical_summary.json'
# 12. Handle potential JSON serialization issues with default=str
#
# TODO: Print confirmation messages:
# 13. List all files that were successfully saved
# 14. Provide guidance on what to do next
# 15. Mention the next notebook in the sequence
#
# Expected output: Saved files and confirmation messages

# Write your code below:
# import os
# os.makedirs('../results/exploration', exist_ok=True)
# ...

## 5. Save Exploration Results

Let's save some key exploration results for future reference.

In [None]:
# Create results directory if it doesn't exist
os.makedirs('../results/exploration', exist_ok=True)

# Save data summaries
exploration_summary = {
    'dataset_info': {
        'expression_shape': expression_data.shape,
        'clinical_shape': clinical_data.shape,
        'expression_missing_pct': (expression_data.isnull().sum().sum() / expression_data.size) * 100,
        'clinical_missing_pct': (clinical_data.isnull().sum().sum() / clinical_data.size) * 100
    },
    'expression_stats': {
        'mean': float(expression_data.values.mean()),
        'std': float(expression_data.values.std()),
        'min': float(expression_data.values.min()),
        'max': float(expression_data.values.max()),
        'low_variance_genes': int((expression_data.var(axis=1) < 0.1).sum())
    },
    'clinical_variables': {
        'numerical': numerical_vars,
        'categorical': categorical_vars
    }
}

# Save gene variance information
gene_stats = pd.DataFrame({
    'gene': expression_data.index,
    'mean': expression_data.mean(axis=1),
    'std': expression_data.std(axis=1),
    'variance': expression_data.var(axis=1),
    'min': expression_data.min(axis=1),
    'max': expression_data.max(axis=1)
})

gene_stats.to_csv('../results/exploration/gene_statistics.csv', index=False)

# Save sample statistics
sample_stats = pd.DataFrame({
    'sample': expression_data.columns,
    'mean': expression_data.mean(axis=0),
    'std': expression_data.std(axis=0),
    'median': expression_data.median(axis=0),
    'q25': expression_data.quantile(0.25, axis=0),
    'q75': expression_data.quantile(0.75, axis=0)
})

sample_stats.to_csv('../results/exploration/sample_statistics.csv', index=False)

# Save clinical data summary
if len(categorical_vars) > 0:
    clinical_summary = {}
    for var in categorical_vars:
        if var in clinical_data.columns:
            clinical_summary[var] = clinical_data[var].value_counts().to_dict()
    
    import json
    with open('../results/exploration/clinical_summary.json', 'w') as f:
        json.dump(clinical_summary, f, indent=2, default=str)

print("Exploration results saved to:")
print("  • ../results/exploration/gene_statistics.csv")
print("  • ../results/exploration/sample_statistics.csv")
print("  • ../results/exploration/clinical_summary.json")

print("\n📝 Ready for the next notebook: 02_data_preprocessing.ipynb")

---

## 📚 Summary

### ✅ **What You Accomplished:**
1. **Setup & Data Loading**: Imported libraries and loaded Metabric datasets
2. **Data Overview**: Explored expression data (~20K genes) and clinical data structure  
3. **Quality Assessment**: Checked for missing values and basic data quality
4. **Clinical Analysis**: Analyzed patient characteristics and key variables
5. **Visualization**: Created basic plots of clinical distributions
6. **Export**: Saved results for next notebook

### 🔄 **Next Steps:**
Run `02_data_preprocessing.ipynb` to clean and prepare the data for machine learning.

---

**Great job exploring the data! 🎉**