# Machine Learning Course: Data Exploration

## Notebook 1: Initial Data Exploration of the Metabric Dataset

### Learning Objectives
By the end of this notebook, you will be able to:
1. Load and examine the structure of transcriptomics and clinical data
2. Identify and handle missing values in genomics datasets
3. Explore clinical variables and their distributions
4. Understand sample types and patient characteristics
5. Visualize key patterns in the data
6. Assess data quality and identify potential issues

### Dataset Overview
The Metabric (Molecular Taxonomy of Breast Cancer International Consortium) dataset contains:
- **Gene expression data**: ~24,000 genes across ~2,000 breast cancer samples
- **Clinical data**: Patient demographics, tumor characteristics, and survival outcomes
- **Sample metadata**: Information about sample collection and processing

---

## 1. Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [None]:
# üìù ACTIVITY 1: Import Required Libraries
# 
# Your task: Import the necessary libraries for data analysis and visualization
# 
# TODO: Import the following libraries with appropriate aliases:
# 1. pandas (as pd) - for data manipulation
# 2. numpy (as np) - for numerical operations  
# 3. matplotlib.pyplot (as plt) - for plotting
# 4. seaborn (as sns) - for statistical visualization
# 5. os - for file system operations
# 6. warnings - to suppress warning messages
#
# TODO: Import specific modules:
# 7. From scipy.stats import: stats, chi2_contingency
#
# TODO: Configure the environment:
# 8. Suppress warnings using warnings.filterwarnings('ignore')
# 9. Set matplotlib plotting style to 'default'
# 10. Set seaborn color palette to "husl"
# 11. Set numpy random seed to 42 for reproducibility
#
# TODO: Print library versions to verify imports:
# 12. Print versions of pandas, numpy, matplotlib, and seaborn
# 13. Create the '../data' directory if it doesn't exist using os.makedirs()
#
# Expected output: Success messages with library versions and confirmation of data directory creation

# Write your code below:
# import ..."

## 2. Task 1: Data Loading

Let's start by loading the Metabric dataset. We'll load both the gene expression data and clinical data.

In [None]:
# üìù ACTIVITY 2: Set Up Data Loading Configuration
#
# Your task: Define the data file paths based on the actual files in the data folder
#
# TODO: Define variables for data organization:
# 1. Create a variable DATA_PATH pointing to '../data/'
# 2. Define file names for the actual datasets present:
#    - CLINICAL_PATIENT_FILE = 'data_clinical_patient.txt'
#    - CLINICAL_SAMPLE_FILE = 'data_clinical_sample.txt'  
#    - EXPRESSION_FILE = 'data_mrna_illumina_microarray.txt'
#
# TODO: Create an informative display:
# 3. Print a header "=== METABRIC DATA FILES STATUS ==="
# 4. Show information about the data files:
#    - All three data files are now present
#    - Clinical files have comment headers (start with #)
#    - Expression file is direct tab-separated format (no comments)
#    - Expression file contains ~20,603 genes
#
# TODO: Check file availability:
# 5. Create a list of the three file names
# 6. Loop through each file and check if os.path.exists(os.path.join(DATA_PATH, file))
# 7. Print ‚úì for found files and ‚ö†Ô∏è for missing files
# 8. Display the expected file paths for any missing files
# 9. Print a summary message about file availability
#
# TODO: Provide dataset information:
# 10. Print information about expected data structure:
#     - Clinical patient data: ~1,904 patients with clinical variables
#     - Clinical sample data: Sample-level information
#     - Expression data: ~20,603 genes √ó ~1,904 samples (Illumina microarray data)
#
# Expected output: File availability status and dataset structure information

# Write your code below:
# DATA_PATH = ...

### 2.1 Load Gene Expression Data

In [None]:
# üìù ACTIVITY 3: Load Gene Expression Data
#
# Your task: Load the Illumina microarray expression data (tab-separated format)
#
# TODO: Implement file loading with error handling:
# 1. Use a try-except block to handle potential file loading errors
# 2. Create the full file path using os.path.join(DATA_PATH, EXPRESSION_FILE)
# 3. Check if the file exists using os.path.exists()
#
# TODO: Load the expression data correctly:
# 4. The file is tab-separated with NO comment lines
# 5. First column is Hugo_Symbol (gene names), second is Entrez_Gene_Id
# 6. Use pd.read_csv(file_path, sep='\t', index_col=0) to load with gene symbols as index
# 7. This will make Hugo_Symbol the row index (gene names)
# 8. You can drop the Entrez_Gene_Id column if desired: .drop('Entrez_Gene_Id', axis=1)
#
# TODO: Display data information:
# 9. Print the shape of the loaded data
# 10. Print number of genes (rows) and samples (columns) with comma formatting
# 11. Print the data type (should be float64 for expression values)
# 12. Show first few gene names (index) and sample names (columns)
# 13. Display a small preview (first 5 genes √ó 5 samples)
#
# TODO: Validate the data structure:
# 14. Check that sample names follow MB-XXXX format
# 15. Verify expression values are in reasonable range (log2 or similar scale)
# 16. Check for any obvious data quality issues
#
# TODO: Handle loading errors:
# 17. If file loading fails, print informative error message
# 18. You can still create dummy data if needed for practice
# 19. But the real data should load successfully now
#
# Expected output: Successful loading of ~20,603 genes √ó ~1,904 samples expression matrix

# Write your code below:
# try:
#     expression_file_path = ...

### 2.2 Load Clinical Data

In [None]:
# üìù ACTIVITY 4: Load Clinical Patient Data
#
# Your task: Load the actual clinical patient data (tab-separated format)
#
# TODO: Implement robust clinical data loading:
# 1. Create a try-except structure for error handling
# 2. Build the file path using os.path.join(DATA_PATH, CLINICAL_PATIENT_FILE)
# 3. Note: The file is tab-separated with comment lines starting with #
#
# TODO: Load the tab-separated file:
# 4. Use pd.read_csv(file_path, sep='\t', comment='#') to skip header lines
# 5. If that fails, try without comment parameter
# 6. If still fails, try with encoding='latin-1'
# 7. Print which loading method worked
#
# TODO: Display clinical data information:
# 8. Print the shape (rows, columns) with comma formatting
# 9. Print number of patients and clinical variables
# 10. Show the column names (there should be many clinical variables)
# 11. Display the first few rows to understand the data structure
#
# TODO: Explore the actual column names present:
# 12. The file contains columns like: PATIENT_ID, AGE_AT_DIAGNOSIS, OS_MONTHS, etc.
# 13. Print a sample of column names to see what clinical variables are available
# 14. Look for key variables: ER_IHC, HER2_SNP6, CLAUDIN_SUBTYPE, etc.
#
# TODO: Handle any loading errors:
# 15. If file loading fails, print the error message
# 16. Provide guidance on checking file format or permissions
#
# Expected output: Clinical data loading status and structure information

# Write your code below:
# try:
#     clinical_file_path = ...

### 2.3 Load Sample Metadata (if available)

In [None]:
# üìù ACTIVITY 5: Load Clinical Sample Data
#
# Your task: Load the sample-level clinical data (tab-separated format)
#
# TODO: Implement sample data loading:
# 1. Use try-except to handle potential loading issues
# 2. Create file path using os.path.join(DATA_PATH, CLINICAL_SAMPLE_FILE)
# 3. This file contains sample-level information (may have multiple samples per patient)
#
# TODO: Load the tab-separated sample file:
# 4. Use pd.read_csv(file_path, sep='\t', comment='#') to skip header lines
# 5. If that fails, try without comment parameter
# 6. Print which format worked
# 7. Display shape and basic information if successful
#
# TODO: Explore sample data structure:
# 8. Print column names to see what sample-level variables are available
# 9. Look for key columns: SAMPLE_ID, PATIENT_ID, TUMOR_SIZE, TUMOR_STAGE, etc.
# 10. Check if there are multiple samples per patient
# 11. Show first few rows to understand data structure
#
# TODO: Compare with patient data:
# 12. Compare PATIENT_ID values between clinical_data and sample_data
# 13. Check if all patients have corresponding sample records
# 14. Note any differences in the number of records
#
# TODO: Handle errors gracefully:
# 15. If file loading fails, print error message and continue
# 16. Set sample_data = None if loading fails
#
# Expected output: Sample data information and comparison with patient data

# Write your code below:
# try:
#     sample_file_path = ...

### 2.4 Data Loading Summary

In [None]:
# üìù ACTIVITY 6: Create Data Loading Summary
#
# Your task: Generate a comprehensive summary of loaded datasets
#
# TODO: Create a formatted summary display:
# 1. Print a header with "=" characters: "DATA LOADING SUMMARY"  
# 2. Add separator line with 60 "=" characters
#
# TODO: Display dataset information:
# 3. Print expression data dimensions using :, formatting for thousands
# 4. Print clinical data dimensions with thousands separators
# 5. Handle the case where sample_data might be None
# 6. Display data types for each dataset
#
# TODO: Calculate and display memory usage:
# 7. Calculate memory usage for expression_data using .memory_usage(deep=True).sum()
# 8. Calculate memory usage for clinical_data the same way  
# 9. Convert bytes to MB by dividing by 1024**2
# 10. Display individual and total memory usage with 2 decimal places
#
# TODO: Add helpful context:
# 11. Print information about data types (DataFrame, etc.)
# 12. Provide guidance on what the numbers mean
# 13. Mention whether real or dummy data is being used
#
# Expected output: Professional summary with dataset dimensions, types, and memory usage

# Write your code below:
# print("\n" + "="*60)
# print("DATA LOADING SUMMARY")
# ...

### üîÑ **Quick Data Download Guide**

If you haven't downloaded the data yet, follow these steps:

1. **Create the data directory** (done automatically above)

2. **Download each file** by clicking these links and saving to `../data/`:
   - [Clinical Patient Data](https://drive.google.com/file/d/15AXx2ZKiQ8MhK8EgK6xnE0XbW_PAxY0j/view?usp=sharing) ‚Üí Save as `clinical_patient_data.csv`
   - [Clinical Sample Data](https://drive.google.com/file/d/1q4I2v-12jUrwJ3Jf8CICi235ZvsODDW7/view?usp=sharing) ‚Üí Save as `clinical_sample_data.csv`
   - [Expression Data](https://drive.google.com/file/d/1q4I2v-12jUrwJ3Jf8CICi235ZvsODDW7/view?usp=sharing) ‚Üí Save as `expression_data.csv`

3. **Re-run the data loading cells** above to load the real data

**Note**: The notebook will work with dummy data if files are not available, but real data is recommended for meaningful analysis.

## 3. Task 2: Data Exploration

Now let's explore the loaded data to understand its structure, quality, and characteristics.

### 3.1 Expression Data Exploration

In [None]:
# üìù ACTIVITY 7: Explore Expression Data Structure
#
# Your task: Examine the basic structure and characteristics of gene expression data
#
# TODO: Display basic dataset information:
# 1. Print "EXPRESSION DATA EXPLORATION" header with "=" separators
# 2. Print the shape of expression_data
# 3. Print the data type of the first column (use .dtypes.iloc[0])
# 4. Print the index name (genes) - handle case where it might be None
# 5. Print the columns name (samples) - handle case where it might be None
#
# TODO: Show sample of data structure:
# 6. Print "First few genes:" and display first 10 gene names as a list
# 7. Print "First few samples:" and display first 10 sample names as a list  
# 8. Print "Expression data preview:" and display first 5 rows √ó 5 columns
# 9. Use display() function to show the DataFrame preview nicely
#
# TODO: Provide context:
# 10. Add explanatory text about what genes and samples represent
# 11. Explain what the expression values likely represent (Z-scores, log2, etc.)
#
# Expected output: Structured overview of expression data dimensions, types, and preview

# Write your code below:
# print("EXPRESSION DATA EXPLORATION")
# print("="*50)
# ...

In [None]:
# üìù ACTIVITY 8: Calculate Expression Data Statistics
#
# Your task: Compute and display statistical summaries of expression values
#
# TODO: Calculate overall statistics:
# 1. Print "Statistical Summary of Expression Data:" with separator line
# 2. Calculate and print the overall mean of all expression values
# 3. Calculate and print the overall standard deviation
# 4. Calculate and print minimum and maximum values
# 5. Calculate and print the median using np.median()
# 6. Format all statistics to 3 decimal places
#
# TODO: Perform data quality checks:
# 7. Count infinite values using np.isinf() and .sum()
# 8. Count extreme values where |x| > 10 using np.abs() and boolean indexing
# 9. Calculate percentage of extreme values relative to total data size
# 10. Print these quality metrics with appropriate labels
#
# TODO: Interpret the results:
# 11. Add comments about what these statistics tell us about data quality
# 12. Explain what we'd expect for normalized expression data (Z-scores)
# 13. Flag any concerning patterns (too many extreme values, etc.)
#
# Expected output: Comprehensive statistical summary with quality assessments

# Write your code below:
# print("Statistical Summary of Expression Data:")
# print("="*50)
# ...

### 3.2 Missing Values Analysis

In [None]:
# üìù ACTIVITY 9: Analyze Missing Values Pattern
#
# Your task: Systematically analyze missing values in both expression and clinical data
#
# TODO: Set up missing values analysis:
# 1. Print "MISSING VALUES ANALYSIS" header with "=" separators
# 2. Start with expression data analysis
#
# TODO: Analyze expression data missing values:
# 3. Use .isnull().sum() to count missing values per gene
# 4. Find genes with missing values using boolean indexing (where count > 0)
# 5. Calculate total missing values and percentage of total dataset size
# 6. Print summary: total missing, genes affected, percentage
# 7. If genes have missing values, show top 10 genes with most missing values
# 8. If no missing values, print a success message with checkmark
#
# TODO: Analyze clinical data missing values:
# 9. Print separator line and "Clinical Data Missing Values:"
# 10. Calculate missing values per clinical variable using same approach
# 11. Find variables with missing values
# 12. Calculate total missing values and percentage
# 13. If variables have missing values, create a summary DataFrame with:
#    - Missing_Count: number of missing values per variable
#    - Missing_Percentage: percentage of patients missing this variable
# 14. Sort by missing count (descending) and display using display()
# 15. If no missing values, print success message
#
# Expected output: Comprehensive missing values report for both datasets

# Write your code below:
# print("MISSING VALUES ANALYSIS")
# print("="*50)
# ...

In [None]:
# üìù ACTIVITY 10: Visualize Missing Values Pattern
#
# Your task: Create visualizations to understand missing data patterns
#
# TODO: Create conditional visualization:
# 1. Check if clinical_data has any missing values using .isnull().sum().sum() > 0
# 2. If there are missing values, create a figure with 2 subplots using plt.subplots(1, 2, figsize=(15, 6))
#
# TODO: Create missing values heatmap (left subplot):
# 3. Use sns.heatmap() to plot clinical_data.isnull()
# 4. Set parameters: cbar=True, cmap='viridis'
# 5. Set title: 'Missing Values Pattern in Clinical Data'
# 6. Set xlabel: 'Variables', ylabel: 'Patients'
#
# TODO: Create missing values bar plot (right subplot):
# 7. Get missing counts for variables with missing values only
# 8. Create bar plot using .plot(kind='bar')
# 9. Set title: 'Missing Values Count by Variable'
# 10. Set xlabel: 'Variables', ylabel: 'Missing Count'
# 11. Rotate x-axis labels by 45 degrees using tick_params(axis='x', rotation=45)
#
# TODO: Finalize the plot:
# 12. Use plt.tight_layout() to prevent overlap
# 13. Use plt.show() to display
# 14. If no missing values, print "No missing values to visualize."
#
# Expected output: Heatmap and bar chart showing missing value patterns (or message if none)

# Write your code below:
# if clinical_data.isnull().sum().sum() > 0:
#     fig, axes = plt.subplots(1, 2, figsize=(15, 6))
#     ...

### 3.3 Clinical Data Exploration

In [None]:
# üìù ACTIVITY 11: Explore Clinical Data Structure
#
# Your task: Examine the structure and content of clinical data
#
# TODO: Display basic clinical data information:
# 1. Print "CLINICAL DATA EXPLORATION" header with "=" separators
# 2. Print number of patients using len(clinical_data) with comma formatting
# 3. Print number of variables using len(clinical_data.columns) with comma formatting
#
# TODO: Show column information systematically:
# 4. Print "Column names and data types:"
# 5. Loop through clinical_data.columns and clinical_data.dtypes using zip()
# 6. Print each column with index number, name (left-aligned in 30 characters), and data type
# 7. Use string formatting: f"{i+1:2d}. {col:<30} {str(dtype):<10}"
#
# TODO: Display data preview:
# 8. Print "First few rows:"
# 9. Use display(clinical_data.head()) to show first 5 rows in nice format
#
# Expected output: Systematic overview of clinical data structure, columns, and preview

# Write your code below:
# print("CLINICAL DATA EXPLORATION")
# print("="*50)
# ...

In [None]:
# üìù ACTIVITY 12: Categorize and Summarize Clinical Variables
#
# Your task: Identify different types of clinical variables and provide summaries
#
# TODO: Categorize variables by data type:
# 1. Use clinical_data.select_dtypes(include=[np.number]) to get numerical variables
# 2. Use clinical_data.select_dtypes(include=['object', 'category']) to get categorical variables
# 3. Convert to lists using .columns.tolist()
#
# TODO: Display variable categories:
# 4. Print f"Numerical variables ({len(numerical_vars)}):" 
# 5. Loop through numerical_vars and print each with bullet point: f"  - {var}"
# 6. Print f"Categorical variables ({len(categorical_vars)}):"
# 7. Loop through categorical_vars and print each with unique count: f"  - {var} ({unique_count} unique values)"
# 8. Get unique count using clinical_data[var].nunique()
#
# TODO: Show numerical variables summary:
# 9. Check if len(numerical_vars) > 0
# 10. If yes, print "Numerical Variables Summary:"
# 11. Use display(clinical_data[numerical_vars].describe()) to show statistics
#
# Expected output: Categorized variable lists and statistical summary for numerical variables

# Write your code below:
# numerical_vars = clinical_data.select_dtypes(include=[np.number]).columns.tolist()
# categorical_vars = ...

## üí° Coding Hints and Templates

Need help getting started? Here are some code templates and hints for the actual Metabric data:

### üìã **Template: Clinical Data Loading (with comment headers)**
```python
# Clinical files have comment lines starting with #
try:
    file_path = os.path.join(DATA_PATH, 'data_clinical_patient.txt')
    if os.path.exists(file_path):
        data = pd.read_csv(file_path, sep='\t', comment='#')
        print(f"‚úì Loaded clinical data: {data.shape}")
    else:
        raise FileNotFoundError(f"File not found: {file_path}")
except Exception as e:
    print(f"‚ùå Error: {e}")
```

### üìã **Template: Expression Data Loading (direct format)**
```python
# Expression file is direct tab-separated, no comments
try:
    file_path = os.path.join(DATA_PATH, 'data_mrna_illumina_microarray.txt')
    if os.path.exists(file_path):
        # Use first column (Hugo_Symbol) as index
        data = pd.read_csv(file_path, sep='\t', index_col=0)
        # Optional: drop Entrez_Gene_Id column
        if 'Entrez_Gene_Id' in data.columns:
            data = data.drop('Entrez_Gene_Id', axis=1)
        print(f"‚úì Loaded expression data: {data.shape}")
    else:
        raise FileNotFoundError(f"File not found: {file_path}")
except Exception as e:
    print(f"‚ùå Error: {e}")
```

### üìä **Template: Missing Values Analysis**
```python
# Missing values template
missing_data = dataset.isnull().sum()
missing_vars = missing_data[missing_data > 0]
total_missing = dataset.isnull().sum().sum()
missing_pct = (total_missing / dataset.size) * 100
print(f"Missing values: {total_missing:,} ({missing_pct:.2f}%)")
```

### üìà **Template: Basic Statistics**
```python
# Statistics template
print(f"Mean: {data.mean():.3f}")
print(f"Std: {data.std():.3f}")
print(f"Min: {data.min():.3f}")
print(f"Max: {data.max():.3f}")
```

### üé® **Template: Data Visualization**
```python
# Visualization template
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Plot 1
data.hist(ax=axes[0])
axes[0].set_title('Distribution')
# Plot 2  
data.plot(kind='box', ax=axes[1])
axes[1].set_title('Box Plot')
plt.tight_layout()
plt.show()
```

### üîç **Key Metabric Data Info**
- **Sample IDs**: Follow MB-XXXX format (e.g., MB-0362, MB-0346)
- **Clinical Variables**: ER_IHC, HER2_SNP6, CLAUDIN_SUBTYPE, AGE_AT_DIAGNOSIS, OS_MONTHS, etc.
- **Expression Values**: Log2-transformed Illumina microarray data
- **Gene Names**: HGNC symbols (Hugo_Symbol column)
- **File Formats**: Clinical = tab + comments, Expression = tab only

### üîç **Common Methods to Remember**
- `df.shape` - Get dimensions
- `df.dtypes` - Get data types
- `df.isnull().sum()` - Count missing values
- `df.describe()` - Statistical summary
- `df.head()` - Show first few rows
- `df.nunique()` - Count unique values per column
- `len(df)` - Number of rows

## üéØ Learning Assessment

### ‚úÖ **Self-Check Questions**

After completing the activities, you should be able to answer:

1. **Data Loading & Structure**
   - How many genes and samples are in the expression dataset? (~20,603 genes √ó ~1,904 samples)
   - What data types are used for clinical variables?
   - How do you handle files with comment headers vs. direct data format?
   - What's the difference between Hugo_Symbol and Entrez_Gene_Id?

2. **Missing Values**
   - What percentage of values are missing in each dataset?
   - Which clinical variables have the most missing data?
   - Are there any missing values in the expression data?
   - How might missing values affect your downstream analysis?

3. **Data Quality**
   - What is the range and distribution of expression values (should be log2-transformed)?
   - Do any samples appear to be outliers based on their expression profiles?
   - Are there genes with very low variance that might not be informative?
   - How well do sample IDs match between clinical and expression data?

4. **Clinical Characteristics**
   - What are the key biomarker variables (ER_IHC, HER2_SNP6)?
   - How are patients distributed across different molecular subtypes (CLAUDIN_SUBTYPE)?
   - What is the age range and survival time distribution?
   - What percentage of patients are ER+, HER2+, etc.?

5. **Expression Data Characteristics**
   - What genes show the highest variance across samples?
   - How correlated are samples with each other?
   - Do expression values follow expected distributions for microarray data?

### üèÜ **Success Criteria**

You have successfully completed this notebook if you can:
- ‚úÖ Load all three datasets without errors
- ‚úÖ Handle both comment-header and direct tab-separated formats
- ‚úÖ Calculate basic statistics for both expression and clinical data
- ‚úÖ Identify and quantify missing values in each dataset
- ‚úÖ Categorize clinical variables into numerical vs categorical
- ‚úÖ Create meaningful visualizations of data distributions
- ‚úÖ Interpret the results and identify potential data quality issues
- ‚úÖ Match samples between expression and clinical datasets

### üöÄ **Extension Challenges** (Optional)

For advanced students:
1. **Cross-Dataset Analysis**:
   - Compare expression variance across different ER status groups
   - Analyze correlation between age and molecular subtypes
   - Identify genes that might be associated with survival time

2. **Advanced Visualizations**:
   - Create correlation matrices between clinical variables
   - Plot expression distributions by molecular subtype
   - Generate sample-sample correlation heatmaps

3. **Quality Control**:
   - Identify potential batch effects in expression data
   - Find outlier samples based on multiple criteria
   - Assess data completeness across different patient groups

4. **Statistical Analysis**:
   - Perform t-tests comparing ER+ vs ER- expression profiles
   - Calculate survival statistics by molecular subtype
   - Test for associations between clinical variables

5. **Data Integration**:
   - Create a master dataset combining clinical and expression data
   - Handle cases where sample IDs don't perfectly match
   - Prepare data for machine learning modeling

### 3.4 Sample Types and Patient Characteristics

In [None]:
# üìù ACTIVITY 13: Analyze Sample Types and Patient Characteristics
#
# Your task: Explore categorical variables and patient characteristics from the actual data
#
# TODO: Set up the analysis:
# 1. Print "SAMPLE TYPES AND PATIENT CHARACTERISTICS" with "=" separators
# 2. Remember the actual columns include: ER_IHC, HER2_SNP6, CLAUDIN_SUBTYPE, etc.
#
# TODO: Analyze categorical variables from the actual data:
# 3. Select first 5 categorical variables to avoid overwhelming output
# 4. For each categorical variable:
#    - Print the variable name
#    - Use .value_counts(dropna=False) to count occurrences including missing
#    - Print the value counts
# 5. Calculate and display percentages:
#    - Use .value_counts(normalize=True, dropna=False) * 100
#    - Loop through and print each value with percentage formatted to 1 decimal
#
# TODO: Focus on key clinical variables (if present):
# 6. Look specifically for these important variables:
#    - ER_IHC (Estrogen Receptor status)
#    - HER2_SNP6 (HER2 status)  
#    - CLAUDIN_SUBTYPE (molecular subtype)
#    - INFERRED_MENOPAUSAL_STATE
#    - VITAL_STATUS
# 7. These tell us about patient characteristics and cancer types
#
# TODO: Interpret the results:
# 8. Comment on what the distributions tell us about the patient cohort
# 9. Note any imbalanced categories that might affect analysis
#
# Expected output: Distribution analysis of key categorical clinical variables

# Write your code below:
# print("SAMPLE TYPES AND PATIENT CHARACTERISTICS")
# print("="*50)
# ...

In [None]:
# üìù ACTIVITY 14: Create Clinical Characteristics Visualizations
#
# Your task: Create comprehensive visualizations of clinical data distributions
#
# TODO: Set up the visualization framework:
# 1. Create a figure with subplots: plt.subplots(2, 3, figsize=(18, 12))
# 2. Use axes.ravel() to flatten the axes array for easy indexing
# 3. Initialize plot_idx = 0 to track which subplot you're using
#
# TODO: Create pie charts for categorical variables:
# 4. Select categorical variables with ‚â§ 10 unique values (to fit in pie charts)
# 5. For the first 4 suitable categorical variables:
#    - Get value counts using .value_counts()
#    - Create pie chart with axes[plot_idx].pie()
#    - Set parameters: labels=value_counts.index, autopct='%1.1f%%', startangle=90
#    - Set title: f'Distribution of {var}'
#    - Increment plot_idx
#
# TODO: Create histograms for numerical variables:
# 6. Select the first 2 numerical variables from your numerical_vars list
# 7. For each numerical variable:
#    - Create histogram using .hist(bins=30, ax=axes[plot_idx], alpha=0.7)
#    - Set title, xlabel, and ylabel appropriately
#    - Increment plot_idx
#
# TODO: Finalize the visualization:
# 8. Hide any unused subplots using a loop and axes[i].set_visible(False)
# 9. Use plt.tight_layout() to prevent overlapping
# 10. Display with plt.show()
#
# Expected output: Professional multi-panel visualization showing distributions of key variables

# Write your code below:
# fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# axes = axes.ravel()
# ...

### 3.5 Expression Data Distribution Analysis

In [None]:
# üìù ACTIVITY 15: Analyze Expression Data Distribution (if available)
#
# Your task: Create comprehensive analysis of gene expression distributions
# Note: This activity only applies if expression data was successfully loaded
#
# TODO: Check if expression data is available:
# 1. Use a conditional: if 'expression_data' in locals() and expression_data is not None:
# 2. Print "EXPRESSION DATA DISTRIBUTION ANALYSIS" header with separators
# 3. If no expression data, print message about needing to download the data file
#
# TODO: Prepare data for visualization:
# 4. Sample a subset of genes to avoid overcrowding: min(100, expression_data.shape[0])
# 5. Use np.random.choice() to randomly select genes for visualization
# 6. Create sample_expression = expression_data.loc[sample_genes]
#
# TODO: Create a 2x2 subplot framework:
# 7. Use plt.subplots(2, 2, figsize=(15, 10))
#
# TODO: Plot 1 - Overall expression distribution:
# 8. Flatten all expression values: expression_data.values.flatten()
# 9. Create histogram with bins=50, alpha=0.7, density=True
# 10. Add title, labels, and a vertical line at mean (axvline at 0 for Z-scores)
# 11. Add legend
#
# TODO: Plot 2 - Sample-wise distribution (boxplot):
# 12. Select first 20 samples: expression_data.iloc[:, :20]
# 13. Create boxplot for first 10 samples with appropriate labels
# 14. Rotate x-axis labels by 45 degrees
#
# TODO: Plot 3 - Gene variance distribution:
# 15. Calculate gene variances: expression_data.var(axis=1)
# 16. Create histogram of variances
# 17. Add vertical line at median variance
#
# TODO: Plot 4 - Sample means distribution:
# 18. Calculate sample means: expression_data.mean(axis=0)
# 19. Create histogram of sample means
# 20. Use plt.tight_layout() and plt.show()
#
# TODO: Print summary statistics:
# 21. Calculate and print overall statistics
# 22. Count low and high variance genes
# 23. Interpret what these distributions tell us about data quality
#
# Expected output: Multi-panel visualization of expression data characteristics (if data available)

# Write your code below:
# if 'expression_data' in locals() and expression_data is not None:
#     print("EXPRESSION DATA DISTRIBUTION ANALYSIS")
#     ...

### 3.6 Sample-Sample Correlation Analysis

In [None]:
# üìù ACTIVITY 16: Sample-Sample Correlation Analysis (if expression data available)
#
# Your task: Analyze correlations between samples to assess data quality
# Note: This requires expression data to be loaded
#
# TODO: Check for expression data availability:
# 1. Use conditional: if 'expression_data' in locals() and expression_data is not None:
# 2. Print "SAMPLE-SAMPLE CORRELATION ANALYSIS" header
# 3. If no expression data, print guidance about downloading the file
#
# TODO: Prepare correlation analysis:
# 4. For computational efficiency, limit to subset of samples: min(50, expression_data.shape[1])
# 5. Select sample subset: expression_data.iloc[:, :n_samples_to_correlate]
# 6. Print message about how many samples will be analyzed
#
# TODO: Calculate correlation matrix:
# 7. Use sample_subset.corr() to calculate pairwise correlations
# 8. This tells us how similar expression profiles are between samples
#
# TODO: Create correlation heatmap:
# 9. Create figure with plt.figure(figsize=(12, 10))
# 10. Create upper triangle mask: np.triu(np.ones_like(sample_corr, dtype=bool))
# 11. Use sns.heatmap() with parameters:
#     - mask=mask, cmap='coolwarm', center=0
#     - square=True, linewidths=0.5, cbar_kws={"shrink": .8}
# 12. Set appropriate title and use plt.tight_layout(), plt.show()
#
# TODO: Calculate correlation statistics:
# 13. Extract upper triangle: sample_corr.where(mask_condition)
# 14. Stack to get 1D array of correlations
# 15. Calculate and print: mean, std, min, max, median correlations
#
# TODO: Identify highly correlated samples:
# 16. Set threshold (e.g., 0.95) for high correlation
# 17. Find sample pairs with correlation above threshold
# 18. Print count and top 5 most correlated pairs if any exist
#
# TODO: Interpret results:
# 19. High correlations might indicate duplicates or very similar samples
# 20. Very low correlations might indicate quality issues
#
# Expected output: Correlation heatmap and statistics (if expression data available)

# Write your code below:
# if 'expression_data' in locals() and expression_data is not None:
#     print("SAMPLE-SAMPLE CORRELATION ANALYSIS")
#     ...

### 3.7 Data Quality Assessment

In [None]:
# üìù ACTIVITY 17: Comprehensive Data Quality Assessment
#
# Your task: Perform systematic quality checks on both expression and clinical data
#
# TODO: Set up quality assessment framework:
# 1. Print "DATA QUALITY ASSESSMENT" header with separators
# 2. Create separate sections for expression and clinical data
#
# TODO: Expression data quality checks (if available):
# 3. Check if expression data exists: if 'expression_data' in locals():
# 4. Print "Expression Data Quality:" with separator line
# 5. Check for constant genes (genes with no variation): (expression_data.std(axis=1) == 0).sum()
# 6. Check for low variance genes: (expression_data.var(axis=1) < 0.01).sum()
# 7. Identify outlier samples using 3-sigma rule:
#    - Calculate sample means and standard deviations
#    - Find samples where |mean - overall_mean| > 3 * overall_std
#    - Do the same for sample standard deviations
# 8. Print counts of problematic genes and samples
#
# TODO: Clinical data quality checks:
# 9. Print "Clinical Data Quality:" section
# 10. Check for duplicate patient records:
#     - If 'PATIENT_ID' in columns, use .duplicated() on that column
#     - Otherwise check for completely duplicate rows
# 11. Print duplicate count
#
# TODO: Check for impossible/invalid values:
# 12. Create quality_issues list to collect problems
# 13. For each numerical variable in clinical data:
#     - Age variables: check for values < 0 or > 120
#     - Size/measurement variables: check for negative values
#     - Time variables (months/days): check for negative values
# 14. Collect and print all quality issues found
#
# TODO: Sample matching between datasets:
# 15. Create sets of sample IDs from expression data (columns) and clinical data
# 16. Find intersection (common samples) and differences
# 17. Calculate overlap percentage
# 18. Print matching statistics and any warnings about low overlap
#
# TODO: Summarize quality assessment:
# 19. Provide overall quality summary
# 20. Recommend next steps based on issues found
#
# Expected output: Comprehensive quality report with actionable recommendations

# Write your code below:
# print("DATA QUALITY ASSESSMENT")
# print("="*50)
# ...

## 4. Summary and Key Findings

Let's summarize the key findings from our data exploration.

In [None]:
# ? ACTIVITY 18: Save Exploration Results (Optional)
#
# Your task: Save exploration results for future reference and analysis
# Note: This is an optional activity for organizing your findings
#
# TODO: Create results directory structure:
# 1. Use os.makedirs('../results/exploration', exist_ok=True) to create directory
# 2. This will store your exploration outputs for later use
#
# TODO: Create and save exploration summary dictionary:
# 3. Create exploration_summary dictionary with sections:
#    - dataset_info: shapes and missing value percentages
#    - expression_stats: basic statistics (if expression data available)  
#    - clinical_variables: lists of numerical and categorical variables
# 4. Handle cases where expression_data might not be available
#
# TODO: Save gene statistics (if expression data available):
# 5. Create DataFrame with gene-level statistics:
#    - gene names, mean, std, variance, min, max for each gene
# 6. Save as CSV: '../results/exploration/gene_statistics.csv'
# 7. Use .to_csv(filename, index=False)
#
# TODO: Save sample statistics (if expression data available):
# 8. Create DataFrame with sample-level statistics:
#    - sample names, mean, std, median, quartiles for each sample
# 9. Save as CSV: '../results/exploration/sample_statistics.csv'
#
# TODO: Save clinical data summary:
# 10. Create dictionary with value counts for each categorical variable
# 11. Use json.dump() to save as '../results/exploration/clinical_summary.json'
# 12. Handle potential JSON serialization issues with default=str
#
# TODO: Print confirmation messages:
# 13. List all files that were successfully saved
# 14. Provide guidance on what to do next
# 15. Mention the next notebook in the sequence
#
# Expected output: Saved files and confirmation messages

# Write your code below:
# import os
# os.makedirs('../results/exploration', exist_ok=True)
# ...

## 5. Save Exploration Results

Let's save some key exploration results for future reference.

In [None]:
# Create results directory if it doesn't exist
os.makedirs('../results/exploration', exist_ok=True)

# Save data summaries
exploration_summary = {
    'dataset_info': {
        'expression_shape': expression_data.shape,
        'clinical_shape': clinical_data.shape,
        'expression_missing_pct': (expression_data.isnull().sum().sum() / expression_data.size) * 100,
        'clinical_missing_pct': (clinical_data.isnull().sum().sum() / clinical_data.size) * 100
    },
    'expression_stats': {
        'mean': float(expression_data.values.mean()),
        'std': float(expression_data.values.std()),
        'min': float(expression_data.values.min()),
        'max': float(expression_data.values.max()),
        'low_variance_genes': int((expression_data.var(axis=1) < 0.1).sum())
    },
    'clinical_variables': {
        'numerical': numerical_vars,
        'categorical': categorical_vars
    }
}

# Save gene variance information
gene_stats = pd.DataFrame({
    'gene': expression_data.index,
    'mean': expression_data.mean(axis=1),
    'std': expression_data.std(axis=1),
    'variance': expression_data.var(axis=1),
    'min': expression_data.min(axis=1),
    'max': expression_data.max(axis=1)
})

gene_stats.to_csv('../results/exploration/gene_statistics.csv', index=False)

# Save sample statistics
sample_stats = pd.DataFrame({
    'sample': expression_data.columns,
    'mean': expression_data.mean(axis=0),
    'std': expression_data.std(axis=0),
    'median': expression_data.median(axis=0),
    'q25': expression_data.quantile(0.25, axis=0),
    'q75': expression_data.quantile(0.75, axis=0)
})

sample_stats.to_csv('../results/exploration/sample_statistics.csv', index=False)

# Save clinical data summary
if len(categorical_vars) > 0:
    clinical_summary = {}
    for var in categorical_vars:
        if var in clinical_data.columns:
            clinical_summary[var] = clinical_data[var].value_counts().to_dict()
    
    import json
    with open('../results/exploration/clinical_summary.json', 'w') as f:
        json.dump(clinical_summary, f, indent=2, default=str)

print("Exploration results saved to:")
print("  ‚Ä¢ ../results/exploration/gene_statistics.csv")
print("  ‚Ä¢ ../results/exploration/sample_statistics.csv")
print("  ‚Ä¢ ../results/exploration/clinical_summary.json")

print("\nüìù Ready for the next notebook: 02_data_preprocessing.ipynb")

---

## üìö Learning Summary

In this notebook, you have learned:

### ‚úÖ **Completed Tasks:**
1. **Data Loading**: Successfully loaded transcriptomics and clinical data
2. **Missing Values Analysis**: Identified and quantified missing data patterns
3. **Clinical Data Exploration**: Analyzed patient characteristics and sample types
4. **Expression Data Analysis**: Examined gene expression distributions and quality
5. **Data Quality Assessment**: Identified potential issues and outliers
6. **Sample Matching**: Evaluated overlap between datasets

### üéØ **Key Insights:**
- Understanding the structure and characteristics of genomics datasets
- Importance of data quality assessment in bioinformatics
- Common issues in transcriptomics data (missing values, low variance genes, outliers)
- Clinical variables that may be important for risk classification

### üîÑ **Next Steps:**
In the next notebook (`02_data_preprocessing.ipynb`), we will:
1. Handle missing values and outliers
2. Normalize gene expression data
3. Filter low-quality genes and samples
4. Create train/validation/test splits
5. Prepare data for machine learning modeling

---

**Great job completing the data exploration phase! üéâ**