# Working with GEO Matrix Files

This notebook demonstrates how to use pysradb to download and process GEO Matrix files, which contain processed expression data from NCBI's Gene Expression Omnibus (GEO).

## What are GEO Matrix Files?

GEO Matrix files are tab-delimited text files that contain processed expression data from microarray or sequencing experiments. They typically have the following structure:

1. Metadata lines that start with `!` character
2. A header line with sample identifiers
3. Data rows with gene/probe identifiers and expression values

## Using the GEOMatrix Class

First, let's import the necessary modules:

In [None]:
import os
import pandas as pd
from pysradb.geomatrix import GEOMatrix

### Initialize with a GEO Accession

Let's initialize the GEOMatrix class with a GEO accession number. We'll use GSE234190 as an example.

In [None]:
matrix = GEOMatrix("GSE234190")

### Get Matrix Links

First, let's check what matrix files are available for this GEO accession:

In [None]:
links, url = matrix.get_matrix_links()
print(f"Matrix URL: {url}")
print(f"Available matrix files: {links}")

### Download Matrix Files

Now, let's download the matrix files:

In [None]:
out_dir = "./geo_matrix_files"
downloaded_files = matrix.download_matrix(out_dir=out_dir)
print(f"Downloaded files: {downloaded_files}")

### Parse Matrix Files

Now, let's parse the downloaded matrix file to extract metadata and data:

In [None]:
metadata, data = matrix.parse_matrix()

# Print some metadata
print("Matrix file metadata (first 5 entries):")
for key, value in list(metadata.items())[:5]:
    print(f"{key}: {value}")
print(f"... and {len(metadata) - 5} more metadata entries")

# Print data summary
print(f"\nMatrix file data shape: {data.shape}")
print(f"First few rows and columns:")
print(data.iloc[:5, :5])

### Convert to DataFrame

You can also directly get the data as a pandas DataFrame:

In [None]:
df = matrix.to_dataframe()
print(f"DataFrame shape: {df.shape}")
print(f"DataFrame columns: {df.columns[:5]}...")
print(f"DataFrame index: {df.index[:5]}...")

### Convert to TSV

Finally, let's convert the matrix file to a clean TSV format:

In [None]:
output_file = os.path.join(out_dir, "GSE234190_matrix.tsv")
matrix.to_tsv(output_file)
print(f"Matrix file converted to TSV: {output_file}")

# Verify the TSV file
tsv_df = pd.read_csv(output_file, sep='\t', index_col=0)
print(f"TSV DataFrame shape: {tsv_df.shape}")
print(f"First few rows and columns:")
print(tsv_df.iloc[:5, :5])

## Using the Command Line Interface

pysradb also provides a command-line interface for working with GEO Matrix files. Here are some examples:

### Download GEO Matrix Files

```bash
pysradb geo-matrix --accession GSE234190
```

### Download and Convert to TSV

```bash
pysradb geo-matrix --accession GSE234190 --to-tsv
```

### Specify Output Directory

```bash
pysradb geo-matrix --accession GSE234190 --out-dir ./my_data
```

### Specify Output TSV File

```bash
pysradb geo-matrix --accession GSE234190 --to-tsv --output-file ./my_data/expression_data.tsv
```

### Download Only Matrix Files with Existing Command

```bash
pysradb download --geo GSE234190 --matrix-only
```

## Working with Expression Data

Once you have the expression data as a pandas DataFrame, you can perform various analyses:

In [None]:
# Basic statistics
print(f"Mean expression per sample:\n{df.mean().head()}")
print(f"\nMedian expression per sample:\n{df.median().head()}")

# Filter for specific genes (if applicable)
if any(df.index.str.contains('BRCA1')):
    brca1_genes = df[df.index.str.contains('BRCA1')]
    print(f"\nBRCA1 gene expression:\n{brca1_genes}")

# Visualize distribution of expression values
try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df.iloc[:100, :5])  # Plot first 100 genes, first 5 samples
    plt.title('Expression Distribution (First 100 genes, First 5 samples)')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
except ImportError:
    print("Matplotlib and/or Seaborn not available for visualization")

## Conclusion

pysradb provides a convenient way to download and process GEO Matrix files, making it easier to work with processed expression data from GEO. The GEOMatrix class offers methods for downloading, parsing, and converting matrix files, while the command-line interface provides a simple way to perform these operations from the terminal.