# Analysis

**Hypothesis**: Monocytes in severe COVID‐19 patients exhibit increased transcriptional heterogeneity in their global gene expression profiles compared to healthy individuals. This increased cell‐to‐cell variability (as measured by the coefficient of variation) may indicate a dysregulated inflammatory response and impaired coordination during antigen presentation, potentially contributing to disease immunopathology.

In [None]:
import scanpy as sc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

# Set up visualization defaults for better plots
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.figsize = (8, 8)
sc.settings.dpi = 100
sc.settings.facecolor = 'white'
warnings.filterwarnings('ignore')

# Set Matplotlib and Seaborn styles for better visualization
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['savefig.dpi'] = 150
sns.set_style('whitegrid')
sns.set_context('notebook', font_scale=1.2)

# Load data
print("Loading data...")
adata = sc.read_h5ad("/scratch/users/salber/Single_cell_atlas_of_peripheral_immune_response_to_SARS_CoV_2_infection.h5ad")
print(f"Data loaded: {adata.shape[0]} cells and {adata.shape[1]} genes")


# Analysis Plan

**Hypothesis**: Monocytes in severe COVID‐19 patients exhibit increased transcriptional heterogeneity in their global gene expression profiles compared to healthy individuals. This increased cell‐to‐cell variability (as measured by the coefficient of variation) may indicate a dysregulated inflammatory response and impaired coordination during antigen presentation, potentially contributing to disease immunopathology.

## Steps:
1. Step 1: Preprocess the AnnData metadata by decoding byte strings (e.g., in 'cell_type_coarse', 'Status', and other relevant fields) to standard UTF-8 strings. This will simplify downstream filtering and plotting steps.
2. Step 2: Filter the AnnData object to extract monocytes by selecting cells where the decoded 'cell_type_coarse' column contains 'Monocyte' (covering both CD14 and CD16 monocytes).
3. Step 3: Check if the expression data is normalized. If the median expression per cell is high (e.g., median > 50), assume it is raw and apply a log1p normalization to reduce skew before computing the coefficient of variation (CV).
4. Step 4: Compute the cell-level CV for each monocyte by calculating the standard deviation divided by the mean expression across all genes, adding a small constant (epsilon = 1e-8) to avoid division by zero (the epsilon value is chosen to be small enough to not distort the CV while ensuring numerical stability).
5. Step 5: Visualize the distribution of the CV for monocytes stratified by disease status (COVID vs Healthy) using a violin plot. Ensure that the 'Status' metadata is decoded to prevent issues with leading byte markers.
6. Step 6: Carry out a statistical comparison between the CV values of COVID and Healthy monocytes using the Mann-Whitney U test, printing the test statistic and p-value, while considering that downstream analyses (e.g., correlation with clinical metadata) should account for potential confounders like cell quality metrics.


In [None]:
import scanpy as sc
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import mannwhitneyu

# Filter the AnnData object to obtain monocyte populations using 'cell_type_coarse'
monocyte_mask = adata.obs['cell_type_coarse'].astype(str).str.contains('Monocyte')
adata_monocytes = adata[monocyte_mask].copy()

# Compute the cell-level coefficient of variation (CV) across all genes
# Ensure that the expression matrix is in dense format
if not isinstance(adata_monocytes.X, np.ndarray):
    X = adata_monocytes.X.toarray()
else:
    X = adata_monocytes.X

# Calculate mean and standard deviation for each cell (rows of X)
cell_means = np.mean(X, axis=1)
cell_stds = np.std(X, axis=1)

# Add a small constant epsilon to avoid division by zero
epsilon = 1e-8
cell_cv = cell_stds / (cell_means + epsilon)

# Save the CV values back to the AnnData object
adata_monocytes.obs['CV'] = cell_cv

# Visualize the distribution of CV by disease status (COVID vs Healthy) using a violin plot
plt.figure(figsize=(8, 6))
ax = sns.violinplot(x=adata_monocytes.obs['Status'].astype(str), y=adata_monocytes.obs['CV'])
ax.set_title('Transcriptional Variability (CV) in Monocytes by Disease Status')
ax.set_xlabel('Disease Status')
ax.set_ylabel('Coefficient of Variation (CV)')
plt.tight_layout()
plt.show()

# Statistical testing: Compare CV between COVID and Healthy monocytes using the Mann-Whitney U test
# Define groups based on the 'Status' field
cv_covid = adata_monocytes.obs.loc[adata_monocytes.obs['Status'].astype(str) == "b'COVID'", 'CV']
cv_healthy = adata_monocytes.obs.loc[adata_monocytes.obs['Status'].astype(str) == "b'Healthy'", 'CV']

stat, p_value = mannwhitneyu(cv_covid, cv_healthy, alternative='two-sided')
print(f'Mann-Whitney U test: Statistic = {stat}, p-value = {p_value}')
