# Cell Communication Perturbation Analysis

This notebook demonstrates how to use the `cell_comm_perturb` package for analyzing and predicting perturbations in cell-cell communication.

## Setup

First, let's install the package if it's not already installed.

In [None]:
# Uncomment the line below to install the package if it's not already installed
# !pip install -e /path/to/cell_comm_perturb

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import our package
import cell_comm_perturb as ccp

## Data Loading

Let's load the data files for our analysis.

In [None]:
# Define file paths
control_path = "path/to/noise_model_predictions_NG.csv"
exp_path = "path/to/noise_model_predictions_DIAB.csv"
control_pct_path = "path/to/NG_results.csv"
exp_pct_path = "path/to/DIAB_results.csv"

# Load data
control_df, exp_df, control_pct, exp_pct = ccp.data_processing.load_data(
    control_path, exp_path, control_pct_path, exp_pct_path,
    control_condition="NG", exp_condition="DIAB"
)

# Display a sample of control data
control_df.head()

## Data Processing

Let's process the data using our package's functions.

In [None]:
# Apply PCT filter to both datasets
control_df = ccp.process_pct_filter(control_df, control_pct, pvalue_cutoff=0.05)
exp_df = ccp.process_pct_filter(exp_df, exp_pct, pvalue_cutoff=0.05)

# Create test and training datasets
final_df = ccp.test_set_create(exp_df, control_df)
exp_final = ccp.preprocess_datasets(final_df)

final_train_df = ccp.train_set_create(exp_df, control_df)
control_final = ccp.preprocess_datasets(final_train_df)

# Perform stratified split for training data
train_control, test_control = ccp.stratified_split_handle_rare_classes(control_final, 'source_target')

## Data Visualization

Let's create some visualizations to understand our data.

In [None]:
# Create a Venn diagram showing overlap between conditions
ccp.plot_venn_diagram(control_df, exp_df, control_condition="NG", exp_condition="DIAB")

## Model Training

Let's train a LightGBM model to predict perturbations.

In [None]:
# Train model with hyperparameter optimization (10 trials for example, increase for real use)
final_model, best_params = ccp.perform_kfold_cv(train_control, n_trials=10, n_jobs=None)

# Print best parameters
print("Best parameters:")
for param, value in best_params.items():
    print(f"{param}: {value}")

## Model Evaluation

Let's evaluate our model on the test data.

In [None]:
# Preprocess test data for evaluation
lgb_control_data, features = ccp.preprocess_test_dataset(test_control)

# Evaluate the model with diagnostic plots
rmse, r2, test_residuals = ccp.evaluate_model_with_plots(final_model, lgb_control_data.data, lgb_control_data.label)

# Print statistics about the logTF Communication Score
stdev_logTF_communication_score = test_control['logTF Communication Score'].std()
range_logTF_communication_score = test_control['logTF Communication Score'].max() - test_control['logTF Communication Score'].min()

print(f"Standard deviation of logTF Communication Score: {stdev_logTF_communication_score:.4f}")
print(f"Range of logTF Communication Score: {range_logTF_communication_score:.4f}")

## Feature Importance

Let's examine the feature importance from our model.

In [None]:
# Plot feature importance based on gain
ccp.plot_feature_importance(final_model, importance_type="gain")

# Plot feature importance based on split
ccp.plot_feature_importance(final_model, importance_type="split")

## Applying the Model to Experimental Data

Now let's apply our model to the experimental data.

In [None]:
# Train the final model on the entire training dataset
final_model = ccp.train_final_model(control_final, best_params)

# Evaluate the model on the experimental data
lgb_exp_data, features = ccp.preprocess_test_dataset(exp_final)
rmse, r2, test_residuals = ccp.evaluate_model_with_plots(final_model, lgb_exp_data.data, lgb_exp_data.label)

## Residuals Analysis and Filtering

Let's analyze the residuals and apply filtering based on the RMSE threshold.

In [None]:
# Create an elbow plot of the residuals
ccp.visualization.plot_elbow_residuals(test_residuals, rmse=0.81)

# Apply filtering based on residuals
filtered_exp_final = ccp.data_processing.filter_with_residuals(
    exp_final, 
    test_residuals, 
    rmse_threshold=0.81, 
    output_path="path/to/output/exp_final_with_residuals.csv"
)

## Save the Model

Let's save our model for future use.

In [None]:
# Save the final model
ccp.save_model(final_model, "path/to/cell_cell_comm_model.pkl")

# Later, we can load the model using:
# loaded_model = ccp.load_model("path/to/cell_cell_comm_model.pkl")