<a href="https://colab.research.google.com/github/sourishdey2005/Personal-research-paper-project-/blob/main/personalproject1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Develop a self-supervised learning model for decoding microbial communication from synthetic metagenomic data to enhance bio-fertilizer design, aiming for over 85% accuracy in communication decoding.

## Setup Environment and Define Problem

### Subtask:
Install necessary Python libraries and define the problem by outlining the simulated characteristics of 'metagenomic data' and 'microbial communication networks' for the synthetic dataset.


**Reasoning**:
The first step of the subtask is to install the necessary Python libraries. I will use `pip` to install `pandas`, `numpy`, `scikit-learn`, `matplotlib`, `seaborn`, and `torch`.



In [1]:
pip install pandas numpy scikit-learn matplotlib seaborn torch



### Defining Synthetic Metagenomic Data and Microbial Communication Networks

To simulate a realistic environment for self-supervised learning, we will define the following characteristics for our synthetic dataset:

#### 1. Characteristics of Synthetic Metagenomic Data:

*   **Microbial Features (Individual Microbes):**
    *   **Functional Genes/Metabolic Capabilities:** Represented as binary vectors or numerical scores indicating the presence/absence or abundance of specific metabolic pathways or functional genes (e.g., genes for nitrogen fixation, phosphate solubilization, antibiotic production, degradation of specific compounds). For instance, a microbe might have `[1, 0, 1, 0]` for nitrogen fixation, antibiotic resistance, quorum sensing, and cellulose degradation, respectively.
    *   **Taxonomic Identity:** Represented as categorical labels or one-hot encoded vectors, indicating the species or genus of the microbe. This could influence its baseline metabolic profile and interaction patterns.
    *   **Growth Rate Potential:** A numerical value indicating the inherent growth potential under ideal conditions, which can be modulated by interactions.
    *   **Resource Preferences:** Numerical scores for uptake rates of various essential nutrients (e.g., carbon, nitrogen, phosphorus sources). For example, `[0.8, 0.2, 0.5]` for glucose, ammonia, and phosphate uptake efficiency.

*   **Encoding:** Features will primarily be encoded as numerical vectors (binary or floating-point) or one-hot encoded categorical variables, making them suitable for neural network input.

#### 2. Characteristics of Microbial Communication Networks:

*   **Communication Signals/Interaction Types:**
    *   **Resource Exchange (Commensalism/Mutualism):** Represented by the transfer of metabolic byproducts or essential nutrients from one microbe to another. This can be modeled as a directed interaction where one microbe produces a resource that another consumes.
    *   **Quorum Sensing Molecules:** Specific signaling molecules produced and detected by microbes, influencing gene expression and collective behaviors. The presence and concentration of these molecules can trigger changes in microbial state or growth.
    *   **Antagonistic Interactions (Competition/Inhibition):** Modeled as the production of inhibitory compounds (e.g., antibiotics, bacteriocins) by one microbe that negatively impacts the growth or survival of another.
    *   **Synergistic Interactions:** Cases where two or more microbes collectively perform a function that neither could do alone, potentially through sequential resource conversion or co-factor production.

*   **Strength/Nature of Interactions:**
    *   **Interaction Strength:** Numerical values representing the magnitude of the effect (e.g., amount of resource exchanged, inhibition strength). This can be a weight on the edge of a graph.
    *   **Interaction Type:** Categorical labels (e.g., 'resource_transfer', 'quorum_sensing', 'inhibition') associated with each interaction, allowing for diverse network structures.
    *   **Dynamic Nature:** Interactions may not be constant and could depend on environmental factors or the current state (e.g., population density) of the interacting microbes.

#### 3. Support for Self-Supervised Learning Task:

The model will learn to decode communication patterns by reconstructing masked or perturbed parts of the metagenomic data or interaction network. For example:

*   **Masked Feature Prediction:** Predict missing microbial features (e.g., metabolic capabilities) based on observed features and known interactions.
*   **Interaction Prediction:** Predict the existence or type of interaction between two microbes given their features.
*   **Temporal Prediction:** Predict future states of the microbial community or communication patterns given past observations.
*   **Graph Representation Learning:** The model will learn embeddings for microbes and interactions, where similarity in the embedding space reflects similar functional roles or communication partners. The self-supervised task will involve using context (e.g., neighboring microbes in the network) to predict attributes of a central microbe, or predicting the existence of links between microbes.

## Generate Large Synthetic Metagenomic Dataset

### Subtask:
Develop a function to generate a synthetic dataset of over 5 million entries. This dataset should simulate 'metagenomic' features and implicit 'communication' signals, embedding patterns suitable for self-supervised learning.


**Reasoning**:
The subtask requires generating a synthetic dataset with over 5 million entries, simulating metagenomic features and communication signals for self-supervised learning. The first step is to import the necessary libraries and define a function that will create this dataset, including the simulation of microbial features and implicit communication patterns as described in the instructions.



In [2]:
import numpy as np
import pandas as pd

def generate_synthetic_data(num_samples):
    """
    Generates a synthetic dataset simulating metagenomic features and microbial communication signals.

    Args:
        num_samples (int): The number of entries to generate in the dataset.

    Returns:
        pandas.DataFrame: A DataFrame containing the synthetic metagenomic data.
    """
    # 1. Define simulation parameters
    num_genes = 15  # Number of functional genes/metabolic capabilities
    num_species = 8  # Number of different microbial species/groups
    num_resources = 4  # Number of different resource preferences

    # Initialize lists to store generated data
    data = {
        'sample_id': np.arange(num_samples)
    }

    # 2. Simulate microbial features
    # 'Functional Genes/Metabolic Capabilities': binary vectors
    for i in range(num_genes):
        data[f'gene_{i+1}'] = np.random.randint(0, 2, num_samples)

    # 'Taxonomic Identity': categorical (represented numerically for simplicity)
    data['taxonomic_id'] = np.random.randint(0, num_species, num_samples)

    # 'Growth Rate Potential': numerical
    data['growth_rate_potential'] = np.random.rand(num_samples) * 10 # Scale from 0-10

    # 'Resource Preferences': numerical values for each resource
    for i in range(num_resources):
        data[f'resource_pref_{i+1}'] = np.random.rand(num_samples)

    # 3. Simulate implicit 'communication' signals or interaction patterns
    # This part embeds patterns for self-supervised learning.
    # Example: Communication strength based on species and resource preferences
    # If species X prefers resource Y, and gene Z is present, communication might be higher.

    # Define a base communication signal
    data['communication_signal'] = np.random.rand(num_samples) * 5 # Base signal 0-5

    # Add patterns based on species and resource preferences
    # For simplicity, let's say species 0 interacts strongly with resource 1,
    # and species 1 interacts strongly with resource 2, etc.
    for i in range(num_species):
        species_mask = (data['taxonomic_id'] == i)
        # Link species to a specific resource preference for interaction
        resource_idx = i % num_resources
        data['communication_signal'][species_mask] += data[f'resource_pref_{resource_idx+1}'][species_mask] * 2

    # Add patterns based on gene presence (e.g., specific gene enhances communication)
    # Example: If gene_1 is present, it boosts communication signal
    data['communication_signal'] += data['gene_1'] * 3 # Gene_1 boosts signal by 3 if present
    data['communication_signal'] += data['gene_5'] * data['gene_10'] * 2 # Synergistic effect between gene_5 and gene_10

    # Ensure communication signal remains positive
    data['communication_signal'] = np.maximum(0, data['communication_signal'])

    # Create DataFrame
    df = pd.DataFrame(data)

    print("Synthetic data generation complete. First 5 rows:")
    print(df.head())
    print("\nDataFrame Info:")
    df.info()
    return df

# Call the function to generate a dataset with at least 5,000,000 entries
num_entries = 5_000_000 # 5 million entries
synthetic_data_df = generate_synthetic_data(num_entries)

Synthetic data generation complete. First 5 rows:
   sample_id  gene_1  gene_2  gene_3  gene_4  gene_5  gene_6  gene_7  gene_8  \
0          0       0       1       1       0       0       1       1       0   
1          1       1       0       1       1       0       1       0       0   
2          2       0       1       0       0       1       0       1       0   
3          3       1       1       0       1       1       1       1       1   
4          4       0       0       1       1       1       1       0       0   

   gene_9  ...  gene_13  gene_14  gene_15  taxonomic_id  \
0       1  ...        1        1        0             2   
1       1  ...        0        0        0             7   
2       0  ...        0        1        0             3   
3       0  ...        1        1        1             0   
4       0  ...        0        1        0             3   

   growth_rate_potential  resource_pref_1  resource_pref_2  resource_pref_3  \
0               6.891323         0.

## Implement Self-Supervised Learning Model

### Subtask:
Design and implement a self-supervised learning model architecture, such as a neural network, and prepare the synthetic dataset for training. The model will be designed to predict the 'communication_signal' as its self-supervised pre-training task.


**Reasoning**:
The first step is to separate the features and the target variable from the `synthetic_data_df` and convert them into PyTorch tensors, as required by the subtask instructions. This prepares the data for use with PyTorch's Dataset and DataLoader.



In [3]:
import torch
from torch.utils.data import Dataset, DataLoader

# 1. Separate features and target
# All columns except 'sample_id' and 'communication_signal' are features
feature_cols = [col for col in synthetic_data_df.columns if col not in ['sample_id', 'communication_signal']]
X = synthetic_data_df[feature_cols].values
y = synthetic_data_df['communication_signal'].values

# Convert to PyTorch tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32).unsqueeze(1) # unsqueeze for single output feature

print(f"Features tensor shape: {X_tensor.shape}")
print(f"Target tensor shape: {y_tensor.shape}")

Features tensor shape: torch.Size([5000000, 21])
Target tensor shape: torch.Size([5000000, 1])


**Reasoning**:
I will create a custom PyTorch Dataset class to handle the feature and target tensors, and then instantiate a DataLoader to enable efficient batch processing during training, as specified in the instructions.



In [4]:
class MetagenomicDataset(Dataset):
    def __init__(self, features, targets):
        self.features = features
        self.targets = targets

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.targets[idx]

# Create an instance of the custom Dataset
metagenomic_dataset = MetagenomicDataset(X_tensor, y_tensor)

# Instantiate a DataLoader
batch_size = 64  # A common batch size for training
metagenomic_dataloader = DataLoader(metagenomic_dataset, batch_size=batch_size, shuffle=True)

print(f"Number of samples in Dataset: {len(metagenomic_dataset)}")
print(f"Number of batches in DataLoader: {len(metagenomic_dataloader)}")

# Verify a batch
for features, targets in metagenomic_dataloader:
    print(f"First batch features shape: {features.shape}")
    print(f"First batch targets shape: {targets.shape}")
    break

Number of samples in Dataset: 5000000
Number of batches in DataLoader: 78125
First batch features shape: torch.Size([64, 21])
First batch targets shape: torch.Size([64, 1])


**Reasoning**:
Now that the data is prepared, I will define a neural network model using `torch.nn.Module`, specifying its architecture with linear layers and activation functions. I will also define the Mean Squared Error as the loss function and Adam as the optimizer, as required by the subtask instructions.



In [5]:
import torch.nn as nn
import torch.optim as optim

# 5. Define a neural network model
class CommunicationPredictor(nn.Module):
    def __init__(self, input_dim):
        super(CommunicationPredictor, self).__init__()
        self.layer1 = nn.Linear(input_dim, 128)
        self.relu1 = nn.ReLU()
        self.layer2 = nn.Linear(128, 64)
        self.relu2 = nn.ReLU()
        self.layer3 = nn.Linear(64, 32)
        self.relu3 = nn.ReLU()
        self.output_layer = nn.Linear(32, 1) # Output a single 'communication_signal'

    def forward(self, x):
        x = self.relu1(self.layer1(x))
        x = self.relu2(self.layer2(x))
        x = self.relu3(self.layer3(x))
        x = self.output_layer(x)
        return x

# Get input dimension from our features tensor
input_dimension = X_tensor.shape[1]

# Instantiate the model
model = CommunicationPredictor(input_dimension)
print(f"Model architecture:\n{model}")

# 6. Define the Mean Squared Error (MSE) as the loss function
criterion = nn.MSELoss()
print(f"Loss function: {criterion}")

# 7. Choose an optimizer (e.g., Adam)
optimizer = optim.Adam(model.parameters(), lr=0.001)
print(f"Optimizer: {optimizer}")

Model architecture:
CommunicationPredictor(
  (layer1): Linear(in_features=21, out_features=128, bias=True)
  (relu1): ReLU()
  (layer2): Linear(in_features=128, out_features=64, bias=True)
  (relu2): ReLU()
  (layer3): Linear(in_features=64, out_features=32, bias=True)
  (relu3): ReLU()
  (output_layer): Linear(in_features=32, out_features=1, bias=True)
)
Loss function: MSELoss()
Optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    decoupled_weight_decay: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.001
    maximize: False
    weight_decay: 0
)


**Reasoning**:
Now that the model, loss function, and optimizer are defined, I will implement a training loop to train the neural network model using the `metagenomic_dataloader`.



In [6]:
num_epochs = 5

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Using device: {device}")

print("Starting model training...")
for epoch in range(num_epochs):
    model.train() # Set the model to training mode
    total_loss = 0
    for batch_features, batch_targets in metagenomic_dataloader:
        # Move data to the device
        batch_features, batch_targets = batch_features.to(device), batch_targets.to(device)

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(batch_features)

        # Calculate loss
        loss = criterion(outputs, batch_targets)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(metagenomic_dataloader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

print("Training complete.")

Using device: cpu
Starting model training...
Epoch [1/5], Loss: 2.2619
Epoch [2/5], Loss: 2.1327
Epoch [3/5], Loss: 2.1197
Epoch [4/5], Loss: 2.1144
Epoch [5/5], Loss: 2.1108
Training complete.


## Train and Fine-tune Model for Communication Decoding

### Subtask:
Evaluate the pre-trained model's performance in decoding microbial communication by predicting the 'communication_signal' and calculating a regression metric, aiming for over 85% accuracy (interpreted as R-squared).


**Reasoning**:
The next step is to evaluate the trained model's performance on the synthetic dataset by predicting the 'communication_signal' and calculating the R-squared metric to assess accuracy, following the detailed instructions provided.



In [7]:
from sklearn.metrics import r2_score

# 1. Set the model to evaluation mode
model.eval()

# Create empty lists to store all predicted values and true target values
all_predictions = []
all_targets = []

print("Evaluating model performance...")
# 2. Disable gradient calculations for inference
with torch.no_grad():
    # 4. Iterate through the metagenomic_dataloader
    for batch_features, batch_targets in metagenomic_dataloader:
        # 5. Move features and targets to the device
        batch_features, batch_targets = batch_features.to(device), batch_targets.to(device)

        # 6. Get predictions from the model
        outputs = model(batch_features)

        # 7. Append the predictions and true targets (converted to NumPy arrays) to their respective lists
        all_predictions.append(outputs.cpu().numpy())
        all_targets.append(batch_targets.cpu().numpy())

# 8. Concatenate all predictions and true targets to get a single array for each
predictions_np = np.concatenate(all_predictions, axis=0)
targets_np = np.concatenate(all_targets, axis=0)

# 10. Calculate the R-squared score
r_squared = r2_score(targets_np, predictions_np)

# 11. Print the calculated R-squared score and indicate whether it meets the target of 85% (0.85)
print(f"R-squared score on the synthetic dataset: {r_squared:.4f}")

if r_squared > 0.85:
    print("Model performance meets the target of over 85% R-squared.")
else:
    print("Model performance does NOT meet the target of over 85% R-squared.")

Evaluating model performance...
R-squared score on the synthetic dataset: 0.6100
Model performance does NOT meet the target of over 85% R-squared.


## Refine Model Training for Improved Communication Decoding

### Subtask:
The model's R-squared score (0.6100) did not meet the target of over 85%. To improve performance, refine the model training by adjusting hyperparameters such as the number of epochs and learning rate.


**Reasoning**:
The model's R-squared score did not meet the target. To improve performance, I will increase the number of epochs for training to allow the model more time to learn the patterns in the data, as per the instructions. I will keep the learning rate unchanged for now and focus on the impact of increased epochs.



In [None]:
num_epochs = 20 # Increased number of epochs

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Using device: {device}")

print("Starting model training with increased epochs...")
for epoch in range(num_epochs):
    model.train() # Set the model to training mode
    total_loss = 0
    for batch_features, batch_targets in metagenomic_dataloader:
        # Move data to the device
        batch_features, batch_targets = batch_features.to(device), batch_targets.to(device)

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(batch_features)

        # Calculate loss
        loss = criterion(outputs, batch_targets)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(metagenomic_dataloader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

print("Training complete.")

Using device: cpu
Starting model training with increased epochs...
Epoch [1/20], Loss: 2.1076
Epoch [2/20], Loss: 2.1056
Epoch [3/20], Loss: 2.1043
