<a href="https://colab.research.google.com/github/samaneh-m/TU-simulation-base-inference/blob/main/SBI_Group_19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Inference of protein secondary structure motifs
###Group 19. Sama and Ana Alonso asd

Proteins are long chains of amino acids that fold into specific shapes. One key level of organization is the secondary structure, where each amino acid is part of three local folding patterns (alphahelix, beta-sheet or random coil [1]), which then further fold into three-dimensional structures, defining the function of the protein. In this project, we focus specifically on predicting alphahelix patterns using a two-state Hidden Markov Model (HMM) [2]. The two states are "alphahelix" and "other" (encompassing beta-sheets and coils). We assume fixed emission and transition probabilities derived from empirical data [3].

Wedefinethefollowing generative model for simulating amino acid sequences: The sequence always starts in the "other" state. We output an amino acid based on the following tables of emission probabilities:

We are also using these transition probabilities: If the current state is "alpha-helix", the next state is "alpha-helix" with probability p = 90% and "other" with probability 1 p = 10%. If the the current state is "other", the next state is "alpha-helix" with probability p = 005 and "other" with probability 1 p = 95%. Using this simulator, we can simulate amino acid chains of arbitrary length. Additionally, with the Viterbi algorithm [4], it is also possible to infer state probabilities for a given amino acid sequence, e.g. using the hmmlearn Python package [5]. Given pairs of amino acid sequences and state probabilities as training data, the goal is to use BayesFlow to train a neural posterior density estimator. Then, compare the posterior state probability estimates for a new protein sequence to the known ground truth, for example, the annotate secondary structure of human insulin [6].

###References
* 1. https://old-ib.bioninja.com.au/higher-level/topic-7-nucleic-acids/73-translation/protein-structure.html
* 2. https://scholar.harvard.edu/files/adegirmenci/files/hmm_adegirmenci_2014.pdf
* 3. https://www.kaggle.com/datasets/alfrandom/protein-secondary-structure
* 4. https://web.stanford.edu/~jurafsky/slp3/A.pdf
* 5. https://pypi.org/project/hmmlearn/
* 6. https://www.rcsb.org/3d-sequence/1A7F

In [None]:
# prompt: create simple code for this section

# This notebook explores the application of Hidden Markov Models (HMMs)
# to the task of protein secondary structure prediction, specifically focusing
# on identifying alpha-helical regions.

# We will simulate amino acid sequences using a two-state HMM
# ("alpha-helix" and "other") with predefined emission and transition probabilities.
# Subsequently, we aim to use BayesFlow to train a neural posterior density estimator
# to infer the hidden states (secondary structure) given an amino acid sequence.

# The project will involve:
# 1. Defining the HMM structure (states, emission probabilities, transition probabilities).
# 2. Simulating amino acid sequences based on the defined HMM.
# 3. Using an HMM inference algorithm (like Viterbi) for ground truth state sequences.
# 4. Utilizing BayesFlow for neural posterior estimation.
# 5. Evaluating the performance of the trained estimator on a new protein sequence.

# Libraries that might be needed:
# !pip install hmmlearn # For HMM operations
# !pip install bayesflow # For neural posterior estimation
# !pip install numpy # For numerical operations
# !pip install pandas # For data handling (if needed)
# !pip install matplotlib # For plotting (if needed)

import numpy as np
from hmmlearn import hmm

# Define the states
states = ["alpha-helix", "other"]
n_states = len(states)

# Define the possible amino acids
# This is a simplified representation, a full set of 20 amino acids would be used in a real scenario
amino_acids = list("ARNDCEQGHI") # Example subset of amino acids
n_amino_acids = len(amino_acids)

# Assume emission probabilities (replace with actual values from source [3])
# This is a placeholder example:
# Rows represent states (alpha-helix, other)
# Columns represent amino acids (A, R, N, D, C, E, Q, G, H, I)
emission_probs = np.array([
    [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1], # Example for alpha-helix
    [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]  # Example for other
])

# Assume transition probabilities (as described in the text)
# Rows represent current state (alpha-helix, other)
# Columns represent next state (alpha-helix, other)
transition_probs = np.array([
    [0.90, 0.10], # From alpha-helix
    [0.05, 0.95]  # From other
])

# Initial state probability (starts in "other")
start_prob = np.array([0.0, 1.0])

# Create the HMM model
model = hmm.MultinomialHMM(n_components=n_states, random_state=42)
model.startprob_ = start_prob
model.transmat_ = transition_probs
model.emissionprob_ = emission_probs

print("HMM model defined.")
print("States:", states)
print("Amino Acids (example):", amino_acids)
print("Transition Probabilities:\n", model.transmat_)
print("Emission Probabilities:\n", model.emissionprob_)
print("Initial State Probability:", model.startprob_)

# Future steps would involve:
# 1. Implementing sequence simulation using model.sample().
# 2. Implementing inference (e.g., Viterbi) using model.decode().
# 3. Integrating with BayesFlow for neural posterior estimation.
# 4. Downloading and processing real protein data (e.g., from RCSB PDB).
