# Biological Sequence Embedding Extraction Tutorial

## What is Embedding?

**Embedding (Embedding Vector)** is a technique for converting text, sequences, or other unstructured data into numerical vectors. In bioinformatics, embedding can convert biological sequences such as DNA sequences and protein sequences into high-dimensional numerical vectors. These vectors can:

1. **Capture semantic information of sequences**: Similar sequences produce similar vectors
2. **Support machine learning**: Numerical vectors can be directly used in various machine learning algorithms
3. **Dimensional reduction representation**: Compress complex sequence information into fixed-length vectors
4. **Calculate similarity**: Calculate similarity between sequences through vector distance

## Why Extract Embedding?

In bioinformatics research, embedding extraction has important value:

- **Sequence classification**: Identify functional types of DNA sequences (such as promoters, enhancers, etc.)
- **Sequence similarity analysis**: Quickly find similar biological sequences
- **Functional prediction**: Predict protein function based on sequence embedding
- **Evolutionary analysis**: Study evolutionary relationships of sequences


**Note:** Due to current limited resources, the API currently provides models supporting 1.2B and 10B, with a maximum embedding length of **128k**, and only returns embeddings from the last layer

### Import Libraries


In [1]:
from genos import create_client

### Create Client


In [2]:
client = create_client(token="your_token_here")

**To ensure smooth use of the service, please make sure you have completed [token](https://cloud.stomics.tech/#/personal-center?tab=apiKey) application.**


## Basic Usage


In [4]:
# DNA sequence
sequence = "ATCGATCGATCGATCGATCGATCGATCG"

# Extract embedding
result = client.get_embedding(sequence)['result']

# View results
print(result)


{'result': {'sequence': 'ATCGATCGATCGATCGATCGATCGATCG', 'sequence_length': 28, 'token_count': 30, 'embedding_shape': [1, 1024], 'embedding_dim': 1024, 'pooling_method': 'mean', 'model_type': 'flash', 'device': 'cuda', 'embedding': tensor([[ 0.0015,  0.0085, -0.0737,  ..., -0.4238, -0.1729,  0.0094]])}, 'status': 200, 'message': None}


The result contains:
- `sequence`: Input sequence
- `sequence_length`: Sequence length
- `token_count`: Number of tokens
- `embedding_dim`: Embedding dimension
- `embedding`: Embedding vector
- `pooling_method`: Pooling method used
- `model_type`: Model used



## Model Parameters

### Available Models

Genos supports multiple pre-trained models:

| Model | Parameters | Flash Attention | Use Case |
|-------|-----------|----------------|----------|
| `Genos-1.2B` | 1.2 billion | ✓ | Fast inference, general tasks |
| `Genos-10B` | 10 billion | ✓ | High-precision tasks |


### Pooling Methods

Pooling controls how to aggregate multiple token embeddings into sequence-level representation:

| Method | Description               | Output |
|--------|---------------------------|--------|
| `mean` | Average pooling (default) | Single vector |
| `max`  | Max pooling               | Single vector |
| `min`  | Min pooling               | Single vector |


### Applications of Embeddings

Extracted embeddings can be used for:

1. **Sequence similarity calculation**:
   ```python
   # Calculate cosine similarity between two sequences
   similarity = cosine_similarity(embedding1, embedding2)
   ```

2. **Sequence classification**:
   ```python
   # Train classifier using embeddings
   classifier = train_classifier(embeddings, labels)
   ```

3. **Clustering analysis**:
   ```python
   # Perform clustering on sequences
   clusters = kmeans_clustering(embeddings)
   ```

4. **Dimensional reduction visualization**:
   ```python
   # Use t-SNE or PCA for dimensional reduction visualization
   reduced_embeddings = tsne.fit_transform(embeddings)
   ```


## Summary

Through this tutorial, we learned:

### Core Concepts
- **Embedding**: Technique for converting biological sequences into numerical vectors
- **Pre-trained models**: Large language models specifically trained for biological sequences
- **Hierarchical features**: Different layers capture sequence information at different levels

### Technical Process
1. Load pre-trained biological sequence models
2. Encode DNA sequences into tokens
3. Obtain embeddings through model inference
4. Analyze feature representations from different layers

### Practical Value
- **Accelerate research**: Quickly analyze large amounts of biological sequences
- **Improve accuracy**: Use pre-trained knowledge to enhance prediction performance
- **Support downstream tasks**: Provide foundation for classification, clustering, similarity analysis, etc.

### Next Examples
1. Use extracted embeddings for downstream population prediction tasks
2. Downstream variant prediction tasks based on embeddings (API)
3. RNA coverage trajectory prediction (API)


**Congratulations!** You have mastered the basic methods of biological sequence embedding extraction!
