# Multimodal Model for Dance & Text    

### 1. Overview  

This project aims to train a **multimodal model** that learns a **shared embedding space** for dance sequences and natural language descriptions. The model leverages **contrastive learning** to align dance phrases with their corresponding text descriptions.  

### Capabilities:  
- **Generating dance sequences** from text descriptions.  
- **Generating text descriptions** from dance sequences.  

### Project Workflow:  
1. **Data Preparation**: Preprocess motion capture data and generate text labels.  
2. **Model Design**: Define the dance encoder, text encoder, and contrastive loss function.  
3. **Training**: Train the model using contrastive learning to align dance and text embeddings.  
4. **Evaluation**: Test the model by generating dance sequences from text and vice versa.  


# 2. Data Preparation  

## 2.1. Motion Capture Data  

The motion capture data is loaded and preprocessed to extract **fixed-length dance phrases**.  


In [39]:
import numpy as np

# Load motion capture data
data = np.load('mariel_betternot_and_retrograde.npy')  # Shape: (# joints, # timesteps, # dimensions)
data = np.transpose(data, (1, 0, 2))  # Reshape to (# timesteps, # joints, # dimensions)

# Extract fixed-length dance phrases
phrase_length = 30  # Example: 30 timesteps
dance_phrases = [data[i:i+phrase_length] for i in range(0, len(data) - phrase_length, phrase_length)]

## 2.2. Generate Natural Language Labels  

Since there are no ground truth labels, we generate them in a **semi-supervised** way:  

- **Manual Labeling**: Label **1%** of the dance phrases with descriptive text.  
- **Automatic Labeling**: Use **KMeans clustering** to group similar dance phrases and assign labels.  


In [40]:
from sklearn.cluster import KMeans

# Flatten dance phrases for clustering
n_phrases, n_timesteps, n_joints, n_dims = len(dance_phrases), phrase_length, data.shape[1], data.shape[2]
flattened_phrases = np.array(dance_phrases).reshape(n_phrases, -1)  # Shape: (# phrases, # timesteps * # joints * # dimensions)

# Perform KMeans clustering
n_clusters = 10  # Number of clusters (adjust based on data)
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(flattened_phrases)

# Assign labels to dance phrases
manual_labels = {0: "kick", 1: "spin", 2: "slow", 3: "smooth", 4: "sharp", 5: "jump", 6: "turn", 7: "wave", 8: "pause", 9: "run"}
dance_labels = [manual_labels[label] for label in cluster_labels]

# 3. Model Design  

## 3.1. Dance Encoder  

The **dance encoder** is an **LSTM-based neural network** that encodes dance phrases into embeddings.  


In [41]:
import torch
import torch.nn as nn

class DanceEncoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(DanceEncoder, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        _, (h_n, _) = self.lstm(x)
        return self.fc(h_n[-1])

## 3.2. Text Encoder  

The **text encoder** uses a **pretrained BERT model** to encode text descriptions into embeddings.  


In [42]:
from transformers import BertModel, BertTokenizer

class TextEncoder(nn.Module):
    def __init__(self, output_dim):
        super(TextEncoder, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.fc = nn.Linear(self.bert.config.hidden_size, output_dim)

    def forward(self, x):
        outputs = self.bert(x)
        return self.fc(outputs.last_hidden_state[:, 0, :])

## 3.3. Contrastive Loss  

The **contrastive loss** ensures that **similar dance-text pairs** are close in the embedding space, while **dissimilar pairs** are far apart.  


In [43]:
class ContrastiveLoss(nn.Module):
    def __init__(self, margin=1.0):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin

    def forward(self, dance_emb, text_emb, labels):
        distance = torch.norm(dance_emb - text_emb, dim=1)
        loss = (labels * distance.pow(2) + (1 - labels) * torch.relu(self.margin - distance).pow(2)).mean()
        return loss

# 4. Training  

## 4.1. Data Augmentation  

- Apply transformations to **dance phrases** (e.g., noise, rotation).  
- Use **text augmentation** (e.g., synonym replacement) for text descriptions.  

## 4.2. Training Loop  

Train the model using the **contrastive loss**.  


In [46]:
import torch.optim as optim
from transformers import BertTokenizer

# Initialize BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Initialize models
dance_encoder = DanceEncoder(input_dim=n_joints * n_dims, hidden_dim=128, output_dim=64)
text_encoder = TextEncoder(output_dim=64)
optimizer = optim.Adam(list(dance_encoder.parameters()) + list(text_encoder.parameters()), lr=1e-4)
criterion = ContrastiveLoss()

# Training loop
for epoch in range(10):  # Example: 10 epochs
    for dance_phrase, text, label in zip(dance_phrases, dance_labels, cluster_labels):
        # Reshape dance_phrase to (sequence_length, input_size)
        dance_phrase_reshaped = dance_phrase.reshape(phrase_length, n_joints * n_dims)
        
        # Convert dance phrase to tensor
        dance_phrase_tensor = torch.tensor(dance_phrase_reshaped, dtype=torch.float32)
        
        # Tokenize text input
        text_tensor = tokenizer(text, return_tensors='pt', padding=True, truncation=True)['input_ids']
        
        # Forward pass
        dance_emb = dance_encoder(dance_phrase_tensor)
        text_emb = text_encoder(text_tensor)  # Pass tokenized text to the text encoder
        loss = criterion(dance_emb, text_emb, label)
        
        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Epoch 1, Loss: -4.667327880859375
Epoch 2, Loss: -4.998797416687012
Epoch 3, Loss: -5.015745162963867
Epoch 4, Loss: -5.394674777984619
Epoch 5, Loss: -5.471688747406006
Epoch 6, Loss: -5.535956859588623
Epoch 7, Loss: -5.922285079956055
Epoch 8, Loss: -6.0956830978393555
Epoch 9, Loss: -6.271459102630615
Epoch 10, Loss: -6.383434295654297


# 5. Evaluation  

## 5.1. Generating Dance from Text  

Use the trained **text encoder** to generate an embedding for a text description and find the **nearest dance phrase embedding**.  


In [55]:
def generate_dance_from_text(text):
    # Generate text embedding
    text_emb = text_encoder(text)
    
    # Compute distances to all dance embeddings
    distances = []
    for dance in dance_phrases:
        # Reshape dance phrase to (sequence_length, input_size)
        dance_reshaped = dance.reshape(phrase_length, n_joints * n_dims)  # Reshape to (30, 165)
        # Convert dance phrase to tensor
        dance_tensor = torch.tensor(dance_reshaped, dtype=torch.float32)
        # Generate dance embedding
        dance_emb = dance_encoder(dance_tensor)
        # Compute distance between text and dance embeddings
        distance = torch.norm(text_emb - dance_emb)
        distances.append(distance)
    
    # Find the closest dance sequence
    closest_dance = dance_phrases[torch.argmin(torch.tensor(distances))]
    
    return closest_dance

## 5.2. Generating Text from Dance  

Use the trained **dance encoder** to generate an embedding for a dance phrase and find the **nearest text embedding**.  


In [58]:
def generate_text_from_dance(dance):
    # Reshape the dance sequence to (sequence_length, input_size)
    dance_reshaped = dance.reshape(phrase_length, n_joints * n_dims)  # Reshape to (30, 165)
    
    # Convert to tensor
    dance_tensor = torch.tensor(dance_reshaped, dtype=torch.float32)
    
    # Generate dance embedding
    dance_emb = dance_encoder(dance_tensor)
    
    # Compute distances to all text embeddings
    distances = []
    for text in dance_labels:
        # Tokenize the text
        text_tensor = tokenizer(text, return_tensors='pt', padding=True, truncation=True)['input_ids']
        # Generate text embedding
        text

In [59]:
# Example text input
text_description = "slow spin with a kick"

# Tokenize the text input (same as during training)
text_tensor = tokenizer(text_description, return_tensors='pt', padding=True, truncation=True)['input_ids']

# Generate dance sequence from text
generated_dance = generate_dance_from_text(text_tensor)
print(f"Generated dance sequence shape: {generated_dance.shape}")

Generated dance sequence shape: (30, 55, 3)


In [61]:
print(generated_dance)

[[[-2.06312513 -5.39422274 -1.61872065]
  [-2.0088675  -5.36202002 -1.3472805 ]
  [-2.14767075 -5.47529221 -1.29353142]
  ...
  [-2.15119791 -5.49386215 -1.08909988]
  [-2.08163595 -5.45382738 -0.83872068]
  [-1.97276092 -5.34514713 -1.16818988]]

 [[-2.09504938 -5.44246817 -1.6213491 ]
  [-2.04460382 -5.40191174 -1.35064662]
  [-2.1795938  -5.51978683 -1.29737091]
  ...
  [-2.18267584 -5.54136705 -1.09420991]
  [-2.11699033 -5.48914766 -0.83960086]
  [-2.00827456 -5.38622856 -1.17073238]]

 [[-2.12278891 -5.4755578  -1.62462592]
  [-2.07807064 -5.43500948 -1.35772872]
  [-2.20409489 -5.56046677 -1.30337548]
  ...
  [-2.20648813 -5.58359861 -1.10193825]
  [-2.15006471 -5.52228975 -0.84372193]
  [-2.0424273  -5.41833305 -1.17673659]]

 ...

 [[-2.15840364 -5.54407263 -1.63721228]
  [-2.11121583 -5.47131586 -1.36742949]
  [-2.2308507  -5.60166311 -1.31303537]
  ...
  [-2.23591852 -5.62535191 -1.11280727]
  [-2.18688416 -5.55673838 -0.8511765 ]
  [-2.0797677  -5.45239544 -1.18521094]]

 [