# Scientific Python Crash Course for Structural Bioinformatics

This crash course introduces the essential scientific Python libraries through structural bioinformatics examples.

**Goal:** Learn NumPy, Plotly, SciPy, and Pandas by working with protein structure coordinates.

**Context:** Proteins are 3D objects. We represent them as points (coordinates) in space. Today you'll learn to manipulate, analyze, and visualize these coordinates using Python's scientific computing tools.

In [1]:
# Check if running on Google Colab
try:
    from google.colab import drive

    is_google_colab = True
except ImportError:
    is_google_colab = False

# If on Google Colab, install packages
if is_google_colab:
    %pip install numpy==2.0.2 scipy==1.16.1 pandas==2.2.2 plotly==5.24.1

In [2]:
import numpy as np
import pandas as pd
import scipy
import plotly
import plotly.graph_objects as go
import plotly.express as px

In [3]:
# Print versions
print(f"Running on Google Colab: {is_google_colab}")
print(f"numpy=={np.__version__}")
print(f"pandas=={pd.__version__}")
print(f"scipy=={scipy.__version__}")
print(f"plotly=={plotly.__version__}")

Running on Google Colab: False
numpy==2.0.2
pandas==2.2.2
scipy==1.16.1
plotly==5.24.1


## Overview

We'll cover:
- **NumPy** - Working with 3D coordinates
- **Plotly** - Interactive 3D visualizations (works everywhere!)
- **SciPy** - Distance calculations and transformations
- **Pandas** - Organizing residue properties

---
## NumPy - The Foundation of Scientific Computing

[NumPy](https://numpy.org/) is the fundamental package for scientific computing with Python.

**Why NumPy for structural bioinformatics?**
- Proteins are 3D objects → we need multi-dimensional arrays
- Operations on thousands of atoms → we need efficiency
- Mathematical calculations (distances, angles, rotations) → we need convenient functions

**Key concept:** NumPy arrays are like lists, but:
- All elements must be the same type (usually numbers)
- Operations work on entire arrays at once (vectorization)
- Much faster for numerical computations

### Creating Arrays: Protein Coordinates

**Biological context:** Every atom in a protein has (x, y, z) coordinates in 3D space.

The **CA (alpha carbon)** is the central carbon atom in each amino acid's backbone. It's commonly used to represent protein structure because:
- One CA per amino acid (residue)
- Traces the protein backbone path
- Reduces complexity while preserving overall shape

In [4]:
# Coordinates of CA atoms for 5 amino acids
# Each row: [x, y, z] in Angstroms (Å)
# Could be any 3D points: atoms, molecules, cells, etc.

ca_coords = np.array(
    [
        [10.5, 12.3, 8.2],  # CA atom of residue 1
        [12.1, 15.8, 9.1],  # CA atom of residue 2
        [15.3, 17.2, 10.5],  # CA atom of residue 3
        [18.9, 19.1, 11.8],  # CA atom of residue 4
        [22.1, 20.5, 13.2],  # CA atom of residue 5
    ]
)

ca_coords

array([[10.5, 12.3,  8.2],
       [12.1, 15.8,  9.1],
       [15.3, 17.2, 10.5],
       [18.9, 19.1, 11.8],
       [22.1, 20.5, 13.2]])

In [5]:
# Array properties
print("Number of dimensions:", ca_coords.ndim)  # 2D array
print("Shape (rows, columns):", ca_coords.shape)  # 5 atoms × 3 coordinates
print("Total number of elements:", ca_coords.size)  # 15 numbers total
print("Data type:", ca_coords.dtype)  # float64 (decimal numbers)

Number of dimensions: 2
Shape (rows, columns): (5, 3)
Total number of elements: 15
Data type: float64


**Understanding shape:**
- Shape `(5, 3)` means 5 rows (atoms) × 3 columns (x, y, z)
- For N atoms: shape is always `(N, 3)` for 3D coordinates
- This is the standard format for protein structure data

### Indexing: Accessing Specific Atoms and Coordinates

**Key Python concept:** Indexing starts at 0!
- First element: index 0
- Last element: index -1

In [6]:
# Get first atom (residue 1)
first_atom = ca_coords[0]
print("First CA atom coordinates:", first_atom)

First CA atom coordinates: [10.5 12.3  8.2]


In [7]:
# Get last atom (residue 5)
last_atom = ca_coords[-1]
print("Last CA atom coordinates:", last_atom)

Last CA atom coordinates: [22.1 20.5 13.2]


In [8]:
# Get specific coordinate: atom 0, z-coordinate (index 2)
z_coord = ca_coords[0, 2]
print("Z-coordinate of first atom:", z_coord, "Å")

Z-coordinate of first atom: 8.2 Å


In [9]:
# Get all x-coordinates (first column)
all_x = ca_coords[:, 0]
print("All x-coordinates:", all_x)

All x-coordinates: [10.5 12.1 15.3 18.9 22.1]


In [10]:
# Get first 3 atoms (slicing)
first_three = ca_coords[0:3]  # or ca_coords[:3]
print("First 3 atoms:")
print(first_three)

First 3 atoms:
[[10.5 12.3  8.2]
 [12.1 15.8  9.1]
 [15.3 17.2 10.5]]


**Slicing notation:** `[start:stop]`
- Includes `start` index
- Excludes `stop` index
- `[0:3]` gives indices 0, 1, 2 (not 3!)

### Array Operations: Transforming Coordinates

**Biological operations:**
1. **Translation** - Moving protein in space (shift all coordinates)
2. **Centering** - Place geometric center at origin
3. **Scaling** - Convert units (e.g., Å to nm)

In [11]:
# Translation: shift entire protein by 5 Å in x-direction
shift = np.array([5.0, 0.0, 0.0])
translated = ca_coords + shift

print("Original first atom:", ca_coords[0])
print("Translated first atom:", translated[0])
print("Difference:", translated[0] - ca_coords[0])

Original first atom: [10.5 12.3  8.2]
Translated first atom: [15.5 12.3  8.2]
Difference: [5. 0. 0.]


**Broadcasting:** NumPy automatically extends the 1D shift `[5, 0, 0]` to work with all 5 atoms. This is much faster than looping!

In [12]:
# Calculate geometric center (centroid)
center = np.mean(ca_coords, axis=0)  # axis=0 means average over rows (atoms)
print("Geometric center:", center, "Å")
print("This is the 'middle point' of the protein")

Geometric center: [15.78 16.98 10.56] Å
This is the 'middle point' of the protein


In [13]:
# Center the protein (move center to origin [0, 0, 0])
centered_coords = ca_coords - center

print("Centered coordinates:")
print(centered_coords)
print("\nNew center:", np.mean(centered_coords, axis=0))  # Should be ~[0, 0, 0]

Centered coordinates:
[[-5.28 -4.68 -2.36]
 [-3.68 -1.18 -1.46]
 [-0.48  0.22 -0.06]
 [ 3.12  2.12  1.24]
 [ 6.32  3.52  2.64]]

New center: [-1.0658141e-15  0.0000000e+00  1.0658141e-15]


**Why centering matters:**
- Before comparing two proteins, we center them both
- Removes arbitrary position in space
- Focuses on the shape, not the location

In [14]:
# Scaling: convert Angstroms to nanometers (1 nm = 10 Å)
coords_nm = ca_coords / 10.0
print("Original (Å):", ca_coords[0])
print("Converted (nm):", coords_nm[0])

Original (Å): [10.5 12.3  8.2]
Converted (nm): [1.05 1.23 0.82]


### Mathematical Operations: Analyzing Structure

**Common structural biology calculations:**
- Distance between atoms
- Radius of gyration (compactness)
- Root Mean Square Deviation (RMSD)

In [15]:
# Distance between first and last atom
# Formula: distance = √[(x₂-x₁)² + (y₂-y₁)² + (z₂-z₁)²]

atom1 = ca_coords[0]
atom2 = ca_coords[-1]

# Method 1: Step by step
diff = atom2 - atom1  # Vector difference
squared_diff = diff**2  # Square each component
sum_squared = np.sum(squared_diff)  # Sum x², y², z²
distance = np.sqrt(sum_squared)  # Take square root

print(f"Distance between first and last CA: {distance:.2f} Å")

# Method 2: Compact (same result)
distance_compact = np.sqrt(np.sum((atom2 - atom1) ** 2))
print(f"Same calculation, compact: {distance_compact:.2f} Å")

Distance between first and last CA: 15.06 Å
Same calculation, compact: 15.06 Å


**Biological interpretation:**
- Our 5-residue fragment spans ~14 Å end-to-end
- Extended peptide: ~3.5 Å per residue → 5 residues ≈ 17.5 Å
- Our value is less → structure is slightly compact (expected)

In [16]:
# Radius of gyration - measure of structural compactness
# Rg = √[mean(distance² from center)]

distances_from_center = centered_coords  # Already centered
squared_distances = np.sum(
    distances_from_center**2, axis=1
)  # Sum x²+y²+z² for each atom
rg = np.sqrt(np.mean(squared_distances))

print(f"Radius of gyration: {rg:.2f} Å")
print("This tells us how 'spread out' the structure is")

Radius of gyration: 5.44 Å
This tells us how 'spread out' the structure is


### Creating Special Arrays - Useful Patterns

In [17]:
# Array of zeros - initialize empty coordinates
empty_coords = np.zeros((10, 3))  # 10 atoms, 3 coordinates each
print("Empty coordinate array:")
print(empty_coords)

Empty coordinate array:
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


In [18]:
# Array of ones - useful for scaling operations
unit_vector = np.ones(3)
print("Unit vector:", unit_vector)

Unit vector: [1. 1. 1.]


In [19]:
# Identity matrix - useful for rotations (no rotation)
identity = np.eye(3)
print("3×3 Identity matrix:")
print(identity)

3×3 Identity matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [20]:
# Evenly spaced values - useful for generating test coordinates
x_line = np.linspace(0, 20, 5)  # 5 points from 0 to 20
print("Evenly spaced x-coordinates:", x_line)

Evenly spaced x-coordinates: [ 0.  5. 10. 15. 20.]


In [21]:
# Random coordinates - useful for testing
random_coords = np.random.randn(5, 3)  # 5 atoms, random positions
print("Random coordinates:")
print(random_coords)

Random coordinates:
[[ 0.70501077  1.21458847 -0.4368845 ]
 [-0.0464494  -0.20241228  1.38322103]
 [-0.93768468  1.27601327  0.24282305]
 [-0.45400306 -0.05542865 -2.41252814]
 [-1.54202099 -0.78678115 -0.2174304 ]]


### Boolean Indexing: Finding Specific Atoms

In [22]:
# Find atoms with x-coordinate > 15 Å
mask = ca_coords[:, 0] > 15  # Boolean array: True/False for each atom
print("Which atoms have x > 15?", mask)

# Get coordinates of those atoms
filtered_atoms = ca_coords[mask]
print("\nAtoms with x > 15 Å:")
print(filtered_atoms)

Which atoms have x > 15? [False False  True  True  True]

Atoms with x > 15 Å:
[[15.3 17.2 10.5]
 [18.9 19.1 11.8]
 [22.1 20.5 13.2]]


In [23]:
# Find atoms in a specific region: x > 12 AND y < 20
region_mask = (ca_coords[:, 0] > 12) & (ca_coords[:, 1] < 20)
print("Atoms in region (x>12 and y<20):")
print(ca_coords[region_mask])

Atoms in region (x>12 and y<20):
[[12.1 15.8  9.1]
 [15.3 17.2 10.5]
 [18.9 19.1 11.8]]


**Use case:** Finding atoms in a specific region of space (e.g., binding pocket, membrane interface)

---
## Plotly - Interactive 3D Visualizations

[Plotly](https://plotly.com/python/) is a powerful interactive visualization library.

**Why interactive visualization matters:**
- Verify calculations make sense through exploration
- Understand protein shape and features from all angles
- Communicate results with engaging, interactive plots

**For structural bioinformatics:**
- Interactive 3D scatter plots show atom positions
- Rotatable structures reveal hidden features
- Hover information provides instant details
- Professional plots ready for presentations

### Interactive 3D Visualization: Exploring Protein Structures

### Customizing Plots: Labels, Colors, Styles

### 🎯 Interactive 3D Visualizations with Plotly

**Universal Compatibility:**
- ✅ **Local Jupyter Notebook** - Perfect interactive experience
- ✅ **Google Colab** - Full interactivity supported
- ✅ **VS Code Jupyter** - Excellent integration
- ✅ **JupyterLab** - Works seamlessly

**Why Plotly for Structural Bioinformatics:**
- 🔄 **Full 3D rotation** - Click and drag to explore structure
- 🔍 **Zoom and Pan** - Detailed examination of regions
- 📍 **Hover information** - Instant residue details
- 💾 **Export options** - Save as interactive HTML
- � **Professional styling** - Publication-ready plots

In [24]:
# Interactive 3D plot with Plotly (works everywhere!)
# Create colors: gradient from blue to red for N-terminus to C-terminus
residue_numbers = np.arange(len(ca_coords))
colors = residue_numbers / (len(ca_coords) - 1)  # Normalize to 0-1

fig_plotly = go.Figure()

# Add CA atoms as scatter points
fig_plotly.add_trace(
    go.Scatter3d(
        x=ca_coords[:, 0],
        y=ca_coords[:, 1],
        z=ca_coords[:, 2],
        mode="markers+text",
        marker=dict(
            size=15,
            color=colors,
            colorscale="RdBu_r",  # Blue to Red (N-terminus to C-terminus)
            opacity=0.8,
            line=dict(width=2, color="black"),
        ),
        text=[f"Res {i + 1}" for i in range(len(ca_coords))],
        textposition="middle right",
        name="CA atoms",
        hovertemplate="<b>Residue %{text}</b><br>"
        + "X: %{x:.2f} Å<br>"
        + "Y: %{y:.2f} Å<br>"
        + "Z: %{z:.2f} Å<extra></extra>",
    )
)

# Add backbone trace
fig_plotly.add_trace(
    go.Scatter3d(
        x=ca_coords[:, 0],
        y=ca_coords[:, 1],
        z=ca_coords[:, 2],
        mode="lines",
        line=dict(color="black", width=8),
        name="Backbone",
        showlegend=True,
        hoverinfo="skip",
    )
)

# Update layout for better visualization
fig_plotly.update_layout(
    title={
        "text": "Interactive 3D Protein Structure<br><sub>Blue=N-terminus, Red=C-terminus</sub>",
        "x": 0.5,
        "font": {"size": 16},
    },
    scene=dict(
        xaxis_title="X coordinate (Å)",
        yaxis_title="Y coordinate (Å)",
        zaxis_title="Z coordinate (Å)",
        camera=dict(eye=dict(x=1.2, y=1.2, z=1.2)),
        aspectmode="cube",
    ),
    width=900,
    height=700,
    showlegend=True,
)

fig_plotly.show()

print("✅ Interactive plot created! You can:")
print("   🔄 Click and drag to rotate")
print("   🔍 Scroll to zoom")
print("   📍 Hover over points for details")

✅ Interactive plot created! You can:
   🔄 Click and drag to rotate
   🔍 Scroll to zoom
   📍 Hover over points for details


---
## SciPy - Scientific Computing Tools

[SciPy](https://scipy.org/) builds on NumPy with advanced scientific functions.

**For structural bioinformatics:**
- Distance calculations (all pairwise distances)
- Spatial data structures (fast neighbor finding)
- Linear algebra (rotations, transformations)
- Optimization (structure fitting)

### Distance Calculations: Contact Analysis

In [25]:
from scipy.spatial.distance import pdist, squareform

# Calculate all pairwise distances
# pdist computes distances efficiently (N*(N-1)/2 comparisons)
distances_condensed = pdist(ca_coords)  # Condensed format
print("Number of pairwise distances:", len(distances_condensed))
print("Formula: N*(N-1)/2 = 5*4/2 =", 5 * 4 // 2)

# Convert to square matrix for easier interpretation
distance_matrix = squareform(distances_condensed)
print("\nDistance matrix shape:", distance_matrix.shape)
print("\nDistance matrix (Å):")
print(distance_matrix.round(2))

Number of pairwise distances: 10
Formula: N*(N-1)/2 = 5*4/2 = 10

Distance matrix shape: (5, 5)

Distance matrix (Å):
[[ 0.    3.95  7.23 11.39 15.06]
 [ 3.95  0.    3.76  8.03 11.79]
 [ 7.23  3.76  0.    4.27  8.03]
 [11.39  8.03  4.27  0.    3.76]
 [15.06 11.79  8.03  3.76  0.  ]]


**Understanding the distance matrix:**
- Symmetric: distance(i,j) = distance(j,i)
- Diagonal is zero: distance from atom to itself = 0
- Each number is distance in Angstroms between two atoms

In [26]:
# Interactive distance matrix heatmap with Plotly
fig_heatmap = go.Figure(
    data=go.Heatmap(
        z=distance_matrix,
        x=[f"Res {i + 1}" for i in range(len(distance_matrix))],
        y=[f"Res {i + 1}" for i in range(len(distance_matrix))],
        colorscale="Viridis",
        colorbar=dict(title="Distance (Å)"),
        hovertemplate="<b>%{y} → %{x}</b><br>" + "Distance: %{z:.2f} Å<extra></extra>",
    )
)

fig_heatmap.update_layout(
    title="CA-CA Distance Matrix<br><sub>Interactive heatmap - hover for details</sub>",
    xaxis_title="Residue",
    yaxis_title="Residue",
    width=600,
    height=500,
)

fig_heatmap.show()

**Interpreting distance patterns:**
- Diagonal: sequential residues are ~3.8 Å apart (peptide bond length)
- Dark blue off-diagonal: residues close in 3D space (contacts)
- Yellow: residues far apart in 3D space

In [27]:
# Contact map: which residues are "in contact"?
contact_threshold = 8.0  # Å
contact_map = distance_matrix < contact_threshold

print("Contact map (True = within 8 Å):")
print(contact_map)

# Count contacts for each residue (exclude self-contact)
np.fill_diagonal(contact_map, False)  # Remove diagonal
contact_counts = np.sum(contact_map, axis=1)
print("\nNumber of contacts per residue:", contact_counts)

Contact map (True = within 8 Å):
[[ True  True  True False False]
 [ True  True  True False False]
 [ True  True  True  True False]
 [False False  True  True  True]
 [False False False  True  True]]

Number of contacts per residue: [2 2 3 2 1]


In [28]:
# Interactive contact map with Plotly
fig_contact = go.Figure(
    data=go.Heatmap(
        z=contact_map.astype(int),
        x=[f"Res {i + 1}" for i in range(len(contact_map))],
        y=[f"Res {i + 1}" for i in range(len(contact_map))],
        colorscale=[[0, "lightblue"], [1, "darkred"]],
        colorbar=dict(title="In Contact?", tickvals=[0, 1], ticktext=["No", "Yes"]),
        hovertemplate="<b>%{y} → %{x}</b><br>"
        + "In contact: %{z}<br>"
        + f"Threshold: {contact_threshold} Å<extra></extra>",
    )
)

fig_contact.update_layout(
    title=f"Contact Map (threshold = {contact_threshold} Å)<br><sub>Red = in contact, Blue = distant</sub>",
    xaxis_title="Residue",
    yaxis_title="Residue",
    width=600,
    height=500,
)

fig_contact.show()

**Biological significance of contacts:**
- Residues in contact can interact (hydrogen bonds, hydrophobic interactions)
- Contact patterns define protein fold
- Modern structure prediction (AlphaFold) predicts contact maps

### Spatial Data Structures: Finding Neighbors Efficiently

In [29]:
from scipy.spatial import KDTree

# Build KDTree for fast spatial queries
tree = KDTree(ca_coords)

# Find all neighbors within 8 Å of first atom
query_point = ca_coords[0]
radius = 8.0

indices = tree.query_ball_point(query_point, radius)
print(f"Atoms within {radius} Å of first atom:")
print(f"Indices: {indices}")
print(f"Number of neighbors: {len(indices)}")

Atoms within 8.0 Å of first atom:
Indices: [0, 1, 2]
Number of neighbors: 3


**Why KDTree?**
- For N atoms, simple approach: N² comparisons
- KDTree: ~N log(N) comparisons
- Essential for proteins with thousands of atoms

### Linear Algebra: Rotation and Alignment

**Common operation:** Rotate protein to align with reference orientation

In [30]:
from scipy.linalg import svd

# Singular Value Decomposition (SVD) - used in Kabsch algorithm
# This is how we find optimal rotation to align two structures

# Center coordinates first
coords_centered = ca_coords - np.mean(ca_coords, axis=0)

# Compute SVD (mathematical details not essential)
U, S, Vt = svd(coords_centered.T)

print("SVD components:")
print("U shape:", U.shape, "- Left singular vectors")
print("S shape:", S.shape, "- Singular values")
print("Vt shape:", Vt.shape, "- Right singular vectors")
print("\nSingular values:", S)
print("These describe the 'spread' along principal axes")

SVD components:
U shape: (3, 3) - Left singular vectors
S shape: (3,) - Singular values
Vt shape: (5, 5) - Right singular vectors

Singular values: [12.04775508  1.61384289  0.1229176 ]
These describe the 'spread' along principal axes


**SVD in structural biology:**
- Kabsch algorithm (protein alignment) uses SVD
- Principal Component Analysis (PCA) uses SVD  
- Essential for comparing structures (RMSD calculation)

You don't need to understand the math, just that:
- SVD finds optimal rotations
- It's the gold standard for structure alignment

### Statistical Analysis: Distribution of Distances

In [31]:
# Interactive distance distribution with Plotly
distances_list = distance_matrix[np.triu_indices_from(distance_matrix, k=1)]

print("Distance statistics:")
print(f"Mean distance: {np.mean(distances_list):.2f} Å")
print(f"Std deviation: {np.std(distances_list):.2f} Å")
print(f"Min distance: {np.min(distances_list):.2f} Å")
print(f"Max distance: {np.max(distances_list):.2f} Å")

# Create interactive histogram
fig_hist = go.Figure()

fig_hist.add_trace(
    go.Histogram(
        x=distances_list,
        nbinsx=10,
        name="Distance Distribution",
        opacity=0.7,
        marker_color="steelblue",
        hovertemplate="Distance: %{x:.2f} Å<br>Count: %{y}<extra></extra>",
    )
)

# Add mean line
mean_dist = np.mean(distances_list)
fig_hist.add_vline(
    x=mean_dist,
    line_dash="dash",
    line_color="red",
    annotation_text=f"Mean = {mean_dist:.2f} Å",
)

fig_hist.update_layout(
    title="Distribution of CA-CA Distances<br><sub>Interactive histogram with statistics</sub>",
    xaxis_title="Distance (Å)",
    yaxis_title="Count",
    width=800,
    height=500,
)

fig_hist.show()

Distance statistics:
Mean distance: 7.73 Å
Std deviation: 3.76 Å
Min distance: 3.76 Å
Max distance: 15.06 Å


---
## Pandas - Organizing Residue Properties

[Pandas](https://pandas.pydata.org/) provides DataFrames - spreadsheet-like data structures.

**Why Pandas for structural bioinformatics?**
- Proteins have multiple properties per residue (coordinates, B-factor, type, etc.)
- Need to filter, sort, and analyze these properties
- Easy to export results to CSV for other tools

**DataFrame = table with labeled columns**

### Creating a DataFrame: Residue Properties

In [None]:
# Create DataFrame with residue information
residue_data = pd.DataFrame(
    {
        "residue_number": [1, 2, 3, 4, 5],
        "residue_name": ["ALA", "VAL", "GLY", "LEU", "GLY"],
        "chain": ["A", "A", "A", "A", "A"],
        "atom_name": ["CA", "CA", "CA", "CA", "CA"],  # Atom type
        "x": ca_coords[:, 0],
        "y": ca_coords[:, 1],
        "z": ca_coords[:, 2],
        "b_factor": [25.3, 18.7, 32.1, 15.2, 28.9],  # Temperature factor (Å²)
    }
)

residue_data

Unnamed: 0,residue_number,residue_name,chain,atom_name,x,y,z,b_factor
0,1,ALA,A,CA,10.5,12.3,8.2,25.3
1,2,VAL,A,CA,12.1,15.8,9.1,18.7
2,3,GLY,A,CA,15.3,17.2,10.5,32.1
3,4,LEU,A,CA,18.9,19.1,11.8,15.2
4,5,GLY,A,CA,22.1,20.5,13.2,28.9


**Understanding the properties:**
- **residue_number**: Position in chain (1, 2, 3...)
- **residue_name**: 3-letter amino acid code (ALA = alanine, GLY = glycine, etc.)
- **chain**: Protein chain identifier (A, B, C...)
- **atom_name**: Type of atom (CA = alpha carbon)
- **x, y, z**: CA atom coordinates (Å)
- **b_factor**: Thermal motion/disorder (higher = more flexible)

### Basic DataFrame Operations

In [33]:
# Get column names
print("Columns:", residue_data.columns.tolist())

# Get shape
print(f"Shape: {residue_data.shape[0]} rows × {residue_data.shape[1]} columns")

# Get data types
print("\nData types:")
print(residue_data.dtypes)

Columns: ['residue_number', 'residue_name', 'chain', 'atom_name', 'x', 'y', 'z', 'b_factor']
Shape: 5 rows × 8 columns

Data types:
residue_number      int64
residue_name       object
chain              object
atom_name          object
x                 float64
y                 float64
z                 float64
b_factor          float64
dtype: object


In [34]:
# Select single column
bfactors = residue_data["b_factor"]
print("B-factors:")
print(bfactors)

B-factors:
0    25.3
1    18.7
2    32.1
3    15.2
4    28.9
Name: b_factor, dtype: float64


In [35]:
# Select multiple columns
coords_df = residue_data[["x", "y", "z"]]
print("Coordinate columns:")
print(coords_df)

Coordinate columns:
      x     y     z
0  10.5  12.3   8.2
1  12.1  15.8   9.1
2  15.3  17.2  10.5
3  18.9  19.1  11.8
4  22.1  20.5  13.2


In [36]:
# Select row by index
first_residue = residue_data.iloc[0]  # iloc = integer location
print("First residue:")
print(first_residue)

First residue:
residue_number       1
residue_name       ALA
chain                A
atom_name           CA
x                 10.5
y                 12.3
z                  8.2
b_factor          25.3
Name: 0, dtype: object


In [37]:
# Select specific value
bfactor_res3 = residue_data.loc[2, "b_factor"]  # Row 2 (3rd row), b_factor column
print(f"B-factor of residue 3: {bfactor_res3}")

B-factor of residue 3: 32.1


### Filtering Data: Finding Specific Residues

In [38]:
# Find residues with high B-factor (flexible regions)
flexible = residue_data[residue_data["b_factor"] > 25]
print("Flexible residues (B-factor > 25):")
print(flexible)

Flexible residues (B-factor > 25):
   residue_number residue_name chain atom_name     x     y     z  b_factor
0               1          ALA     A        CA  10.5  12.3   8.2      25.3
2               3          GLY     A        CA  15.3  17.2  10.5      32.1
4               5          GLY     A        CA  22.1  20.5  13.2      28.9


In [39]:
# Find glycine residues (most flexible amino acid - no sidechain)
glycines = residue_data[residue_data["residue_name"] == "GLY"]
print("Glycine residues:")
print(glycines)

Glycine residues:
   residue_number residue_name chain atom_name     x     y     z  b_factor
2               3          GLY     A        CA  15.3  17.2  10.5      32.1
4               5          GLY     A        CA  22.1  20.5  13.2      28.9


In [40]:
# Multiple conditions: flexible AND in first half of structure
condition = (residue_data["b_factor"] > 20) & (residue_data["residue_number"] <= 3)
filtered = residue_data[condition]
print("Flexible residues in first half:")
print(filtered)

Flexible residues in first half:
   residue_number residue_name chain atom_name     x     y     z  b_factor
0               1          ALA     A        CA  10.5  12.3   8.2      25.3
2               3          GLY     A        CA  15.3  17.2  10.5      32.1


### Adding Calculated Columns

In [41]:
# Calculate distance from origin for each residue
residue_data["distance_from_origin"] = np.sqrt(
    residue_data["x"] ** 2 + residue_data["y"] ** 2 + residue_data["z"] ** 2
)

print("DataFrame with calculated column:")
print(residue_data[["residue_number", "residue_name", "distance_from_origin"]])

DataFrame with calculated column:
   residue_number residue_name  distance_from_origin
0               1          ALA             18.132292
1               2          VAL             21.882870
2               3          GLY             25.301779
3               4          LEU             29.347232
4               5          GLY             32.907446


### Summary Statistics

In [42]:
# Statistical summary of numerical columns
print("Summary statistics:")
print(residue_data.describe())

Summary statistics:
       residue_number          x          y          z   b_factor  \
count        5.000000   5.000000   5.000000   5.000000   5.000000   
mean         3.000000  15.780000  16.980000  10.560000  24.040000   
std          1.581139   4.778284   3.171277   2.013206   7.014841   
min          1.000000  10.500000  12.300000   8.200000  15.200000   
25%          2.000000  12.100000  15.800000   9.100000  18.700000   
50%          3.000000  15.300000  17.200000  10.500000  25.300000   
75%          4.000000  18.900000  19.100000  11.800000  28.900000   
max          5.000000  22.100000  20.500000  13.200000  32.100000   

       distance_from_origin  
count              5.000000  
mean              25.514324  
std                5.853983  
min               18.132292  
25%               21.882870  
50%               25.301779  
75%               29.347232  
max               32.907446  
       residue_number          x          y          z   b_factor  \
count        5.0000

In [None]:
# Group by residue type (GLY appears twice in our protein)
residue_stats = residue_data.groupby("residue_name")["b_factor"].agg(
    ["count", "mean", "min", "max"]
)
print("B-factor statistics by residue type:")
print(residue_stats)
print("\nNotice that GLY appears twice with different B-factors!")

B-factor statistics by residue type:
              count  mean   min   max
residue_name                         
ALA               1  25.3  25.3  25.3
GLY               2  30.5  28.9  32.1
LEU               1  15.2  15.2  15.2
VAL               1  18.7  18.7  18.7

Notice that GLY appears twice with different B-factors!


### Sorting and Ranking

In [44]:
# Sort by B-factor (most rigid to most flexible)
sorted_by_bfactor = residue_data.sort_values("b_factor")
print("Residues sorted by B-factor (rigid → flexible):")
print(sorted_by_bfactor[["residue_number", "residue_name", "b_factor"]])

Residues sorted by B-factor (rigid → flexible):
   residue_number residue_name  b_factor
3               4          LEU      15.2
1               2          VAL      18.7
0               1          ALA      25.3
4               5          GLY      28.9
2               3          GLY      32.1


In [45]:
# Sort by distance from origin (closest to farthest)
sorted_by_distance = residue_data.sort_values("distance_from_origin")
print("Residues sorted by distance from origin:")
print(sorted_by_distance[["residue_number", "distance_from_origin"]])

Residues sorted by distance from origin:
   residue_number  distance_from_origin
0               1             18.132292
1               2             21.882870
2               3             25.301779
3               4             29.347232
4               5             32.907446


### Visualization with Pandas

In [None]:
# Pandas with Plotly backend - much easier!
# Set Plotly as the plotting backend for pandas
pd.options.plotting.backend = "plotly"

# Simple scatter plot colored by B-factor
fig = residue_data.plot.scatter(
    x="residue_number",
    y="distance_from_origin",
    color="b_factor",
    size="b_factor",
    hover_data=["residue_name", "atom_name"],
    title="Residues Colored by B-factor<br><sub>Size and color represent flexibility</sub>",
    labels={
        "residue_number": "Residue Number",
        "distance_from_origin": "Distance from Origin (Å)",
        "b_factor": "B-factor (Å²)",
    },
    color_continuous_scale="Viridis",
)

fig.show()

print("✅ Pandas + Plotly backend makes visualization much easier!")
print("📊 The plot shows residue position vs distance, colored by flexibility")

✅ Pandas + Plotly backend makes visualization much easier!
📊 The plot shows residue position vs distance, colored by flexibility


### Exporting Results

In [47]:
# Save to CSV file
# residue_data.to_csv('residue_analysis.csv', index=False)
# print("Data saved to residue_analysis.csv")

# Save to Excel (requires openpyxl or xlsxwriter)
# residue_data.to_excel('residue_analysis.xlsx', index=False)

print("To export: uncomment the lines above")

To export: uncomment the lines above


---
## Summary: What You've Learned

### NumPy - Arrays and Operations
- Create arrays for 3D coordinates
- Index and slice to access specific atoms
- Perform vector operations (translation, centering)
- Calculate geometric properties (distances, center of mass)
- Boolean indexing to filter atoms

### Plotly - Interactive Visualization
- Create interactive 3D scatter plots of structures
- Customize colors, sizes, hover information
- Rotate, zoom, and explore structures from all angles
- Generate publication-quality interactive figures
- Export as interactive HTML for sharing

### SciPy - Scientific Computing
- Calculate pairwise distances efficiently
- Create contact maps
- Use spatial data structures (KDTree)
- Apply linear algebra (SVD for alignment)
- Analyze distance distributions

### Pandas - Data Organization
- Create DataFrames for residue properties
- Filter and select specific residues
- Add calculated columns
- Compute summary statistics
- Sort and rank residues
- Export results for further analysis

