## Metareader – exporting `atom_site` category to CSV and further usage

In this notebook we demonstrate an additional use case of the `metareader` script, which is
exporting the `atom_site` category to CSV format.

The `atom_site` category contains atomic-level information such as atom names,
residue identifiers, coordinates and occupancies. Exporting this category to CSV
allows easy downstream analysis, for example using the Pandas library. Atomic-level data can be filtered,
summarized and inspected using standard data analysis tools without the need
to work directly with the original mmCIF file.

One example structure is used:
- **1UBQ** (X-ray structure)

Input files are local mmCIF files stored in: `data/pdb/`

### Exporting atom_site for a single structure (1UBQ)

We first extract the `atom_site` category and export it to CSV.

In [6]:
!mkdir -p ../outputs/metareader_atom_site/1ubq

!python ../src/rnapolis/metareader.py \
  ../data/pdb/1ubq.cif \
  -c atom_site \
  --csv-directory ../outputs/metareader_atom_site/1ubq \
  > /dev/null

In [7]:
!ls ../outputs/metareader_atom_site/1ubq

atom_site.csv struct.csv


The directory now contains a CSV file with atomic information.

In [10]:
!head -n 6 ../outputs/metareader_atom_site/1ubq/atom_site.csv

group_PDB,id,type_symbol,label_atom_id,label_alt_id,label_comp_id,label_asym_id,label_entity_id,label_seq_id,pdbx_PDB_ins_code,Cartn_x,Cartn_y,Cartn_z,occupancy,B_iso_or_equiv,pdbx_formal_charge,auth_seq_id,auth_comp_id,auth_asym_id,auth_atom_id,pdbx_PDB_model_num
ATOM,1,N,N,.,MET,A,1,1,?,27.340,24.430,2.614,1.00,9.67,?,1,MET,A,N,1
ATOM,2,C,CA,.,MET,A,1,1,?,26.266,25.413,2.842,1.00,10.38,?,1,MET,A,CA,1
ATOM,3,C,C,.,MET,A,1,1,?,26.913,26.639,3.531,1.00,9.62,?,1,MET,A,C,1
ATOM,4,O,O,.,MET,A,1,1,?,27.886,26.463,4.263,1.00,9.62,?,1,MET,A,O,1
ATOM,5,C,CB,.,MET,A,1,1,?,25.112,24.880,3.649,1.00,13.77,?,1,MET,A,CB,1


The first line of the file is the CSV header and defines all available columns.
Each subsequent line corresponds to a single atom in the structure.

The `atom_site` table contains detailed atomic information, including:
- the chemical element (`type_symbol`),
- atom and residue identifiers (`label_atom_id`, `label_comp_id`, `label_seq_id`),
- chain identifiers (`label_asym_id`),
- Cartesian coordinates (`Cartn_x`, `Cartn_y`, `Cartn_z`),
- occupancy and temperature factors (`occupancy`, `B_iso_or_equiv`).

This representation makes the atomic structure easy to process in a tabular form.

### Loading atom_site CSV into Pandas

The CSV file can be directly imported into Pandas for further analysis.

In [12]:
import pandas as pd

df = pd.read_csv("../outputs/metareader_atom_site/1ubq/atom_site.csv")

In [13]:
df.head()

Unnamed: 0,group_PDB,id,type_symbol,label_atom_id,label_alt_id,label_comp_id,label_asym_id,label_entity_id,label_seq_id,pdbx_PDB_ins_code,...,Cartn_y,Cartn_z,occupancy,B_iso_or_equiv,pdbx_formal_charge,auth_seq_id,auth_comp_id,auth_asym_id,auth_atom_id,pdbx_PDB_model_num
0,ATOM,1,N,N,.,MET,A,1,1,?,...,24.43,2.614,1.0,9.67,?,1,MET,A,N,1
1,ATOM,2,C,CA,.,MET,A,1,1,?,...,25.413,2.842,1.0,10.38,?,1,MET,A,CA,1
2,ATOM,3,C,C,.,MET,A,1,1,?,...,26.639,3.531,1.0,9.62,?,1,MET,A,C,1
3,ATOM,4,O,O,.,MET,A,1,1,?,...,26.463,4.263,1.0,9.62,?,1,MET,A,O,1
4,ATOM,5,C,CB,.,MET,A,1,1,?,...,24.88,3.649,1.0,13.77,?,1,MET,A,CB,1


Each row of the DataFrame represents one atom
This makes it straightforward to perform filtering and simple statistics
without working directly with the original mmCIF file.

### Example: simple summaries

In [18]:
# total number of atoms

len(df)

660

In [32]:
# count atoms by element type

df["type_symbol"].value_counts().to_frame(name="atom_count")

Unnamed: 0_level_0,atom_count
type_symbol,Unnamed: 1_level_1
C,378
O,176
N,105
S,1


The atom count by element shows that the structure consists mainly of carbon,
oxygen and nitrogen atoms, with a single sulfur atom and no phosphorus atoms.
This is expected for a protein structure such as ubiquitin.

### Example: selecting a subset of atoms

As an example, atoms of a selected element can be extracted for downstream analysis.

In [25]:
# select only sulfur atoms

sulfur_atoms = df[df["type_symbol"] == "S"]
sulfur_atoms.head()

Unnamed: 0,group_PDB,id,type_symbol,label_atom_id,label_alt_id,label_comp_id,label_asym_id,label_entity_id,label_seq_id,pdbx_PDB_ins_code,...,Cartn_y,Cartn_z,occupancy,B_iso_or_equiv,pdbx_formal_charge,auth_seq_id,auth_comp_id,auth_asym_id,auth_atom_id,pdbx_PDB_model_num
6,ATOM,7,S,SD,.,MET,A,1,1,?,...,23.959,5.904,1.0,17.17,?,1,MET,A,SD,1


As an example, sulfur atoms are selected from the structure.
In this protein structure only a single sulfur atom is present,
which corresponds to the sulfur-containing amino acid methionine.

### Example: counting atoms per residue

Atomic data can also be grouped at the residue level.

In [28]:
# count atoms per residue

atoms_per_residue = (
    df.groupby(["label_asym_id", "label_comp_id", "label_seq_id"])
      .size()
      .reset_index(name="atom_count")
)

atoms_per_residue.head()


Unnamed: 0,label_asym_id,label_comp_id,label_seq_id,atom_count
0,A,ALA,28,5
1,A,ALA,46,5
2,A,ARG,42,11
3,A,ARG,54,11
4,A,ARG,72,11


This shows how many atoms belong to each residue in a given chain.
Such information can be useful for checking residue completeness
or detecting unusual residues.