# Si-Ge cluster expansion workflow - part 1

This is a CASM project tutorial to generate a phase diagram using a Si-Ge binary alloy cluster expansion fit to DFT calculations. The overall workflow is split into two parts.

Topics covered in part 1:

1. **Project initialization**: Define the primitive crystal structure and allowed atoms on each crystal site
2. **Enumeration**: Enumerate crystal structures which are symmetrically distinct orderings of the atoms allowed by the prim occupation DoF
3. **Calculation**: Calculate the energies of the enumerated structures using DFT
4. **Import and mapping**: Import calculation results, mapping to orderings on the prim
5. **Set reference states**: Choose reference states to define a formation energy for each structure
6. **Query**: Query calculation properties


In [None]:
import pathlib
import libcasm.xtal as xtal
from casm.project import Project
from casm.project.json_io import safe_dump

input_dir = pathlib.Path("input")

project_path = pathlib.Path("SiGe_occ")
project_path.mkdir(parents=True, exist_ok=True)

## Project initialization

### Specify the "prim"

A primitive crystal structure and allowed degrees of freedom (the "prim") specifies:

- lattice vectors
- crystal basis sites
- global degrees of freedom
- site degrees of freedom, including allowed occupant species on each basis site.

When combined with a choice of basis function type, order, and truncation, the prim provides all the information needed to generate cluster expansion basis functions.

Here is the prim for the Si-Ge binary alloy project, which we write to a JSON-formatted file named "prim.json":

In [None]:
prim_data = {
    "title": "SiGe_occ",
    "lattice_vectors": [
        [0.000000000000, 2.800000000000, 2.800000000000],  # 1st lattice vector
        [2.800000000000, 0.000000000000, 2.800000000000],  # 2nd lattice vector
        [2.800000000000, 2.800000000000, 0.000000000000],  # 3rd lattice vector
    ],
    "coordinate_mode": "Fractional",
    "basis": [
        {
            "coordinate": [0.0, 0.0, 0.0],
            "occupant_dof": ["Si", "Ge"],
        },
        {
            "coordinate": [0.25, 0.25, 0.25],
            "occupant_dof": ["Si", "Ge"],
        },
    ],
}

with open(project_path / "prim.json", "w") as f:
    f.write(xtal.pretty_json(prim_data))

For this particular project, the prim contains:

- **lattice_vectors**: A list of crystal lattice vectors. Units are typically Angstrom, but are ultimately determined by the method used to perform calculations. 
- **basis**: A list of crystal basis sites, including coordinate and allowed degrees of freedom. For this ZrO project, the basis sites contain:
  - **coordinate**: The location of the basis site, according to the "coordinate_mode".
  - **occupants**: A list of the possible occupant species that may reside at each site. The names are case sensitive, and “Va” is reserved for vacancies.
- **coordinate_mode**: Defines the units of basis site coordinates. May be one of:
  - "Cartesian": To specify basis coordinates using Cartesian coordinates:
    $$ r_{cart} = (x, y, z) $$
  - "Fractional" or "Direct": To specify basis coordinates defined in terms of the lattice vectors:
    $$ r_{cart} = L r_{frac}, $$
    where:
    - $r_{frac}$ are the coordinates in the fractional representation
    - $r_{cart}$ are the coordinates in the Cartesian representation
    - $L$ is the lattice as a column-vector matrix. 
  
**Note**: It is common, but not required, to use the results of a fully relaxed calculation of the structure with the default occupation values for the prim lattice vectors. The default occupation on each site is the species listed first in "occupants". For occupation cluster expansions, ideal supercells of the prim lattice are used for the initial state of DFT calculations and are the default reference for strain.

### Initialize a CASM project

A CASM project is a directory containing data related to a particular prim. The CASM project directory structure standardizes the location of various files used by multiple CASM methods. This makes it easier to perform the most common operations and easier to share a project with others.

A CASM project is initialized by defining a prim and using [Project.init TODO](TODO). This will:

1. Check if the prim has a primitive unit cell with a CASM standard lattice orientation
2. Perform a symmetry analysis
3. Generate some default directories, data, and settings
4. Perform a configuration check 


Notes:

- Project files that the user should not typically modify directly, including a copy of the prim, are stored in a hidden `.casm` sub-directory of the CASM project directory. The presense or absence of the `.casm` directory is used by CASM to detect a CASM project.


In [None]:
project = Project.init(path=project_path)

coming: 
- Show what happens with non-primitive prim or non-standard lattice?
- Visualize the prim

## Enumeration

### Introduction

To fit a cluster expansion for the Si-Ge system, we need a set of calculated energies for Si-Ge crystal structures with various orderings to use as training data. To begin, we use CASM to enumerate symmetrically distinct [Supercell]() and [Configuration]():

- A [Supercell]() defines the three-dimensional translations that repeat a crystal structure. 

  -  A supercell can be specified by the integer transformation matrix, $T$, relating the superstructure lattice vectors, $S$, to the unit structure lattice vectors, $L$, according to $S = L T$, where $S$ and $L$ are shape=(3,3) matrices with lattice vectors as columns.
  
  TODO: figure
 
- A [Configuration]() is a compact representation of the unit cell for a crystal structure that is allowed by the DoF specified in the prim. For this Si-Ge project, a configuration can be specified by:
 
  - the supercell that is the unit cell for the crystal structure, and
  - the occupant on each sites in the supercell (i.e. is Si or Ge on each site in the supercell).
  
  TODO: figure

The Supercell object holds symmetry representations that efficiently applying symmetry operations to a configuration, allowing for comparisons and checks to determine if a configuration is symmetrically distinct. The same symmetry representations can be used to transform any configuration in the same supercell, so a Supercell object can be shared by multiple Configuration objects.

### Supercell enumeration

#### Enumerating supercells by volume

The method [enum.supercells_by_volume]() enumerates symmetrically distinct supercells from a minimum to a maximum volume, specified as integer multiples of the prim unit cell volume. It also has additional options, described in the reference documentation, for more complex use cases:

- Enumerate supercells of another supercell
- Enumerate 1d or 2d supercells
- Enumerate supercells with a fixed shape but different sizes

In [None]:
# Enumerate supercells with volume 1 to 4
project.enum.supercells_by_volume(
    max=4,
    min=1,
    id="supercells_by_volume.1",
    verbose=True,
)

#### Enumeration Data

The results of an enumeration can be accessed using the [EnumData]() class. The last enumeration is saved in [project.enum.last](). We can put additional information, including a text description of the enumeration, in the [EnumData.meta]() dict and save the updated EnumData using [commit](). Subsequently, if "desc" exists in [EnumData.meta](), it will be printed along with summary information such as the number of supercells.

In [None]:
enum_data = project.enum.last
enum_data.meta = {"desc": "Initial supercell enumeration"}
enum_data.commit()
print(enum_data)

Later, the [EnumData]() may also be accessed by id string using the [enum.get]() method.

In [None]:
enum_data = project.enum.get("supercells_by_volume.1")
print(enum_data)

#### SupercellSet

The supercells enumerated by [enum.supercells_by_volume]() are stored as [SupercellRecord](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.SupercellRecord.html#supercellrecord) in a [SupercellSet](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.SupercellSet.html#supercellset). Each SupercellRecord includes a [Supercell](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.Supercell.html#supercell) and some additional information about the supercell, including a [supercell_name]() string which is used as an identifier.

A SupercellSet:

- does not keep multiple SupercellRecord for supercells that have the same superlattice vectors;
- does allow storing separate SupercellRecord for supercells which are distinct (have different superlattice vectors) but are symmetrically equivalent (superlattice points are mapped by a crystal group operation).

Iterating over a [SupercellSet](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.SupercellSet.html#supercellset) yields [SupercellRecord](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.SupercellRecord.html#supercellrecord). Each SupercellRecord includes a [Supercell](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.Supercell.html#supercell) and some additional information about the Supercell.

In [None]:
# Iterate over the first three SupercellRecord
# in the SupercellSet and print the record
for i, record in enumerate(project.enum.last.supercell_set):
    print(record)
    if i == 2:
        break

#### Storing multiple enumerations

Enumerations are stored in directories based on their id string. If an id is not given, or has value None, a new enumeration is automatically generated in sequential order. If the id of an existing enumeration is given, that enumeration is updated with any additional supercells generated.


In [None]:
# Enumerate supercells with volume 3 to 5
project.enum.supercells_by_volume(max=5, min=3, id="supercells_by_volume.2")
print()
print(project.enum.last)

### Configuration enumeration

#### Enumerating configurations by supercell

The method [enum.occ_by_supercell]() enumerates all occupations in supercells ranging from a minimum to a maximum volume. All configurations are guaranteed to be in a canonical supercell. By default it:

- only outputs primitive configurations,
- only outputs configurations in canonical form (the configuration that compares greatest to all configurations in a supercell that can be mapped by symmetry operations).

With these defaults, if enumeration proceeds without skipping supercells, all symmetrically distinct configurations will be enumerated.

As with enum.supercells_by_volume it also has additional options, described in the reference documentation, for more complex use cases:

- Enumerate occupations in supercells of another supercell
- Enumerate occupations in 1d or 2d supercells
- Enumerate occupations in supercells with a fixed shape but different sizes

**Warning**: The number of possible occupations in a $n$-component alloy with $m$ sites is $n^m$. Take care not to request too large of an enumeration. 
    

In [None]:
# Enumerate configurations in supercells with volume 1 to 4
project.enum.occ_by_supercell(
    max=4,
    min=1,
    id="occ_by_supercell.1",
)

#### ConfigurationSet

The configurations enumerated by [enum.occ_by_supercell]() are stored as [ConfigurationRecord](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.ConfigurationRecord.html#configurationrecord) in a [ConfigurationSet](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.ConfigurationSet.html#configurationset). Each ConfigurationRecord includes a [Configuration](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.Configuration.html#configuration) and some additional information about the configuration, including a [configuration_name]() string which is used as an identifier.

A ConfigurationSet:

- requires configuration be in a canonical supercell;
- does not keep multiple ConfigurationRecord for configurations that have the same DoF values;
- does allow storing separate ConfigurationRecord for configurations which are distinct (have different DoF values) but are symmetrically equivalent (DoF values are mapped by a symmetry operation).
- users are responsible for placing any other constraints (canonical configurations only, primitive configurations only, etc.) on which configuration are added to ConfigurationSet.

**Warning**: ConfigurationSet is optimized for finding unique configurations using canonical supercells. Users must ensure that configuration added to ConfigurationSet are in a canonical supercell. This is not checked by ConfigurationSet but required to ensure proper configuration naming, serialization, and deserialization. Configurations that are not in a canonical supercell should be stored in a list or some other data structure.

Iterating over a [ConfigurationSet](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.ConfigurationSet.html#configurationset) yields [ConfigurationRecord](https://prisms-center.github.io/CASMcode_pydocs/libcasm/configuration/2.0/reference/libcasm/_autosummary/libcasm.configuration.ConfigurationRecord.html#configurationrecord).

In [None]:
# Iterate over the first three ConfigurationRecord
# in the ConfigurationSet and print the record
for i, record in enumerate(project.enum.last.configuration_set):
    print(record)
    if i == 2:
        break

#### Conversion to structure

The [Configuration.to_structure]() methods convert a CASM [Configuration]() to a CASM [Structure](). A Structure:

- represents a crystal structure with a 3d lattice,
- is not restricted to the DoF values allowed by a prim,
- has built in methods for conversions to and from VASP POSCAR format.

In [None]:
# Iterate over the first three ConfigurationRecord
# in the ConfigurationSet and print the record
for i, record in enumerate(project.enum.last.configuration_set):
    name = record.configuration_name
    structure = record.configuration.to_structure()
    poscar_str = structure.to_poscar_str(title=name)

    print("~~~")
    print(f"Configuration: {name}")
    print(f"Structure: {structure}")
    print(f"POSCAR:\n{poscar_str}", end="")
    if i == 2:
        break

#### Filtered enumeration

A custom filter function may be used to filter configurations during enumeration. Here we:

- use [ConfigCompositionCalculator]() to calculate the number of each type of atom in the supercell,
- keep configurations that have exactly 2 Ge,
- use ``dry_run=True``so the enumeration is not committed automatically.

In [None]:
# Enumerate configurations:
# - in supercells with volume 1 to 3
# - with exactly 2 Ge atoms in the supercell

from libcasm.configuration import (
    Configuration,
    SupercellRecord,
)
from casm.project import EnumData

# Get the casm.project.ConfigCompositionCalculator
comp = project.chemical_composition

# Get the index of Ge in the composition arrays
i_Ge = comp.components.index("Ge")

# Print each check?
verbose_checks = True


def filter_f(config: Configuration, enum_data: EnumData) -> bool:
    """Return True to include; False to exclude"""

    # Get number of Ge in the supercell
    N_Ge = comp.per_supercell(config)[i_Ge]

    # Print info about the config being checked
    if verbose_checks:
        record = SupercellRecord(config.supercell)
        print(
            f"~check~ {record.supercell_name}",
            config.occupation,
            f"include?: {N_Ge == 2}",
        )
    return N_Ge == 2


project.enum.occ_by_supercell(
    max=3,
    min=1,
    filter_f=filter_f,
    verbose=True,
    dry_run=True,
)

### Acting on enumeration data

This section provides a reference for various actions that can be performed on enumerations, and may be skipped for the Si-Ge demonstration project.

#### Get an enumeration by id

- Also, update enumeration metadata and commit.

In [None]:
enum_data = project.enum.get("supercells_by_volume.1")
enum_data.meta = {"desc": "Initial supercell enumeration"}
enum_data.commit()
print(enum_data)

#### List all enumerations

- Print a summary of each enumeration in the project

In [None]:
project.enum.list()

#### Copy an enumeration

- Will raise if the destination enumeration already exists

In [None]:
project.enum.copy(
    src_id="supercells_by_volume.2",
    dest_id="supercells_by_volume.3",
)
project.enum.list()

#### Merge enumerations

- Supercells and configurations in source enumeration sets are inserted into the destination enumeration sets.
- Supercells and configurations in source enumeration lists are appended to the destination enumeration lists if they are not already present.

In [None]:
project.enum.merge(
    src_id="supercells_by_volume.1",
    dest_id="supercells_by_volume.3",
)
project.enum.list()

#### Remove an enumeration

- Will raise if the enumeration does not exist

In [None]:
project.enum.remove("supercells_by_volume.3")
project.enum.list()

## Calculation

The calculation process described here involves the following conversions:

    libcasm.configuration.Configuration 
    -> libcasm.xtal.Structure 
    -> ase.Atoms 
    -> VASP input files (POSCAR, KPOINTS, INCAR, POTCAR)

- CASM provides some standard directory structures for saving input files.
- CASM provides a very simple integration to [The Atomic Simulation Environment (ase)](https://wiki.fysik.dtu.dk/ase/index.html) which can be customized for a particular use case.
- Users can customize the input file generation for other DFT codes.


### Calculation settings for ASE + VASP

Write settings to [project.dir.calctype_settings_dir_v2]():

    <project>/calculation_settings/calctype.<calctype_id>/

- The settings files are calculation type dependent.
- Any necessary files can go in the calculation settings directory.


In [None]:
import os

# Give your calculation settings a name
calctype_id = "vasp.default"

# Make a calculation settings directory
calctype_settings_dir = project.dir.calctype_settings_dir_v2(
    calctype=calctype_id,
)
calctype_settings_dir.mkdir(parents=True, exist_ok=True)

# !! CHANGE THIS AS NECESSARY !!
with open(calctype_settings_dir / "INCAR", "w") as f:
    f.write(
        """ISPIN = 1 #does non spin-polarized calc.
PREC = Accurate #cutoff + wrap around errors.
IBRION= 2 #conj. grad. relaxation.
NSW=61 #numberof ionic steps taken in minimization. Make it odd.
ISIF= 3 #whether stress tensor is calculated, what is allowed to relax.
ENMAX=600 #cutoff
ISMEAR = 1 #BZ integration method (for relaxation runs).
SIGMA = 0.2 #smearing width (keep T*S < 1meV/atom). 
LWAVE = .FALSE.
LCHARG = .FALSE.
"""
    )

# !! CHANGE THIS AS NECESSARY !!
with open(calctype_settings_dir / "KPOINTS", "w") as f:
    f.write(
        """Fully automatic mesh
0              ! 0 -> automatic generation scheme 
Auto           ! fully automatic
  10           ! length (R_k)
"""
    )

# !! CHANGE THIS AS NECESSARY !!
# Save pseuodopotential and xc settings
ase_vasp_settings = {
    "setups": {
        "Si": "",
        "Ge": "_d",
    },
    "xc": "pbe",
}
safe_dump(ase_vasp_settings, calctype_settings_dir / "ase_vasp.json", force=True)

# !! CHANGE THIS AS NECESSARY !!
# Set VASP_PP_PATH
vasp_pp_path = input_dir / "dummy_vasp_potentials/"
os.environ["VASP_PP_PATH"] = str(vasp_pp_path.resolve())

### Generate VASP input files

The calculation directories created will be at:

    <enum_dir>/training_data/<configname>/calctype.<calctype_id>/

Where

    <enum_dir> = <project>/enumerations/enum.<enum_id>/



In [None]:
import casm.project.ase_utils as ase_utils
from casm.project.json_io import read_required, printpathstr

# Enumeration ID & Calculation type ID to use
enum_id = "occ_by_supercell.1"
calctype_id = "vasp.default"

# Location for VASP input files:
calctype_settings_dir = project.dir.calctype_settings_dir_v2(calctype_id)

# Read pseuodopotential and xc settings
ase_vasp_settings = read_required(calctype_settings_dir / "ase_vasp.json")

# Construct AseVaspTool
ase_vasp_tool = ase_utils.AseVaspTool(
    calctype_settings_dir=calctype_settings_dir,
    setups=ase_vasp_settings.get("setups"),
    xc=ase_vasp_settings.get("xc"),
)

# Get the enumeration data
enum_data = project.enum.get(enum_id)

# Iterate over ConfigurationSet and write VASP input files using ase
print(f"Setting up {len(enum_data.configuration_set)} calculations...")
for record in enum_data.configuration_set:
    print("~~~")
    calc_dir = project.dir.enum_calctype_dir(  #
        enum=enum_id,
        configname=record.configuration_name,
        calctype=calctype_id,
    )
    ase_vasp_tool.setup(
        casm_structure=record.configuration.to_structure(),
        calc_dir=calc_dir,
    )
    files = os.listdir(calc_dir)
    print("calc_dir:", printpathstr(calc_dir))
    print("files:", files)

### Running Calculations

Some tools that help run calculations:

- [The Atomic Simulation Environment (ase)](https://wiki.fysik.dtu.dk/ase/index.html)
- [pymatgen](https://pymatgen.org/)
- [signac](https://signac.io/)
- [row](https://row.readthedocs.io/en/0.2.0/)


### Make CASM structures with properties

After running the calculations, construct CASM [Structure](https://prisms-center.github.io/CASMcode_pydocs/libcasm/xtal/2.0/reference/libcasm/_autosummary/libcasm.xtal.Structure.html#structure) with the calculated properties (energy, relaxed lattice, relaxed atom coordinates) and save as files named "properties.calc.json".

[CASM recognized properties](https://prisms-center.github.io/CASMcode_docs/formats/dof_and_properties/) can be transformed by symmetry operations and imported.

Example [JSON representation of structure properties](https://prisms-center.github.io/CASMcode_docs/formats/casm/crystallography/SimpleStructure/):

    "global_properties": {
        "energy": {
            "value": [-37.82958967]
        }
    }
    
    "atom_properties": {
        "force": {
            "value": [
                [0.00029669, -8.18e-06, 0.00029669],
                [0.0, 0.00022049, 0.0],
                [-0.00029669, -8.18e-06, -0.00029669],
                [0.00029669, 8.18e-06, 0.00029669],
                [0.0, -0.00022049, 0.0],
                [-0.00029669, 8.18e-06, -0.00029669],
                [0.0, -0.00014681, 0.0],
                [0.0, 0.00014681, 0.0]
            ]
        }
    }

Here we unzip pre-calculated calculations and read in all the "properties.calc.json" files as CASM [Structure](https://prisms-center.github.io/CASMcode_pydocs/libcasm/xtal/2.0/reference/libcasm/_autosummary/libcasm.xtal.Structure.html#structure) with properties.

## Import and Mapping

### Overview

The import and mapping process is essentially the reverse of the calculation setup process:

    VASP output files (vasprun.xml)
    -> ase.Atoms
    -> libcasm.xtal.Structure 
    -> libcasm.configuration.Configuration 

However, while the Configuration -> Structure conversion is deterministic, the relaxed Structure -> Configuration mapping is not. 

When performing mappings, the term "parent structure" is used for ideal superstructures of the prim (the Configuration being mapped to), and "child structure" is used for the (possible relaxed) structure being mapped.

CASM provides several methods to systematically check and score possible mappings:

- [map_structures](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.methods.map_structures.html#map-structures):
  - A very general structure mapping method.
  - Proposes and checks [StructureMapping](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.info.StructureMapping.html#structuremapping) of parent structures to a child structure for a range of parent supercell volumes and lattice vector reorientations, allowing combinations of rigid rotation, translation, lattice strain, and atom displacement.
  - Roughly, the approach is to first propose and check lattice mappings, and then propose and check atom mappings.
  - Mappings are scored using a weighted sum of lattice strain cost and atomic displacement cost metrics.
- [map_lattices](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.methods.map_lattices.html#map-lattices):
  - A lattice mapping method for when the parent superstructure lattice is not known.
  - Proposes and checks [LatticeMapping](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.info.LatticeMapping.html#latticemapping) from parent superstructures to the child structure considering reorientations of the parent lattice vectors.
  - Scores mappings using a lattice strain cost metric.
  - This method is often used inside a loop over parent superstructures.
- [map_lattices_without_reorientation](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.methods.map_lattices_without_reorientation.html#map-lattices-without-reorientation):
  - A lattice mapping method for when the parent superstructure lattice is known.
  - Constructs a [LatticeMapping](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.info.LatticeMapping.html#latticemapping) from the parent superstructure lattice to the child lattice by calculating the lattice deformation gradient directly, without any reorientation of the lattice vectors.
  - Does not score the mapping, but [isotropic_strain_cost](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.info.isotropic_strain_cost.html#isotropic-strain-cost) and [symmetry_breaking_strain_cost](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.info.symmetry_breaking_strain_cost.html#symmetry-breaking-strain-cost) can be used independently.
  - This method is useful when CASM enumerated the initial configuration used for the calculation input.
- [map_atoms](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.methods.map_atoms.html#map-atoms):
  - An atom mapping method for when the lattice mapping is known.
  - Proposes and scores [AtomMapping](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.info.AtomMapping.html#atommapping) displacements from sites to atoms, considering the distinct translations of child atoms to parent structure sublattices.
  - Scores mappings using a atomic displacement cost metric.
  - This method can be used in combination with map_lattices or map_lattices_without_reorientation to get a complete [StructureMapping]().
- The [mapsearch subpackage](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.mapsearch.html#module-libcasm.mapping.mapsearch):
  - Enables the construction of a custom mapping search algorithm.
  - A tutorial on using [libcasm.mapping.mapsearch](https://prisms-center.github.io/CASMcode_pydocs/libcasm/mapping/2.0/reference/libcasm/_autosummary/libcasm.mapping.mapsearch.html#module-libcasm.mapping.mapsearch) is coming soon.

The mapping methods are described in the paper [Thomas, Natarajan, and Van der Ven, npj Computational Materials, 7 (2021), 164](https://doi.org/10.1038/s41524-021-00627-0).

### Collect calculated structures

In [None]:
import zipfile

# Unzip precalculated data to data_dir
zip_path = input_dir / "SiGe_occ_training_data.zip"
data_dir = project.path / "precalculated"
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(data_dir)

## Collect for "properties.calc.json" files in calculated ##
calculated = []


def collect(path: pathlib.Path):
    struc_data = read_required(path)
    struc = xtal.Structure.from_dict(struc_data)
    calculated.append(
        {
            "calculated_structure": struc,
            "path": str(path),
        }
    )


# Crawl data_dir and collect "properties.calc.json files
for root, dirs, files in os.walk(data_dir):
    for filename in files:
        if filename == "properties.calc.json":
            collect(pathlib.Path(root) / filename)

# Print structures with calculated energies
for i, x in enumerate(calculated):
    print(f"--- structure {i} ---")
    print(x.get("path"))
    print(x.get("calculated_structure"))
    print()

### Map structures to configurations

- Here we just find and keep the best mapping.
- In some cases, additional next-best mappings may be kept.

In [None]:
import libcasm.mapping.methods as mapmethods

# Collect results here:
failed_structures = []
mapped_structures = []

for i, x in enumerate(calculated):
    # Structure to map
    structure = x.get("calculated_structure")
    structure_fg = xtal.make_structure_factor_group(structure)

    # no vacancies, exact volume
    xtal_prim = project.prim.xtal_prim
    vol = int(len(structure.atom_type()) / len(xtal_prim.occ_dof()))

    # find mappings
    mappings = mapmethods.map_structures(
        prim=xtal_prim,
        structure=structure,
        max_vol=vol,
        prim_factor_group=project.prim.factor_group.elements,
        structure_factor_group=structure_fg,
        min_vol=vol,
        min_cost=0.0,
        max_cost=1e20,
        lattice_cost_weight=0.5,
        lattice_cost_method="isotropic_strain_cost",
        atom_cost_method="isotropic_disp_cost",
        k_best=1,
    )

    # Check mapping results:
    if len(mappings) == 0:
        failed_structures.append(x)
    else:
        result = {
            # Keep the best one:
            "scored_structure_mapping": mappings[0],
        }
        result.update(x)
        mapped_structures.append(result)

print("Finished mapping structures:")
print(f"- # Successful mappings: {len(mapped_structures)}")
print(f"- # Failed mappings: {len(failed_structures)}")

### Convert mapped configurations with properties

In [None]:
from libcasm.configuration import (
    ConfigurationSet,
    ConfigurationWithProperties,
    make_canonical_configuration,
)

# Make mapped structures,
# from calculated structure and structure mapping,
# then make ConfigurationWithProperties
configuration_set = ConfigurationSet()
for i, x in enumerate(mapped_structures):
    print(f"--- structure {i} ---")
    print(x.get("path"))

    # Apply structure mapping to make mapped structure
    mapped_structure = mapmethods.make_mapped_structure(
        structure_mapping=x.get("scored_structure_mapping"),
        unmapped_structure=x.get("calculated_structure"),
    )
    x["mapped_structure"] = mapped_structure

    # Convert mapped structure to a ConfigurationWithProperties
    mapped_config_with_props = ConfigurationWithProperties.from_structure(
        prim=project.prim,
        structure=mapped_structure,
        supercells=enum_data.supercell_set,
    )
    print("Mapped configuration with properties:", mapped_config_with_props)
    mapped_canonical_config = make_canonical_configuration(
        mapped_config_with_props.configuration,
        in_canonical_supercell=True,
    )
    energy = mapped_config_with_props.scalar_global_property_value("energy")
    x["mapped_configuration_with_properties"] = mapped_config_with_props

    record = configuration_set.add(mapped_canonical_config)
    x["canonical_configuration_name"] = record.configuration_name
    print(f"canonical_form={record.configuration_name}, energy={energy}")
    print()

print("Finished making mapped configuration:")
print(f"- # of mapped structures: {len(mapped_structures)}")
print(f"- # of mapped configurations: {len(configuration_set)}")

In [None]:
# Just a useful function for plot formatting #
from bokeh.io import output_notebook

output_notebook()


def format_plot(p):
    # p.xaxis.axis_label = x_label
    # p.yaxis.axis_label = y_label

    font_size_1 = "14pt"
    font_size_2 = "10pt"
    font_name = "helvetica"

    p.title.text_font = font_name
    p.title.text_font_size = font_size_1

    p.xaxis.axis_label_text_font = font_name
    p.xaxis.axis_label_text_font_size = font_size_1
    p.xaxis.major_label_text_font = font_name
    p.xaxis.major_label_text_font_size = font_size_2

    p.yaxis.axis_label_text_font = font_name
    p.yaxis.axis_label_text_font_size = font_size_1
    p.yaxis.major_label_text_font = font_name
    p.yaxis.major_label_text_font_size = font_size_2

### Plot lattice and strain mapping costs


In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource
import numpy as np

names = []
for x in mapped_structures:
    relpath = pathlib.Path(x["path"]).relative_to(data_dir)
    names.append(os.path.join(*relpath.parts[1:3]))

all_scores = [x["scored_structure_mapping"] for x in mapped_structures]
total_cost = np.array([x.total_cost() for x in all_scores])
lattice_cost = np.array([x.lattice_cost() for x in all_scores])
atom_cost = np.array([x.atom_cost() for x in all_scores])

data = {
    "names": names,
    "total_cost": total_cost,
    "lattice_cost": lattice_cost,
    "atom_cost": atom_cost,
}
tooltips = [
    ("name", "@names"),
    ("total_cost", "@total_cost"),
    ("lattice_cost", "@lattice_cost"),
    ("atom_cost", "@atom_cost"),
]
source = ColumnDataSource(data)

p = figure(width=600, height=400, tooltips=tooltips)
p.scatter("lattice_cost", "atom_cost", source=source, size=5, color="navy", alpha=0.5)
format_plot(p)
p.xaxis.axis_label = "Lattice cost"
p.yaxis.axis_label = "Atom cost"
show(p)

In [None]:
# Save structure mappings by input path
results = {}
for x in mapped_structures:
    results[str(x["path"])] = {
        "calculated_structure": x["calculated_structure"].to_dict(),
        "mapped_configuration_with_properties": x[
            "mapped_configuration_with_properties"
        ].to_dict(),
        "scored_structure_mapping": x["scored_structure_mapping"].to_dict(),
        "mapped_structure": x["mapped_structure"].to_dict(),
    }

safe_dump(
    data=results,
    path=data_dir / "mapped_structures.json",
    force=True,
)

## Composition axes

### Introduction

In a crystal with a fixed number of sites, the number of species occupying the same sublattice are not indepedent. In general, a crystal occupied by $s$ component species will have $k<s$ independent compositions. CASM converts between compositions expressed as number of species per unit cell and compositions in terms of independent "parametric composition axes" using:

\begin{align}
    \vec{n} &= \vec{n}_0 + \mathbf{Q} \vec{x} \\
    \vec{x} &= \mathbf{R}^{\mathsf{T}} (\vec{n} - \vec{n}_0)
\end{align}


where:

- $\vec{n}$: Vector of shape=($s$,), the number of each component species
  per unit cell (*mol_composition*).
- $\vec{x}$: Vector of shape=($k$,), The composition along each composition axis when referenced to the origin composition (*param_composition*).
- $\vec{n}_0$: Vector of shape=($s$,), The origin in composition space, as the number of each component species per unit cell.
- $Q$: Matrix of shape=($s$, $k$), with columns representing
the change in composition per unit cell going one unit distance along each independent composition axis.
- $R$: Matrix of shape=($s$, $k$), such that $\mathbf{R}^{\mathsf{T}}\mathbf{Q} = \mathbf{Q}^{\mathsf{T}}\mathbf{R} = \mathbf{I}$.

The "parametric composition axes" are the columns of $Q$, $\vec{q}_i$. Due to preservation of the number of sites per unit cell, $\sum_{i} Q_{ij} = 0$. If vacancies are allowed, they are included as a component species.

### Print standard parametric composition axes

When a CASM project is initialized, a set of standard choices for the parametric axes are determined and stored in the [Project.chemical_composition_axes] attribute. Printing the chemical_composition_axes results in a table summarizing the possible choices:

In [None]:
# print possible axes
print(project.chemical_composition_axes)

### Select default parametric composition axes

To select a particular choice to be used as the default for calculating parametric compositions, use [set_current_axes]():


In [None]:
# select axes with <key>, unset with None
project.chemical_composition_axes.set_current_axes(1)

# print possible axes and formulas for current choice
print(project.chemical_composition_axes)

# commit current choice
project.chemical_composition_axes.commit()

Notes:

- In CASM v2, by default the parametric composition axes are "normalized" in the sense that a unit distance along that axis corresponds to a change in occupation of one site per unit cell. In CASM v1, the standard composition axes were not normalized.
- The term "endmember" usually refers to the extreme compositions in a solid solution. In the context of CompositionConverter, the term "end member composition" is used to mean the composition one unit distance along a parametric composition axis, $\vec{n}_0 + \vec{q}_i$.
- When printing formulas, the characters "a", "b", "c", etc. are used to represent the parametric compositions, $x_1$, $x_2$, $x_3$, etc.
- When referring to parametric composition axes, the characters "a", "b", "c", etc. are used to represent the parametric composition axes, $\vec{q}_1$, $\vec{q}_2$, $\vec{q}_3$, etc.

## Analyze calculated properties

### Get properties

- Energy per unitcell (eV / unitcell)
- Parameteric composition ($a$ in Si$_{2-a}$Ge$_{a}$)
  - Using [ConfigCompositionCalculator](https://prisms-center.github.io/CASMcode_pydocs/casm/project/2.0/reference/casm/_autosummary/casm.project.ConfigCompositionCalculator.html#configcompositioncalculator)

In [None]:
# This is a ConfigCompositionCalculator
calc_comp = project.chemical_composition_axes.config_composition

# Store properties here
mapped_configurations = []
energy_per_unitcell = []
comp_a = []

for x in mapped_structures:
    
    # Get configuration, energy, and param_composition
    record = x["mapped_configuration_with_properties"]
    configuration = record.configuration
    n_unitcells = configuration.supercell.n_unitcells
    energy = record.scalar_global_property_value("energy")
    param_composition = calc_comp.param_composition(configuration)

    # Append
    mapped_configurations.append(configuration)
    energy_per_unitcell.append(energy / n_unitcells)
    comp_a.append(param_composition[0])



### Plot energy per unitcell vs composition

- First plot without setting any energy reference

In [None]:

data = {
    "names": names,
    "total_cost": total_cost,
    "lattice_cost": lattice_cost,
    "atom_cost": atom_cost,
    "energy_per_unitcell": energy_per_unitcell,
    "comp_a": comp_a,
}
tooltips = [
    ("name", "@names"),
    ("total_cost", "@total_cost"),
    ("lattice_cost", "@lattice_cost"),
    ("atom_cost", "@atom_cost"),
    ("energy_per_unitcell", "@energy_per_unitcell"),
    ("comp_a", "@comp_a"),
]
source = ColumnDataSource(data)

p = figure(width=600, height=400, tooltips=tooltips)
p.scatter("comp_a", "energy_per_unitcell", source=source, size=5, color="navy", alpha=0.5)
format_plot(p)
p.xaxis.axis_label = "Parametric composition (a in Si(2-a)Ge(a))"
p.yaxis.axis_label = "Calculated energy per unitcell"
show(p)

### Calculate formation energy

- Get energy and composition of configurations with min / max $a$

In [None]:
from libcasm.composition import FormationEnergyCalculator

# For 1 independent composition axis:
# get reference state - at max composition
i_max = np.argmax(comp_a)
e_max = energy_per_unitcell[i_max]
comp_a_max = comp_a[i_max]

# get reference state - at min composition
i_min = np.argmin(comp_a)
e_min = energy_per_unitcell[i_min]
comp_a_min = comp_a[i_min]

# Formation energy calculator
e_calc = FormationEnergyCalculator(
    composition_ref = np.array([
        [comp_a_min],
        [comp_a_max],
    ]).transpose(),
    energy_ref = np.array([e_min, e_max]),
)

# Calculate formation energies
formation_energy_per_unitcell = []
for i, config in enumerate(mapped_configurations):
    ef = e_calc.formation_energy(
        composition=calc_comp.param_composition(config),
        energy=energy_per_unitcell[i]
    )
    formation_energy_per_unitcell.append(ef)

### Plot formation energy per unitcell vs composition


In [None]:
data = {
    "names": names,
    "total_cost": total_cost,
    "lattice_cost": lattice_cost,
    "atom_cost": atom_cost,
    "energy_per_unitcell": energy_per_unitcell,
    "formation_energy_per_unitcell": formation_energy_per_unitcell,
    "comp_a": comp_a,
}
tooltips = [
    ("name", "@names"),
    ("total_cost", "@total_cost"),
    ("lattice_cost", "@lattice_cost"),
    ("atom_cost", "@atom_cost"),
    ("energy_per_unitcell", "@energy_per_unitcell"),
    ("formation_energy_per_unitcell", "@formation_energy_per_unitcell"),
    ("comp_a", "@comp_a"),
]
source = ColumnDataSource(data)

p = figure(width=600, height=400, tooltips=tooltips)
p.scatter("comp_a", "formation_energy_per_unitcell", source=source, size=5, color="navy", alpha=0.5)
format_plot(p)
p.xaxis.axis_label = "Parametric composition (a in Si(2-a)Ge(a))"
p.yaxis.axis_label = "Formation energy per unitcell"
show(p)