# Comprehensive Tutorial: Chemical Space Exploration with SMACT

## Introduction
In the search for new materials with desirable properties, researchers often face the challenge of navigating vast chemical spaces. The combinatorial explosion of possible element combinations makes it impractical to synthesize and test each one experimentally. Computational screening methods, such as those provided by the SMACT (Semiconducting Materials by Analogy and Chemical Theory) library, offer a way to narrow down the search to a more manageable set of candidates by applying chemical rules and filters.

This tutorial will guide you through using SMACT and related tools to:

- Generate chemical spaces either combinatorially or by fetching data from databases like the Materials Project.
- Apply chemical filters to screen out unlikely candidates.
- Identify potential materials for specific engineering applications such as solar cells, batteries, and aerospace alloys.

By the end of this tutorial, you will be equipped to use SMACT to efficiently explore and filter chemical spaces for materials discovery.

This tutorial will guide you through exploring different chemical spaces using SMACT, from initial composition generation to filtered candidate lists. We'll explore several real-world examples including:
1. Binary oxides for photocatalysis
2. Ternary chalcogenides & quaternary oxide generation for photovoltaics
3. Double perovskites for ferroelectrics
4. Potential materials for batteries 


## Real-World Applications of some of the Materials generated in this tutorial
### 1. Photocatalysts for Water Splitting
    Binary oxides like TiO2, Fe2O3
    Requirements: bandgap ~2.0-2.2 eV, proper band alignment
### 2. Solar Cell Materials
    Ternary chalcogenides like CuInSe2
    Requirements: bandgap 1.0-1.5 eV, high absorption coefficient
### 3. Ferroelectric Materials
    Double perovskites like Ba2FeWO6
    Requirements: proper tolerance factor, polar structure
### 4. Battery Materials
    Binary compounds like LiCoO2
    Requirements: high capacity, low cost, stable structure


### Exercises - after completing the follow along - try the following to test your understanding:
1. Modify the binary oxide filter to target specific bandgap ranges for visible light absorption.
2. Implement additional chemical filters for the ternary chalcogenides based on ionic radii.
3. Add stability predictions using simple chemical rules for the double perovskites


## Setting Up the Environment
Before we begin, ensure you have the necessary Python packages installed:

```bash
pip install smact pymatgen matminer mp-api
```
Note: **mp-api** is the Materials Project API client library.

 **Important Note:** Replace "YOUR_API_KEY" with your actual Materials Project API key in the code examples.

In [None]:
import multiprocessing
from itertools import combinations, product
from pymatgen.core import Composition
import smact
from smact import screening
from smact.screening import pauling_test, eneg_states_test
import csv
import pandas as pd
from mp_api.client import MPRester

myapikey = "replace with your mp_api key or env key"

  from .autonotebook import tqdm as notebook_tqdm


# Pre-liminaries

## Understanding SMACT
In this section of the tutorial you will begin to get a feel for how SMACT works. SMACT is a Python library designed to facilitate the exploration of chemical spaces for materials discovery. It provides tools to:

- Generate possible compositions based on element combinations.
- Apply chemical rules such as charge neutrality and electronegativity balance.
- Filter out compositions that are unlikely to form stable compounds.

### Key Features of SMACT
- **Element Class**: Represents elements with properties like oxidation states and electronegativities.
- **Screening Functions**: Functions like **smact_filter** help apply chemical rules to filter compositions.
- **Integration with Other Libraries**: SMACT works well with **pymatgen** and **matminer** for further analysis.

# Part A: Generating Chemical Spaces
There are two primary ways to generate chemical spaces:
- Combinatorial Generation: As shown in the previous tutorial, this is a method by which we systematically combinine elements to create potential compositions. 
- Fetching Data from Materials Project: Using the Materials Project database to obtain existing materials.

### Combinatorial Generation
We can generate all possible combinations of elements within a set to explore potential compounds.

Example 1: Generate all ternary combinations from a list of elements.

In [None]:
from itertools import combinations
from smact import element_dictionary

# Define elements of interest
symbol_list = ['Li', 'Na', 'K', 'Mg', 'Ca', 'Sr', 'Ba', 'Al', 'Ga', 'In', 'Sn', 'Pb', 'Zn', 'Cd', 'Hg']
all_elements = element_dictionary(symbol_list)

# Generate all ternary combinations
ternary_combinations = combinations(all_elements.values(), 3)

# Print the first 5 combinations
for i, combo in enumerate(list(ternary_combinations)[:5]):
    print(f"Combination {i+1}: {', '.join([el.symbol for el in combo])}")

Combination 1: Li, Na, K
Combination 2: Li, Na, Mg
Combination 3: Li, Na, Ca
Combination 4: Li, Na, Sr
Combination 5: Li, Na, Ba


Example 2: Fetching Data from Materials Project - 
The **Materials Project** is a database of computed materials properties. We can use their API to fetch materials data.



In [None]:
"""
This script queries the Materials Project API to find stable binary metallic compounds.
It searches for compounds formed between pairs of metallic elements, filtering out any
that contain non-metallic elements. For each compound, it retrieves key properties like
material ID, formula, stability, crystal system, band gap and theoretical status.
this script takes about 4mins to run - might be a little long to run on your personal computer.
"""

import csv
from itertools import combinations
import pandas as pd
from mp_api.client import MPRester

def get_binary_compounds(api_key: str, metallic_elements: list) -> pd.DataFrame:
    """
    Query Materials Project for stable binary metallic compounds.
    
    Args:
        api_key: Materials Project API key
        metallic_elements: List of metallic element symbols to search
        
    Returns:
        DataFrame containing compound properties
    """
    compounds_info = []
    excluded_elements = ["O", "S", "Se", "Te", "F", "Cl", "Br", "I", "N", "P", "As"]
    fields = ["material_id", "formula_pretty", "elements", "energy_above_hull", 
             "symmetry", "band_gap", "theoretical"]
    
    with MPRester(api_key) as mpr:
        # Search each binary combination
        for pair in combinations(metallic_elements, 2):
            docs = mpr.materials.summary.search(
                elements=list(pair),
                num_elements=(2, 2), 
                energy_above_hull=(0, 0.1),
                fields=fields
            )
            
            # Filter and store results
            for doc in docs:
                if not any(elem.symbol in excluded_elements for elem in doc.elements):
                    compounds_info.append({
                        "material_id": doc.material_id,
                        "formula": doc.formula_pretty,
                        "elements": ", ".join(elem.symbol for elem in doc.elements),
                        "energy_above_hull": doc.energy_above_hull,
                        "crystal_system": doc.symmetry.crystal_system,
                        "band_gap": doc.band_gap,
                        "theoretical": doc.theoretical
                    })
                    
    return pd.DataFrame(compounds_info)

# Define metallic elements to search
metallic_elements = ["Li", "Be", "Na", "Mg", "Al", "K", "Ca", "Sc", "Ti", "V", "Cr", "Mn", 
                    "Fe", "Co", "Ni", "Cu", "Zn", "Ga", "Rb", "Sr", "Y", "Zr", "Nb", "Mo",
                    "Tc", "Ru", "Rh", "Pd", "Ag", "Cd", "In", "Sn", "Cs", "Ba", "La", "Hf", 
                    "Ta", "W", "Re", "Os", "Ir", "Pt", "Au", "Hg", "Tl", "Pb", "Bi"]

# Query compounds and display results
api_key = myapikey  
df = get_binary_compounds(api_key, metallic_elements)
print("\nFirst 5 compounds found:")
print(df.head())
print(f"\nTotal compounds found: {len(df)}")


Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 2/2 [00:00<00:00, 17476.27it/s]
Retrieving SummaryDoc documents: 100%|██████████| 75/75 [00:00<00:00, 688343.11it/s]
Retrieving SummaryDoc documents: 100%|██████████| 8/8 [00:00<00:00, 111848.11it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 6/6 [00:00<00:00, 111353.20it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 0it [00:00, ?it/s]
Retrieving SummaryDoc documents: 100%|██████████| 3/3 [00:00<00:00, 11770.73it/s]
Retrieving SummaryDoc documents: 100%|██████████| 5/5 [00:00<00:00, 27594.11it/s]
Retrieving SummaryDoc documents: 10


First 5 compounds found:
  material_id formula elements  energy_above_hull crystal_system  band_gap  \
0  mp-1186151   NaLi3   Li, Na           0.043195     Tetragonal       0.0   
1   mp-973316   NaLi3   Li, Na           0.040660      Hexagonal       0.0   
2   mp-977218   Li2Mg   Li, Mg           0.017815   Orthorhombic       0.0   
3  mp-1094586   Li2Mg   Li, Mg           0.010662   Orthorhombic       0.0   
4   mp-982380   Li2Mg   Li, Mg           0.007795     Monoclinic       0.0   

   theoretical  
0         True  
1         True  
2         True  
3         True  
4         True  

Total compounds found: 3826





## Part B: Applying Chemical Filters

After generating a chemical space, we need to apply chemical filters to narrow down potential candidates.

### Charge Neutrality and Electronegativity Tests
SMACT Screening provides a variety of functions that come in handy when it comes to screening chemical spaces for desired workflows, it is equipped with:

1. Charge Neutrality: Ensuring the total charge in a compound is zero.
2. Pauling Test: Verifying that a combination of ions makes chemical sense,(i.e. positive ions should be of lower electronegativity).
3. Eneg States Test/Threshold/Alternate: checking electronegativity criterions between anions and cations.
4. no repeats: Check if any anion or cation appears twice.
5. ml_rep_generator: Function to take a composition of Elements and return a list of values between 0 and 1 that describes the composition, useful for machine learning.
6. smact_filter: combines both the charge neutrality and electronegativity tests in one go for simple application in external scripts that wish to apply the general 'smact test'.
7. smact_validity: Check if a composition is valid according to the SMACT rules. Composition is considered valid if it passes the charge neutrality test and the Pauling electronegativity test.


### Using ```smact_filter```
The ```smact_filter``` function combines both the charge neutrality and electronegativity
    tests in one go for simple application in external scripts that
    wish to apply the general 'smact test'.


The function takes the following arguments:
```python
def smact_filter(
    els: Union[Tuple[Element], List[Element]],
    threshold: Optional[int] = 8,
    stoichs: Optional[List[List[int]]] = None,
    species_unique: bool = True,
    oxidation_states_set: str = "icsd24",
    comp_tuple: bool = False,
) -> Union[List[Tuple[str, int, int]], List[Tuple[str, int]]]:
    ...
```
Parameters:

```els```: A tuple or list of Element objects.

```threshold```: Maximum allowed stoichiometry (default is 8).

```stoichs```: Specific stoichiometric ratios to consider.

```species_unique```: If True, considers different oxidation states as unique species.

```oxidation_states_set```: Set of oxidation states to use ('icsd24', 'smact14','pymatgen', 'wiki', or a custom file path). **WARNING:** For backwards compatibility in SMACT >=2.7, expllicitly set oxidation_states_set to 'smact14' if you wish to use the 2014 SMACT default oxidation states. In SMACT 3.0, the smact_filter function will be set to use a new default oxidation states set.

```comp_tuple```: If True, returns results as named tuples.


One of the mos

In [None]:
from smact import Element
from smact.screening import smact_filter

# Define elements
elements = (Element('Na'), Element('Cl'))

# Apply SMACT filter
compositions = smact_filter(elements)

# Display valid compositions
for comp in compositions:
    print(comp)


(('Na', 'Cl'), (1, -1), (1, 1))


# Part C: Identifying potential materials for specific engineering applications

## 1: Binary Oxides for Photocatalysis
First, let's explore binary oxide semiconductors that might be suitable for water splitting:

In [None]:
def setup_binary_oxide_space():
    """Setup chemical space for binary oxides"""
    # Define transition metals of interest
    transition_metals = ["Ti", "V", "Cr", "Mn", "Fe", "Co", "Ni", "Cu", "Zn"]
    
    # Convert to SMACT elements
    tm_elements = [smact.Element(symbol) for symbol in transition_metals]
    oxygen = smact.Element("O")
    
    return tm_elements, oxygen

def binary_oxide_filter(metal):
    """Filter binary oxides based on chemical rules"""
    compounds = []
    
    # Oxidation states for oxygen
    o_state = -2
    
    for ox_state in metal.oxidation_states:
        # Check charge neutrality
        cn_e, cn_r = smact.neutral_ratios([ox_state, o_state], threshold=8)
        
        if cn_e:
            # Check electronegativity
            eneg_ok = pauling_test(
                [ox_state, o_state],
                [metal.pauling_eneg, 3.44]  # 3.44 is O electronegativity
            )
            
            if eneg_ok:
                formula = [metal.symbol, "O"]
                compounds.append((formula, cn_r[0]))
    
    return compounds

# Generate candidates
metals, oxygen = setup_binary_oxide_space()
with multiprocessing.Pool() as pool:
    binary_results = pool.map(binary_oxide_filter, metals)

# Format results
binary_formulas = []
for result in binary_results:
    for comp in result:
        formula = "".join(f"{el}{amt}" for el, amt in zip(comp[0], comp[1]))
        binary_formulas.append(Composition(formula).reduced_formula)

print("Viable binary oxide candidates:")
print("\n".join(binary_formulas))

Viable binary oxide candidates:
Ti2O
TiO
Ti2O3
TiO2
V2O
VO
V2O3
VO2
V2O5
Cr2O
CrO
Cr2O3
CrO2
Cr2O5
CrO3
Mn2O
MnO
Mn2O3
MnO2
Mn2O5
MnO3
Mn2O7
Fe2O
FeO
Fe2O3
FeO2
Fe2O5
FeO3
Co2O
CoO
Co2O3
CoO2
Co2O5
Ni2O
NiO
Ni2O3
NiO2
Cu2O
CuO
Cu2O3
CuO2
Zn2O
ZnO


## 2: Ternary Chalcogenides for Solar Cells
Now let's explore ternary chalcogenides that might be suitable for solar cells:

In [None]:
def setup_chalcogenide_space():
    """Setup chemical space for ternary chalcogenides"""
    # Group 11 metals (Cu, Ag)
    group_11 = ["Cu", "Ag"]
    # Group 13 metals (In, Ga)
    group_13 = ["Ga", "In"]
    # Chalcogens
    chalcogens = ["S", "Se"]
    
    metal_1 = [smact.Element(m) for m in group_11]
    metal_2 = [smact.Element(m) for m in group_13]
    chalc = [smact.Element(c) for c in chalcogens]
    
    return metal_1, metal_2, chalc

def ternary_chalcogenide_filter(elements):
    """Filter ternary chalcogenides with specific criteria"""
    compounds = []
    m1, m2, ch = elements
    
    # Additional criteria for solar cells
    bandgap_range = (1.0, 2.5)  # eV
    
    for ox_1 in m1.oxidation_states:
        for ox_2 in m2.oxidation_states:
            for ox_ch in ch.oxidation_states:
                ox_states = [ox_1, ox_2, ox_ch]
                
                # Charge neutrality check
                cn_e, cn_r = smact.neutral_ratios(ox_states, threshold=8)
                
                if cn_e:
                    # Electronegativity check
                    eneg_ok = pauling_test(
                        ox_states,
                        [m1.pauling_eneg, m2.pauling_eneg, ch.pauling_eneg]
                    )
                    
                    if eneg_ok:
                        formula = [m1.symbol, m2.symbol, ch.symbol]
                        compounds.append((formula, cn_r[0]))
    
    return compounds

# Generate candidates
m1_els, m2_els, ch_els = setup_chalcogenide_space()
ternary_combinations = [(m1, m2, ch) 
                       for m1 in m1_els 
                       for m2 in m2_els 
                       for ch in ch_els]

with multiprocessing.Pool() as pool:
    ternary_results = pool.map(ternary_chalcogenide_filter, ternary_combinations)

### 2b: Quarternary Solar Oxides example

### add latest example on smact docs - or insert colab link to that smact version of the solar oxides example
- https://colab.research.google.com/github/WMD-group/SMACT/blob/master/docs/tutorials/smact_generation_of_solar_oxides.ipynb



## 3: Double Perovskites for Ferroelectrics
Finally, let's explore double perovskites (A2BB'O6):

In [None]:
def setup_double_perovskite_space():
    """Setup chemical space for double perovskites"""
    # A-site cations (large ionic radius)
    a_site = ["Ba", "Sr", "Ca"]
    # B-site cations (smaller transition metals)
    b_site = ["Fe", "Mn", "Ni"]
    b_prime_site = ["Mo", "W", "Re"]
    
    a_els = [smact.Element(a) for a in a_site]
    b_els = [smact.Element(b) for b in b_site]
    b_prime_els = [smact.Element(bp) for bp in b_prime_site]
    oxygen = smact.Element("O")
    
    return a_els, b_els, b_prime_els, oxygen

def double_perovskite_filter(elements):
    """Filter double perovskites with specific criteria"""
    compounds = []
    a, b, b_prime, o = elements
    
    # Goldschmidt tolerance factor limits
    tol_factor_range = (0.8, 1.0)
    
    for a_ox in a.oxidation_states:
        for b_ox in b.oxidation_states:
            for bp_ox in b_prime.oxidation_states:
                ox_states = [a_ox, b_ox, bp_ox, -2]  # O is -2
                
                # Check charge balance
                if sum([2*a_ox, b_ox, bp_ox, 6*(-2)]) == 0:
                    # Check electronegativity ordering
                    eneg_ok = pauling_test(
                        ox_states,
                        [a.pauling_eneg, b.pauling_eneg, 
                         b_prime.pauling_eneg, 3.44]
                    )
                    
                    if eneg_ok:
                        formula = [a.symbol, b.symbol, b_prime.symbol, "O"]
                        compounds.append((formula, [2, 1, 1, 6]))
    
    return compounds

# Generate candidates
a_els, b_els, bp_els, oxygen = setup_double_perovskite_space()
perovskite_combinations = [(a, b, bp, oxygen) 
                          for a in a_els 
                          for b in b_els 
                          for bp in bp_els]

with multiprocessing.Pool() as pool:
    perovskite_results = pool.map(double_perovskite_filter, 
                                 perovskite_combinations)

# Flatten results and create dataframe
flattened_results = [comp for result in perovskite_results if result for comp in result]
df = pd.DataFrame(flattened_results, columns=['Formula', 'Stoichiometry'])

# Print first 5 candidates
print("\nFirst 5 double perovskite candidates:")
print(df.head())


First 5 double perovskite candidates:
           Formula Stoichiometry
0  [Ba, Fe, Mo, O]  [2, 1, 1, 6]
1  [Ba, Fe, Mo, O]  [2, 1, 1, 6]
2  [Ba, Fe, Mo, O]  [2, 1, 1, 6]
3  [Ba, Fe, Mo, O]  [2, 1, 1, 6]
4  [Ba, Fe, Mo, O]  [2, 1, 1, 6]


## 4: Identifying Potential Battery Materials (WIP- need to check that this is producing consistent results with what might be in the Materials Project)
**Goal:** Find binary compounds suitable for battery applications. This is just a simple example of how a potential workflow might look like. For a more comprehensive example, see the [Materials Project tutorial](https://next-gen.materialsproject.org/batteries)/ (https://github.com/materialsproject/docs/blob/master/docs/user-guide/batteries-explorer.md)

The code approach:

**Fetch Data:** Use Materials Project API to get binary compounds.

**Filter Compounds:** Select those with low energy above hull (stable) and desired properties.

**Apply SMACT Validity Check:** Ensure the compounds pass SMACT's chemical rules.

In [None]:
from mp_api.client import MPRester
from smact.screening import smact_validity
from pymatgen.core import Composition

api_key = myapikey  # Replace with your Materials Project API key


with MPRester(api_key) as mpr:
    # Search each binary combination
    docs = mpr.materials.summary.search(
        num_elements=(2), 
        energy_above_hull=(0, 0.05),
        is_metal=False
    )

# Filter and validate compounds
valid_compounds = []
for doc in docs:
    formula = doc.formula_pretty
    if smact_validity(formula):
        valid_compounds.append({
            'material_id': doc.material_id,
            'formula': formula,
            'band_gap': doc.band_gap,
            'energy_above_hull': doc.energy_above_hull,
            'formation_energy_per_atom': doc.formation_energy_per_atom,
        })

print(f"Number of valid battery material candidates: {len(valid_compounds)}")

# Save to CSV
import pandas as pd
df = pd.DataFrame(valid_compounds)
df.to_csv('battery_material_candidates.csv', index=False)



Retrieving SummaryDoc documents: 100%|██████████| 2699/2699 [00:06<00:00, 429.73it/s]

Number of valid battery material candidates: 2214





## Part D: Advanced Methods


### Parallel Processing for Large Datasets
When dealing with large chemical spaces, computations can be time-consuming. Using multiprocessing can speed up the process.


In [None]:
import multiprocessing

def process_combinations(els):
    # Your filtering code here
    pass

with multiprocessing.Pool() as pool:
    results = pool.map(process_combinations, element_combinations)


Parallel processing is particularly valuable when **featurizing** large datasets, but needs to be handled carefully.

For parallel featurization using matminer, you can control the number of parallel processes:
```python
from matminer.featurizers import feature_calculators
feature_calculators.set_n_jobs(n_jobs=X)  # X is number of parallel processes
```

While setting n_jobs=-1 uses all available cores, this can cause memory issues with large datasets.
A safer approach is using 1-2 cores ie setting n_jobs to 1 or 2 or chunking the data:

```python
import pandas as pd
from matminer.featurizers import composition as cf
```
### Example chunking approach
```python
def process_chunk(chunk_df):
    featurizer = cf.ElementProperty.from_preset("magpie")
     return featurizer.featurize_dataframe(chunk_df, "formula")

 # Split dataframe into chunks
chunk_size = 1000  # Adjust based on your memory constraints
chunks = [df[i:i + chunk_size] for i in range(0, len(df), chunk_size)]

# Process chunks sequentially
results = []
for chunk in chunks:
    processed_chunk = process_chunk(chunk)
    results.append(processed_chunk)

# Combine results
final_df = pd.concat(results, ignore_index=True)
```

### Note: Always test your featurization pipeline on a small subset first before processing the full dataset.

## Conclusion
In this tutorial, we've explored how to use SMACT and related tools to:

- Generate chemical spaces either combinatorially or by fetching data from databases.
- Apply chemical filters to narrow down potential material candidates.
- Identify materials suitable for specific engineering applications.

By leveraging SMACT's capabilities, researchers can efficiently navigate the vast landscape of possible compounds and focus on the most promising candidates for experimental validation.



## References: 
- SMACT Documentation: SMACT GitHub Repository
- Materials Project API: Materials Project API Documentation
- Pymatgen Library: Pymatgen Documentation
- Matminer Library: Matminer Documentation
