<a href="https://colab.research.google.com/github/sp8rks/MaterialsInformatics/blob/main/worked_examples/foundry/foundry.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

# Foundry

Foundry is an easy-to-use API that allows the use to access a bunch of material science datasets. The data can be loaded very efficiently and without much hassle. This notebook will be similar to deepchem_pubchempy and MP_API in that it will be focused on showing how to access and play around with the datasets.

#### Video (general material databases)

https://www.youtube.com/watch?v=cdSENQPsAiI&list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0&index=7 (Materials Data Repositories)

## Setup

### Imports

In [12]:
from foundry import Foundry
f = Foundry()


In [13]:
# Search for band gap datasets
results = f.search("band gap", limit=5)
results

Unnamed: 0,dataset_name,title,year,DOI
0,foundry_g4mp2_solvation_v1.2,DFT Estimates of Solvation Energy in Multiple ...,root=2022,10.18126/jos5-wj65


## Data Loading 

Let's load the first dataset

In [14]:
# Get a dataset
results = f.search("band gap", limit=1)
dataset = results.iloc[0].FoundryDataset

# Get the schema
schema = dataset.get_schema()

print(f"Dataset: {schema['name']}")
print(f"Title: {schema['title']}")
print(f"DOI: {schema['doi']}")
print(f"Data Type: {schema['data_type']}")

Dataset: foundry_g4mp2_solvation_v1.2
Title: DFT Estimates of Solvation Energy in Multiple Solvents
DOI: root='10.18126/jos5-wj65'
Data Type: tabular


We can examine the data

In [15]:
# Examine fields (columns)
print("Fields:")
print("-" * 60)
for field in schema['fields']:
    role = field['role']  # 'input' or 'target'
    name = field['name']
    desc = field['description'] or 'No description'
    units = field['units'] or ''
    print(f"  [{role:6}] {name}: {desc} {f'({units})' if units else ''}")

# Examine splits (train/test/validation)
print("Splits:")
print("-" * 60)
for split in schema['splits']:
    print(f"  - {split['name']}: {split.get('type', 'data')}")

Fields:
------------------------------------------------------------
  [input ] smiles_0: Input SMILES string 
  [input ] smiles_1: SMILES string after relaxation 
  [input ] inchi_0: InChi after generating coordinates with CORINA 
  [input ] inchi_1: InChi after relaxation 
  [input ] xyz: InChi after relaxation (XYZ coordinates after relaxation)
  [input ] atomic_charges: Atomic charges on each atom, as predicted from B3LYP 
  [input ] A: Rotational constant, A (GHz)
  [input ] B: Rotational constant, B (GHz)
  [input ] C: Rotational constant, C (GHz)
  [input ] inchi_1: InChi after relaxation 
  [input ] n_electrons: Number of electrons 
  [input ] n_heavy_atoms: Number of non-hydrogen atoms 
  [input ] n_atom: Number of atoms in molecule 
  [input ] mu: Dipole moment (D)
  [input ] alpha: Isotropic polarizability (a_0^3)
  [input ] R2: Electronic spatial extant (a_0^2)
  [input ] cv: Heat capacity at 298.15K (cal/mol-K)
  [target] g4mp2_hf298: G4MP2 Standard Enthalpy of Formation, 

Next, we can load our dataset. This can be done in a few different ways. Firstly, you can use f.list() to print off all the avaiable datasets. The other option is to browse their website https://foundry-ml.org/#/datasets or https://www.materialsdatafacility.org/portal which has a very nice UI for finding them. 

After you have an idea of the dataset that you want you can either copy and paste in the doi from the website or search it within python.

If you do not know the doi of the dataset it can be found by searching the name of the datasets as shown. For this notebook we will use the 'Predicting the thermodynamic stability of perovskite oxides using machine learning models' dataset.

In [16]:
#add cell here to loady via DOI

Let's load a specific split of the data

In [19]:
# Load only training data
train_data = dataset.get_as_dict(split='train')
print(f"Training data keys: {train_data.keys() if isinstance(train_data, dict) else type(train_data)}")

TransferAPIError: ('GET', 'https://transfer.api.globus.org/v0.10/operation/endpoint/82f1b5c6-6e9b-11e5-ba47-22000b92c6ec/ls?path=%2Ffoundry%2Ffoundry_g4mp2_solvation_v1.2%2F', 'Bearer', 502, 'ExternalError.DirListingFailed.Timeout', 'Command Failed: Error (connect)\nEndpoint: globuspublish#mdf-publications (82f1b5c6-6e9b-11e5-ba47-22000b92c6ec)\nServer: 141.142.218.119:443\nMessage: The operation timed out\n', 'CIMLI4ZVK')

In [None]:
# Load all splits at once
all_data = dataset.get_as_dict()
print(f"All splits: {list(all_data.keys())}")

: 

: 

We can also load data with metadata

In [18]:
# Get data with schema attached
result = dataset.get_as_dict(include_schema=True)

print(f"Result keys: {result.keys()}")
print(f"\nSchema name: {result['schema']['name']}")
print(f"Data splits: {list(result['data'].keys())}")

TransferAPIError: ('GET', 'https://transfer.api.globus.org/v0.10/operation/endpoint/82f1b5c6-6e9b-11e5-ba47-22000b92c6ec/ls?path=%2Ffoundry%2Ffoundry_g4mp2_solvation_v1.2%2F', 'Bearer', 502, 'ExternalError.DirListingFailed.Timeout', 'Command Failed: Error (connect)\nEndpoint: globuspublish#mdf-publications (82f1b5c6-6e9b-11e5-ba47-22000b92c6ec)\nServer: 141.142.218.119:443\nMessage: The operation timed out\n', 'jknfL492i')

After loading it we have to assign it to variables which will download it. Just as a warning, some of these datasets can be quite large (300mb+) so it's worth checking out the dataset on the website before downloading it. This dataset is only 8.29 MB but it's something to be aware of. 

In [None]:
X_mp, y_mp = data.get_as_dict()['train']

: 

: 

Now that we've loaded our data we can inspect it and see what the data contains. 

In [None]:
X_mp.describe()

: 

: 

This dataset only contains one input value (formula) but we can featurize it to get more inputs to train on. This is a very simple dataset (one input, one output) but the datasets available can get quite large. 

## Try It Yourself!

- Use the foundry API to grab the 'Charting the complete elastic properties of inorganic crystalline compounds' dataset
- Load the data and inspect it for what it contains
- Featurize the formula column and create a dataframe with those features, nsites, space group, and volume
- Assign the target variable to be the average bulk modulus 
- create train/test splits, standardize the data, and train a random forest model predicting average bulk modulus (K_Voigt)
- score it using mean squared error, mean average error, and R2 

: 

: 