<a href="https://colab.research.google.com/github/sp8rks/MaterialsInformatics/blob/main/worked_examples/foundry/foundry.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

# Foundry

Foundry is an easy-to-use API that allows the use to access a bunch of material science datasets. The data can be loaded very efficiently and without much hassle. This notebook will be similar to deepchem_pubchempy and MP_API in that it will be focused on showing how to access and play around with the datasets.

#### Video (general material databases)

https://www.youtube.com/watch?v=cdSENQPsAiI&list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0&index=7 (Materials Data Repositories)

## Setup

### Imports

In [1]:
from foundry import Foundry
f = Foundry()


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import pathlib, sys
pkg_dir = pathlib.Path(sys.executable).parents[1] / "Lib" / "site-packages" / "foundry"
print("mdf_client exists?:", (pkg_dir / "mdf_client.py").exists())
print("files:", [p.name for p in pkg_dir.glob("*.py")])


mdf_client exists?: True
files: ['auth.py', 'errors.py', 'foundry.py', 'foundry_cache.py', 'foundry_dataset.py', 'https_download.py', 'https_upload.py', 'mdf_client.py', 'models.py', 'utils.py', '__init__.py', '__main__.py']


In [3]:
import sys

# 1) remove the broken PyPI build
!"{sys.executable}" -m pip uninstall -y foundry-ml

# 2) delete any leftover 'foundry' package folder (defensive cleanup)
import shutil, pathlib
site = pathlib.Path(sys.executable).parents[1] / "Lib" / "site-packages"
shutil.rmtree(site / "foundry", ignore_errors=True)

# 3) install from GitHub main
!"{sys.executable}" -m pip install --no-cache-dir "foundry-ml @ git+https://github.com/MLMI2-CSSI/foundry.git"


Found existing installation: foundry_ml 1.2.1
Uninstalling foundry_ml-1.2.1:
  Successfully uninstalled foundry_ml-1.2.1
Collecting foundry-ml@ git+https://github.com/MLMI2-CSSI/foundry.git
  Cloning https://github.com/MLMI2-CSSI/foundry.git to c:\users\taylo\appdata\local\temp\pip-install-bfgt4sb5\foundry-ml_96a70564a8fa41c9b409295266d16954
  Resolved https://github.com/MLMI2-CSSI/foundry.git to commit 4dd9c2336c39a2f78b52a10d4390522fb0c4fc70
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: foundry-ml
  Building wheel for foundry-ml (pyproject.toml): started
  Building wheel for foundry-ml (pyproject.toml): finished with status 'done'
  Created wheel for foundry-ml: fil

  Running command git clone --filter=blob:none --quiet https://github.com/MLMI2-CSSI/foundry.git 'C:\Users\taylo\AppData\Local\Temp\pip-install-bfgt4sb5\foundry-ml_96a70564a8fa41c9b409295266d16954'


## Data Loading 

First we need to create an instance of foundry to use with the API. use_globus=False just means that we won't be using the Globus integration that foundry offers. This is optional if you want to do this but for this notebook we will not be using it.

In [4]:
f = Foundry(use_globus=False)

Next, we can load our dataset. This can be done in a few different ways. Firstly, you can use f.list() to print off all the avaiable datasets. The other option is to browse their website https://foundry-ml.org/#/datasets or https://www.materialsdatafacility.org/portal which has a very nice UI for finding them. 

After you have an idea of the dataset that you want you can either copy and paste in the doi from the website or search it within python.

If you do not know the doi of the dataset it can be found by searching the name of the datasets as shown. For this notebook we will use the 'Predicting the thermodynamic stability of perovskite oxides using machine learning models' dataset.

In [8]:
from mdf_forge.forge import Forge

# Connect to MDF
forge = Forge()

# Search for datasets (example: band gap data)
results = forge.search("band gap")

print(f"Found {len(results)} datasets")
print(results[0]["dc"]["title"])


ModuleNotFoundError: No module named 'mdf_forge'

This does not always find a specific dataset, it will return a table containing information about all of the datasets that match the query. Most importantly, this table contains both the source_id and the doi number for the dataset we want. We can use these two pieces of information to load the data. Alternatively you can index our the FoundryObject from the datasets variable.

In [None]:
# loading with the source_id
data = f.get_dataset('perovskite_stability_v1.1')

: 

: 

In [None]:
# loading with the doi number 
data = f.get_dataset('10.18126/qe5y-2dnz')

: 

: 

In [None]:
# straight indexing
data = datasets.iloc[0].FoundryDataset

: 

: 

After loading it we have to assign it to variables which will download it. Just as a warning, some of these datasets can be quite large (300mb+) so it's worth checking out the dataset on the website before downloading it. This dataset is only 8.29 MB but it's something to be aware of. 

In [None]:
X_mp, y_mp = data.get_as_dict()['train']

: 

: 

Now that we've loaded our data we can inspect it and see what the data contains. 

In [None]:
X_mp.describe()

: 

: 

This dataset only contains one input value (formula) but we can featurize it to get more inputs to train on. This is a very simple dataset (one input, one output) but the datasets available can get quite large. 

## Try It Yourself!

- Use the foundry API to grab the 'Charting the complete elastic properties of inorganic crystalline compounds' dataset
- Load the data and inspect it for what it contains
- Featurize the formula column and create a dataframe with those features, nsites, space group, and volume
- Assign the target variable to be the average bulk modulus 
- create train/test splits, standardize the data, and train a random forest model predicting average bulk modulus (K_Voigt)
- score it using mean squared error, mean average error, and R2 

: 

: 