# Transfer learning for experimental band gap prediction

## Overview
### Context
The band gap of a material dictates many of its electrical properties and defines whether or not it is an insulator, a semiconductor, or a conductor. A model that can predict the experimental band gap of a given material, as well as guide/screen new materials is useful in a variety of electronic applications.
### Problem formulation
Develop a machine learning (ML) model that can predict the experimental band gap of a material given only its composition. Utilize transfer learning from a model trained on Materials Project (i.e., simulated) band gaps to improve accuracy.
## Approach
- Will use the `matbench_expt_gap` and `matbench_mp_gap` datasets
    - For the sake of time, I filtered out all materials with a band gap of zero. Ideally, I would develop a classification model to identify metals as an initial pre-screening step before performing regression. Then for prediction on a new material, metals could automatically be assigned a band gap of zero and all others could use the trained regression model.
- For now, will use the [`roost`](https://github.com/CompRhys/aviary) model

# Import modules 

In [1]:
# General python 
from scipy import stats
import pandas as pd
import numpy as np

# Plotting & EDA
import helpers.plotting as my_plt
import helpers.eda as eda

# Data
from matminer.datasets import load_dataset,get_all_dataset_info
from pymatgen.core import Composition

# Machine learning training & prediction
from sklearn.model_selection import train_test_split
from helpers.tl import train, predict

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

IndentationError: unindent does not match any outer indentation level (tl.py, line 37)

# Load and clean data

In [None]:
# Load experimental dataset
df_expt = load_dataset('matbench_expt_gap')
print(get_all_dataset_info('matbench_expt_gap'))
df_expt.drop(df_expt[df_expt['gap expt'] == 0].index, inplace = True)
df_expt.reset_index(inplace = True)

In [None]:
# Load Materials Project dataset
df_mp = load_dataset('matbench_mp_gap')
print(get_all_dataset_info('matbench_mp_gap'))
df_mp.drop(df_mp[df_mp['gap pbe'] == 0].index, inplace = True)
df_mp.reset_index(inplace = True)

## Data pre-processing
Check for null values and duplicated rows/compositions.

In [None]:
# Convert formula to composition for each sample
def get_composition(formula):
    return Composition(formula)

In [None]:
# Clean data: Check for null values and duplicates
df_expt['composition'] = df_expt['composition'].apply(get_composition)
df_expt
print('Null values?')
print(df_expt.isnull().any())
print('\nDuplicated rows?')
print(df_expt['composition'].duplicated().any())

In [None]:
df_mp['formula'] = [ s.formula for s in df_mp['structure'].values ]
df_mp['formula']

In [None]:
# Clean data: Check for null values and duplicates
df_mp.drop(columns=['index'], inplace=True)
df_mp['composition'] = df_mp['formula'].apply(get_composition)
df_mp
print('Null values?')
print(df_mp.isnull().any())
print('\nDuplicated rows?')
print(df_mp['composition'].duplicated().any())

In [None]:
df_dups_std = df_mp.groupby('composition').agg(np.std, ddof=0)['gap pbe']

In [None]:
selection = df_dups_std[ df_dups_std < 0.5]
df_no_dups = df_mp[df_mp.composition.isin(selection.index)].groupby('composition').mean().reset_index()

In [None]:
df_mp = df_no_dups

# Baseline model: No transfer learning