# Exploratory data analysis for materials informatics

## Overview
### Context
Exploratory data analysis (EDA) is the process of analyzing and summarizing the general features of a dataset in an exploratory manner. It is both "getting to know" your data and preparing your data for further analysis (e.g., machine learning).

There is no single workflow for EDA and the steps should be customized for each dataset/research question. Here, I perform a stereotypical workflow for a general materials dataset, but by no means should it be assumed to be comprehensive or necessary to perform all the steps for a given problem.
### Problem formulation
Clean (i.e., remove null values, potentially harmful outliers) data and analyze patterns in the features and target property.
## Approach
### 1. Data set importing and featurization will be done via `matminer`.
- Will use the `matbench_expt_gap` dataset
    - For the sake of time, I filtered out all materials with a band gap of zero. Ideally, I would develop a classification model to identify metals as an initial pre-screening step before performing regression. Then for prediction on a new material, metals could automatically be assigned a band gap of zero and all others could use the trained regression model.
- For now, will only use compositional features

In [1]:
# General python 
from scipy import stats
import pandas as pd
import numpy as np
import os

# Plotting & EDA
import helpers.plotting as my_plt
import helpers.eda as eda

# Data and feature engineering
from matminer.datasets import load_dataset,get_all_dataset_info
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf
from matminer.featurizers import structure as st
from pymatgen.core import Composition

# Load data and featurize

In [2]:
# Load data
dataset = 'matbench_expt_gap'
df = load_dataset(dataset)
print(get_all_dataset_info('matbench_expt_gap'))
df.drop(df[df['gap expt'] == 0].index, inplace = True)
df.reset_index(inplace = True)

Dataset: matbench_expt_gap
Description: Matbench v0.1 test dataset for predicting experimental band gap from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, removing compositions with reported band gaps spanning more than a 0.1eV range; remaining compositions were assigned values based on the closest experimental value to the mean experimental value for that composition among all reports. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	composition: Chemical formula.
	gap expt: Target variable. Experimentally measured gap, in eV.
Num Entries: 4604
Reference: Y. Zhuo, A. Masouri Tehrani, J. Brgoch (2018) Predicting the Band Gaps of Inorganic Solids by Machine Learning J. Phys. Chem. Lett. 2018, 9, 7, 1668-1673 https:doi.org/10.1021/acs.jpclett.8b00124.
Bibtex citations: ["@Article{Dunn2020

## Data pre-processing
Check for null values and duplicated rows. In this data set, neither of these exist, so no data pre-processing is performed.

In [3]:
# Clean data: Check for null values and duplicates
print('Null values?')
print(df.isnull().any())
print('\nDuplicated rows?')
print(df['composition'].duplicated().any())

Null values?
index          False
composition    False
gap expt       False
dtype: bool

Duplicated rows?
False


## Featurization based on composition

In [4]:
# Convert formula to composition for each sample
def get_composition(formula):
    return Composition(formula)
df['composition'] = df['composition'].apply(get_composition)
df

Unnamed: 0,index,composition,gap expt
0,2,"(Ag, Ge, Pb, S)",1.83
1,3,"(Ag, Ge, Pb, Se)",1.51
2,6,"(Ag, Ge, S)",1.98
3,7,"(Ag, Ge, Se)",0.90
4,8,"(Ag, Hg, I)",2.47
...,...,...,...
2149,4584,"(Zr, Ni, Sb)",0.55
2150,4586,"(Zr, O)",4.99
2151,4592,"(Zr, S)",2.75
2152,4596,"(Zr, Se)",2.00


In [6]:
wd = os.getcwd()
data_fname = f'{dataset}_featurized.csv'
data_path = f'{wd}/{data_fname}'
if os.path.isfile(data_path):
    df = pd.read_csv(data_fname) 
else:
    # Feature engineering: Get compositional features from matminer
    compf =  MultipleFeaturizer([cf.Stoichiometry(), cf.ElementProperty.from_preset("magpie"),
                         cf.ValenceOrbital(props=['avg']), cf.IonProperty(fast=True)])
    df = compf.featurize_dataframe(df, col_id = 'composition')
    df.to_csv(data_fname, index=False)
df.columns

Index(['index', 'composition', 'gap expt', '0-norm', '2-norm', '3-norm',
       '5-norm', '7-norm', '10-norm', 'MagpieData minimum Number',
       ...
       'MagpieData mean SpaceGroupNumber',
       'MagpieData avg_dev SpaceGroupNumber',
       'MagpieData mode SpaceGroupNumber', 'avg s valence electrons',
       'avg p valence electrons', 'avg d valence electrons',
       'avg f valence electrons', 'compound possible', 'max ionic char',
       'avg ionic char'],
      dtype='object', length=148)

In [None]:
# Split data into input and output
X_col = compf.feature_labels() # only want to use the compositional features for ML
df[X_col+['gap expt']].to_csv(data_fname)
X_df = df[X_col]
y = df['gap expt']

# Exploratory Data Analysis
## Feature correlations
Here, I analyze the correlations between each feature pair and drop any that are very highly correlated ( correlation coefficient > 0.8 ).

In [None]:
# drop highly correlated pairs
X_nocorr = eda.corr_filter(X_df, 0.8)