In [7]:
import json
from pymatgen.core.structure import Structure as ST

In [8]:
import sys

sys.path.append("../")

from src.Utility import bad_data_clean

### Initial data cleaning and feature extraction
---------------------------

In this notebook, we are mainly cleaning the raw output data from VASP. The VASP output data is 
stored in .json format with both computed NMR tensors and crystal structure information derived from
.cif.

In [9]:
# read the raw data.
data_path = "../data/"
with open(data_path + "raw/Alnmr.json", "r") as file:
    data = json.load(file)
    print("length of file is {}".format(len(data)))

length of file is 3479


The raw data are separated into different keys.

* *structure*--crystal structure information form .cif.
* *formula*--the chemical formula of the material. 
* *g0*--the G=0 contribution to NMR tensors. (see VASP wiki for details)
* *cs*--the raw chemical shielding tensor.
* *efg*--the raw EFG tensor. 


In [10]:
data[0].keys()

dict_keys(['structure', 'formula', 'g0', 'cs', 'efg'])

There are some data points that dose not contain Al atoms or no structure info at all.
We can clean the bad data point with the following helper fnx. 

In [11]:
data = bad_data_clean(data)

num of problem compound: 8
len of none problematic data: 3471


Also some data points are simply replicating each other, we need to clean the redundances. 

In [12]:
# Also we can get rid of the redundances
for i in range(len(data)):
    string = json.dumps(data[i], sort_keys=True)
    data[i] = string
data = list(set(data))

for i in range(len(data)):
    dictionary = json.loads(data[i])
    data[i] = dictionary
print("length of file without redundancy is {}".format(len(data)))

length of file without redundancy is 3022


Save the cleaned data as another .json file in the interim folder in /data/.

In [13]:
filename = "Alnmr_clean.json"
with open(data_path + "interim/" + filename, "w") as outfile:
    json.dump(data, outfile)