<a href="https://colab.research.google.com/github/sandeepchemistry/CVPAT/blob/master/Parsing_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will use pymatgen, a python package, to parse the files from VASP and Gaussian. pymatgen is well-documented which makes it easy to use. (https://pymatgen.org/index.html)

In [None]:
# install pymatgen
!pip install pymatgen

In [None]:
# fetch the required files from GitHub
!git clone https://github.com/vinayak2019/Parsing_Files
!tar -xf  Parsing_Files/vasp/vasprun.xml.tar.gz -C Parsing_Files/vasp/

# **Parsing VASP files**

The Vienna Ab initio Simulation Package: atomic scale materials modelling from first principles.
(https://www.vasp.at/).

Here we will look at parsing XML file and a text file created by VASP.

In [None]:
# load Vasprun from pymatgen
from pymatgen.io.vasp import Vasprun

In [None]:
# read the xml file
vasprun = Vasprun("/content/Parsing_Files/vasp/vasprun.xml")

In [None]:
# the available methods for Vasprun
dir(vasprun)

In [None]:
# check whether calculation is converged
vasprun.converged_electronic

In [None]:
# get final energy
vasprun.final_energy

pymatgen also had modules to plot. Here we will use the DoSPlotter to plot density of states (https://en.wikipedia.org/wiki/Density_of_states)

In [None]:
# plotting density of states
from pymatgen.electronic_structure.plotter import DosPlotter

tdos = vasprun.tdos
plotter = DosPlotter(sigma=0.1)
plotter.add_dos("Total DOS", tdos)
plotter.show()

We can also parse any text files created by VASP with pymatgen. Below is an example to parse the OUTCAR file

In [None]:
# load the module Outcar
from pymatgen.io.vasp import Outcar

# read the file
outcar = Outcar("/content/Parsing_Files/vasp/OUTCAR")

# display the statistics for the job
outcar.run_stats

# **Parsing Gaussian files**

Gaussian 16 is the latest in the Gaussian series of programs. It provides state-of-the-art capabilities for electronic structure modeling. Gaussian 16 is licensed for a wide variety of computer systems. All versions of Gaussian 16 contain every scientific/modeling feature, and none imposes any artificial limitations on calculations other than your computing resources and patience. https://gaussian.com/gaussian16/

## **pymatgen**

The modules to parse gaussian are availbale in pymatgen. Other codes which can be parsed are listed here (https://pymatgen.org/pymatgen.io.html)

In [None]:
# we import the module load parse gaussian output files
from pymatgen.io.gaussian  import GaussianOutput

In [None]:
# reading the log file
gout = GaussianOutput("/content/Parsing_Files/gaussian/tddft.log")

In [None]:
# looking at all the methods available
dir(gout)

In [None]:
# getting the final energy
gout.final_energy

In [None]:
# final structure
gout.final_structure

In [None]:
# TDDFT excitations
gout.read_excitation_energies()

## **Generic text parsing**

We will use regular expressions for parsing text files. (https://en.wikipedia.org/wiki/Regular_expression)


The process for parsing is as follows -
1.   Find a unique pattern for the start of parsing segment
2.   Find a pattern for the end of segment
1.   Read the file
2.   Look for the line with start pattern
1.   Starting the parsing code until end pattern is encountered



Use this for testing regular expression (https://regex101.com/)









In [None]:
# import regular expression module
import re

In [None]:
# We will parse the Mulliken charges for all atoms

# Find the pattern
start_pattern = re.compile(r'Mulliken charges:')

In [None]:
# The pattern at the end
end_pattern = re.compile(r'Sum of Mulliken charges')

In [None]:
# read the file (I prefer to read lines as list of lines)
with open("/content/Parsing_Files/gaussian/tddft.log") as f:
  lines = f.readlines()

In [None]:
# check types
print("The type of lines variable is ",type(lines))

# print first 10 lines
print(lines[:10])

We will use re.match() to check whether the line contains our start pattern

In [None]:
# find line with start pattern
for idx, line in enumerate(lines): # loops over the lines
  if re.match(start_pattern,line.strip()):
    break
print(idx)

To stop parsing we will again use re.match() to match the end pattern. We will append the lines to the list

In [None]:
# parse lines
idx = idx + 2
line = lines[idx]

parsed_lines = []
while not(re.match(end_pattern,line.strip())):
  parsed_lines.append(line.strip())
  idx += 1
  line = lines[idx]

In [None]:
# parsed data
print(parsed_lines)

To clean up the data we will create a dictionay of form
{"number":1,
"atom":"C",
"charge":"-0.4"}

In [None]:
# testing on one line
parsed_lines[0].strip().split()

In [None]:
#creating dictionary
clean_line = parsed_lines[0].strip().split()
d = {"number": int(clean_line[0]),
     "atom": clean_line[1],
     "charge": float(clean_line[2])
 }
print(d)

In [None]:
# clean up the data
data = []
for line in parsed_lines:
  clean_line = line.strip().split()
  data.append(
      {"number": int(clean_line[0]),
     "atom": clean_line[1],
     "charge": float(clean_line[2])
 }
  )

In [None]:
# creating a table with pandas
import pandas as pd

pd.DataFrame(data)

### **Exercise**

Parse the Mulliken charges with hydrogens summed into heavy atoms

In [None]:
# YOUR CODE HERE

Please take this survey to help me improve the workshop https://docs.google.com/forms/d/e/1FAIpQLSdpn3lpq1n1fA4aqLDvfA9VARsTNBnD5p6gcCtJ_VaYGiYxlA/viewform?usp=sf_link