# CHEM7370 Class 3
## Conditions, files, and strings, oh my!

In [6]:
# Here's our list once again
energy_kcal = [-13.4, -2.7, 5.4, 42.1]

The `append` function adds a new item to the end of an existing list. Let's use it in a `for` loop to convert the entire list of energies from kcal to kJ.

Try running this block of code. See if you can figure out why it doesn’t work. What statement do we need to add to make this code work as intended?

In [12]:
energy_kJ = []
for number in energy_kcal:
   
   kJ = number * 4.184
   energy_kJ.append(kJ)

print(energy_kJ)

[-56.0656, -11.296800000000001, 22.593600000000002, 176.1464]


## Making choices: logic Statements
Within your code, you may need to evaluate a variable and then do something if the variable has a particular value. This type of logic is handled by an `if` statement. In the following example, we only append the negative numbers to a new list.

In [18]:
negative_energy_kJ = []

for number in energy_kJ:
    if number < 0:
        negative_energy_kJ.append(number)

print(negative_energy_kJ)

[-56.0656, -11.296800000000001]


Other logic operations include

* equal to `==`
* not equal to `!=`
* greater than `>`
* less than `<`
* greater than or equal to `>=`
* less than or equal to `<=`

You can also use `and`, `or`, and `not` to check more than one condition.

If you are comparing strings, not numbers, you use different logic operators like `is`, `in`, or `is not`. We will see these types of logic operators used soon.

## Exercise
The following list contains some floating point numbers and some numbers which have been saved as strings. 

In [27]:
data_list = ['-12.5', 14.4, 8.1, '42']

Set up a `for` loop to go over each element of `data_list`. If the element is a string (`str`), recast it as a float. Save *all* of the numbers to a new list called `number_list`. Pay close attention to your indentation!

In [32]:
number_list = []
for i in data_list:
    if type(i) is str:
        x = float(i)
    number_list.append(x)
print (number_list)

[-12.5, -12.5, -12.5, 42.0]


The `if` statement can handle more complicated logic: we can specify what to do if the condition is satisfied *and* what other thing to do otherwise. We can even check for more conditions and finally do something entirely different if none of them are satisfied. The `if/elif/else` structure comes in handy for this purpose.

In [34]:
molecules = ['water','benzene','DNA']
animals = ['dog','cat','octopus']
animals.append('shark')
thought = 'shark'
if thought in molecules:
    print("I'm thinking about a molecule. Its name is",thought,".")
elif thought in animals:
    print("I'm thinking about an animal. Its name is",thought,".")
else:
    print("I'm thinking about something. It is",thought,".")

I'm thinking about an animal. Its name is shark .


## Combining loops and logic
The `for` loop goes through a code block a predetermined number of times: for example, `for x in range(5):` will always execute the next code block exactly 5 times. Sometimes we want to be more nimble and decide on the fly if we want to keep doing the loop or not, by checking some condition like in an `if` statement. For this purpose, we need a `while` loop.

In the following example, to count down the seconds until launch, we will import the `time` module so that we can use the `time.sleep()` command. What does it do? Yes, you guessed right! It does nothing for a given number of seconds.

In [35]:
import time
t = 10
while t>0:
    print(t,"seconds until launch.")
    t -=1
    time.sleep(1.0)
print("Launch!!!")

10 seconds until launch.
9 seconds until launch.
8 seconds until launch.
7 seconds until launch.
6 seconds until launch.
5 seconds until launch.
4 seconds until launch.
3 seconds until launch.
2 seconds until launch.
1 seconds until launch.
Launch!!!


Want to launch earlier? No problem, we can `break` out of the `while` loop:

In [None]:
t = 10
while t>0:
    print(t,"seconds until launch.")
    if t<4.0:
        print("Launch accelerated!")
        break
    t -=1
    time.sleep(1.0)
print("Launch!!!")

## One quirky feature of lists - be careful when copying them!
Let's make a simple list and call it `l1`:

In [None]:
l1 = list(range(10))
print(l1)

We will now copy the entire list to a new variable `l2` and then modify one of the elements of `l1`. Wait, what happened to `l2`?

In [None]:
l2 = l1
l1[4] = 100
print(l1)
print(l2)

To prevent it (unwanted 99% of the time) behavior, instead of just copying a list, let's slice it all out:

In [None]:
l1 = list(range(10))
l2 = l1[:]
l1[4] = 100
print(l1)
print(l2)

## Working with files
One of the most common tasks in research is analyzing data. Many computational chemistry programs output text files that include a large amount of information including text and data that you need to analyze. Often, you need to sort through the output file and identify particular pieces of information that are most important to you. In general, this is called file parsing.

## Working with file paths - the os.path module
For this section, we will be working with the file `ethanol.out` in the `outfiles` directory.

To see this, go to a new cell and type `ls`. `ls` stands for ‘list’, and will list all of the contents of the current directory. This command is not a Python command, but will work in the Jupyter notebook. To see everything in the data directory, type

In [None]:
ls data

In order to parse a file, you must tell Python the location of the file, or the “file path”. For example, you can see what folder your Jupyter notebook is in by typing `pwd` into a cell in your notebook and evaluating it. `pwd` stands for ‘print working directory’, and can also be used in your terminal to see what directory you’re in.

In [None]:
pwd

Notice that the file paths are different for different operating systems. The Windows system uses a backslash (`\`), while Mac and Linux use a forward slash (`/`) for filepaths.

When we write a script, we want it to be usable on any operating system, thus we will use a python module called `os.path` that will allow us to define file paths in a general way.

In order to get the path to the `ethanol.out` file in a general way, type

In [None]:
import os

ethanol_file = os.path.join('data', 'outfiles', 'ethanol.out')
print(ethanol_file)

## Reading a file
In Python, there are many ways in python to read in information from a text file. The best method to use depends on the type of data and the type of analysis you are performing. If you have a file with lots of different types of information, text and numbers, with different types of formatting, the most generic way to read in information is the `readlines()` function. Before you can read in a file, you have to open the file using the file path we defined above. This will create a file object, or filehandle. The file we will be analyzing in this example is a PSI4 output file for a SCF/cc-pVDZ energy calculation for an ethanol molecule.

In [None]:
outfile = open(ethanol_file,"r")
data = outfile.readlines()


This code opens a file for reading and assigns it to the filehandle `outfile`. The `r` argument in the function stands for read. Other arguments might be `w` for write if we want to write new information to the file, or `a` for append if we want to add new information at the end of the file.

In the next line, we use the `readlines()` function to get our file as a list of strings. Notice the dot notation introduced last lesson; readlines acts on the file object given right before the dot. The function creates a list called `data` where each element of the list is a string that is one line of the file. This is always how the `readlines()` function works.

Note that the `readlines()` function can only be used on a file object one time. If you forget to set `outfile.readlines()` equal to a variable, you must open the file again in order to get the contents of the file.

After you open and read information from a file object, you should always close the file.

In [None]:
outfile.close()

## Exercise
Check that your file was read in correctly by determining how many lines are in the file.

## Searching for a pattern in your file
The file we opened is an output file which calculates the energy (and a lot of other stuff!) for an ethanol molecule. As stated previously, the `readlines()` function put the file contents into a list where each element is a line of the file. You may remember from the previous lesson that a `for` loops can be used to execute the same code repeatedly. As we learned in the previous lesson, we can use a `for` loop to iterate through elements in a list.

Let’s take a look at what’s in the file.

In [None]:
for line in data:
    print(line)

This will print exactly what is in the file.

If you look through the output, you will see that the critical line says “Final Energy”. We want to search through this file and find that line, and print only that line. We can do this using an `if` statement.

Returning to our file example,

In [None]:
for line in data:
    if 'Final Energy' in line:
        energy_line = line
        print(energy_line)

Remember that `readlines()` saves each line of the file as a string, so `energy_line` is a string that contains the whole line. For our analysis, if we are most interested in the energy, we need to split up the line so we can save just the number as a different variable name. To do this, we use a new function called `split`. The `split` function takes a string and divides it into its components using a delimiter.

The delimiter is specified as an argument to the function (put in the parenthesis `()`). If you do not specify a delimiter, a space is used by default. Let’s try this out.

In [None]:
energy_line.split()

Or, we can use the colon (‘:’) as the delimiter.

In [None]:
energy_line.split(':')

When we use ‘:’ as the delimiter, a list with two elements is returned. It is split where a colon was found.

We can save the output of this function to a variable as a new list. In the example below, we take the line we found in the `for` loop and split it up into its individual words.

In [None]:
words = energy_line.split()
print(words)

From this `print` statement, we now see that we have a list called words, where we have split `energy_line`. The energy is actually the fourth element of this list, so we can now save it as a new variable.

In [None]:
energy = words[-1]
print(energy)

If we now try to do a math operation on energy, we get an error message? Why do you think that is?

In [None]:
energy + 50

Try to change the definition of `energy` so that you can add to that number.

## Exercise on File Parsing
Use the provided `sapt.out` file. In this output file, the program calculates the interaction energy for an ethene-ethyne complex. The output reports four interaction energy components: electrostatics, induction, exchange, and dispersion. Parse each of these energies, in kcal/mole, from the output file. (Hint: study the file in a text editor to help you decide what to search for.) Calculate the total interaction energy by adding the four components together. Your code’s output should look something like this:

`Electrostatics : -2.25850118 kcal/mol`

`Exchange : 2.27730198 kcal/mol`

`Induction : -0.5216933 kcal/mol`

`Dispersion : -0.9446677 kcal/mol`

`Total Energy : 1.4475602000000003 kcal/mol`

## Searching for a particular line number in your file
There is a lot of other information in the output file we might be interested in. For example, we might want to pull out the initial coordinates for the molecule. If we look through the file in a text editor, we notice that the coordinates begin with a line that says

Center X Y Z Mass

and then the coordinates begin on the next line. In this case, we don’t want to pull something out of this line, as we did in our previous example, but we want to know which line of the file this is so that we can then pull the coordinates from the next few lines.

When you use a `for` loop, it is easy to have python keep up with the line numbers using the `enumerate` command. The general syntax is

`for linenum, line in enumerate(list_name):`

`    do things in the loop`

In this notation, there are now two variables you can use in your loop commands, `linenum` (which can be named something else) will keep up with what iteration you are on in the loop, in this case what line you are on in the file. The variable `line` (which could be named something else) functions exactly as it did before, holding the actual information from the list. Finally, instead of just giving the list name you use `enumerate(list_name)`.

This block of code searches our file for the line that contains “Center” and reports the line number.

In [None]:
for linenum, line in enumerate(data):
    if 'Center' in line:
        print(linenum)
        print(line)

Now we know that this is line 77 in our file (remember that you start counting at zero!).

## Check Your Understanding

What would be printed if you entered the following:

In [None]:
print(data[77])
print(data[78])
print(data[79])
print(data[80])
print(data[81])

# Advanced string operations
## Is this reaction balanced?
Here's the combustion reaction for ethanol, given as a string variable. Can we process the string and find out if this reaction is balanced?

In [None]:
reaction = 'C2H5OH + 3O2 -> 2CO2 + 3H2O'

This does not look like a simple task, but we will try to break it apart into a sequence of easier steps. First, it would be nice to parse the string to identify a list of `reactants` and a list of `products`. The `split` method will come in handy. Can you make these two lists?

Let's now pick one of the molecule from the list of `reactants` or `products`. The first question we might want to answer is: do we have one molecule of that type, or is there a stoichiometric coefficient in front of it? Write the code that will find the integer `coefficient` for this molecule. The code has to be aware that, if a specific number is not given, the `coefficient` is 1.

You can do a lot of the same things with strings as you do with lists; in particular, you can slice them with the same `[m:n]` operator, and you can iterate over them: `for x in string_variable:` will run over the individual characters in a string, one at a time. For a single character `x`, you have some functions to find out what character it is:

* x.isdigit() will be `True` if the character `x` is a digit (0,...,9) and `False` otherwise.
* x.isupper() will be `True` if the character `x` is an uppercase letter and `False` otherwise.
* x.islower() will be `True` if the character `x` is an lowercase letter and `False` otherwise.

How to store the information about the atoms found and their numbers? I suggest creating a *dictionary*, which is a data type similar to a list but indexed by their descriptions (keys) instead of numbers. Dictionaries are denoted by curly brackets `{}`.

This example code first creates an empty dictionary `atomic_masses` and then adds a couple atomic masses to it, indexed by the element symbol.

In [None]:
atomic_masses = {}
atomic_masses['H'] = 1
atomic_masses['He'] = 4
atomic_masses['C'] = 12
atomic_masses['O'] = 16
print(atomic_masses)

Adding new entries to a dictionary can be a little tricky, especially if we use the dictionary to count atoms: if we just set the entry to a new value, we forget the previous number of atoms of this kind instead of adding to it. Therefore, before modifying a dictionary entry, it's good to check if the entry already exists. The easiest way to do so is finding out if the entry is `in` the list of keys for the dictionary:

In [None]:
print(atomic_masses.keys())

The result of `dictionary.keys()` is, as you see, not strictly a list, but works like a list for practical purposes (and can be converted to a list). *Note: Python does not guarantee that the list of keys will be ordered in any specific way.*

In [None]:
print(list(atomic_masses.keys()))

Let's now go back to our problem of counting atoms in a molecule. You have a string `molecule` that contains its formula (with no stoichiometric coefficient left in front of it, assume that if we found a coefficient, we cut it out). Your task is to extract a dictionary `atoms` that contains the number of atoms of each type. For example, if `molecule = "C2H5OH"`, the resulting dictionary should be `atoms = {'H' : 6, 'C' : 2, 'O' : 1}`.

Now that you counted atoms in one molecule, can you count them in all the reactant molecules (multiplying the count by each stoichiometric coefficient)? Make sure that you don't lose atoms that are in one reactant molecule but not the other(s). Make dictionaries `allreactants` and `allproducts` listing the number of atoms of each type in *all* reactant/product molecules together.

We got to the last step! Are the numbers of atoms of each type in reactants and products the same? If they are the same for all atom types, the reaction is balanced!