# Introduction to Python programming for MPECDT
### [Gerard Gorman](http://www.imperial.ac.uk/people/g.gorman), [Christian Jacobs](http://www.imperial.ac.uk/people/c.jacobs10)
### Modified for MPECDT by [David Ham](http://www.imperial.ac.uk/people/david.ham)

# Lecture 5: Files, strings, and dictionaries

Learning objectives: You will learn how to:

* Work with Python programmes in files.
* Read data in from a file
* Parse strings to extract specific data of interest.
* Use dictionaries to index data using any type of key.

## Reading data from a plain text file
We can read text from a [text file](http://en.wikipedia.org/wiki/Text_file) into strings in a program. This is a common (and simple) way for a program to get input data. The basic recipe is:

Let's look at an example. You will have downloaded the file data1.txt in the data folder along with these lecture notes. The files has a column of numbers:

### Running a program in a file.

The goal is to read this file and calculate the mean. This time, instead of typing code in an IPython notebook, we're going to use a Python program written in a file on disk. This is the most common way of writing programs and enables us to compose together pieces of code to undertake more complex operations than those which can easily be typed in a few lines in a notebook. 

First, let's look at the python program we've written. It's in the repository you cloned, but instead of being in the `notepad` directory, it's in the `src` directory. It's called `mean.py`. Files containing Python code usually have the ending `.py`. This tells editors and other applications to treat the file as a Python application. Open the file in a text editor (gedit is a good choice if you don't already have a text editor). The code is exactly what you would type in a notebook for this task.

Now let's run the programme. Open a terminal and change to the `src` directory. Then type:

```
ipython3 mean.py
```

You should expect the program to print out 20.95, which is the mean of the numbers above.

### Reading more complex data.
Let's make this example more interesting. There is a **lot** of data out there for you to discover all kinds of interesting facts - you just need to be interested in doing a little analysis. For this case I have downloaded tidal gauge data for the port of Avonmouth from the [BODC](http://www.bodc.ac.uk/). If you look at the header of file data/2012AVO.txt you will see the [metadata](http://en.wikipedia.org/wiki/Metadata):

The program `tides.py` reads the column ASLVTD02 (the surface elevation) and plots it. Open it in your text editor to study what it does, and then run it:

```
ipython3 tides.py
```

Quiz time:

* What tidal constituents can you identify by looking at this plot?
* Is this primarily a diurnal or semi-diurnal tidal region? (hint - change the x-axis range on the plot above).

You will notice in the above example that we used the *split()* string member function. This is a very useful function for grabbing individual words on a line. When called without any arguments it assumes that the [delimiter](http://en.wikipedia.org/wiki/Delimiter) is a blank space. However, you can use this to split a string with any delimiter, *e.g.*, *line.split(';')*, *line.split(':')*.

## <span style="color:blue">Exercise 1: Read a two-column data file</span>
The file *data/xy.dat* contains two columns of numbers, corresponding to *x* and *y* coordinates on a curve. The start of the file looks like this:

-1.0000   -0.0000</br>
-0.9933   -0.0087</br>
-0.9867   -0.0179</br>
-0.9800   -0.0274</br>
-0.9733   -0.0374</br>

Make a program that reads the first column into a list *x* and the second column into a list *y*. Then convert the lists to arrays, and plot the curve. Print out the maximum and minimum y coordinates. (Hint: Read the file line by line, split each line into words, convert to float, and append to *x* and *y*.)

**Don't forget to commit your program to your git repository**

## <span style="color:blue">Exercise 2: Read a data file</span>
The files data/density_water.dat and data/density_air.dat contain data about the density of water and air (respectively) for different temperatures. The data files have some comment lines starting with # and some lines are blank. The rest of the lines contain density data: the temperature in the first column and the corresponding density in the second column. The goal of this exercise is to read the data in such a file and plot the density versus the temperature as distinct (small) circles for each data point (you might need to refer back to the documentation for the [plot](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot) function). Let the program take the name of the data file via raw_input. Apply the program to both files.

## <span style="color:blue">Exercise 3: Read acceleration data and find velocities</span>
A file data/acc.dat contains measurements $a_0, a_1, \ldots, a_{n-1}$ of the acceleration of an object moving along a straight line. The measurement $a_k$ is taken at time point $t_k = k\Delta t$, where $\Delta t$ is the time spacing between the measurements. The purpose of the exercise is to load the acceleration data into a program and compute the velocity $v(t)$ of the object at some time $t$.

In general, the acceleration $a(t)$ is related to the velocity $v(t)$ through $v^\prime(t) = a(t)$. This means that

$$
v(t) = v(0) + \int_0^t{a(\tau)d\tau}
$$

If $a(t)$ is only known at some discrete, equally spaced points in time, $a_0, \ldots, a_{n-1}$ (which is the case in this exercise), we must compute the integral above numerically, for example by the Trapezoidal rule:

$$
v(t_k) \approx \Delta t \left(\frac{1}{2}a_0 + \frac{1}{2}a_k + \sum_{i=1}^{k-1}a_i \right), \ \ 1 \leq k \leq n-1. 
$$

We assume $v(0) = 0$ so that also $v_0 = 0$.
Read the values $a_0, \ldots, a_{n-1}$ from file into an array, plot the acceleration versus time, and use the Trapezoidal rule to compute one $v(t_k)$ value, where $\Delta t$ and $k \geq 1$ are specified using raw_input.

## <span style="color:blue">Exercise 4: Read acceleration data and plot velocities</span>
The task in this exercise is the same as the one above, except that we now want to compute $v(t_k)$ for all time points $t_k = k\Delta t$ and plot the velocity versus time. Repeated use of the Trapezoidal rule for all $k$ values is very inefficient. A more efficient formula arises if we add the area of a new trapezoid to the previous integral:

$$
v(t_k) = v(t_{k-1}) + \int_{t_{k-1}}^{t_k}a(\tau)\ d\tau \approx v(t_{k-1}) + \Delta t \frac{1}{2}\left(a_{k-1} + a_k\right), 
$$

for $k = 1, 2, \ldots, n-1$, while $v_0 = 0$. Use this formula to fill an array *v* with velocity values. Now only $\Delta t$ is given on via raw_input, and the $a_0, \ldots, a_{n-1}$ values must be read from file as in the previous exercise.

## Python dictionaries
Suppose we need to store the temperatures in Oslo, London and Paris. The Python list solution might look like:

In [None]:
temps = [13, 15.4, 17.5]
# temps[0]: Oslo
# temps[1]: London
# temps[2]: Paris

In this case we need to remember the mapping between the index and the city name. It would be easier to specify name of city to get the temperature. Containers such as list and arrays use a continuous series of integers to index elements. However, for many applications such an integer index is not useful.

**Dictionaries** are containers for which any immutable Python object can be used
as an index. Let's rewrite the previous example using a Python dictionary:

In [None]:
temps = {"Oslo": 13, "London": 15.4, "Paris": 17.5}
print("The temperature in London is", temps["London"])

Add a new element to a dictionary:

In [None]:
temps["Madrid"] = 26.0
print(temps)

Loop (iterate) over a dictionary:

In [None]:
for city in temps:
    print("The temperature in %s is %g" % (city, temps[city]))

The index in a dictionary is called the **key**. A dictionary is said to hold key–value pairs. So in general:

Does the dictionary have a particular key (*i.e.* a particular data entry)?

In [None]:
if "Berlin" in temps:
    print("We have Berlin and its temperature is ", temps["Berlin"])
else:
    print("Me no can give you Berlin hot or cold.")

In [None]:
print("Oslo" in temps) # i.e. standard boolean expression

The keys and values can be reached as iterators:

In [None]:
print("Keys = ", temps.keys())
print("Values = ", temps.values())

Note that the sequence of keys is **arbitrary**! Never rely on it, if you need a specific order of the keys then you should explicitly sort:

In [None]:
for key in sorted(temps):
    value = temps[key]
    print(key, value)

Remove Oslo key:value:

In [None]:
del temps["Oslo"] # remove Oslo key w/value
print(temps, len(temps))

In a manner similar to that we saw for arrays, two variable names can refer to the same dictionary:

In [None]:
t1 = temps
t1["Stockholm"] = 10.0
print(temps)

So we can see that while we modified *t1*, the *temps* dictionary was also changed.

Let's look at a simple example of reading the same data from a file and putting it into a dictionary. We will be reading the file *data/deg2.dat*.

In [None]:
infile = open("../data/deg2.dat", "r")
# Start with empty dictionary
temps = {}             
for line in infile:
    # If you examine the file you will see a ':' after the city name,
    # so let's use this as the delimiter for splitting the line.
    city, temp = line.split(":") 
    temps[city] = float(temp)
infile.close()
print(temps)

## <span style="color:blue">Exercise 5: Make a dictionary from a table</span>
The file *data/constants.txt* contains a table of the values and the dimensions of some fundamental constants from physics. We want to load this table into a dictionary *constants*, where the keys are the names of the constants. For example, *constants['gravitational constant']* holds the value of the gravitational constant (6.67259 $\times$ 10$^{-11}$) in Newton's law of gravitation. Make a function that that reads and interprets the text in the file, and thereafter returns the dictionary.

## <span style="color:blue">Exercise 6: Explore syntax differences: lists vs. dictionaries</span>
Consider this code:

In [None]:
t1 = {}
t1[0] = -5
t1[1] = 10.5

Explain why the lines above work fine while the ones below do not:

In [None]:
t2 = []
t2[0] = -5
t2[1] = 10.5

What must be done in the last code snippet to make it work properly?

## <span style="color:blue">Exercise 7: Compute the area of a triangle</span>
An arbitrary triangle can be described by the coordinates of its three vertices: $(x_1, y_1), (x_2, y_2), (x_3, y_3)$, numbered in a counterclockwise direction. The area of the triangle is given by the formula:

$A = \frac{1}{2}|x_2y_3 - x_3y_2 - x_1y_3 + x_3y_1 + x_1y_2 - x_2y_1|.$

Write a function *area(vertices)* that returns the area of a triangle whose vertices are specified by the argument vertices, which is a nested list of the vertex coordinates. For example, vertices can be [[0,0], [1,0], [0,2]] if the three corners of the triangle have coordinates (0, 0), (1, 0), and (0, 2).

Then, assume that the vertices of the triangle are stored in a dictionary and not a list. The keys in the dictionary correspond to the vertex number (1, 2, or 3) while the values are 2-tuples with the x and y coordinates of the vertex. For example, in a triangle with vertices (0, 0), (1, 0), and (0, 2) the vertices argument becomes:

##String manipulation
Text in Python is represented as **strings**. Programming with strings is therefore the key to interpret text in files and construct new text (*i.e.* **parsing**). First we show some common string operations and then we apply them to real examples. Our sample string used for illustration is:

In [None]:
s = "Berlin: 18.4 C at 4 pm"

Strings behave much like lists/tuples - they are simply a sequence of characters:

In [None]:
print("s[0] = ", s[0])
print("s[1] = ", s[1])

Substrings are just slices of lists and arrays:

In [None]:
# from index 8 to the end of the string
print(s[8:])

In [None]:
# index 8, 9, 10 and 11 (not 12!)
print(s[8:12])

In [None]:
# from index 8 to 8 from the end of the string
print(s[8:-8])

You can also find the start of a substring:

In [None]:
# where does "Berlin" start?
print(s.find("Berlin"))

In [None]:
print(s.find("pm"))

In [None]:
print(s.find("Oslo"))

In this last example, Oslo does not exist in the list so the return value is -1.

We can also check if a substring is contained in a string:

In [None]:
print ("Berlin" in s)

In [None]:
print ("Oslo" in s)

In [None]:
if "C" in s:
    print("C found")
else:
    print("C not found")

### Search and replace
Strings also support substituting a substring by another string. In general this looks like *s.replace(s1, s2)*, which replaces string *s1* in *s* by string *s2*, *e.g.*:

In [None]:
s = s.replace(" ", "_")
print(s)

In [None]:
s = s.replace("Berlin", "Bonn")
print(s)

In [None]:
# Replace the text before the first colon by 'London'
s = s.replace(s[:s.find(":")], "London")
print(s)

Notice that in all these examples we assign the new result back to *s*. One of the reasons we are doing this is strings are actually constant (*i.e* immutable) and therefore cannot be modified *inplace*. We **cannot** write for example:

We also encountered examples above where we used the split function to break up a line into separate substrings for a given separator (where a space is the default delimiter). Sometimes we want to split a string into lines - *i.e.* the delimiter is the [carriage return](http://en.wikipedia.org/wiki/Carriage_return). This can be surprisingly tricky because different computing platforms (*e.g.* Windows, Linux, Mac) use different characters to represent a carriage return. For example, Unix uses '\n'. Luckly Python provides a *cross platform* way of doing this so regardless of what platform created the data file, or what platform you are running Python on, it will do the *right thing*: 

In [None]:
t = "1st line\n2nd line\n3rd line"
print("""original t =
""", t)

In [None]:
# This works here but will give you problems if you are switching
# files between Windows and either Mac or Linux.
print(t.split("\n"))

In [None]:
# Cross platform (ie better) solution
print(t.splitlines())

### Stripping off leading/trailing whitespace
When processing text from a file and composing new strings, we frequently need to trim leading and trailing whitespaces:

In [None]:
s = "        text with leading and trailing spaces          \n"
print("-->%s<--"%s.strip())

In [None]:
# left strip
print("-->%s<--"%s.lstrip())

In [None]:
# right strip
print("-->%s<--"%s.rstrip())

### join() (the opposite of split())
We can join a list of substrings to form a new string. Similarly to *split()* we put strings together with a delimiter inbetween:

In [None]:
strings = ["Newton", "Secant", "Bisection"]
print(", ".join(strings))

You can prove to yourself that these are inverse operations:

As an example, let's split off the first two words on a line:

In [None]:
line = "This is a line of words separated by space"
words = line.split()
print("words = ", words)
line2 = " ".join(words[2:])
print("line2 = ", line2)

## <span style="color:blue">Exercise 8: Improve a program</span>
The file *data/densities.dat* contains a table of densities of various substances measured in g/cm$^3$. The following program reads the data in this file and produces a dictionary whose keys are the names of substances, and the values are the corresponding densities.

In [None]:
def read_densities(filename):
    infile = open(filename, 'r')
    densities = {}
    for line in infile:
        words = line.split()
        density = float(words[-1])
    
        if len(words[:-1]) == 2:
            substance = words[0] + ' ' + words[1]
        else:
            substance = words[0]
        
        densities[substance] = density
    
    infile.close()
    return densities

densities = read_densities('../data/densities.dat')

One problem we face when implementing the program above is that the name of the substance can contain one or two words, and maybe more words in a more comprehensive table. The purpose of this exercise is to use string operations to shorten the code and make it more general. Implement the following two methods in separate functions in the same program, and control that they give the same result.

1. Let *substance* consist of all the words but the last, using the join method in string objects to combine the words.
2. Observe that all the densities start in the same column file and use substrings to divide line into two parts. (Hint: Remember to strip the first part such that, e.g., the density of ice is obtained as *densities['ice']* and not *densities['ice     ']*.)

## File writing
Writing a file in Python is simple. You just collect the text you want to write in one or more strings and, for each string, use a statement along the lines of

The write function does not add a newline character so you may have to do that explicitly:

That’s it! Compose the strings and write! Let's do an example. Write a nested list (table) to a file:

In [None]:
# Let's define some table of data
data = [[ 0.75,        0.29619813, -0.29619813, -0.75      ],
        [ 0.29619813,  0.11697778, -0.11697778, -0.29619813],
        [-0.29619813, -0.11697778,  0.11697778,  0.29619813],
        [-0.75,       -0.29619813,  0.29619813,  0.75      ]]

# Open the file for writing. Notice the "w" indicates we are writing!
outfile = open("tmp_table.dat", "w")
for row in data:
    for column in row:
        outfile.write("%14.8f" % column)
    outfile.write("\n")   # ensure newline
outfile.close()

And that's it - run the above cell and take a look at the file that was generated in the folder you run IPython from.

## <span style="color:blue">Exercise 9: Write function data to a file</span>
We want to dump $x$ and $f(x)$ values to a file named function_data.dat, where the $x$ values appear in the first column and the $f(x)$ values appear in the second. Choose $n$ equally spaced $x$ values in the interval [-4, 4]. Here, the function $f(x)$ is given by:

$f(x) = \frac{1}{\sqrt{2\pi}}\exp(-0.5x^2)$

Write a program which does this. **Don't forget to commit it to your repository.**