# Data Munging Workshop

Up to this point on the course we have been working with nice, clean, structured datasets. This is all very nice, but not very realistic. In the real world, datasets will arrive at your door in a bad shape - full of errors, null values and unreliability. You need to know how to approach these datasets, how to understand their limitations, and how to make repairs where necessary.

This week's workshop will explore how we can use tools in Python to import, clean, analyse and export new datasets. Specifically we'll be working with the **Pandas Data Analysis Library** (http://pandas.pydata.org), which you may have used elsewhere. During this tutorial, we'll work through an example data cleaning process, finishing with uploading the data directly to your MySQL Database. You'll learn how to approach new datasets, but further expand your skills in using Pandas.

So without further ado, the first thing we need to do is setup our working environment. **Run the scripts below to import the Pandas libraries.**

In [20]:
# import libraries, and set pd as the pandas alias
import pandas as pd

# run this command too - just to allow more data to be displayed than default
pd.set_option('display.max_rows', 200)
# this one ensures graphs properly display in the notebook
%matplotlib inline

## Data Import

Pandas has a range of functions for enabling the import of data into Python. The import functions are relatively straight forward to use, and require very little in the way actual coding. There are also functions available for a range of different data formats.

In this initial section of the workshop, we will explore how to import CSV data using Pandas. Other data formats (e.g. Excel, JSON, HTML) can be imported will similar ease, the documentation for these tools can be found here: http://pandas.pydata.org/pandas-docs/stable/io.html

An important thing to know about Pandas is that it will always load your data into its own Dataframe format. This format basically acts as a multicolumn table, and so will eventually make loading our cleaned data into MySQL all the more easier. However, before we get to that stage, we need to load the data in, check it for problems and fix it up.

During this part of the workshop we'll be working with air quality data obtained from the *London Air Quality Network* (http://www.londonair.org.uk). Understanding air quality is clearly a very important element of ensuring wellbeing in urban areas. However, the data is highly prone to errors and mistakes, and so requires careful handling and analysis.

We will use a dataset originally obtained from London Air, containing a year's worth of air quality data recorded on Euston Road. You can find this dataset on Moodle, named `'LaqnData_EustonRoad.csv'`.

**Download it and put it somewhere on your computer where you can access it. Add the directory to the command below and run the command.**

In [21]:
# REPLACE THE DIRECTORY BELOW WITH THE LOCATION OF YOUR FILE!

# this command loads your csv data and sets up the 'smog' dataframe, using the Pandas (pd) libraries
smog = pd.read_csv('D:\Xin\MSc Smart Cities and Urban Analytics\Modules\BENVGSC4\Week 4\LaqnData_EustonRoad.csv')

We've now imported the data as a dataframe called `'smog'`.

## Initial Data Checks

You can start by checking the import by looking at the data itself - to do this you simply call the dataframe by name. **Look at the data using the command below, what does it tell you about what we have?**

In [22]:
smog

Unnamed: 0,Site,Species,ReadingDateTime,Value,Units,Provisional or Ratified
0,CD9,NO,01/01/2014 00:00,63.0,ug m-3,R
1,CD9,NO,01/01/2014 00:15,64.3,ug m-3,R
2,CD9,NO,01/01/2014 00:30,57.6,ug m-3,R
3,CD9,NO,01/01/2014 00:45,87.8,ug m-3,R
4,CD9,NO,01/01/2014 01:00,55.3,ug m-3,R
5,CD9,NO,01/01/2014 01:15,92.5,ug m-3,R
6,CD9,NO,01/01/2014 01:30,75.6,ug m-3,R
7,CD9,NO,01/01/2014 01:45,,ug m-3,R
8,CD9,NO,01/01/2014 02:00,,ug m-3,R
9,CD9,NO,01/01/2014 02:15,77.0,ug m-3,R


This gives you a sample of the data, but it may be that you want to look at only the first 10 rows. You can do this by adding the `.head(n)` function to the end of the dataframe name, replacing `n` with the number of rows you wish to see. **Try this out below.**

smog.head()

This gives us a sense of what the data looks like, but doesn't provide an idea of how complete it is across the whole dataset. The `.count()` function provides counts of *non-null* values in each column. **Try running this below.**

In [23]:
smog.count()

Site                       175200
Species                    175200
ReadingDateTime            175200
Value                      110233
Units                      175200
Provisional or Ratified    175200
dtype: int64

It would appear that we have a few null values to deal with in the `Value` column. We'll come on to that later.

Before we move on though, it would be worth exploring variation in the data we have imported. From the sample loaded earlier, it would appear we have a number of categorical datasets, and it would be useful to know how many rows correspond to each category. 

To do this, we use the `.value_counts()` function. In calling a function against a column, Pandas allows us to reference the column name directly within the function. This structure requires the dataframe (e.g. `smog`), the column name (e.g. `Site`), and the function name (e.g. `value_counts()`).

In [24]:
smog.Site.value_counts()

CD9    175200
Name: Site, dtype: int64

In [25]:
smog.Species.value_counts()

NO2      35040
PM2.5    35040
NO       35040
PM10     35040
NOX      35040
Name: Species, dtype: int64

In [26]:
smog.ReadingDateTime.value_counts()

09/08/2014 21:00    5
10/06/2014 01:30    5
05/09/2014 22:15    5
23/03/2014 16:00    5
09/11/2014 17:00    5
03/08/2014 04:45    5
18/12/2014 22:30    5
28/04/2014 01:15    5
07/09/2014 02:45    5
24/06/2014 12:30    5
20/07/2014 10:15    5
03/08/2014 03:30    5
21/03/2014 04:00    5
01/08/2014 01:30    5
26/10/2014 18:30    5
17/08/2014 05:30    5
11/11/2014 18:30    5
27/10/2014 00:00    5
10/06/2014 23:15    5
05/04/2014 20:30    5
12/01/2014 14:30    5
13/06/2014 22:15    5
16/05/2014 19:00    5
11/01/2014 14:00    5
27/10/2014 20:30    5
31/12/2014 04:00    5
02/05/2014 06:00    5
07/03/2014 19:45    5
31/03/2014 18:45    5
03/08/2014 13:00    5
06/08/2014 10:30    5
07/09/2014 18:15    5
31/07/2014 05:00    5
08/05/2014 15:45    5
26/10/2014 15:00    5
31/10/2014 23:45    5
15/11/2014 17:45    5
13/05/2014 16:45    5
16/07/2014 19:45    5
03/10/2014 16:30    5
21/08/2014 21:45    5
19/09/2014 09:00    5
15/08/2014 17:30    5
07/09/2014 17:30    5
10/10/2014 16:45    5
11/01/2014

**What do these results tell you about the dataset? **

Next run the same `.value_counts()` function for the `Units` column.

Now run it for the `'Provisional or Ratified'` column. **Hint**: Because of the space in the column name, we can not call it directly, and so must use the `dataframe['column_name'].` format.

We might also want to test the coocurrance of different values across different data columns. In order to do this we use the `.groupby()` function instead. This function takes two or more column names, and where calling `.size()` returns the size of each group.

In [27]:
smog.groupby(['Species', 'Units']).size()

Species  Units        
NO       ug m-3           35040
NO2      ug m-3           35040
NOX      ug m-3 as NO2    35040
PM10     ug/m3            35040
PM2.5    ug m-3           35040
dtype: int64

Again, try this for the `'Species'` and `'Provisional or Ratified'` columns. **What do these queries tell you about the data?**

You should now be building a picture of what the data looks like and the associations between columns. 

The final checks we can do use the `.describe()` function. This provides some basic summary stats relating to variations in the column data. You run it in the same way as you did the `.value_counts()` function, by just calling it against a column name. **Try running this function for each column below.**

As you will see, different types of results are extracted for each column, some more meaningful than others. `.describe()` actually combines a number of statistical measures, including `.max()`, `.min()` and `.mean()`, the full range can be found here: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics

In running through the measures on this dataset, you would be rightfully surprised if you noticed that the `smog.ReadingDateTime.describe()` did not provide very useful data. This is because we have not yet set the correct data types. 

So, before we go on to more detailed organisation of the data, we need to specify how each column should be handled.

## Setting Data Types

As we saw in the last section, the statistics generated for certain columns were not as useful as we expected them to be. This is because we have not yet specified our data types, another important stage in the data cleaning process. 

So, now we've imported our data, we should check the specified data types encoded for each column of the imported dataset. We do this by running the `.dtypes` function on the dataframe.

In [28]:
smog.dtypes

Site                        object
Species                     object
ReadingDateTime             object
Value                      float64
Units                       object
Provisional or Ratified     object
dtype: object

We can see that most of the columns have been imported as `objects`, which is to mean 'string'. This is satisfactory in some cases, based on the sample of the data above, but not all.

Given the issues noted earlier, let's first check out that date field. Most text loaders will not recognise some text as a date by default, so we need to tell it what it is seeing before before we can use the data effectively. 

Fortunately, Pandas has some robust tools for organising date and time data. For this we'll use the `.to_datetime` function and pass it the column in question. The column, now in date format, is written back to the dataframe. We verify this change by running the `.dtypes` function again.

In [29]:
smog['ReadingDateTime'] = pd.to_datetime(smog['ReadingDateTime'])
smog.dtypes

Site                               object
Species                            object
ReadingDateTime            datetime64[ns]
Value                             float64
Units                              object
Provisional or Ratified            object
dtype: object

Now, run the `.describe()` function, and see how the results differ from earlier.

In [30]:
smog.ReadingDateTime.describe()

count                  175200
unique                  35040
top       2014-11-20 03:00:00
freq                        5
first     2014-01-01 00:00:00
last      2014-12-31 23:45:00
Name: ReadingDateTime, dtype: object

We now have a decent idea of the range of dates within the dataset.

The only other change we might want to make is to the `'Provisional or Ratified'` column. A text column for a categorical column containing only one of two values seems a bit superfluous, and would be better served by a conversion to a boolean type. By doing so, a true value can simply reflect whether the reading has been ratified or not.

The first step is to set up a dictionary, linking key to value, and then mapping the new value to the existing column. As you will see, where the column once read `'R'` it now reads `True`, and where it was once `'P'` it is now `False`.

In [31]:
# set up dictionary containing mapping 
d = {'R': True, 'P': False}

# map new vales to existing column values
smog['Provisional or Ratified'] = smog['Provisional or Ratified'].map(d)

Now the `True` or `False` values have been set, we need to convert the type of the column to boolean. This is achieved through the `.astype()` function, which again writes the values back to the dataframe. We send the text 'bool' with the function in order to convert to boolean.

Other values can be used instead of `'bool'` where necessary. A list of alternatives can be found in the user guide here: http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes 

In [32]:
smog['Provisional or Ratified'] = smog['Provisional or Ratified'].astype('bool')
smog.dtypes

Site                               object
Species                            object
ReadingDateTime            datetime64[ns]
Value                             float64
Units                              object
Provisional or Ratified              bool
dtype: object

Now that we've made these changes, the column name doesn't make much sense. We can change that easily again using Pandas. This time we use the `.rename()` function, which will take a dictionary containing the changes we wish to make. We'll simply change the column name to reflect whether the data has been ratified or not.

In [33]:
smog = smog.rename(columns={'Provisional or Ratified': 'Ratified'})

**Now just check the data again to confirm that the column name changes have been made.**

In [34]:
smog

Unnamed: 0,Site,Species,ReadingDateTime,Value,Units,Ratified
0,CD9,NO,2014-01-01 00:00:00,63.0,ug m-3,True
1,CD9,NO,2014-01-01 00:15:00,64.3,ug m-3,True
2,CD9,NO,2014-01-01 00:30:00,57.6,ug m-3,True
3,CD9,NO,2014-01-01 00:45:00,87.8,ug m-3,True
4,CD9,NO,2014-01-01 01:00:00,55.3,ug m-3,True
5,CD9,NO,2014-01-01 01:15:00,92.5,ug m-3,True
6,CD9,NO,2014-01-01 01:30:00,75.6,ug m-3,True
7,CD9,NO,2014-01-01 01:45:00,,ug m-3,True
8,CD9,NO,2014-01-01 02:00:00,,ug m-3,True
9,CD9,NO,2014-01-01 02:15:00,77.0,ug m-3,True


**Finally, for simplicity, change the `ReadingDateTime` column name to simply `DateTime`. Then check the contents once more.**

You can check your answer or get a hint on **Slack AnswerBot** for this one! Look up **Week 3, Question 1.**

## Detailed Data Checking

Now that we have the data in a useable format, we can start breaking it into useful parts.

Our tests earlier identified five `Species` categories within the data. In view of what we know about the dataset (e.g. it's pollution data) and the other columns, this seems like a good place to start initially breaking down the dataset.

We'll do this by taking a subset of the original dataset and experimenting with it. To create this subset we use the `.loc` function, and provide some specification logic (sort of like SQL). For this example, we'll take all data from `smog` where the `Species` equals `'NOX'`. **Run the scripts below.**

**Note**: There are lots more ways you can create subsets from Pandas dataframes, if you're interested then check out the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing

In [35]:
smog_nox = smog.loc[smog.Species == 'NOX']

Now we have this dataset, **check its contents** and verify it contains what we expect it to.

**Run the `.describe()` function too to check the statistics of the subset data.** Notice anything odd?

Let's also plot the data. Again, very simple using Pandas. We just use the `.plot()` function on the dataframe, and specify the `x` and `y` axes. In the case of this data, we're probably most interested in the variation of the value over time. **Run the code below to explore time series variation within the NOX subset, do you see anything strange?**

**Note**: The plot below is a very basic default and shouldn't be used for data presentation. More can be found about how to more carefully specify Pandas plots in the documentation here http://pandas.pydata.org/pandas-docs/stable/visualization.html (Although even prettier methods will be taught later in the course.)

In [36]:
smog_nox.plot(x = 'DateTime', y = 'Value')

KeyError: 'DateTime'

Now, let's run a few of those tests we used earlier here, to find out the extents of this subset. Answer the following questions:

1. Does the subset contain null values? If so, how many?
2. What is it's temporal range? 
3. How much of the data has been ratified?

**Run these tests in the boxes below.**

In [None]:
# how many null values? (See AnswerBot Q2)

In [None]:
# temporal range? (See AnswerBot Q3)

In [None]:
# how much of the data is ratified? (See AnswerBot Q4)

Now we know a little about the NOX subset, we would like to know whether the other subsets of the data are of similar data quality.  

**Rerun the tests above for each of the five Species categories**. The seven boxes below have been provided for you to carry this out, but feel free to add more by going to the Insert menu and selecting Insert Cell Below (or click on the round down arrow button in the toolbar).

In [None]:
# create subset

In [None]:
# check the contents

In [None]:
# check the statistics

In [None]:
# plot data

In [None]:
# null values?

In [None]:
# temporal range?

In [None]:
# data ratified?

Now that have an idea of the variation in values and data quality across each Species category, it's time to decide what to do about the data. **What do you think?**

Well, I'll tell you what I think we should do. The `PM2.5` and `PM10` data is potentially interesting, but the irregular consistency and proportion of null values make it unreliable. While other categories contain some null values, these are not as prevalent and so we handle these later. I'd therefore suggest we remove the `PM2.5` and `PM10` data from our further analyses.

To do this, we'll turn back to the original dataset - `smog` - and work from there. We'll remove all rows where the Species attribute is `PM2.5` or `PM10`. To do this, we create a new dataframe and add only the rows that correspond to this condition. **Run the example below, which creates a dataframe that excludes the `PM2.5` data.**

In [None]:
smog_nopm25 = smog.loc[smog.Species != 'PM2.5']

**Check the contents of the new dataframe, and confirm that the `'PM2.5'` data has been removed.**

Now, the problem is that the new dataset only excludes one of the subsets we want to remove. So we now need to add to the logic statement above to remove rows where the `Species` is `'PM10'`. 

Like SQL, we use `OR` and `AND` specifiers to join these statements - however, in Pandas the syntax is slightly different. Here, `OR` is defined using a bar like this `|` and `AND` is defined using a `&` symbol. We wrap each individual condition within round brackets, and then put the whole statement within square brackets (just like the simple version above). More information on creating these conditions can be found here http://pandas.pydata.org/pandas-docs/stable/cookbook.html#building-criteria

**Now, using this syntax, from the original dataframe exclude all rows where the `Species` is set as `PM2.5` or `PM10`.**

**And, now, of course, recheck the contents...**

Before we move on - Did you also notice the negative values earlier? There were some in the PM columns, but also some in the NO columns. Do you think these are acceptable? 

Have a look at those negative values again - export the data using the query below. This is the syntax for querying subsets of your data.

In [None]:
smog.loc[smog['Value'] < 0]

They don't seem to really make sense, so we need to get rid of them.  We do this by just assigning this subset of the data a mull value (known as `NaN` in Python). The `NaN` object is found in the `numpy` library so we have to import that first before we make the change. 

**NOTE**: We could use the method above to extract the subset of data for assigning to null. However, Python complains when you do this, and produces a `SettingWithCopyWarning`. Instead we use the `.loc()` function to do this, which similarly extracts a subset of the data based on a condition.

**Once you've run the script below, check again to see that the previously negative values have been changed to null values.**

In [None]:
# import the numpy library so we can use it's null object
import numpy as np

# replaces the Value data with NaN where they are less than 0
smog.loc[smog['Value'] < 0, 'Value'] = np.nan

## Data Transformation

Following our efforts above we are left with a dataset that includes only relatively clean (and so, hopefully, useful) data. We now want to think about how we get this data into a database for future use.

What we need to consider now is whether the data is organised in the most useful way. Is the current arrangement of columns and rows the most convenient considering that the values vary depending on the `Species`? Are there any columns that repeat across the dataset (resulting in duplication)? **Can you think of another way to arrange the data?**

What I would suggest is that we convert the data so that the `Species` become the column headers, and each row contains the date, time and a value for each column. The `Units` data can be removed as these values vary directly with `Species` (although we should note them for when we are working with the data). The `Ratified?` dataset, however, would seem important and so we will include with within our transformed dataset. We are essentially *unstacking* the data, from the rather unhelpful format it arrived in.

To reshape the dataframe in the ways described above, we use the `.pivot()` function. This function takes an `index` name (to change on each row), a `columns` name for column groups, and a `values` name to indicate variation across these axes. 

**The syntax to create a pivot table between `DateTime` and `Species` is shown below, run it and then check the contents to see how the table has been transformed.**

In [None]:
smog_pivot = smog.pivot(index='DateTime', columns='Species', values='Value')

In [None]:
# check the contents

The process of creating a pivot table removes all other rows, so we need to do the same for the `Ratified` data, recording how the ratification of the data changes over date and time. 

**Run another pivot function below to create a table that shows whether a recording (`NO`, `NO2`, or `NOX`) has been ratified or not, then check the contents.** You can check your answer in **AnswerBot Q5**.

More information on pivot tables can be found here if needed -> http://pandas.pydata.org/pandas-docs/stable/reshaping.html

In [None]:
# create the pivot table

In [None]:
# check the contents

By now you should have two pivot tables - one show variation in values, the other showing variation in ratification status. We can choose to either keep these tables separate, or join them together, depending on how we want to store them on the database.

In this case, given that they share the same date and time range, we will join them back together *side-by-side*. To perform this function, we use the `.join()` tool, which in fact works in a very similar way to the SQL `JOIN` function. 

To use this, we simply run the function against an existing table, naming the joined table in the function parameters. We can specify further information too - such as the suffixes to use for matching column names, the type of join (e.g. inner, outer, left, right), and which column to use for the join. As we specified the indexes on these tables during the pivot table creation stage, we do not need to make this final specification.

**Look at the specification of the function below and try running it.**

More information on the construction of dataframe joins (and merges, which are similar) can be found here http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

In [None]:
# the lsuffix and rsuffix parameters specify how we handle the matching column names
smog_join = smog_pivot.join(**insert table name here**, lsuffix='_val', rsuffix='_ratified')

# and check the results
smog_join

We're now almost there - hurrah! 

One final thing to consider - with all of these nulls dotted around the data, could we achieve a better coverage by aggregating to a less granular temporal scale? How about just using hourly data, rather than every 15 minutes? 

Well we can do this by running the `.resample()` function on the dataframe. This function summarises the data to the specified temporal range, and aggregates on mean as default. For our case, we'll just calculate hourly averages. 

**Run the code below, taking note of its arrangement.**

In [None]:
# the '60T' section defines the sampling range (T = minutes)
smog_resamp = smog_join.resample('60T').mean()

**Now one more time. Create a new dataframe that resamples for every day, recording the maximum value for each time period.** You can check this one using **AnswerBot Q6**.

**HINT**: For this stage, you'll need to work out how to specify the resample rate, using the list found here http://pandas.pydata.org/pandas-docs/dev/timeseries.html#offset-aliases and identify the correct sampling method from the documentation here http://pandas.pydata.org/pandas-docs/dev/timeseries.html#up-and-downsampling

## Database Import

We've finally reached a point where we can feel happy with the quality and format our data is in. It's now time to export that data to a database for future use.

We'll use the Pandas SQL functionality to achieve this. Other more comprehensive SQL packages are available, but the Pandas tools make working between dataframes and SQL quite simple.

### Installing Database Package

When accessing a database through an external script or application, you'll always need to ensure two things are done:

1. Make sure the *package* is installed that allows your application (e.g. Python) to talk to your database (e.g. MySQL).

2. Write a *connection string* within your application that tells it where the database is and how to connect to it. The syntax for this varies but the connection string always contains details of the database type, it's location, and the access credentials (e.g. username and password) needed for connection.

**If you're working on your own machine, follow these steps to install the PyMySQL drivers.** This process is the same for any Python package being installed through Anaconda, but you'll find the majority of useful data science libraries come pre-installed. If you're using an ISD machine then unfortunately the installation will not persist outside of your current login, so you'll have to do this again every time you want to use the package.

1. Open Anaconda Navigator and go to the Environments tab.

2. Switch the drop down menu from Installed to All.

3. Search for 'pymysql'.

4. Click on the box on the left-hand side of the 'pymysql' entry, and hit 'Mark for Installation'.

5. Click the Apply button in the bottom right corner, then wait for the installation to complete.

### Making the Database Connection

The Pandas SQL connection builds on the SQLAlchemy library. To create the connection, we first load the SQLAlchemy library, and then create the connection. **The scripts for this are provided below, replace the access credentials where necessary, then run them.** You'll notice that the database location details (host and port numbers) have been provided for you.

The connection settings provided here are for the MySQL database, but different settings are required for other database environments. More information on how to create these connections are provided here: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html

In [None]:
# import the SQLAlchemy libraries
from sqlalchemy import create_engine

# create the connection string to the MySQL database
# replace USERNAME and PASSWORD with your own credentials 
engine = create_engine('mysql+pymysql://USERNAME:PASSWORD@dev.spatialdatacapture.org:3306/USERNAME')

# make the connection to the database
conn = engine.raw_connection()

### Accessing the Database from Python

Once we've created the connection to the database, we can write data directly to the database. This uses the `.to_sql()` function on the cleaned dataframe we wish to upload. Within this function we provide parameters the specify the database table name and connection details. The function helpfully creates the new 

**Check out the query below, and then run it.** 

In [None]:
# change TABLENAME for the name of your newly created table
# don't worry about the warning if you get it 
smog_resamp.to_sql('YOUR_TABLE_NAME', conn, flavor='sqlite')

Finally, we will want to check that the data has actually been correctly uploaded to the database. You can do this through MySQL Workbench, but it's good to know how to run SQL queries in from Python too. 

Here we will run a Pandas function that will call an SQL query and then place the results within a new dataframe. The function is called `.read_sql()` and just requires the SQL query, and the connection details. 

**Test this out below, filling the SQL query with a simply query on your new dataset. Once you've downloaded the data, check out the contents.**

In [None]:
# remember to enter your SQL query in below
smog_db_data = pd.read_sql('YOUR SQL QUERY HERE', conn)

In [None]:
# check the imported data

**Open up MySQL Workbench too and go and find your new dataset.**

More information on how to access databases, insert data and query tables using SQL can be found the in documentation here: http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries

### Exercises

Well done for completing this tutorial. You should now have a good grounding in how to use Python and Pandas to load and clean data, and upload it to a database. We'll continue to use similar methods in the coming weeks. 

If you want to explore these methods further in the meantime, here are a few additional activities you might want to try:

* Download a new dataset from London Air (http://www.londonair.org.uk/), go through the same process to fix up the data.

* If you don't want to work with a new air pollution dataset, use the SQL connection to download only the *ratified* data. Explore how the NO, NO2 and NOX readings vary together over time, using plots and any other methods you feel like trying out (covariance and correlation, perhaps?).

* Go and find another dataset to play with at the UK Open Datastore (http://data.gov.uk/data/search). There are some truly terrible, messy examples on there to get your teeth stuck into. 

