# MEI Introduction to Data Science
# Lesson 2 - Activity 1

The activities in this course uses Python program code to make it quick and easy for you to analyse a data set. You do not have to understand Python programming to use the code but it should give you an appreciation of why code is a useful tool for working with data.

The activity uses the data from the Edexcel large data set which features weather data for 1987 and 2015 at 8 locations.

This activity contains additional guidance about how to use notebooks in Kaggle. When using these notebooks you should be able to:

* Run ready-made Python code in the notebook
* Make changes to the code and see the effects

### *Using this notebook*
This page is a kind of online document called a notebook. It includes some Python code. The code has been written for you. The code will analyse the contents of the 'large data set' and show you the results. 

As you work through this notebook you will:
* **run code** and look at  the results
* **make changes** to the code and see how that changes the results

### *Make your own copy*
Before you go any further, make your own personal copy of this notebook. Look for the blue button in the top right corner of this window. It says 'copy and edit'. Click this button. It might take a few seconds to create the copy. Open your copy and continue reading down this page.


## Problem
> *What were the main differences in weather between 1987 and 2015 at the locations in the dataset?*

## Getting the data (1)
In this lesson you will initially look at the data from:

* Heathrow 1987
* Heathrow 2015

This notebook contains Python code that you can use to analyse the data. The Python code uses a code module called 'pandas'. This is an additional module makes that includes ready-made tools for analysing data. You do not need to have met Python or pandas before: the code has been written for you.

### *Running the code*

The next box holds some lines of Python code. Some lines start with the hashtag symbol #. These lines are comments. The comments explain what the code does. 

* Click anywhere in the box of code. 
* You will see a large blue arrow to the left of the box. Click on the arrow to run the code.

The code below:
* **Imports** the pandas module. You only have to do this once in each notebook.
* **Imports** the content of the heathrow-2015.csv file as data set in pandas
* **Displays** the first few rows of the data set for checking purposes.

The output of the code will appear below the code. The output will show 5 records from the Heathrow 2015 subset.

In [None]:
# import the pandas module
import pandas as pd

# copy data from a file called heathrow-2015.csv and store it in pandas as a dataset called heathrow_2015_data
heathrow_2015_data = pd.read_csv("../input/ldsedexcel/heathrow-2015.csv")

# show the first records from the data
heathrow_2015_data.head()

### *Adapting the code*
The next box contains a single line of code. It outputs 6 records from the data set called 'heathrow_2015_data'. 

* Change the code so it outputs 16 records
* What happens if you try to output 100 records? Or 1000 records?
* Try changing **head** to **tail** in the code. What do you think the output will be?

If you make a mistake you can press Ctrl-Z to undo your changes.

In [None]:
heathrow_2015_data.head(6)

### *Displaying the field names, data types and size of a data set*
In the previous tasks you looked at a table of data. It contains data for every day of 2015 from May to October. It is composed of columns and rows.

* Each column of the data table represents one field. Each field holds a different data value.
* Each row of the table represents one record. A record holds all the data values for a single date in 2015.

The next block of code will output the dimensions of the data set in the format (rows,columns).

* Run the code now and check the result

In [None]:
heathrow_2015_data.shape

The next block of code will output a list of the fields in this table. The output also shows the data type of each field. The main data types you will encounter are:
* **int64**: integers (64 bits are used to store the number)
* **float64**: a *floating point* number: i.e. includes a decimal point (64 bits are used to store the number)
* **object**: this includes text fields

It is important to know how the fields are stored as this has an impact on the type of analysis you can do, for example you will not be able to calculate a mean of a set of text values.

* Run the code now and check the result

In [None]:
heathrow_2015_data.dtypes

**Checkpoint**
> * How many records are there in the Heathrow 2015 data set?
> * How many fields are there in the Heathrow 2015 data set?
> * Which numerical fields have been imported as floating point numbers and which have been imported as integers?

## Exploring the data (1)
### *Numerical data*
In this part of the activity you will output summary values from the table. Summary values 'summarise' the values for one field of the table, using data from all the records in the table. For example the mean or standard deviation.

The next block of code will output a list of the summary values available for the field 'Daily Mean Temperature'.
* Run the code and see what summary values are available for Daily Mean Temperature

In [None]:
heathrow_2015_data['Daily Mean Temperature'].describe()

This code outputs summary values for 'Daily Mean Temperature'. You should have already output a list of other field names using the `dtype` command above. 

* Edit the code to change 'Daily Mean Temperature' to the name of any other field. See what summary values are available. Note that you must use the precise name of the field and that it is *case-sensitive*. You can copy (ctrl-C) and paste (ctrl-v) from the list produced with `dtype`.
* Explore the type of summary data that is available for each field in the subset.

In [None]:
heathrow_2015_data['Daily Mean Temperature'].describe()

You might find it useful to see a boxplot for the data. The code in the boxes below will give boxplot for the Daily Mean Temperature. Plotting diagrams will be explored in more detail in lesson 4.

In [None]:
# this imports the plotting library - you only have to do this once in the notebook
# this box has no output
import matplotlib.pyplot as plt

In [None]:
# generate the box plot for the Daily Mean Temperature column
heathrow_2015_data.boxplot(column = ['Daily Mean Temperature'])
plt.show()

Try changing the code to display the boxplots for other columns. You can use the output of `dtypes` to decide which columns a boxplot will be appropriate for. 

### Non-numeric values
Some of the fields in the data set have non-numeric values. Summary values such as mean cannot be calculated for non-numeric fields. However we can still produce useful summary values for these fields. 

An example is wind direction. This is stored in the field called 'Mean Cardinal Direction'. The value is given as a compass direction such as 'S' (South) or 'SSW' (South-South-West). One way to summarise this data for a whole year is to count the number of times each value was recorded. 

The code below outputs the summary data and value counts for Heathrow 2015. Using the 'print' command lets you show two different results in the same output box.

* Run the code.


In [None]:
print(heathrow_2015_data['Mean Cardinal Direction'].describe())
print(heathrow_2015_data['Mean Cardinal Direction'].value_counts())

**Checkpoint**

The command `heathrow_2015_data['Daily Mean Temperature'].describe()` gave you some summary values.

> * The summary value 'count' indicates the number of valid data records in the data set. What about the summary value 'mean'? 
> * Go through all the summary values and say what you think each one tells you about the field 'Daily Mean Temperature'.

## Getting the data (2)
In this part of the lesson you will compare summary values for two different data subsets:
* Heathrow 2015
* Heathrow 1987

Before you begin - think about what these summary values might show. Do you expect the two data sets to be similar, or very different? 

### Creating a new data set
Here is some code you have seen before. It imports a data set.
* FROM a file named heathrow-2015
* TO a dataset named heathrow_2015_data
* Gives the summary of the 'Daily Mean Temperature' field for heathrow_2015_data

Change the code so that it imports data 
* FROM a file named heathrow-1987
* TO a dataset named heathrow_1987_data
* Gives the summary of the 'Daily Mean Temperature' field for heathrow_1987_data

You will have to change the code in three places.

In [None]:
heathrow_2015_data = pd.read_csv("../input/ldsedexcel/heathrow-2015.csv")
heathrow_2015_data['Daily Mean Temperature'].describe()

**Checkpoint**
> What differences can you identify between the temperature for the two years? Is this what you expected to see?

## Exploring the data (2)
### *Cleaning the data*

Sometimes a data set has missing data values. In other cases the values might be obviously incorrect. For example in a list of temperatures you might see a single value higher than 1000. This is obviously wrong and will affect your summary values for that data set.

Finding and fixing these problems is called 'cleaning' the data. In this part of the lesson you will find and fix some problems with the Heathrow data sets.

The code below outputs the summary values for ‘Daily Total Rainfall’ for 2015.
* Run the code now
* Adapt the code to see the values for Heathrow 1987

How are the results different from the results for 'Daily Mean Temperature'? Does this problem apply to both years? 

Look back at the data table for Heathrow 2015 at the top of this notebook. Can you see any problems with the rainfall data?

In [None]:
heathrow_2015_data['Daily Total Rainfall'].describe()

Many of the values you might expect to see (such as 'mean') are not calculated for Daily Total Rainfall. That is because the field includes data of several different types:
* floating point numbers such as 3.2
* integers such as 7 or 0
* text such as 'tr'

The computer can't create summary values from a mixture of data types. So you need to clean the data before you can see summary values. The code in the two boxes below cleans the data for Heathrow 2015.
* The first box Replaces “tr” with 0.025 (The value of "tr" represents a trace amount of rain less than 0.05mm therefore we are replacing it with a value in the middle of the range 0-0.05).
* The second box changes every value to “float” data type

Run the code in both boxes to clean the Heathrow 2015 data.
Make changes to the code and run it again to clean the Heathrow 1987 data.

In [None]:
# replace any instances of 'tr' with 0.025
heathrow_2015_data['Daily Total Rainfall'] = heathrow_2015_data['Daily Total Rainfall'].replace({'tr': 0.025})

In [None]:
# change the data type to float
heathrow_2015_data['Daily Total Rainfall'] = heathrow_2015_data['Daily Total Rainfall'].astype('float')

# get a summary for the field
heathrow_2015_data['Daily Total Rainfall'].describe()

The next box shows the 'data cleaning' code again. If you have time, explore the effect of making changes to this code:

* Instead of replacing 'tr' with 0.025, replace with 0. What difference does it make to results?
* Delete the line that changes all the values to float data type. What difference does it make to results?

In [None]:
# The dataset needs to be imported again to overwrite it with the original with "tr" values in the rainfall column
heathrow_2015_data = pd.read_csv("../input/ldsedexcel/heathrow-2015.csv")

# You can edit the following three lines to change the value "tr" is replaced by
heathrow_2015_data['Daily Total Rainfall'] = heathrow_2015_data['Daily Total Rainfall'].replace({'tr': 0.025})
heathrow_2015_data['Daily Total Rainfall'] = heathrow_2015_data['Daily Total Rainfall'].astype('float')
heathrow_2015_data['Daily Total Rainfall'].describe()

Checkpoint
> Which value would you use to replace the string 'tr' in the rainfall field? Give your reasons.

## Analysing the data

In this section you can decide which fields to analyse so that you can answer the original question. 
The code below will take data from a CSV file called heathrow-2015. It stores it as a data set called heathrow_2015 data. It then prints out:

* Summary values for Daily Mean Temperature
* Value counts for Mean Cardinal Direction

By making changes to this code load other data sets and explore the data in the different fields. You can see the other sheets saved as csv files by expanding the "Data" panel in the right-hand column and clicking on the "ldsedexcel" folder.

In [None]:
# import the data
heathrow_2015_data = pd.read_csv("../input/ldsedexcel/heathrow-2015.csv")

# find the statistics for a numerical column 
print(heathrow_2015_data['Daily Mean Temperature'].describe())

# draw a boxplot 
heathrow_2015_data.boxplot(column = ['Daily Mean Temperature'])
plt.show()

# count the values in a non-numerical column
print(heathrow_2015_data['Mean Cardinal Direction'].value_counts())

## Communicating the results
**Checkpoint**

> Use the statistics and charts produced to answer the initial problem: *What were the main differences in weather between 1987 and 2015 at the locations in the dataset?*