# MEI Introduction to Data Science
# Lesson 6 - Activity 1b

It is widely reported that the temperature has increased both worldwide in the UK over the last 100 years (e.g. https://www.bbc.co.uk/news/science-environment-50976909). This activity uses the data from the Edexcel large data set and related data from the Met office website to explore how much the temperature has changed and whether this change is similar for different parts of the UK. The activity demonstrates working through at least two iterations of a data science cycle.

In activity 1a for this lesson you explored whether 2015 was a hotter year than 1987 for weather stations in the large data set. In activity 1b you will expand on this by working through another iteration of the data cycle with additional data.

* Run the code below to import pandas and matplotlib

In [None]:
# import pandas
import pandas as pd 

# import matplotlib
import matplotlib.pyplot as plt

## Problem (2)
The analysis in activity 1a only explored the data for 2 different years from at most 5 different weather stations. There is data for more stations and more years at https://www.metoffice.gov.uk/research/climate/maps-and-data/historic-station-data. 

Using this you could rephrase the initial problem as:
> Has the temperature been higher at UK locations since 1990?

To explore this you could find the temperature for each year and compare the years before and since 1990.

## Getting the data (2)
For this notebook the data for 5 weather stations (Bradford, Camborne, Heathrow, Hurn and Leuchars) have been stored as CSV files locally. *Note that Leeming is not available and has been replaced by Bradford.*

This dataset contains readings for:
* Mean daily maximum temperature (tmax)
* Mean daily minimum temperature (tmin)
* Days of air frost (af)
* Total rainfall (rain)
* Total sunshine duration (sun)

These are recorded for the year (yyyy) and the month (mm)

* Run the code below to import the data for Hurn

In [None]:
hurn_all_data = pd.read_csv('../input/metofficeweatherbymonth/hurndata.csv')
hurn_all_data.head()

Before exploring the data it will be helpful to add an additional field that stores whether the date was before 1990.

The code below generates a new field in the dataset that is set to *True* or *False* depending on the condition: `hurn_all_data['yyyy']>1989`
* Run the code in the boxes below to create the *post1989* field and check the head and tail of the data

In [None]:
# create a new column that will be True or False depending on the statement: heathrow_aug_data['yyyy']>1989
hurn_all_data['post1989'] = hurn_all_data['yyyy']>1989

# check the data by displaying the first few rows
hurn_all_data.head()

The first few rows should all show `False` for `post1989`.
* Add some code in the box below to check that the last few rows display `True`

In [None]:
# check the last rows of the dataset
hurn_all_data.tail()


## Exploring the data (2)
* Add and run some code in the box below to check the data types

In [None]:
# check the data types
hurn_all_data.dtypes

The output should indicate that there are values in the tmax field that are not numbers. To find these you need to search the data for any values that cannot be directly converted to numbers.

The code in the box below searches through all the `hurn_all_data['tmax']` values and then checks whether the value in the field is numeric, i.e. contains only numbers. If the value is `False` it displays the row.

Unfortunately the `str.isnumeric()` does not recognise decimal points as parts of numbers so the value 9.3 would return `False` therefore the string with any decimal points removed is checked. Removing the decimal point is acheived by using `replace('.','')`.  
* Run the code below to search for any fields with additional characters

In [None]:
# search for any rows where the temperature is not a viable number
hurn_all_data[(( hurn_all_data['tmax'].str.replace('.', '')).str.isnumeric() == False)]

This output shows that one of the values has an asterisk in it. 
* Run the code below to remove the asterisk and conver to a floating point number.

In [None]:
# remove '*' signs by replacing them with an empty string
hurn_all_data['tmax'] = hurn_all_data['tmax'].str.replace('*', '')

# convert to float type
hurn_all_data['tmax'] = hurn_all_data['tmax'].astype('float')

# check the data types
hurn_all_data['tmax'].describe()

* Run the code below to plot a time series for the data data

In [None]:
# plot a time series for the data
hurn_all_data.plot(y='tmax', figsize=(10,5))
plt.show()

This plot is difficult to interpet as the temperature varies with the month. You could filter the dataset to just look at August data.
* Run the code in the boxes below to create a new dataset of the August readings and plot a time series.

In [None]:
# create a new data set with just the august data
hurn_aug_data = hurn_all_data[hurn_all_data['mm'] == 8]

# check the head of the data
hurn_aug_data.head()

In [None]:
# plot a time series for the maximum temperature for august
hurn_aug_data.plot(x='yyyy', y='tmax', figsize=(12,5))
plt.show()

You could also explore the maximum temperature in other months.
* Add and run some code in the box below to explore the maximum temperature for at least one different month.

In [None]:
# create a new data set with the data for a single month
hurn_jan_data = hurn_all_data[hurn_all_data['mm'] == 1]
# check the head of the data

hurn_jan_data.head()


In [None]:
# plot a time series for the maximum temperature for a single month
hurn_jan_data.plot(x='yyyy', y='tmax', figsize=(12,5))
plt.show()

You could also explore the minimum temperature for different months.
* Add and run some code below to explore the temperature for some different months - *you might wish to add some additional code boxes*

In [None]:
# create time series for the minimum temperature for some different months

hurn_all_data[(( hurn_all_data['tmin'].str.replace('.', '').str.replace('-','')).str.isnumeric() == False)]
hurn_all_data['tmin'] = hurn_all_data['tmin'].str.replace('*', '')
hurn_all_data['tmin'] = hurn_all_data['tmin'].astype('float')

hurn_jan_data = hurn_all_data[hurn_all_data['mm'] == 1]
hurn_aug_data = hurn_all_data[hurn_all_data['mm'] == 8]

hurn_jan_data.plot(x='yyyy', y='tmin', figsize=(12,5))
hurn_aug_data.plot(x='yyyy', y='tmin', figsize=(12,5))
plt.show()

**Checkpoint**
> * Why was it valid to remove the asterisk from the temperature and not just ignore this field? *You might want to refer to the original data at:  https://www.metoffice.gov.uk/research/climate/maps-and-data/historic-station-data*

The source is reliable, so the asterisk can be put down to human error rather than a discrepancy in the data

> * Which other months have you chosen to explore the maximum temperature for? Why have you chosen these months?

January, as it is of a different general climate, meaning the data can be explored from different angles

> * Which other months have you chosen to explore the minimum temperature for? Why have you chosen these months?

The same months, to draw comparisons

## Analysing the data (2)
Displaying the mean, standard deviation and boxplots for the years before 1990 and since 1990 will help you answer the problem.

* Run the code in the boxes below to display the means, standard deviations and boxplots

In [None]:
# print the means
print("Mean of the maximum temperature for August")
print(hurn_aug_data.groupby(['post1989'])['tmax'].mean())

# print a blank line
print("\n")

# print the standard deviation
print("Standard deviation of the maximum temperature for August")
print(hurn_aug_data.groupby(['post1989'])['tmax'].std())

# display the boxplot
hurn_aug_data.boxplot(column = ['tmax'],by='post1989', vert=False,figsize=(10, 5))
plt.title("Maximum temperature for August: Hurn")
plt.show()

You can repeat this analysis for the other months for which you have created datasets.
* Add and run some code in the box below to display the means, standard deviations and boxplots for the maximum and minimum temperatures for these other months

In [None]:
# display means, standard deviations and boxplots for maximum and minimum temperature for the other months


print("Mean of the maximum temperature for January")
print(hurn_jan_data.groupby(['post1989'])['tmax'].mean())

print("\n")


print("Standard deviation of the maximum temperature for January")
print(hurn_jan_data.groupby(['post1989'])['tmax'].std())

hurn_jan_data.boxplot(column = ['tmax'],by='post1989', vert=False,figsize=(10, 5))
plt.title("Maximum temperature for January: Hurn")
plt.show()


You could also perform similar analysis on the data from other weather stations. 
* Add and run some code below to import the data from at least one of the other weather stations from the data in the metofficeweatherbymonth folder - *you might want to add additional code boxes*

In [None]:
# analyse the maximum and minimum temperatures for selected months from one of the other weather stations
heathrow_all_data = pd.read_csv('../input/metofficeweatherbymonth/heathrowdata.csv')
heathrow_all_data = heathrow_all_data.dropna(how='any',axis=0) 


heathrow_all_data['tmax'] = heathrow_all_data['tmax'].astype('str')

heathrow_all_data[(( heathrow_all_data['tmax'].str.replace('.', '').str.replace('-','')).str.isnumeric() == False)]
heathrow_all_data['tmax'] = heathrow_all_data['tmax'].astype('float')

heathrow_all_data['tmin'] = heathrow_all_data['tmin'].astype('str')

heathrow_all_data[(( heathrow_all_data['tmin'].str.replace('.', '').str.replace('-','')).str.isnumeric() == False)]
heathrow_all_data['tmin'] = heathrow_all_data['tmin'].astype('float')


heathrow_jan_data = heathrow_all_data[heathrow_all_data['mm'] == 1]
heathrow_aug_data = heathrow_all_data[heathrow_all_data['mm'] == 8]


heathrow_all_data['post1989'] = heathrow_all_data['yyyy']>1989

print("Mean of the maximum temperature for January")
print(heathrow_jan_data.groupby(['post1989'])['tmax'].mean())

print("\n")


print("Standard deviation of the maximum temperature for January")
print(heathrow_jan_data.groupby(['post1989'])['tmax'].std())


heathrow_jan_data.plot(x='yyyy', y='tmin', figsize=(12,5))
heathrow_aug_data.plot(x='yyyy', y='tmin', figsize=(12,5))

print("Standard deviation of the maximum temperature for August")
print(heathrow_aug_data.groupby(['post1989'])['tmax'].mean())
print("Standard deviation of the maximum temperature for January")
print(heathrow_aug_data.groupby(['post1989'])['tmax'].std())

heathrow_aug_data.boxplot(column = ['tmax'],by='post1989', vert=False,figsize=(10, 5))
plt.title("Maximum temperature for August: Heathrow")
heathrow_jan_data.boxplot(column = ['tmax'],by='post1989', vert=False,figsize=(10, 5))
plt.title("Maximum temperature for January: Heathrow")
plt.show()


**Checkpoint**
> * Which has shown the greater change for pre and post-1989: maximum or minimum temperature? 

Maximum

> * Is the change similar for different months or different weather stations?

Relatively similar across stations and months

> * Is maximum temperature more variable pre or post-1989?  

After

## Communicating the result (2)
**Checkpoint**
> Use the results above to answer the initial problem: *Has the temperature been higher at UK locations since 1990?*

## Problem (3) ...
You have used the data to answer two questions:
* *Was 2015 a hotter year than 1987?*
* *Has the temperature been higher at UK locations since 1990?*

What other questions could you investigate to help you explore how much the temperature has changed and whether this change is similar for different parts of the UK?