#Lesson 6: Processing data with pandas II
This week we will continue developing our skills using [pandas](https://pandas.pydata.org/) to process real data.

---

General information
Sources
This lesson is inspired by the Geo-python module at the University of Helsinki which in turn acknowledges the Programming in Python lessons from the Software Carpentry organization. This version was adapted for Colab and a UK context by Ruth Hamilton.

About this document
This is a Google Colab Notebook. This particular notebook is designed to introduce you to a few of the basic concepts of programming in Python. Like other common notebook formats (e.g. Jupyter), the contents of this document are divided into cells, which can contain:

Markdown-formatted text,
Python code, or
raw text
You can execute a snippet of code in a cell by pressing Shift-Enter or by pressing the Run Cell button that appears when your cursor is on the cell .

---




## Motivation



According to the Met Office, 2020 was the [second warmest year in their record](https://www.metoffice.gov.uk/about-us/press-office/news/weather-and-climate/2021/2020-ends-earths-warmest-10-years-on-record). In this lesson, we will use our data manipulation and analysis skills to continue to analyze the weather data for 2020 in Sheffield.

Along the way we will cover a number of useful techniques in pandas including:

* renaming columns
* iterating data frame rows and applying functions
* data aggregation
* repeating the analysis task for several input files



## About the data

We will be working with the same data as last week but will be using the raw file as downloaded from the [CEDA archives](https://data.ceda.ac.uk/badc/ukmo-midas-open/data/uk-daily-temperature-obs/dataset-version-202207/south-yorkshire/00525_sheffield/qc-version-1.). I have downloaded the files but have kept the original file names which is a little more complicated.

- We have data for Sheffield weather recordings for 1920, 1930, 1940, ...,2020.
- each file will have 91 rows of metadata at the start


We will develop our analysis workflow using data for a single year. Then, we will repeat the same process for all the years.

## Reading the data

In order to get started, first mount your Google drive so we can access the data.let's first import pandas:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Then import pandas.

In [None]:
#import pandas as pd
import pandas as pd

At this point, let's have another look at the dataset we used last week for 2020 and how it is structured. We are using a true CSV (comma separated value) file, i.e. the values are separated by commas (you can check this by opening one of the data files in a text editor like Notepad). This is the default for the `read_csv()` function but it can also read in data that uses another character (or characters) to separate values.

---
>**Input data structure**
>
>- **Delimiter:** If your data uses a different *delimiter* , e.g. a white space, you can use specify either the `sep` parameter or the `delim_whitespace` paramter (see the documentation for the [read_csv() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html))
but note that you can't use both at the same time. By default, the `read_csv()` method assumes that the `sep` paramter is set to `,`, i.e. `sep=,`.
>
>- **No Data values:** Some datasets may use a character to indicate 'no_data'. We can tell pandas to consider those characters as NaNs by specifying `na_values=['*', '#']`. This would replace any value of `*` or `#` as a `NaN` value.
---

Remember, each of the files starts with some metadata (90 rows in the raw data) so we need to add the `skiprows` paramtere when we read in the files.

In [None]:
# Define relative path to the file
fp = r'/content/drive/Shareddrives/TRP479_Spatial_Data_science_2024/L6/Data/midas-open_uk-daily-temperature-obs_dv-202207_south-yorkshire_00525_sheffield_qcv-1_2020.csv'

# Read data using varying amount of spaces as separator and specifying * characters as NoData values
data = pd.read_csv(fp, sep=',',skiprows=90)

Let's see how the data looks by printing the first five rows with the `head()` function:

In [None]:
data.head()

All seems ok. However, we won't be needing all of the 22 columns. We can check all column names by running `data.columns`.
> Remember, `columns` is one of the pandas *attributes* that you can use with a pandas dataframe. *attributes* are called *without* brackets, e.g. `data.columns` .

In [None]:
#show the list of columns in the dataframe:
data.columns


There are a number of columns that are probably not relevant, and also a lot that only contain `NaN` values. To get a better idea of what is contained in the metadata, we could look at the .csv file in Excel - or we could read it into a new dataframe in pandas.

From last week, we know that the metadate for each file is contained in the first 90 rows. We can use the `nrows` paramter with the `read_csv()` method to specify the number of rows containing metadata (in this case, 90). We are also going to set the `header` paramter to `None` becasue there first line does not contain column headings.

In [None]:
#read the first 90 rows as meta-data
meta_data=pd.read_csv(fp,header=None,nrows=90)



In [None]:
meta_data.head()

A description for all these columns is also available in the metadata file [midas-open_uk-daily-temperature-obs_dv-202207_station-metadata.csv](https://drive.google.com/open?id=1rRzYF4BGDxQWcO7tsbyiOS6l12ODg0Zv&authuser=r.hamilton%40sheffield.ac.uk&usp=drive_fs).

### Reading in the data once again

This time, we will read in only some of the columns using the `usecols` parameter. Let's read in columns that might be somehow useful to our analysis, or at least that contain some values that are meaningful to us, including the observation time, some id details, and the maximum and minimum air temperature readings: `'ob_end_time', 'id_type', 'ob_hour_count',
       'src_id',  'max_air_temp',
       'min_air_temp'`

In [None]:
# Read in only selected columns
data = pd.read_csv(fp,
                   usecols=['ob_end_time', 'id_type', 'ob_hour_count',
       'src_id',  'max_air_temp',
       'min_air_temp'], skiprows=90)

# Check the dataframe
data.head()

Okay, so we can see that the data was successfully read to the DataFrame.

## Renaming columns

As we saw above some of the column names are a bit awkward and difficult to interpret. Luckily, it is easy to alter labels in a pandas DataFrame using the [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) function. In order to change the column names, we need to tell pandas how we want to rename the columns using a dictionary that lists old and new column names

Let's first check again the current column names in our DataFrame:

In [None]:
data.columns

<div class="alert alert-info"><b>Dictionaries</b><br/>

A [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) is a specific data structure in Python for storing key-value pairs. During this course, we will use dictionaries mainly when renaming columns in a pandas series, but dictionaries are useful for many different purposes! For more information about Python dictionaries, check out [this tutorial](https://realpython.com/python-dicts/).
</div>

We can define the new column names using a [dictionary](https://www.tutorialspoint.com/python/python_dictionary.htm) where we list "`key: value`" pairs, in which the original column name (the one which will be replaced) is the key and the new column name is the value.

- Let's change the following:
   
   - `ob_end_time` to `TIMESTAMP`
   - `max_air_temp` to `MAX`
   - `min_air_temp` to `MIN`

In [None]:
# Create the dictionary with old and new names
new_names = {"ob_end_time" : "TIMESTAMP", "max_air_temp" :"MAX", "min_air_temp": "MIN"}

# Let's see what the variable new_names look like
new_names

In [None]:
# Check the data type of the new_names variable
type(new_names)

From above we can see that we have successfully created a new dictionary.

Now we can change the column names by passing that dictionary using the parameter `columns` in the `rename()` function:

In [None]:
# Rename the columns
data = data.rename(columns=new_names)

# Print the new columns
print(data.columns)

Perfect, now our column names are easier to understand and use.

### Check your understanding

The `src_id` columns contains the *id* code for the location, let's rename the column `src_id` to `source_id`.

In [None]:
# Create the dictionary with old and new names


# Rename the columns


# Check the output
data.head()

In [None]:
#@title Click to show code
# Create the dictionary with old and new names
new_names={"src_id":"source_id"}

# Rename the columns
data = data.rename(columns=new_names)

# Check the output
data.head()

## Data properties

As we learned last week, it's always a good idea to check basic properties of the input data before proceeding with the data analysis. Let's check the:

- Number of rows and columns

In [None]:
data.shape

- Top and bottom rows

In [None]:
data.head()

In [None]:
data.tail()

- Data types of the columns

In [None]:
data.dtypes

- Descriptive statistics

In [None]:
data.describe()

You should have noticed that the last line of our `data.tail()` call conatined the `end data` flag. We need to remove that line again...

In [None]:
# 'drop' the last row of data in our data frame
data.drop(len(data)-1,axis = 'index',inplace =True)

## Parsing dates

You will have noticed that although the newly named `TIMESTAMP` column contains time/data information, the data type is read as an *object*. We will eventually want to group our data based on month . We also want to be able to distinguish between *daytime* and *nighttime* readings (those recorded in the 12 hours preceding 21:00, and those recorded in the 12 hours before 09:00, respectively).

Let's have a closer look at the date and time information we have by checking the values in that column, and their data type:

In [None]:
data["TIMESTAMP"].head(10)

In [None]:
data["TIMESTAMP"].tail(10)

The `TIMESTAMP` column contains two observations per day. The timestamp for the first observation is `2020-01-01 09:00:00`, i.e. from 1st of January 2020, and the timestamp for the latest observation is `2020-12-31 21:00:00`.

In [None]:
#check the data types again
data.dtypes

The date information in TIMESTAMP is stored as an *object*, i.e. a *string*.

We want to extract the day and night temperatures and **aggregate the data on a monthly level**.  In order to do so, we need to "label" each row of data based on wether it is day or night and the month when the record was observed. In order to do this, we need to somehow separate information about the month and time for each row.

We create these "labels" by making a new column (or an index) containing information about the month and the time.



### String slicing

Our data is in string format, which can actually make it quite easy to work with. Our next step is to "cut" the needed information from the [string objects](https://docs.python.org/3/tutorial/introduction.html#strings). If we look at the latest time stamp in the data (`2020-01-01 09:00:00`), you can see that there is a systematic pattern `YEAR-MONTH-DAY HOUR:MINUTE:SECONDS`. The year, month and day are separated by a `-` , then there is a white space followed by the time using the 24 hour clock and `hours:minutes:seconds`!

In [None]:
#let's look at the first entry in the TIME column:
day_1=data.iloc[0,0]

#print it on the screen
print(day_1)

#identify the part of the string with the *data* information
print(day_1[0:10])

Based on this information, we can **slice** the correct range of characters from the `TIME` column using [pandas.Series.str.slice()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html)


In [None]:
# Slice the string
data["DATE"] = data["TIMESTAMP"].str.slice(start=0, stop=10)

# Let's see what we have
data.head()

Nice! Now we have added an attribute to the rows based on information about date.

### Check your understanding

Create three new columns, a `'MONTH'` column with only the **month**, `'DAY'` column with only the **day**, and a `'TIME'` column with information about the **time**.

In [None]:
# Extract information about *time* from the TIMESTAMP column into a new column 'TIME':


In [None]:
# Extract information about *day* from the TIMESTAMP column into a new column 'DAY':

In [None]:
# Extract information about *month* from the TIMESTAMP column into a new column 'MONTH':

In [None]:
#@title Click to show code
# Extract information about *month* from the TIMESTAMP column into a new column 'MONTH':
data["MONTH"] = data["TIMESTAMP"].str.slice(start=5, stop=7)

# Extract information about *day* from the TIMESTAMP column into a new column 'DAY':
data["DAY"] = data["TIMESTAMP"].str.slice(start=8, stop=10)

# Extract information about *time* from the TIMESTAMP column into a new column 'TIME':
# Slice the string
data["TIME"] = data["TIMESTAMP"].str.slice(start=11)



In [None]:
# Check the result
data.head()

Finally, let's create an identifier that contains the 'month and day' information; this allows us to easily identify records for a 24 hour period.

In [None]:
data.dtypes


In [None]:
data['MONTH_DAY']=data['MONTH'] + data['DAY']
data.head()

In this section, we have extracted date, month, day and time information from a text formatted string - **this is a very useful skill!** - and one that is very commonly used in Data Science. These methods will work on any strings e.g. postcodes, email addresses etc, although they may get a little more complicated if the text is not formatted as simply as it is here; this is where your problem-solving skills will come in handy!

Next, we will look at how pandas can work *natively* with date/time information.

### Datetime

In pandas, we can convert dates and times into a new data type [datetime](https://docs.python.org/3.7/library/datetime.html) using [pandas.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function.

In [None]:
# Convert character strings to datetime
data['DATE_dt'] = pd.to_datetime(data['DATE'])

In [None]:
# Check the output
data['DATE_dt'].head()

<div class="alert alert-info"><b>Pandas Series datetime properties</b><br/>

There are several methods available for accessing information about the properties of datetime values. Read more from the pandas documentation about [datetime properties](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetime-properties).
</div>

Now, we can extract different time units based on the datetime-column using the [pandas.Series.dt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html) accessor:

In [None]:
data['DATE_dt'].dt.year

In [None]:
data['DATE_dt'].dt.month

We can also combine the datetime functionalities with other methods from pandas. For example, we can check the number of unique months in our input data:

In [None]:
data['DATE_dt'].dt.month.nunique()

## Aggregating data in Pandas by grouping

Here, we will learn how to use [pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) which is a handy method for compressing large amounts of data and computing statistics for subgroups.

We will use the groupby method to calculate the average *day* and *night* temperatures for each month through these main steps:

  1. **Grouping the data** based on the time of day and month
  2. Calculating the average for each month (each group)
  3. Storing those values into **a new DataFrame** called `monthly_data`

Before we start grouping the data, let's once more see what our input data looks like.

In [None]:
print("number of rows:", len(data))

In [None]:
data.head()

We have quite a few rows of weather data, and two observations per day. Our goal is to create an aggreated data frame that would have only one row per month with day and nighttime temperatures.

Our `TIME` attribute is a bit confusing (09:00:00 corresponds the *night time* reading and 21:00:00 corresponds to the *day time* reading). Let's create a column, `TofDay`, that contains a text flag indicating night or day time readings.

We are going to use the `.loc` method we used last week to up date the values in our new `TofDay` column depending on the value in the `TIME` column.

In [None]:

data.loc[data['TIME'] == "21:00:00",'TofDay'] = 'Day'
data.loc[data['TIME'] == "09:00:00",'TofDay'] = 'Night'

In [None]:
data.head()

Let's also create an *average* day/nightime temperature for each day. We will assume this is simply the *mean* of the minimum and maximum air temeratures for each 12 hour period. To reduce confusion, we will call this `'MID'` to indicate that it is the *midpoint* between the `MAX` and `MIN` values.

In [None]:
data['MID']=(data['MAX']+data['MIN'])/2
data.head()

Now we can start grouping our data. Let's **group** our data based on the  month.

In [None]:
grouped_m = data.groupby("MONTH")

---
>**Note**
>
>It is also possible to create combinations on-the-fly when grouping the data:
  >
>```  
grouped_mt = data.groupby(['MONTH', 'TofDay'])
```

---

Let's explore the new variable `grouped_m`.

In [None]:
type(grouped_m)

In [None]:
len(grouped_m)

We have a new object with type `DataFrameGroupBy` with 12 groups. In order to understand what just happened, let's also check the number of unique month  combinations in our data:

In [None]:
data['MONTH'].nunique()

Length of the grouped object should be the same as the number of unique values in the column we used for grouping. For each unique value, there is a group of data.

Let's explore our grouped data even further.

We can check the "names" of each group.

In [None]:
# Next line will print out all 12 group "keys"
grouped_m.groups.keys()

### Accessing data for one group

Let us now check the contents for the group representing June (the name of that group is `06`). We can get the values of that hour from the grouped object using the `get_group()` method.

In [None]:
# Specify a month (as character string)
month="06"

# Select the group
group1 = grouped_m.get_group(month)

In [None]:
# Let's see what we have
group1.head()


Ahaa! As we can see, a single group contains a **DataFrame** with values only for that specific month. Let's check the DataType of this group.

In [None]:
type(group1)


So, as noted above, one group is a pandas DataFrame! This is really useful, because we can now use all the familiar DataFrame methods for calculating statistics, etc. for this specific group. We can, for example, calculate the average values for all variables using the statistical functions that we have seen already (e.g. mean, std, min, max, median, etc.).

We can do that by using the `mean()` function that we already did during Lesson 5.

- Let's calculate the mean for following attributes all at once:

    - `MAX`
    -`MIN`
    -`MID`

In [None]:
# Specify the columns that will be part of the calculation
mean_cols = ['MAX','MIN','MID']

# Calculate the mean values all at one go
mean_values = group1[mean_cols].mean()

# Let's see what we have
print(mean_values)

Above, we saw how you can access data from a single group. In order to get information about all groups (all months) we can use a `for` loop or methods available in the grouped object.

### For loops and grouped objects

When iterating over the groups in our `DataFrameGroupBy` object it is important to understand that a single group in our `DataFrameGroupBy` actually contains not only the actual values, but also information about the `key` that was used to do the grouping. Hence, when iterating over the data we need to assign the `key` and the values into separate variables.

So, let's see how we can iterate over the groups and print the key and the data from a single group (again using `break` to only see what is happening for the first group).

In [None]:
# Iterate over groups
for key, group in grouped_m:
    # Print key and group
    print(f"Key:\n {key}")
    print(f"\nFirst rows of data in this group:\n {group.head()}")

    # Stop iteration with break command; we are using this to stop the loop in the first iteration so we can see what is going on.
    break

OK, so from here we can see that the `key` contains the 'name' of the group (month).

Let's build on this and see how we can create a DataFrame where we calculate the mean values for all those weather attributes that we were interested in. We will repeat some of the earlier steps here so you can see and better understand what is happening.

In [None]:
# Create an empty DataFrame for the aggregated values
monthly_data = pd.DataFrame()

# The columns that we want to aggregate
mean_cols = ["MAX", "MIN" , "MID"]

# Iterate over the groups
for key, group in grouped_m:

    # Calculate mean - this is the same code we used in Cell 53...
    mean_values = group[mean_cols].mean()

    # Add the ´key´ (i.e. the date+time information) to the mean values
    mean_values["MONTH"] = key

    # Convert the mean_values series to a DataFrame and make it have a row orientation
    row = mean_values.to_frame().transpose()

    # Concatenate the aggregated values into the monthly_data DataFrame
    monthly_data = pd.concat([monthly_data, row], ignore_index=True)

This has calculated the *mean* value for each month.

In [None]:
mean_values

Now, let us see what we have.

In [None]:
print(monthly_data)

Awesome! Now we have aggregated our data and we have a new DataFrame called `monthly_data` where we have mean values for each month in the data set.

### Finding the mean for all groups at once

We can also achieve the same result by computing the mean of all columns for all groups in the grouped object.

In [None]:
grouped_m.mean()

In [None]:
grouped_m.mean(numeric_only=True)

We can also link the methods together. The lines below show how we can group on mutliple columns and calculate the mean all at once.

In [None]:
data_tm= data.groupby(['TofDay','MONTH'])['MAX','MIN','MID'].mean()
print("Grouped by time of day *then* month:\n",data_tm.head()) # note the use of the 'newline' charcater '\n'.

data_mt=data.groupby(['MONTH','TofDay'])[['MAX','MIN','MID']].mean()
print("\n Grouped by month *then* time of day:\n",data_mt.head())

> **NOTE** the first line `data_tm= data.groupby(['TofDay','MONTH'])['MAX','MIN','MID'].mean()` resulted in a **warning**. After a quick internet search, I found the solution is to put the columns into a **list** (as indicated by the `[[  ]]` in the second line).
>
>Programming standards can change quickly and you will encounter warnings and syntax errors frequently. With warnings, the code will execute *for now* but it is a good idea to follow up on them so that you can prevent your code from breaking in the future.

Although we have used the `groupby` method, the result of the `mean()` method is a dataframe. We can check that by checking the `type`. Since it is a dataframe, we can use `.loc` and `.iloc` functionality to extract individual values.




In [None]:
print(type(data_mt))
print("Average July night-time temperatures\n",data_mt.loc['07', 'Night'],"\n",sep='')
print("Average July temperatures\n", data_mt.loc['07'],sep='')

In [None]:
group_md=data.groupby('MONTH_DAY')['MID'].mean()
group_md.head()
#group_md['mean_24']=group_md[['MID']].mean()
#df['average_1_3'] = df[['salary_1', 'salary_3']].mean(axis=1)
#group_md.head()

## Detecting warm months

Now that we have aggregated our data on monthly level, all we need to do is to sort our results in order to check which month had the warmest daytime and nighttime temperatures. A simple approach is to select all Daytime readings from the data, group the data and check which group(s) have the highest mean value.

We can start this by selecting all records that are from the Day (regardless of the month).

In [None]:
daytime = data[data["TofDay"] == "Day"]

We can group by the month.

In [None]:
grouped = daytime.groupby(by="MONTH")

And then we can calculate the mean for each group.

In [None]:
monthly_day_mean = grouped.mean()

In [None]:
monthly_day_mean.head()

Finally, we can sort and check the highest temperature values. We can sort the data frame in a descending order to do this.

In [None]:
monthly_day_mean.sort_values(by="MID", ascending=False)

So, what month had the highest average daytime temperature?

## Check your understanding

Now find the month with the highest average nightime temperature

In [None]:
#Now, find the month with the highest average nightime temperature

---

Up until 2010, the data only recorded the maximum and minimum over a 24 hour
period. This means, if we want to make comparisons across multiple years, we will have to calculate an *average* for the 24 hour period. Again our simplest assumption is that this is simply the average of the max and minimum values from both day and night time readings.

Calculating this is going to be a bit more complicated...

We can use our *uniue* `MONTH_DAY` column to group all values for one day, and then find the *mean* of the `MID` column.

In [None]:
# Group by month and day
grouped_month_day = data.groupby(by=['MONTH_DAY'])
daily_mean=grouped_month_day.mean()

print(daily_mean.sort_values(by='MID', ascending=False).head(5))
print("\n")

In [None]:
#check that values for July 31st have mean of 23.375
print(data.loc[(data['MONTH']=="07") & (data['DAY']=="31")])        # shows all values for July 31
print(data.loc[(data['MONTH']=="07") & (data['DAY']=="31")].mean()) #shows mean for July 31 values




## Repeating the data analysis with a larger dataset

To wrap up today's lesson, let's repeat the data analysis steps above for all the available data we have (!!). First, it would be good to confirm the path to the **folder** where all the input data are located.

The idea is, that we will repeat the analysis process for each input file using a (rather long) for loop! In the following cide block, we pull together all the main analysis steps we have done so far, along with some additional output info. If this works, we cen then look at integrating into a loop so we can automate the process over a number of files.

In [None]:
# this script brings together all of the main steps we have taken from reading in the data to renaming and creating columns
# Read in only selected columns
fp = r'/content/drive/Shareddrives/TRP479_Spatial_Data_science_2024/L6/Data/midas-open_uk-daily-temperature-obs_dv-202207_south-yorkshire_00525_sheffield_qcv-1_2020.csv'

data = pd.read_csv(fp,
                   usecols=['ob_end_time', 'id_type', 'ob_hour_count',
       'src_id',  'max_air_temp',
       'min_air_temp'], skiprows=90)

# 'drop' the last row of data in our data frame becasue we know it contains the 'end data' flag
data.drop(len(data)-1,axis = 'index',inplace =True)

# Rename the columns
new_names = {"ob_end_time" : "TIMESTAMP", "max_air_temp" :"MAX", "min_air_temp": "MIN"}
data = data.rename(columns=new_names)


#Print info about the current input file:
print("Date:", data.at[0,"TIMESTAMP"])
print("NUMBER OF OBSERVATIONS:", len(data))

# Create column
col_name = 'MID'
data[col_name] = None

# Calculate 'average' temperature as midpoint between max and min readings
data['MID'] = (data['MAX'] + data['MIN'])/2


# Slice the string into DATE, MONTH, DAY and TIME columns
data["DATE"] = data["TIMESTAMP"].str.slice(start=0, stop=10)
data["MONTH"] = data["TIMESTAMP"].str.slice(start=5, stop=7)
data["DAY"] = data["TIMESTAMP"].str.slice(start=8, stop=10)
data["TIME"] = data["TIMESTAMP"].str.slice(start=11)

#make MONTH_DAY unique identifier
data['MONTH_DAY']=data['MONTH']+data['DAY']

#create 'Time of Day' flag
data.loc[data['TIME'] == "21:00:00",'TofDay'] = 'Day'
data.loc[data['TIME'] == "09:00:00",'TofDay'] = 'Night'



# Extract observations for the months of days and nights
days = data[data['TofDay']=="Day"]
nights = data[data['TofDay']=="Night"]

# Group by year and month
grouped_day = days.groupby(by=['MONTH'])
grouped_night = nights.groupby(by=['MONTH'])

# Get mean values for each group
monthly_day_mean = grouped_day.mean()
monthly_night_mean = grouped_night.mean()

# Print info
print(monthly_day_mean.sort_values(by='MID', ascending=False).head(5))
print("\n")

###But

Prior to 2010, only a single Min and Max reading was taking for each 24 hour period. So we need to calculate the mean for the 24 hour period *for all files*. We can do this using the `MONTH_DAY` flag.

In [None]:
# this script brings together all of the main steps we have taken from reading in the data to renaming and creating columns
# Read in only selected columns
fp = r'/content/drive/Shareddrives/TRP479_Spatial_Data_Science_Data/L6/Data/midas-open_uk-daily-temperature-obs_dv-202207_south-yorkshire_00525_sheffield_qcv-1_2020.csv'

data = pd.read_csv(fp,
                   usecols=['ob_end_time', 'id_type', 'ob_hour_count',
       'src_id',  'max_air_temp',
       'min_air_temp'], skiprows=90)

# 'drop' the last row of data in our data frame becasue we know it contains the 'end data' flag
data.drop(len(data)-1,axis = 'index',inplace =True)

# Rename the columns
new_names = {"ob_end_time" : "TIMESTAMP", "max_air_temp" :"MAX", "min_air_temp": "MIN"}
data = data.rename(columns=new_names)


#Print info about the current input file:
print("Date:", data.at[0,"TIMESTAMP"])
print("NUMBER OF OBSERVATIONS:", len(data))

# Create column
col_name = 'MID'
data[col_name] = None

# Calculate 'average' temperature as midpoint between max and min readings
data['MID'] = (data['MAX'] + data['MIN'])/2


# Slice the string into DATE, MONTH, DAY and TIME columns
data["DATE"] = data["TIMESTAMP"].str.slice(start=0, stop=10)
data["MONTH"] = data["TIMESTAMP"].str.slice(start=5, stop=7)
data["DAY"] = data["TIMESTAMP"].str.slice(start=8, stop=10)
data["TIME"] = data["TIMESTAMP"].str.slice(start=11)

#make MONTH_DAY unique identifier
data['MONTH_DAY']=data['MONTH']+data['DAY']

#create 'Time of Day' flag
#data.loc[data['TIME'] == "21:00:00",'TofDay'] = 'Day'
#data.loc[data['TIME'] == "09:00:00",'TofDay'] = 'Night'

# Group by month and day
grouped_month_day = data.groupby(by=['MONTH_DAY'])
daily_mean=grouped_month_day.mean()

#group by month
grouped_month= data.groupby(by=['MONTH'])
monthly_mean=grouped_month.mean()

# Print info
print(daily_mean.sort_values(by='MID', ascending=False).head(5))
print(monthly_mean.sort_values(by='MID', ascending=False).head(5))
print("\n")

#group_md=data.groupby('MONTH_DAY')['MID'].mean()
#group_md.head()

At this point we will use the `glob()` function from the module `glob` to list all of our input files. `glob` is a handy function for finding files in a directrory that match a given pattern, for example.

In [None]:
import glob

In [None]:
file_list = glob.glob(r"/content/drive/Shareddrives/TRP479_Spatial_Data_Science_Data/L6/Data/midas-open_uk-daily-temperature-obs_dv-202207_south-yorkshire*csv")


> **Note** that we're using the \* character as a wildcard, so any file that starts with `/content/drive/Shareddrives/TRP479_Spatial_Data_Science_Data/L6/Data/midas-open_uk-daily-temperature-obs_dv-202207_south-yorkshire` and ends with `csv` will be added to the list of files we will iterate over. We specifically use the path up to `_south-yorkshire`  to avoid having our metadata file included in the list.
</div>

In [None]:
print("Number of files in the list", len(file_list))
print(file_list)

Now, you should have all the relevant file names in a list, and we can loop over the list using a for loop.

In [None]:
#creates an empty pandas dataframe to store our yearly data
summary_data = pd.DataFrame([])

# Repeat the analysis steps for each input file:
for fp in file_list:

# Read in only selected columns
   data = pd.read_csv(fp,
                    usecols=['ob_end_time', 'id_type', 'ob_hour_count',
       'src_id',  'max_air_temp',
       'min_air_temp'], skiprows=90)

# 'drop' the last row of data in our data frame becasue we know it contains the 'end data' flag
   data.drop(len(data)-1,axis = 'index',inplace =True)

# Rename the columns
   new_names = {"ob_end_time" : "TIMESTAMP", "max_air_temp" :"MAX", "min_air_temp": "MIN"}
   data = data.rename(columns=new_names)


#Print info about the current input file:
   print("Date:", data.at[0,"TIMESTAMP"])
   print("NUMBER OF OBSERVATIONS:", len(data))

# Create column
   col_name = 'MID'
   data[col_name] = None

# Calculate 'average' temperature as midpoint between max and min readings
   data['MID'] = (data['MAX'] + data['MIN'])/2


# Slice the string into DATE, YEAR, MONTH, DAY and TIME columns

   data["DATE"] = data["TIMESTAMP"].str.slice(start=0, stop=10)
   data["YEAR"] =data["TIMESTAMP"].str.slice(start=0, stop=4)
   data["MONTH"] = data["TIMESTAMP"].str.slice(start=5, stop=7)
   data["DAY"] = data["TIMESTAMP"].str.slice(start=8, stop=10)
   data["TIME"] = data["TIMESTAMP"].str.slice(start=11)

#make MONTH_DAY unique identifier
   data['MONTH_DAY']=data['MONTH']+data['DAY']


# Group by day
   grouped_month_day = data.groupby(by=['MONTH_DAY'])
   daily_mean=grouped_month_day.mean(numeric_only=True)
   daily_mean['YEAR']=str(data.loc[0]['YEAR'])  #stores the YEAR

   #group by month
   grouped_month= data.groupby(by=['MONTH'])
   monthly_mean=grouped_month.mean(numeric_only=True)
   monthly_mean['YEAR']=str(data.loc[0]['YEAR'])  #stores the YEAR
  # monthly_mean['MONTH']=str(grouped_month.loc['MONTH'])  #stores the YEAR

   daily_head= daily_mean.sort_values(by='MID', ascending=False).head(1)  #dataframe with only hottest average day of the year
   monthly_head= monthly_mean      #dataframe with monthly avverages

   #update the summary_data dataframe with the next years data
   #summary_data=summary_data.append(pd.DataFrame(monthly_head)) #.append for dataframes is being deprecated; use .concat as belwo instead
   summary_data = pd.concat([summary_data, pd.DataFrame(monthly_head)], ignore_index=True)




# Print info
   print("%0.2f",monthly_mean.sort_values(by='MID', ascending=False).head(5))
   print("\n")

In [None]:
summary_data


So, what month usually has the highest average temperature?

---
## Check your understanding

You should be able to edit the previous code to output the *hottest* day of each year. When editing code, it's a good idea to use commenting to edit out bits of code rather then deleting lines...