### Neshyba & Deloya Garcia, 2024

# ClimateStats

## Overview
The idea of this module is to develop your skill and insights into the statistical analysis of weather records that lead to climate. The weather data we'll be accessing is archived at the NOAA website, https://gml.noaa.gov/dv/data/index.php?category=Meteorology&frequency=Hourly%2BAverages. We'll focus (as the name suggests) on statistics of _hourly_ measurements of key weather variables, including temperature, wind speed, wind direction. 

The main computing resource we'll be using to look at these data is the data management tool *pandas* to organize data and metadata associated with carbon emissions over time. We'll be using *pandas* a little more here than previously, in that we'll be using it to search for flags that indicate missing or bad data. 

In term of climate literacy, important lessons here are idea that climate science is a statistical science, specifically the statistics of weather, and therefore we would expect that statistical evidence of climate change would be most pronounced in polar regions.   


## Learning goals
1. I can find and interpret metadata for NOAA weather records.
1. I can use *pandas* to read in tables of data as dataframes, implement quality-control measures, and combine dataframes.
1. I can explain how probabilities densities are related to histograms, describe circumstances under which they might preferable, and compute them from measurements.
1. I can use historical weather records to describe statistical evidence of polar amplification.  

In [None]:
# Get some resources
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
%matplotlib notebook

### Interpreting metadata
It's often useful to inspect metadata associated with a data set before diving in. A good way to start out this module is to just go to the NOAA website and look around! Locate the "i" icons on the right-hand side, and find  answers to the following questions:

1. The three-letter _region code_ tells you where the data were collected. What does "BRW" stand for?
1. The _units_ of numerical data are specified in there somewhere. What are the units of temperature? Of wind speed?
1. _flags_ let you know if data are missing. For the NOAA dataset, these flags all look something like "-99.9", but specfics vary. What's the flag for missing temperature?

Enter your answers in the cell below.

YOUR ANSWER HERE

### Loading (and plotting) time series of hourly temperature data, using Pandas
Next, we'll collect some data for a recent year from the NOAA website, as a *pandas dataframe*. Execute the cell below and have a look at the output.

In [None]:
# Load the 2022 data
df2023 = pd.read_csv('https://gml.noaa.gov/aftp/data/meteorology/in-situ/brw/met_brw_insitu_1_obop_hour_2023.txt',
                        delimiter=r"\s+",header=None, 
                        usecols=[0,1,2,3,4,5,6,9],
                        names=['station','year','month','day','hour','winddirection','windspeed','temperature']) 

# Print some information about the dataframe
display(df2023)

# Plotting the temperatures as a function of month
plt.figure()
plt.plot(df2023['month'], df2023['temperature'], 'kx', label='2023')
plt.xlabel('month')
plt.ylabel('Temperature (C)')
plt.title('Hourly temperatures, by month, for year 2023')
plt.grid()
plt.legend()

### Pause for analysis
Take a close look at this graph. You'll probably notice some absurdly low "temperatures", close to -1000 degrees! Don't be alarmed -- this is not https://en.wikipedia.org/wiki/The_Day_After_Tomorrow. Those values are flags that mark the data as being bad or missing. This is a quality-control issue that we'll attend to in a bit, but in the meantime we'll practice getting other data.

In the cell below, do the same, but for data from the year 1977 (load, display, and plot).

In [None]:
# your code here 


### Quality control
The code below directs *pandas* to look at each row, and if the temperature is -999.9, it records the index of that row, in a list called "badindices". The next line of code is a very cool Pandas functionality: it drops (gets rid of) those indices from the dataframe! We also re-plot the data, just to be sure.

By the way, if you execute this cell twice, you'll see that the second time it says it isn't dropping any points -- because (as I'm sure you've guessed) it got all the bad data the first time.

In [None]:
# Find bad temperatures
badindices = df2023[ df2023['temperature'] == -999.9 ].index
print('I am dropping this many missing data points: ', len(badindices))
df2023.drop(badindices,inplace=True)

# Plot
plt.figure()
plt.plot(df2023['month'],df2023['temperature'], 'kx', label='2023')
plt.xlabel('month')
plt.ylabel('Temperature (C)')
plt.title('Hourly temperatures, by month, for year 2023')
plt.grid()
plt.legend()

### Your turn
In the cells below, do the same for the "1977" dataframe (get rid of bad data, and plot hourly temperatures by month).

In [None]:
# your code here 


### Plotting hourly temperatures on the same graph
You have probably already noticed that it's a bit difficult to compare two datasets unless you graph them together. In the cell below, plot the 1977 and 2023 hourly temperature data on the same graph (still by month), using the black/blue coding you did before and the label/legend method.

In [None]:
# your code here 


### Pause for analysis
Take a moment to examine the plot you just made, and use the cell below to record a few observations about the seasonal variation it reveals. 

1. What are the hottest months at BRW (Utqiagvik)?
1. What are the coldest months?
1. Although it's not considered "climate" unless one is averaging over (ideally) 30 years, sometimes we look at shorter time periods anyway, because that's the data we have. What stands out about the data you see? If you were to choose two months in which warming seems to be amplified over other months, what two months would those be?

YOUR ANSWER HERE

### Focusing on a month of interest
The cell below will be used to focus on March as the "month of interest."

In [None]:
# Specify which month we want to focus on
month_of_interest = 3

### Examining hourly temperatures by day of month
The cell below shows how to extract and display hourly temperatures belonging to a certain month of the year.

In [None]:
# Extract data belonging to the month of interest
df2023_month_of_interest = df2023[df2023['month'] == month_of_interest]
display(df2023_month_of_interest) 

# Open a figure
plt.figure()

# Convert to numpy arrays and plot the day and temperature
plt.plot(df2023_month_of_interest['day'], df2023_month_of_interest['temperature'], 'kx')
plt.title('Hourly temperatures, by day, for month '+str(month_of_interest)+', year 2023')
plt.xlabel('day')
plt.ylabel('Temperature (C)')
plt.grid()

### Your turn
Repeat what we just did, but for 1977. Let's stick with the '+' and 'blue' representation we used before.

In [None]:
# your code here 


### Histograms of hourly temperatures
In the foregoing, you might have noticed that it's a little hard to infer trends from visual inspection of a time series. A more useful statistical strategy is called *binning*. Binning involves grouping data into ranges of the weather variable of interest -- in this case, temperatures in a given month. Binning is a key statistical method of climate science. Numpy's binning function is the _histogram_ function, referred to below as "np.histogram".

The cell below does this for the 2023 dataset.

In [None]:
# Get the histogram for the modern dataset
h2023_month_of_interest, e2023_month_of_interest = np.histogram(df2023_month_of_interest['temperature'],density=True)

# Check on some array lengths
print(np.size(h2023_month_of_interest))
print(np.size(e2023_month_of_interest))
print(np.size(e2023_month_of_interest[0:-1]))

# Plot the histogram 
plt.figure()
plt.plot(e2023_month_of_interest[0:-1],h2023_month_of_interest,'k')
plt.title('Hourly temperature probability density for month '+str(month_of_interest)+', year 2023')
plt.xlabel('Temperature (C)')
plt.grid()

### Bins and bin edges in a probability density

It's worth pausing for a moment on the meaning of the x-axis in figures like the above. If we see that the peak in a probability density occurs at an x-value of (say) -22, it means that the most probable hourly temperature that month fell within a certain range of -22 degrees C. That range is called a *bin*, and is decided automatically by np.histogram -- in this case, it has decided that each bin should be about three degrees in width, e.g., -22 to -19 degrees. 

You might have noticed a strange notation here too: Why are we specifying e2023_month_of_interest[0:-1]? The short story is, a set of 10 *bins* requires that we specify 11 *edges* (AKA _bin boundaries_). So when we plot the number of observations, we leave off either the last edge. That's what the "-1" in  edgesmonthlymodern[0:-1] does.

### Your turn
In the cell below, do the following:

1. Calculate and plot the probability density of the 1977 dataset. 
1. Plot both the 2023 and 1977 probability densities on the same graph, with the label/legend method.

In [None]:
# your code here 


### Pause for analysis
Use the cell below to comment on what these data are telling us about climate changes in Utqiakvik between the years 1977 and 2023. Key ideas would include *most-probable temperature* and *temperature variability*. If you notice any significant bimodal behavior, make a note of it.

YOUR ANSWER HERE

### Combining dataframes from multiple years
Below is an example of how to merge dataframes from multiple years. This will come in handy for building confidence in statistical inferences we can draw from these data.

In [None]:
# Modern data: load three years of data as separate dataframes
df2020 = pd.read_csv('https://gml.noaa.gov/aftp/data/meteorology/in-situ/brw/met_brw_insitu_1_obop_hour_2020.txt', 
                        delimiter=r"\s+",header=None, 
                        usecols=[0,1,2,3,5,6,9], 
                        names=['station','year','month','day','winddirection','windspeed','temperature']) 

df2021 = pd.read_csv('https://gml.noaa.gov/aftp/data/meteorology/in-situ/brw/met_brw_insitu_1_obop_hour_2021.txt', 
                        delimiter=r"\s+",header=None, 
                        usecols=[0,1,2,3,5,6,9], 
                        names=['station','year','month','day','winddirection','windspeed','temperature']) 

df2022 = pd.read_csv('https://gml.noaa.gov/aftp/data/meteorology/in-situ/brw/met_brw_insitu_1_obop_hour_2022.txt', 
                        delimiter=r"\s+",header=None, 
                        usecols=[0,1,2,3,5,6,9], 
                        names=['station','year','month','day','winddirection','windspeed','temperature']) 

# Now join them and display the result
df2020s = pd.concat( [df2020, df2021, df2022, df2023])
display(df2020s)

### Your turn
Do the same for three years in the 1970s, starting in 1977. Call the resulting dataframe "df1970s".

In [None]:
# your code here 


### Comparisons
In the cell below, the goal is to produce a plot of the probability density of hourly temperatures for the combined "2020s" dataframe, on the same graph as for the combined "1970s" dataframe (using label/legend, other annotation).

In [None]:
# your code here 


### Pause for analysis
In the space below, comment on the following:

1. Does it seem to you that inferences made from the 3-year probability density curves you just got are more or less reliable than from the 1-year probability densities you got before? One criterion you can use for this purpose: spiky probability densities tend to be an indicator of less reliable statistics.
1. What *is* the trend we're seeing in terms of temperature at Utqiakvik? Say, from most probable in the 1970s to most probable in the 2020s.
1. Go to https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/globe/land_ocean/1/3/1850-2023 select the options shown below, and plot the results. Based on what you see, which season (Spring/Summer/Fall/Winter) in the Arctic seems to be warming the fastest? How does that warming rate compare with the global average of $~1.5^o C$ for the year 2023? 

- Parameter: Average Temperature Anomaly
- Time Scale: 12-Month
- Month: All Months
- Start Year: 2023
- End Year: 2023
- Surface: Land and Ocean


YOUR ANSWER HERE

### Extensions
Your last task is to extend these ideas to other meteorological variables, both using the two multi-year dataframes (1970s and 2020s at BRW) you've already built. You have two choices:

1. Compare monthly time series and probability densities of the daily *wind direction* ('winddirection'). If you choose this option, you need to know that the flag for bad wind direction is -999.0. You'll also need to check NOAA's metadata for units so you can annotate your graphs appropriately.
1. Compare monthly time series and probability densities of the daily *wind speed* ('windspeed'). If you choose this option, you need to know that the flag for bad wind speed data back in the 1970s was -99.9, and the flag for bad windspeed in the 2020s is -999.9. You'll also need to check NOAA's metadata for units so you can annotate your graphs appropriately.

In [None]:
# your code here 


### Pause for analysis
Describe the trend you see; as before, key ideas are most-probable values, and variablity. Examples ... 
- "Winds nowadays in Utqiakvik appear to be coming more from the ____ (North/South/West/East)"  
- "Utqiakvik appears to be getting windier (or less windy) in today's warmer climate"
- "Wind speed in Utqiakvik seems to have remained about the same since the 1970s"

YOUR ANSWER HERE

### Refresh/save/validate
Double-check everything is OK, and press the "Validate" button (as usual).

### Close/submit/logout
Close, submit, and log out.