# Aggregating time series data and comparing distributions 

Today's goals are to compare three time series with different properties

1. Temperature
2. Precipitation
3. Streamflow 

You will also get to practice your pandas skills. 


In [None]:
# first lets import pandas and matplotlib mopdules 
import pandas as pd
import matplotlib.pyplot as plt 
# The below command is to make sure figures show up in your notebook
%matplotlib inline 

Now we need to read our data. Let's do the weather station data first, which we already loaded last week. 
The data csv file is is location in the `W3_1_WeatherDataTimeSeries/data/` folder and named `USC00442208_19000101-20240122.csv`

In [None]:
# complete the line below to point to the correct weather data file
weatherDataPath = 
#
weatherData = pd.read_csv(weatherDataPath,
                    #  Make sure the dates import in datetime format. We tell pandas that this is a date and not text. 
                    parse_dates = ['DATE'],
                    #  Set DATE as the index so you can subset data by time period
                    index_col = ['DATE']
                      )


weatherData

Next we will read the stream discharge data. Before loading the data have a look at the file contents, located in the `data`-directory. You will see some key differences between the weather station data and this data. 

- There are quite a few lines without data, that we have to skip
- Data is delimited by `Tab` and not `,`

[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) has many options that help us to deal with pretty complex files. Here we set `skiprows` and `delimiter`. We also provide a list of column names for our data frame. 

In [None]:
dischargeData = pd.read_csv('../data/USGS_01622000.txt', 
                        skiprows=31,
                        delimiter='\t', 
                        names = ['agency', 'stationID', 'date','discharge_cfs','label'], # I am
                        parse_dates = ['date'],
                        index_col = ['date'])
dischargeData 

Right now our data is in two different dataframes. Lets change that and merge all data into a single data frame. 

Note: [Combining dataframes](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) can be complicated very quickly, depending on how exactly you want to do it. 

In our case, we want to aling our data by the date, which is the `index` of both dataframes. 

We can simply copy the `'TMAX'` and `'PRCP'` columns to a new dataframe and then add assing the discharge data to a new column in the same dataframe called `discharge`. 



In [None]:
# Complete the code below and check what happened. How was the data alinged? 
listOfColumns = 
dataCombined = weatherData[listOfColumns].copy() 
dataCombined['discharge']= dischargeData['discharge_cfs'] 

dataCombined

Let's quickly look at 10 years of data using the temporal subsetting with `.loc[]`. How about for the time period between 1990 and 2000?

In [None]:
dataCombined.loc[].plot(subplots = True)

Now lets look at some basic statistics using `.describe()`

In [None]:
# Enter your code here

<div class="alert alert-info" role="alert">
<h3 class="alert-heading">Questions</h3>
    
What do you notice about the data?

Do these statistics tell use something important?

</div>

The answer is yes, there is something important, but that is way easier to see using plots.

`Histograms` and `boxplots` are both graph types that visualize distributions. 

Let's create histograms first for all three variables.  

You can either create three separate histograms. Try this! You should add sensible bin numbers (`bins`) and labels to your figures. You should also set the ylimit to make sure the plots look sensible `ylim=[start, end].
The [`.plot()` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) shows you all the options. You can also look at the previos week's code.

In [None]:
# Complete the 3 plots 
dataCombined.plot()
...


You can also have _matplotlib_ do all of them at the same time in subplots by setting the `subplots=True`. See what happens. 

In [None]:
dataCombined.plot(kind='hist', subplots=True)

<div class="alert alert-info" role="alert">
<h3 class="alert-heading">Questions</h3>
    
What is the problem here? 

</div>


[Box plots](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) are another more compact way of describing the data. Let's focus on streamflow for a moment and create a new plot with setting `kind='box'`. 


In [None]:
dataCombined.plot()

Before we move on to climate data, lets inspect `'PRCP'`. Unlike temperature and discharge, we don't expect rain on most days and so we have a lot of zeros in the data set, which are real data. 

We can use the `.loc[]` command with a condition (for example `dataCombined['PRCP'] > 0` to select only days with precipitation. 
Using the `.count()` we get the number of non-zeros. 

**Quick challenge: What the the change of rain on any given day? How would you do that?**

In [None]:
condition = dataCombined['PRCP']>0
dataCombined['PRCP'].loc[condition].count()



## Aggregating weather data to climate data

Because there is a lot of variability between years, we need to aggregate our data to [climate periods](https://www.ncei.noaa.gov/products/land-based-station/us-climate-normals), typically taken to be 30-years of data.

Luckily you already know how to do that using `.pivot_table`

<div class="alert alert-warning" role="warning">
<h3 class="alert-heading">Challenge</h3>
    
- Use the pivot_table function to create
    - mean monthly temperature
    - total (sum) monthly precipitation
    - mean monthly discharge 
- Select data from 1990 to 2020
- Create three plots showing your results. 
</div>



In [None]:
dataCombined['year']=dataCombined.index.year
dataCombined['month']=dataCombined.index.month

# Complete the code below for the pivot tables and the plots 

<div class="alert alert-info" role="alert">
<h3 class="alert-heading">Questions</h3>
    
How do you explain these data?

What do you notice about streamflow, when looking at precipitation?

Do they likely influence each other

</div>

We can also plot these two as a scatter plot to investigate further. 

In [None]:
dataCombined.plot(kind='scatter',x='PRCP',y='discharge')


# Take-away/ Conclusions

- Why does this matter? 
- Why do we need to know the shape of the variable distributions, when we think about climate data and climate risk? 
    - What do you think? 

