# Lab 7: Time Series Data


**Submission instruction**: Please upload a `pdf` of your completed notebook to Gradescope. This lab will be manually graded (no autograder).


In [None]:
# write your code here
me = ["Rick Marks", "rlmarks"]
partner = ["Piper Marks", "piper"]

## 1. Loading and preparing data (10 minutes)

I downloaded the monthly unemployment rate from the U.S. Bureu of Labor Statistics (http://data.bls.gov/dataViewer/view/timeseries/LNU04000000) for you and defined a new function named `clean_data`.  Run the cells below to read in and clean the data.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
def clean_data(file_name):
    # read csv file into pandas table
    unemployment = pd.read_csv(file_name)
    # drop irrelevant column
    unemployment = unemployment.drop(columns=['Series ID'])
    # rename columns (I don't expect you to know how to do this)
    unemployment = unemployment.rename(columns={'Label': 'Time', 'Value': 'Rate','Period':'Month'})
    # change data type of Time to datetime from string
    unemployment['Time'] = pd.to_datetime(unemployment['Time'], format='%Y %b')
    # change data type of Rate to float from string
    unemployment['Rate'] = unemployment['Rate'].astype(float)
    # change Month column from something that looks like 'M01' to 1. 
    unemployment['Month'] = unemployment.Month.str.slice(1).astype(int)

    return unemployment

In [None]:
# call the function defined earlier on csv file 'unemployment_unadj.csv'

unemployment = clean_data('unemployment_unadj.csv') 
unemployment

**Question 1.1 (1 point)**: Looking at the definition of the function `clean_data` above, does it seem like it is a general purpose function you could use to read and clean most any data file?  In one sentence, why or why not?  

[Write your answer here]

**Question 1.2 (1 point)**: To see the overall trend, make the following plot:

![trend.png](trend.png)

You can do that by plugging in the appropriate parameters in the following line of code:

```Python
sns.lineplot(data=..., x=... , y=...)
```

In [None]:
# Write and run your code here


**Question 1.3 (1 point)**: Do you observe any trends? Any peaks? Any cycles? If so, how many? And approximately how many years does each cycle last? No calculation or code needed. There are multiple valid answers. Don't spend more than 1~2 minutes on this question.


[Write your answer here]

**Question 1.4 (1 point)**: Next, can you write 1~2 lines of code that only plots the unemployment rates from the past decade, i.e. 2014-2023? My solution uses boolean indexing. 


In [None]:
# Write and run your code here


**Question 1.5 (1 point)** Do you observe anything interesting from this plot?

[Write your answer here]

## 2. Seasonality (10 minutes)

Seasonality refers to regular trends that occur every calendar year (or at shorter intervals). For example, you would observe seasonal trends in ice cream sale, electricity usage, number of Google searches about college basketball, traffic at Franklin st, etc. Does unemployment rate have seasonal trends as well?

To see the monthly trends, copy paste and run the following code:

```Python
sns.lineplot(data=unemployment, x='Month', y='Rate', hue='Year')
```

In [None]:
# Copy paste and run the code here


Hmm, this seems a little cluttered. Let's try this:

```Python
sns.lineplot(data=unemployment, x='Month', y='Rate')
```

When presented with multiple lines of data (in this case, each line = one year), seaborn automatically takes the **average** of those lines. Therefore that dark blue solid line is the monthly unemployment rate averaged over the years. 

The light blue area around the line represents the **level of uncertainty or randomness**. We'll discuss this concept (confidence intervals and standard deviations) at a later time, but for now, just think of it this way: thin light blue band around the line = super sure about the blue line being a good representation of the data across the years, thick band = not so sure, things really depend on the year.

In [None]:
# Copy paste and run the code here


**Question 2.1 (1 point)**: Using pandas operations, calculate the numbers in the dark blue line, i.e. unemployment rate averaged over the years.

**Hint**: Use `Groupby`! Test your code by plotting your answer and comparing it against the blue line above.

In [None]:
# Write and run your code here

**Question 2.2 (1 point)**:  Do you observe any seasonal trends? Do you have any guesses on why or why not?

[Write your answer here]

## 3. Seasonally adjusted numbers (5 minutes)

This is why the government also releases **seasonally adjusted numbers**. This means that for every January number, they subtracted the average January rate, and for every February number, they subtracted the average February rate, and so on. This allows people to understand the unemployment rate in the context of the given month. 

I've downloaded that for you in a file called `unemployment_season.csv`. 

**Question 3.1 (1 point)**: Write two lines of code below that 

1. calls the earlier defined function `clean_data` on this new file and saves the returned table to a variable called `unemployment_adj`.
2. creates a lineplot across `Time` to see the overall trend in this new table.

In [None]:
# Write and run your code here

**Question 3.2 (1 point)**: Compare this new plot with the first lineplot from the unadjusted unemployment rates. Do you observe any difference?


[Write your answer here]

**Question 3.3 (1 point)**: Finally, "zoom in" to the most recent decade (2014-2023) like you did earlier. You should be able to do so by copy pasting some of your previous code and swapping out the table names. What difference do you notice?


In [None]:
# Write and run your code here

## BONUS: Adding text labels

You can add text to your lineplot using the code below.  This might be helpful for your final project. Just sandwich your plotting code from 3.3 to where the ... are.

```Python
fig, ax = plt.subplots(1)
...
...
ax.text(x= pd.to_datetime('2020-06-01'), y=14, s='RIP', fontsize=18)
```
Looks like the spike in unemployment rate from COVID can't be explained away by seasonality :(


![rip_covid.png](rip_covid.png)