<a id="top-of-da"></a>

# Data Analysis

Now we will work with real data to apply the concepts presented in the previous notebooks. Some things to keep in mind as we work through these examples:
- Real data is messy. It almost always will need to be cleaned up before analysis.
- You should explore your data to get a high-level understanding of what you're working with. This will help you explain results, identify errors in your code or analysis, and further your understanding of the data.
- A notebook is a great place to experiment with your analysis and try out new approaches. As you refine your analysis methods and decide on the best approach, clean up your notebook to include only the necessary steps. A clean analysis notebook can serve as documentation of your methods for yourself, your research group, and peers in your discipline if you decide to publish.

With the above in mind, this notebook will cover these topics:
1. [Cleaning up your data](#load-clean-data)
2. [Exploring and understanding your data](#explore-data)
3. [Analyzing your data](#analyze-data)

Exercises<br>
[Exercise 1](#exercise1-da)<br>
[Exercise 2](#exercise2-da)<br>
[Exercise 3](#exercise3-da)<br>
[Exercise 4](#exercise4-da)<br>
[Exercise 5](#exercise5-da)

# 0. Download data

Before we begin cleaning our data, we need data to work with! Let's download data from surface weather stations across the United States. Run the following cell to download one file with data from over 500 stations over the course of one day.

This file is about 35 MB, so it may take 30-60 seconds to download.

In [None]:
!mkdir data
!python download_data.py

[Return to top of notebook](#top-of-da)<br>
***
<a id="load-clean-data"></a>

# 1. Cleaning up your data

Let's load our data, organize it, and clean it up. The first thing we need to figure out is how to properly parse this file. The data is tabular, but contains a few lines of comments before the data.

In [None]:
import pandas as pd

In [None]:
raw_data = open("data/20120801.txt").readlines()

In [None]:
raw_data[:10]

There are a couple ways we can handle this. We can just skip reading those lines using the `skiprows` keyword.

In [None]:
df = pd.read_csv("data/20120801.txt", skiprows=5)

Alternatively, since we know these lines are comments, we can use the `comment` keyword and pass it `"#"`.

In [None]:
df = pd.read_csv("data/20120801.txt", comment="#")

We can refine how our dataframe is structured when we open it. First, let's select an index column.

In [None]:
df = pd.read_csv("data/20120801.txt", comment="#", index_col="valid")

In [None]:
df.head(5)

Since these are timestamps, `pandas` can parse them so that we can operate on them later.

In [None]:
df = pd.read_csv("data/20120801.txt", comment="#", index_col="valid", parse_dates=True)

In [None]:
df.head(5)

In [None]:
df.index

Next, we only want some of these columns for our analysis. We can tell `pandas` which columns we want when we load the data

In [None]:
df.columns

In [None]:
usecols = [
    "valid", "station", "lon", "lat", "tmpf", "dwpf", "relh", "drct", "mslp", "gust", "p01i"
]
df = pd.read_csv("data/20120801.txt", comment="#", index_col="valid", parse_dates=True, usecols=usecols)

In [None]:
df

Now we have a smaller dataset to work with containing quantities that we want for our analysis. We can do two more things before working with the data though.
1. Rename the columns to make them more readable
2. Remove missing data points

In [None]:
new_cols = {
    "station": "Station ID",
    "lon": "Longitude",
    "lat": "Latitude",
    "tmpf": "Temperature (deg F)",
    "dwpf": "Dewpoint (deg F)",
    "relh": "Relative Humidity (%)",
    "drct": "Wind direction (deg)",
    "gust": "Wind Gust (knot)",
    "mslp": "Mean Sea Level Pressure (hPa)",
    "p01i": "Precipitation 1-hour accumulation (inch)",
}
df = df.rename(columns=new_cols)

In [None]:
df

Now we have a dataframe that is readable and contains the data we care about. That took a bit of effort, but it will make the rest of our exploration and analysis easier. Let's remove the missing data points now.

In [None]:
df = df.where(df != "M")

In [None]:
df

Since we read the data from a text file, the columns were loaded as strings. We want numeric values for all columns except `"Station ID"`, so let's convert them to `float`.

In [None]:
numeric_columns = df.columns[1:]
df[numeric_columns] = df.loc[:, df.columns[1:]].astype(float)

In [None]:
df

[Return to top of notebook](#top-of-da)<br>
[Return to top of section](#load-clean-data)
***
<a id="explore-data"></a>

# 2. Exploring your data

Now that we have a clean and organized dataset, let's look at what it contains.

In [None]:
df.describe()

<a id="exercise1-da"></a>
### Exercise 1

1. Notice the `counts` for many of our columns are smaller than `"Latitude"` and `"Longitude"`. Why?
2. Inspect the minimums and maximums for a varaible of your choice. Do you think these are reasonable values? Explain.
3. Plot the variable you chose. Would you change your answer to the above question based on this plot?

In [None]:
# your code here

***

We can explore relationships between our variables with, for example, `.corr()`. By default, `.corr()` uses the Pearson correlation, but there are other built-in options if you desire another approach.

In [None]:
df[numeric_columns].corr()

<a id="exercise2-da"></a>
### Exercise 2
1. What does this tell us about our data?
2. What do you notice about the correlations?

In [None]:
# your code here

***
Let's subset the data to a single station.

In [None]:
station_name = "AWG"
station_data = df[df["Station ID"] == station_name]

In [None]:
station_data.loc[:, "Temperature (deg F)"].plot()

In [None]:
station_data.loc[:, ["Temperature (deg F)", "Dewpoint (deg F)"]].plot()

In [None]:
ax = station_data.loc[:, ["Temperature (deg F)", "Dewpoint (deg F)"]].plot()
station_data.loc[:, "Relative Humidity (%)"].plot(secondary_y=True)

In [None]:
station_data.loc[:, numeric_columns].corr()

<a id="exercise3-da"></a>
### Exercise 3
1. What do the correlations tell us about the data from this single station?
2. Why are some values NaN?

In [None]:
# your code here

[Return to top of notebook](#top-of-da)<br>
[Return to top of section](#explore-data)
***
<a id="analyze-data"></a>

# 3. Analyzing your data

Let's perform some simple analysis on our dataset. The temperature and dewpoint are in degree Fahrenheit, but perhaps we want the units to be celsius. Let's write a function that converts the units, then assigns those values to new columns.

In [None]:
def fahrenheit_to_celsius(temperature):
    return 5 / 9 * (temperature - 32)

In [None]:
fahrenheit_to_celsius(station_data.loc[:, "Temperature (deg F)"])

In [None]:
station_data = station_data.assign(
    **{
        "Temperature (deg C)": fahrenheit_to_celsius(station_data.loc[:, "Temperature (deg F)"])
    }
)

In [None]:
station_data.columns

In [None]:
station_data.loc[:, "Temperature (deg C)"].plot()

<a id="exercise4-da"></a>
### Exercise 4
1. Convert dewpoint to degrees Celsius and assign it to a new column in `station_data`
2. Plot temperature and dewpoint in degrees Celsius together

In [None]:
# your code here

<a id="exercise5-da"></a>
### Exercise 5
1. Write a function to convert wind speed in knots to miles per hour. The conversion rate is 1 knot = 1.15 mph
2. Assign a new column to `station_data` using your function to convert wind speed from knots to miles per hour
3. Plot wind speed in miles per hour and wind direction on the same plot

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here