
### Summary Statistics for Boulder, Colorado Weather Data in 2017                                                               

In [43]:
import numpy as np 
import pandas as pd

The data we'll explore concerns temperatures and other weather observations in Boulder County over the month of July 2017.  

The data was obtained from the National Oceanic and Atmospheric Administration's [Climate.gov](https://www.climate.gov/) website.  You can find and download loads of climate-related data from NOAA [here](https://www.climate.gov/maps-data/datasets).   

The data is stored in a .csv file called `clean_boulder_weather.csv`.

In [None]:
# Two different paths to the data 
local_path = 'clean_boulder_weather.csv'

# Load the data into a DataFrame 
df = pd.read_csv(local_path)

In [45]:
df.head(50)

Unnamed: 0,STATION,NAME,DATE,PRCP,TMAX,TMIN
0,USW00094075,"BOULDER 14 W, CO US",2017-07-01,0.0,68.0,31.0
1,USW00094075,"BOULDER 14 W, CO US",2017-07-02,0.0,73.0,35.0
2,USW00094075,"BOULDER 14 W, CO US",2017-07-03,0.0,68.0,46.0
3,USW00094075,"BOULDER 14 W, CO US",2017-07-04,0.05,68.0,43.0
4,USW00094075,"BOULDER 14 W, CO US",2017-07-05,0.01,73.0,40.0
5,USW00094075,"BOULDER 14 W, CO US",2017-07-06,0.0,76.0,48.0
6,USW00094075,"BOULDER 14 W, CO US",2017-07-07,0.02,74.0,43.0
7,USW00094075,"BOULDER 14 W, CO US",2017-07-08,0.0,65.0,44.0
8,USW00094075,"BOULDER 14 W, CO US",2017-07-09,0.01,73.0,39.0
9,USW00094075,"BOULDER 14 W, CO US",2017-07-10,0.01,75.0,44.0


Each row in the DataFrame refers to a particular weather station / date combination.  The columns of the DataFrame are as follows: 

- **STATION**: The unique identification code for each weather station 
- **NAME**: The location / name of the weather station 
- **DATE**: The date of the observation 
- **PRCP**: The precipitation (in inches)
- **TMAX**: The daily maximum temperature (in Fahrenheit)
- **TMIN**: The daily minimum temperature (in Fahrenheit)

To observe how many weather stations we have, we can pass the **NAME** column (or the **STATION** column) into Python's set function. 

In [46]:
set(df["NAME"])

{'BOULDER 14 W, CO US',
 'BOULDER, CO US',
 'GROSS RESERVOIR, CO US',
 'NIWOT, CO US',
 'NORTHGLENN, CO US',
 'RALSTON RESERVOIR, CO US',
 'SUGARLOAF COLORADO, CO US'}

It appears that we have data from seven different weather stations.  For consistency, let's reduce the data to just the reports from the weather station in `Northglenn`.  

We extract the rows of the DataFrame concerned with the Northglenn weather station and store this data in a new DataFrame called `dfNorthglenn`. 

In [47]:
dfNorthglenn = df.loc[df['NAME'].eq('NORTHGLENN, CO US')]
dfNorthglenn.head()


Unnamed: 0,STATION,NAME,DATE,PRCP,TMAX,TMIN
184,USC00055984,"NORTHGLENN, CO US",2017-07-01,0.0,74.0,51.0
185,USC00055984,"NORTHGLENN, CO US",2017-07-02,0.0,91.0,55.0
186,USC00055984,"NORTHGLENN, CO US",2017-07-03,0.0,91.0,57.0
187,USC00055984,"NORTHGLENN, CO US",2017-07-04,0.0,91.0,56.0
188,USC00055984,"NORTHGLENN, CO US",2017-07-05,0.0,96.0,56.0


Pandas (and Numpy) have canned functions that compute each of the summary statistics. All of these functions can be called either on a Pandas Series (i.e. a column of a DataFrame) or on an entire DataFrame at one time.  

For instance, the sample mean of the maximum daily temperature is given by: 

In [48]:
dfNorthglenn["TMAX"].mean()

92.33333333333333

Let us observe what happens if we call .mean( ) on the entire DataFrame. 

In [49]:
# Using this code below, it will select only the numerical values within the data frame 
dfNorthglenn = dfNorthglenn.select_dtypes(include='number')

# Calculate the mean values now 
mean_vals = dfNorthglenn.mean()

print(mean_vals)


PRCP     0.021667
TMAX    92.333333
TMIN    59.666667
dtype: float64


In this case, Pandas returned a Series with the means of all of the **numerical** data in the DataFrame. 

The functions for the other summary statistics are as follows: 

\begin{array}{l|l}
\textrm{Function} & \textrm{Statistics} \\
\hline
\textrm{.var()} & \textrm{variance} \\
\textrm{.std()} & \textrm{standard deviation} \\
\textrm{.min()} & \textrm{minimum value} \\
\textrm{.max()} & \textrm{maximum value} \\
\textrm{.median()} & \textrm{value} \\
\textrm{.quantile(q)} & \textrm{quantile, where q is the desired percentage as a decimal} \\
\end{array}

Let's use these functions to compute a 5-number summary for the maximum daily temperature for `dfNorthglenn`

In [50]:
# Use the following outputs
# STEP 1: 'minval' , 'maxval' , 'Q1', 'Q2', 'Q3' 
# STEP 2: then print out using the following code: print("5-Number Summary: {:.2f}    {:.2f}    {:.2f}    {:.2f}

minval = dfNorthglenn['TMAX'].min()
maxval = dfNorthglenn['TMAX'].max()
Q1 = dfNorthglenn['TMAX'].quantile(0.25)
Q2 = dfNorthglenn['TMAX'].quantile(0.5)
Q3 = dfNorthglenn['TMAX'].quantile(0.75)

print("5-Number Summary: {:.2f}    {:.2f}    {:.2f}    {:.2f}    {:.2f}".format(minval, Q1, Q2, Q3, maxval))

5-Number Summary: 74.00    89.25    93.00    98.00    101.00


Pandas has a function called .describe( ) that will compute all of the standard summary statistics for you.  You can apply it either to a Pandas Series or to an entire DataFrame.  

We run the .describe( ) function on the **TMAX** column of the DataFrame `dfNorthglenn`, and check that the results agree with the computations from above.

In [51]:
tmax = dfNorthglenn['TMAX'].describe()

print(tmax)

count     30.000000
mean      92.333333
std        7.345340
min       74.000000
25%       89.250000
50%       93.000000
75%       98.000000
max      101.000000
Name: TMAX, dtype: float64


Now, we'll explore how the mean and the standard deviation change when we perform basic transformations on the data.  In particular, we're interested in what happens if we 

1. Add or subtract some value from every entry in the data set 
1. Multiply every entry in the data set by some value 

We know from above that the mean and standard deviation of the `Northglenn` **TMAX** value are 92.333 and 7.345340

In [53]:
tmax_add_3 = dfNorthglenn['TMAX'] + 3
tmax_add_3.head()

tmax_mult_3 = dfNorthglenn['TMAX'] * 3

new_tmax_mean_add_3 = tmax_add_3.mean()
new_tmax_std_add_3 = tmax_add_3.std()
new_tmax_mean_multiply_3 = tmax_mult_3.mean()
new_tmax_std_multiply_3 = tmax_mult_3.std()
print("Summary: {:.2f} {:.2f} {:.2f} {:.2f}".format(new_tmax_mean_add_3, new_tmax_std_add_3, new_tmax_mean_multiply_3, new_tmax_std_multiply_3))

Summary: 95.33 7.35 277.00 22.04


This can be proven proven by using the formulas for the two statistics:

$$
\bar{x} = \frac{1}{n} \displaystyle\sum_{k=1}^n x_k \quad \quad \textrm{and} \quad \quad s = \sqrt{\frac{1}{n-1} \sum_{k=1}^n \left( x_k - \bar{x}\right)^2} 
$$

**The Mean with Addition**:  when we add a constant to each observation the constant also gets added to the mean.  We can show this in general as follows.  Let $y_k = x_k + a$ be the shifted observations.  We then have:


$$
\bar{y} = \frac{1}{n} \sum_{k=1}^n y_k \quad = \quad \frac{1}{n} \sum_{k=1}^n (x_k + a) \quad = \quad 
\frac{1}{n} \sum_{k=1}^n x_k  + \frac{1}{n} \sum_{k=1}^n a 
\quad = \quad \bar{x}  + \frac{1}{n} \cdot an 
\quad = \quad \bar{x}  + a 
$$

**The Std Dev with Addition**: On the contary, the standard deviation stays the same when we add a constant to each observation.  This should make intuitive sense because the std dev is a measure of the spread of the data, and by adding a constant to each observation we're just shifting things down the number line.  Let's see if we can use the formula for std dev to confirm this mathematically. 

$$
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left( y_k - \bar{y} \right)^2 } \quad = \quad 
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left[ (x_k + a) - (\bar{x}+a) \right]^2 } \quad = \quad 
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left( x_k - \bar{x} \right)^2 } 
$$

We thus see that the standard deviation of both the $x$'s and the  $y$'s are the same. 

**The Mean with Multiplication**: When we multiply each observation by a constant the mean also gets multiplied by the constant.  We can show this in general as follows.  Let $z_k = b \cdot x_k$ be the multiplied observations.  We then have:


$$
\bar{z} = \frac{1}{n} \sum_{k=1}^n z_k \quad = \quad \frac{1}{n} \sum_{k=1}^n b \cdot x_k  \quad = \quad 
b \cdot \frac{1}{n} \sum_{k=1}^n x_k  
\quad = \quad b\cdot \bar{x}  
$$

**The Std Dev with Addition**: Further, std dev gets multiplied by the constant.

$$
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left( z_k - \bar{z} \right)^2 } \quad = \quad 
\sqrt{\frac{1}{n-1}\sum_{k=1}^n \left( b\cdot x_k - b\cdot \bar{x} \right)^2 } \quad = \quad 
\sqrt{\frac{1}{n-1}\sum_{k=1}^n b^2 \cdot \left( x_k - \bar{x} \right)^2 } \quad = \quad 
b\cdot \sqrt{\frac{1}{n-1}\sum_{k=1}^n  \left( x_k - \bar{x} \right)^2 } 
$$



Now, we will convert **TMAX** and **TMIN** from Fahrenheit to Celsius.  Remember that the transformation is given by: 

$$
\textrm{CELSIUS} = \frac{5}{9} (\textrm{FAHRENHEIT}-32) 
$$

In [54]:
dfNorthglenn['TMAX-C'] = (5/9)* (dfNorthglenn.loc[:, 'TMAX'] -32)   # dfNorthglenn.loc[:, 'TMAX'] output: all values from the column TMAX

dfNorthglenn['TMIN-C'] = (5/9)* (dfNorthglenn.loc[:, 'TMIN'] -32)
dfNorthglenn.head()

Unnamed: 0,PRCP,TMAX,TMIN,TMAX-C,TMIN-C
184,0.0,74.0,51.0,23.333333,10.555556
185,0.0,91.0,55.0,32.777778,12.777778
186,0.0,91.0,57.0,32.777778,13.888889
187,0.0,91.0,56.0,32.777778,13.333333
188,0.0,96.0,56.0,35.555556,13.333333


What do we expect the mean and the standard deviation of the daily maximum temperature to be in Celsius?

Let $\bar{x}^{min}$ and $\bar{x}^{max}$ represent the mean of **TMAX** and **TMIN** in Fahrenheit, respectively, then we expect the means of the associated values in Celsius to be `y_min` and `y_max`. We will use the following 2 formulas for the calcuations:

$$
\bar{y}^{min} = \frac{5}{9}\left( \bar{x}^{min} - 32 \right) \\
\bar{y}^{max} = \frac{5}{9}\left( \bar{x}^{max} - 32 \right)
$$

Finally, we will use the .mean( ) method to **TMAX-C** and **TMIN-C** to compare the results.

In [55]:
x_min = dfNorthglenn['TMIN'].mean() #mean of TMIN in F
x_max = dfNorthglenn['TMAX'].mean()  #mean of TMAX in F
y_min = (5/9) * (x_min - 32)  #mean of TMIN-C in C
y_max = (5/9) * (x_max - 32) #mean of TMAX-C in C

print("Mean Min Temp in F = {:.3f}".format(y_min))
print("Mean Max Temp in F = {:.3f}".format(y_max))

ybar_min = dfNorthglenn['TMIN-C'].mean()
ybar_max =  dfNorthglenn['TMAX-C'].mean()

print("Mean Min Temp in C = {:.3f}".format(ybar_min))
print("Mean Max Temp in C = {:.3f}".format(ybar_max))

Mean Min Temp in F = 15.370
Mean Max Temp in F = 33.519
Mean Min Temp in C = 15.370
Mean Max Temp in C = 33.519


In [56]:
std_min = dfNorthglenn['TMIN'].std()
std_max =  dfNorthglenn['TMAX'].std()
std_C_min = dfNorthglenn['TMIN-C'].std()
std_C_max =  dfNorthglenn['TMAX-C'].std()

print("Standard deviation in F for T min and T max: {:.2f} {:.2f}".format(std_min, std_max))
print("Standard deviation in C for T min and T max: {:.2f} {:.2f}".format(std_C_min, std_C_max))
#the standard deviation in C is equal to the one in F x 5/9 (addition don't apply but multiplication does)

Standard deviation in F for T min and T max: 4.91 7.35
Standard deviation in C for T min and T max: 2.73 4.08


After this computations, we will be following these steps:

(a) Compute the daily temperature range (max **minus** min) in Fahrenheit for each row in the `Northglenn` DataFrame and store it in a column called **TDIFF**.

(b)  What is the mean temperature difference over the month of July? 

(c)  What is the difference between the means of the max and min daily temperatures? 

(d)  Do you see a relationship between these two quantities?  If so, can you prove that it's always the case for mean difference and difference of means? 

In [57]:
# PART A
dfNorthglenn['TDIFF'] = dfNorthglenn['TMAX'] - dfNorthglenn['TMIN']
dfNorthglenn.head()

Unnamed: 0,PRCP,TMAX,TMIN,TMAX-C,TMIN-C,TDIFF
184,0.0,74.0,51.0,23.333333,10.555556,23.0
185,0.0,91.0,55.0,32.777778,12.777778,36.0
186,0.0,91.0,57.0,32.777778,13.888889,34.0
187,0.0,91.0,56.0,32.777778,13.333333,35.0
188,0.0,96.0,56.0,35.555556,13.333333,40.0


To compute the mean temperature difference we just need to compute the mean of the **TDIFF** column we just created.

In [58]:
# PART B
mean_diff = dfNorthglenn['TDIFF'].mean()
print("Mean Temp Diff = {:.3f}".format(mean_diff))


Mean Temp Diff = 32.667


We now compute the difference of the **max** and **min** temperature means from the `dfNorthglenn` data frame.

In [37]:
# PART C
diff_of_means = dfNorthglenn['TMAX'].mean() - dfNorthglenn['TMIN'].mean()

print("Diff of Mean Temps = {:.3f}".format(diff_of_means))

Diff of Mean Temps = 32.667
