# Introduction

**Note that this lecture is currently incomplete; if you come back to the page and this message is gone, you can consider it complete.**

## What are "statistical methods," and why do we need them?

To address this, let's breakdown the course title into three pieces:

1. "Statistical"
2. "Methods"
3. "in Earth and Atmospheric Sciences"

### "Statistical"

Imagine a very simple dataset: "Whether or not it rained more than one foot in Central Park on August 15, 2023".  It consists of a single value: `No` (you can take my word for it). 

We don't need anything complicated to characterize it.  As the saying goes, in this rather odd instance, the data "speaks for itself."

Now let's look at a more realistic dataset: the actual daily precipitation as measured by the [Central Park weather station](https://w1.weather.gov/data/obhistory/KNYC.html).  This record spans from the year 1869 to today.  Does this data "speak for itself"?  

If it did, we could just print out its values, take it all in, and be pleased with ourselves.  Let's try that using Python, the programming language that we'll use throughout this course (if the lines of code below seem mysterious or confusing to you, just set that aside for now, and trust that we've used them to load daily data from the Central Park weather station):

In [4]:
import xarray as xr

In [6]:
filepath_in = "../data/central-park-station-data.nc"
precip_central_park = xr.open_dataset(filepath_in)["precip"]

In [10]:
print(f"# of values in the dataset: {len(precip_central_park)}")

# of values in the dataset: 56428


That's probably more than we can take in by eye.  Let's try anyways:

In [16]:
import numpy as np

# Need this so that it actually prints all the values rather than just a few.
with np.printoptions(threshold=np.inf):
    print(precip_central_park.values)

[0.75 0.03 0.   0.18 0.05 0.   0.   0.   0.   0.01 0.   0.85 0.   0.
 0.04 0.   0.   0.   0.15 0.   0.   0.   0.   0.   0.   0.   0.   0.
 0.   0.47 0.   0.   0.   0.   1.55 0.   0.   0.   0.   0.   0.85 0.05
 0.   0.   0.   2.6  0.   0.   0.54 0.06 0.   0.   0.   0.56 0.   0.
 0.66 0.   0.   0.   0.02 0.   0.1  0.   0.03 0.   0.   0.   1.06 0.
 0.   0.   0.02 0.18 0.   0.   0.   0.   0.28 0.   0.   0.95 0.   0.
 0.   0.78 0.   0.   1.15 0.04 0.01 0.43 0.   0.   0.   0.   0.3  0.
 0.   0.   0.   0.   0.   0.   0.   0.   0.04 0.   0.   0.12 0.08 0.
 0.   0.   0.   0.   0.   0.05 0.36 0.   0.58 0.7  0.02 0.   0.   0.
 0.   0.   0.   0.   0.   0.   0.12 0.   0.15 0.5  0.   0.   0.87 0.
 0.27 0.19 0.   0.   0.   0.4  0.   0.07 0.   0.14 0.14 0.   0.08 0.
 0.   0.08 0.   0.   0.33 0.   0.19 0.01 0.01 0.   0.93 0.06 0.   0.
 0.   0.   0.   1.53 0.02 0.   0.01 0.   0.02 0.67 0.2  0.   0.26 0.
 0.   0.42 0.   0.   0.   0.   0.   0.52 0.   0.   0.   0.1  0.18 0.05
 0.38 0.   0.   0.   0.04 0.64

That's definitely way too much for the data to "speak for itself".  In fact, that's true for anything more than one or two data points.  Moreover, while more than 50,000 data points may sound like a lot, in the Earth Sciences it's not uncommon to work with datasets with millions or even billions of points.

Pretty much always, we need to translate our raw data into digestible chunks: individual numbers and plots that capture key things about the dataset we want to know.  And that's exactly what statistics provides.

### "Methods"

Think of "methods" as just another word for "tools."  As such, what we're after in this course is building a toolkit of statistical techniques that will be valuable to you in your research and other coursework.  This has some important implications:

- We'll focus *less* on the underlying theoretical basis for the techniques we'll discuss.  Just enough to be confident in their meaning, their conceptual basis, and their interpretation.
- We'll focus *more* on implementing them IRL (in real life).  That means coding them up and using them to analyze real datasets from the Earth sciences.
- We'll devote time not just to specific analysis techniques, but all the other steps in the pipeline from a raw data file to accurately generated statistical measures and pretty plots.  That includes things like:
  - setting up your computational environment (python, conda, jupyter, ...)
  - dealing with different file formats (netCDF, CSV, .xls, grib, ...)
  - cleaning your data (missing data, clearly wrong data, probably wrong data, ...)
  - computer performance, i.e. making your calculations run in a reasonable amount of time (vectorized routines, parallelization, ...)

### "in Earth and Atmospheric Sciences"

Of course, being in the EAS department, we will focus almost exclusively on datasets and problems that arise in the Earth and Atmospheric Sciences (as opposed to say, medicine).  And my own research emphasis with EAS is on rainfall and how it varies on many timescales (day to day, season to season, year to year, etc.), which will bias the example datasets used in that direction.  Finally, at *the* City College of New York after all, we'll use data from our own backyard as much as possible.

But the overall approach---how to characterize datasets as succinctly and accurately as possible, individually and as they relate to one another---is universally applicable.  Likewise for the vast majority of the individual tools themselves; with modest effort nearly all of them can be adapted to a wide array of datasets, e.g. time series of economic data, astronomical data from telescopes, biometric data from your smart watch.

## Why care about statistics at all?

Why bother with things like conditional probabilities, autocorrelations, covariance matrices, etc., which especially if you haven't heard of them before might sound like kind of a pain?  Well, virtually every major problem in the Earth sciences---and in fact, virtually every major problem facing the world---ultimately has a statistical basis.

In the Earth sciences, that includes all of the following questions:

- Are extreme rainfall events in New York City becoming more frequent?
- How does the El Niño-Southern Oscillation influence wintertime
- In ice core records of Earth's climate going back thousands and millions of years, are there key recurring cycles, and if so how long do they last?
- What is the relationship between air pollution and child asthma?
  - Based on that, is there a meaningful difference in child asthma incidence in Harlem vs. the Upper East Side that?

# Descriptive Statistics

Suppose you've just gotten hold of some data to analyze.  In this case, the Central Park precipitation (or "precip" for short) timeseries.  What ways can you condense it into digestible pieces?

The most compact way of representing anything really is boiling it down to a single number (sometimes called a *scalar*).  Now we'll walk through some key scalar measures of any dataset.  They are split into what aspects of the data they most directly capture:

- **Measures of central tendency**: roughly, what do "typical" values of the dataset look like?
- **Measures of dispersion**: roughly, are the values tightly clumped together or spread far apart?
- **Measures of shape**: moving away from where the values most clump together, is there a long "tail" of values extending in one direction, or the other, or both?  (I.e. how lopsided is it, and how "fat-tailed" or "skinny-tailed.")

## Measures of Central Tendency

### Mean

Probably the single most intuitive measure is the average, or *mean*.  Sum up all the values, and divide by the number of values.  Symbolically:

$$\overline{X}_i=\frac{1}{N}\sum_{i=1}^N X_i,$$ 

where

- the overbar $\overline{\phantom{X}}$ denotes the mean
- $X_i$ is our dataset
- the subscript $_i$ indexes the individual data points.  So $X_1$ is the first value, $X_2$ is the second value, etc.
- $N$ is the total number of points
- $\sum_{i=1}$ is the standard notation for summation.  It means: sum over all the values of $X_i$ from $i=1$ to $i=N$

(footnote: weighted averages)

The mean of the Central Park rainfall dataset is:

In [29]:
precip_central_park.mean()

### Median

The median is the value such that exactly half of the data points lie below it, and half lie above it.

Compared to the mean, it is insensitive to "outliers"---that is, points that are way different than most of the other points.

The median of the Central park rainfall dataset is:

In [26]:
precip_central_park.median()

...zero?  Can that be right?  Yes: on most days, meaning more than half of days (in New York City as well as most places other than rainforests and other extremely wet places) there is no rain.  So the median---which separates the dataset into a a lower half and upper half---will be zero.

This highlights that precipitation is not a truly *continuous* quantity the way say temperature is.  On days that precipitation occurs, the amount is indeed continuous---there are no discrete amounts of rainfall that must occur.  But on days with no precipitation at all, there is a single discrete value: zero.

The median is also just a particular example of a more general quantity, *quantiles* (if expressed as a fraction from 0 to 1) or *percentiles* (if expressed as percentages from 0% to 100%).

Let's look at a few select quantiles of the Central Park rainfall:

In [31]:
precip_central_park.quantile([0.25, 0.5, 0.75, 0.99])

So in this case, the bottom 75% of points span only a 0.05 inch range, while the 75th to 99th percentile spans 1.66 inches!

### Mode

The mode is simply the value that occurs most frequently.

Same as for the median, there's not a nice compact way of expressing that in an equation.  You just count up how many times different values occur and see which one does the most.

So based on our discussion so far of the Central Park rainfall dataset, it should come as no suprise that the mode is zero:

In [27]:
import scipy
scipy.stats.mode(precip_central_park)

ModeResult(mode=0.0, count=37690)

Note that this is a rare example where, for a continuous variable, the mode is well defined.  We can contrast that with the Central Park temperature record.

## Measures of Dispersion
1. Range
2. Variance
3. Standard Deviation
4. Example: Analyzing Precipitation Variability

## Measures of Shape
1. Skewness
2. Kurtosis
3. Example: Analyzing Wind Speed Distribution

## Measures of Association
1. Covariance
2. Correlation
3. Example: Correlation between Temperature and Humidity

# Data Visualization
## Importance of Visualization
1. Understanding Data
2. Communicating Findings

## Types of Plots and Graphs
1. Histograms
2. Scatter Plots
3. Line Charts
4. Bar Charts
5. Box Plots

## Visualization Tools in Python
1. Matplotlib
2. Seaborn
3. Example: Visualizing Climate Patterns

## Best Practices in Visualization
1. Choosing the Right Chart Type
2. Labeling and Annotations
3. Aesthetics and Accessibility

# Descriptive statistics and visualization as quality control

The Central Park weather station has not always stayed in the exact same place or used the exact same instruments.

# Practical Exercise (Interactive/Lab Component)
1. Exploring a Real Dataset (e.g., Climate Data)
2. Calculating Descriptive Statistics
3. Creating Visualizations
4. Discussion and Interpretation

# Conclusion
1. Summary of Key Concepts
2. Relevance to Upcoming Topics
3. Q&A and Feedback

# Supplementary Materials
1. Additional Readings
2. Tutorial Videos or Links
3. Homework or Assignments