# CSS 120 - Environmental Data Science

## Lecture 02 -- Introduction

# The Evolution of Data Analysis

- **1970s Data Analysis**: Focused on statistics in mathematics or statistics departments.

- **Today's Trend**: Machine learning (ML) in computer science departments is prevalent.

- **Key Question**: Is ML just a rebranded form of statistics?

# Origins of ML and Statistics

- **Statistics**: Dates back to 1749, rooted in government data collection.

- **Probability's Beginnings**: Started in 1654 with Pascal and Fermat.

- **Growth of Statistics**: Expanded by Pierre de Laplace in 1812 for various applications.

# Birth of Computer Science and AI

- **Post-WWII Developments**: Emergence of digital programmable computers.

- **AI's Inception**: Coined by John McCarthy in 1956 at the Dartmouth Conference.

- **ML as AI Branch**: Concept of learning machines introduced by Turing in 1950.

# Origins of ML and Statistics

- **Statistics**: Dates back to 1749, rooted in government data collection.

- **Probability's Beginnings**: Started in 1654 with Pascal and Fermat.

- **Growth of Statistics**: Expanded by Pierre de Laplace in 1812 for various applications.

# The Rise and Fall of AI

- **AI's Early Overpromises**: Led to two major 'AI winters' due to unrealistic expectations.

- **Lighthill's Critique**: 1973 report highlighting disappointments in AI progress.

- **Shift in Terminology**: Use of terms like ML, data mining to avoid AI stigma.

# ML in the Internet Era

- **Data Abundance**: Boosted by the rise of the Internet in the 1990s.

- **Multilayer Perceptron**: A milestone in ML development in 1986.

- **Data Science Emergence**: Term introduced in late 1990s by statisticians.

# ML vs. Statistics in Data Science

- **Distinct Cultures**: Despite overlapping, ML and statistics have different approaches.

- **ML's Growth**: Fostered in environments supporting heuristic research.

- **ML's 'Black Box' Nature**: Contrasts with statistical models' interpretability.

# AI's Aspiration and the Yin-Yang of Data Science

- **AI's Ambition**: To mimic the human brain's structure in model complexity.

- **Hinton's Argument**: Large number of parameters is necessary in AI models.

- **Dualism in Data Science**: ML and statistics as complementary elements.

# AI's Aspiration and the Yin-Yang of Data Science

![](https://raw.githubusercontent.com/rauls3/R-Projects/main/1.1.png)

# Exploration Patterns: Yin and Yang in Science

- **Maritime Exploration Analogy**: Eastern route before Columbus's western journey.

- **Cosmology's Shift**: From visible matter to exploring dark matter and dark energy.

- **Data Science Evolution**: Starting with small parameter models in statistics, moving to ML's larger parameter models.

# Breaking Traditional Constraints in ML

- **The Parameter 'Sound Barrier'**: Overcoming the old constraint of parameter count.

- **Real-World Applications**: ML excels in complex problems like image recognition, self-driving cars.

- **The Yin and Yang of Data Size**: ML's rapid growth due to its capacity to handle large parameter spaces.

# Predictor Variables in Statistics vs. ML

- **Predictor Selection**: A common practice in statistics, less so in ML.

- **ML's Approach to Information**: Prefers to retain rather than discard data.

- **Issues with Predictor Selection**: Risk of overestimating prediction skill.

# The Tradeoff: Interpretability vs. Accuracy

- **Statistics**: Offers interpretability with fewer parameters and predictors.

- **Machine Learning**: Prioritizes accuracy over interpretability, especially in complex datasets.

- **Evolving Complexity**: As data grows in size and complexity, ML's accuracy becomes more vital.

# Physics and Data Science: Parallel Evolutions

- **Classical vs. Quantum Mechanics**: Transition from deterministic to probabilistic understanding.

- **Modern Physics Approach**: Utilizing both classical and quantum mechanics as needed.

- **Data Science's Duality**: Learning both statistics and ML for diverse data challenges.

# Introduction to Environmental Data Science

# Introduction to Environmental Data Science

- **Definition**: Intersection of environmental science (ES) and data science.

- **Branches of ES**: Includes atmospheric science, hydrology, oceanography, and more.

- **Data Characteristics**: Each ES branch has unique data characteristics.

- **Role of Statistics**: Long history of statistical methods application in various ES fields.

# Characteristics of Environmental Data

- **Data Types**: Environmental datasets usually contain continuous data (e.g., temperature, wind speed).

- **Comparison with Non-Environmental Data**: Non-environmental ML datasets often have discrete or categorical data.

- **Bounded vs. Unbounded Data**: Non-environmental data is typically bounded, while environmental data often isn't.

# Predictive Modeling in Environmental Data Science

- **Problem Types**: Classification for discrete output, regression for continuous output.

- **Method Development**: Many ML methods initially designed for classification, later adapted for regression.

- **Performance Evaluation**: Models are tested with separate datasets to evaluate accuracy.

# Challenges in Predictive Modeling with Environmental Data

- **Outlier Issues**: Greater risk with unbounded continuous data than with bounded data.

- **Extrapolation Risks**: Models trained on limited domains may yield inaccurate predictions for new, outlying data.

- **Complexity in Environmental Predictions**: Accurate prediction using environmental data can be more challenging than in other domains.

# Outliers in Environmental Data Science

![Outliers in Data Science](https://raw.githubusercontent.com/rauls3/R-Projects/img/Screenshot%202023-10-11%20at%2010.06.39.png)

# Outliers in Environmental Data Science

- **Figure Explanation**: Shows the challenge of outliers in 2-D input data.

- **Training vs. Test Data**: Difference in ranges can lead to extrapolation errors.

- **Case Study**: Excessive extrapolation in predicting air quality due to outlier test data.

# Air Quality Prediction Example

- **Predictor Variable**: Cumulated precipitation for predicting 
 concentration.

- **Study Details**: Nonlinear regression model trained on 2013-2015 data, tested on 2010-2012 data.

- **Extrapolation Issue**: Significant difference in precipitation levels between training and test datasets.

# Weather vs. Climate in Environmental Data Science

- **Old Saying**: "Climate is what you expect; weather is what you get."

- **Data Grouping**: Short-term variations ('weather') vs long-term averages ('climate').

- **Practical Implications**: Importance of seasonal forecasts in various industries.

# Central Limit Theorem's Effect on Data Nature

- **Data Transformation**: Weather data averaged over time becomes climate data.

- **Central Limit Theorem**: Causes weakening of non-linear relations in averaged data.

- **Model Performance**: Non-linear models more effective for weather data than climate data.

# Central Limit Theorem's Effect on Data Nature

![](https://raw.githubusercontent.com/rauls3/R-Projects/img/Screenshot%202023-10-11%20at%2010.06.52.png)

# Climate Extremes in Data Analysis

- **Shift in Focus**: Growing interest in climate extremes due to climate change.

- **Climate Extremes Variables**: Derived from daily data, e.g., annual number of frost days.

- **ML in Climate Extremes Study**: Increasing application in analyzing these variables.

# ML Adoption in Environmental Sciences

- **Varied Germination Rates**: Dependence on existing models in respective fields.

- **Meteorology vs. Hydrology**: Slower ML adoption in meteorology due to established numerical models.

- **Hydrology and Remote Sensing**: Quicker acceptance of ML models.

# Challenges for ML Models in Environmental Sciences

- **Need for Large Sample Sizes**: Nonlinear ML models require more data to outperform.

- **Oceanography and Climate Science**: Slower ML adoption due to data collection difficulties and long timescales.

- **Traditional vs. ML Approaches**: Initially separate, now merging within environmental sciences.

# Recent Trends and Reviews in Environmental AI/ML

- **Recent Developments**: Rapid ML growth even in fields like oceanography and climate science.

- **Integration of ML and Physics**: Emerging trend of combining divergent approaches.

- **Key Reviews**: Important literature on AI/ML's role in environmental sciences by Haupt, Gagne, and Hsieh.

# Introduction to Curve Fitting

# Introduction to Curve Fitting

- **Concept**: Curve fitting with one independent variable and one dependent variable \( y \).

- **True Signal**: Quadratic relation 

$$ Y_{signal} = x - 0.25x^2 $$

- **Data Composition**:

$$Y = Y_{signal} + \epsilon $$ 

where $\epsilon$ is Gaussian noise.

# Synthetic Data Advantage

- **Purpose**: Using synthetic data allows for a clear understanding of the true signal.

- **Noise Characteristics**: Gaussian distribution with zero mean, standard deviation half of $Y_{signal}$.

# Polynomial Curve Fitting

- **Polynomial Function**: $$\hat{y} = w_0 + w_1x + w_2x^2 + ... + w_mx^m$$

- **Adjustable Parameters**: $w_j$ are the weights for the polynomial, with $m + 1$ total parameters.

- **Polynomial Orders**: Example uses orders 1, 2, 4, and 9 for fitting.

# Minimizing Mean Squared Error (MSE)

- **Objective**: Fit polynomials to data by minimizing the MSE between $\hat{y}$ and data $y$.

- **MSE Formula**: 

$$MSE = \frac{1}{N} \sum_{i=1}^N (\hat{y_i} - y_i)^2 $$

- **Data Points**: Fitting done with 11 data points in the example.

![](https://raw.githubusercontent.com/rauls3/R-Projects/img/Screenshot%202023-10-11%20at%2010.07.46.png)

# Underfitting and Overfitting Concepts

- **Simple Linear Regression**: Order 1 polynomial is a straight line fit.

- **Fit Improvement**: Better fit from order 1 to 2, but worsens at higher orders.

- **Overfitting Example**: Order 9 polynomial fits training data but misses the true signal.

# Model Validation Importance

- **Model Limitation**: Real-world scenarios don't reveal the true signal.

- **Independent Validation Data**: Essential to detect overfitting or underfitting.

- **MSE Trends**: Order 2 polynomial minimizes MSE for both training and validation data.

# Model Validation Importance

![](https://raw.githubusercontent.com/rauls3/R-Projects/img/Screenshot%202023-10-11%20at%2010.07.57.png)

# Impact of Training Data Size

- **More Data Reduces Overfitting**: Comparison between 15 and 100 data points.

- **Overfitting Reduction**: Order 9 polynomial fit improves with more data.

# Effect of Noise Level and Data Quantity

- **High Noise Scenario**: Increased noise leads to worse overfitting.

- **Large Data with High Noise**: Overfitting is reduced even with noisy data if the dataset is large.

- **Modern Data Science Insight**: Ability to retrieve weak signals from noisy data with ample data points.

![](https://raw.githubusercontent.com/rauls3/R-Projects/img/Screenshot%202023-10-11%20at%2010.38.06.png)

# Extrapolation in Polynomial Solutions

- **Extrapolation Domain**: Analysis beyond the training domain of \( x \in [-2, 2] \).

- **Solution Behavior**: Higher order polynomials perform poorly outside the training domain.

- **Reproducibility Issues**: Different extrapolations with different random data initializations.

![](https://raw.githubusercontent.com/rauls3/R-Projects/img/Screenshot%202023-10-11%20at%2010.38.20.png)

# Challenges with Polynomial Extrapolation

- **Rapid Increase Outside Domain**: High order polynomials grow quickly as $x \rightarrow \pm \infty$

- **Modern Data Science Methods**: Use of artificial neural networks with less aggressive growth basis functions.

- **Extrapolation Taming**: Efforts to reduce wild extrapolation, though not entirely eliminated.

# Basic Types of Data in Data Science and Types of Learning

# Basic Types of Data in Data Science

## A. Discrete or Categorical Variables
- **Nature**: Defined, distinct categories or specific values.
- **Examples**: Binary (e.g., [0, 1]), States (e.g., [on, off]), Truth Values (e.g., [true, false]).
- **Environmental Science Applications**: Weather states ([storm, no storm]), Temperature ranges ([cold, normal, warm]).

## B. Continuous Variables
- **Nature**: Variables that can take any value within a range.
- **Environmental Science Applications**: Temperature, wind speed, pollutant concentration.

## C. Probability Distributions
- **Role**: Describe the likelihood of different outcomes.
- **Examples**: Gaussian distribution for temperature, Weibull distribution for wind speed.

# Data Descriptions and Preferences in Machine Learning and Statistics

## Machine Learning vs. Statistics
- **Early ML Focus**: Predominantly on discrete/categorical data, especially in commercial/engineering fields.
- **Environmental Science Preference**: Continuous data for specific predictions (e.g., exact temperature).
- **Statistical Approach**: Use of probability distributions for detailed predictions (e.g., temperature with mean and standard deviation).

## Linkage Between ML and Statistics
- **Historical Gap**: Initial lack of strong connection between ML and statistical methods.
- **Recent Developments**: Improved integration, with ML methods increasingly cast in probabilistic frameworks.

# Types of Learning in Data Science

## Supervised Learning
- **Analogy**: Learning with a teacher's guidance (in this case, an objective).

## Unsupervised Learning
- **Analogy**: Solitary learning (e.g., child solving a jigsaw puzzle on their own).
- **Nature**: Relying on self-organization without direct teaching.

## Reinforced Learning
- **Brief Description**: A less common form of learning, involving learning from feedback.

# Supervised Learning

# Introduction to Supervised Learning

## Definition and Process
- **Objective**: To find a mapping from input variables $X$ to output variables $\hat{y}$.
- **Input Variables**: Also known as predictors, features, attributes, or covariates.
- **Output Variables**: Also known as response variables or predictands.
- **Notation**: $\{x_i, y_i\}$ represents training data, where $i = 1, ..., N$ and $N$ is the sample size.

# Types of Supervised Learning

## Regression
- **Description**: Output variables are real or continuous (e.g., predicting next day's wind speed from temperature, humidity, and pressure).
- **Input Nature**: Usually real variables, but can include discrete/categorical variables.

## Classification
- **Description**: Output variables are discrete/categorical (e.g., classifying weather as ‘storm’ or ‘no storm’).
- **Binary vs. Multi-Class**: Binary classification for two classes, multi-class for more than two (e.g., classifying seasonal temperatures or satellite images).

# Supervised Learning in Environmental Science vs. Other Fields

## Environmental Science Applications
- **Common Use**: Tends to focus more on regression problems.
- **Combined Approach**: Some problems use both classification and regression (e.g., precipitation forecast).

## Non-Environmental Applications
- **Common Use**: Predominantly classification (e.g., spam filters, credit card fraud detection, handwriting recognition).
- **Large Class Variety**: Especially in applications like object recognition.

# Unsupervised Learning

# Introduction to Unsupervised Learning

## Concept and Goal
- **Key Difference from Supervised Learning**: No output data $\hat{y}$, only input data $x$.
- **Objective**: To find hidden structure or patterns within the input data $x$.

# Applications of Unsupervised Learning

## Clustering
- **Purpose**: Grouping similar data points in the input space.
- **Example**: Identifying teleconnection patterns in atmospheric data.

## Dimension Reduction
- **Purpose**: Condensing high-dimensional data into lower dimensions.
- **Techniques**: Such as principal component analysis.
- **Example**: Reducing environmental datasets from 100 dimensions to 2 or 3 dimensions.

# The Significance of Unsupervised Learning

## Perspectives on Unsupervised Learning
- **Geoffrey Hinton's View**: Emphasizes the vastness of unsupervised learning in human learning.
- **Quote**: Highlights the inefficiency of learning one bit per second compared to the need for 10 bits per second from input.

## Deep Learning and Unsupervised Learning
- **Review by LeCun, Bengio, et al.**: Initially catalytic, then overshadowed by supervised learning.
- **Future Expectation**: Anticipation of unsupervised learning gaining more importance, mirroring human and animal learning patterns.

# Understanding the Curse of Dimensionality

# Understanding the Curse of Dimensionality

## Definition

- **Origin**: Coined by Bellman in 1961.
- **Problem**: Data methods for low-dimensional datasets become ineffective in high dimensions.

# Dimensionality and Data Coverage

## Illustrative Example
- **1-D Space**: A segment of width 0.5 covers half of the unit interval [0, 1].
- **2-D Space**: A square of width 0.5 covers a quarter of the unit square.
- **3-D and Beyond**: Exponential decrease in coverage with increasing dimensions.

![](https://raw.githubusercontent.com/rauls3/R-Projects/img/Screenshot%202023-10-11%20at%2010.08.11.png)

# Impact on Data Sampling and Techniques

## Sampling Challenges

- **High-Dimensional Spaces**: Sparse data points in higher dimensions, e.g., less than 0.1 data point per hypercube in 10-D space.
- **Implication**: Difficulty in obtaining representative samples in high-dimensional spaces.

## Technique Breakdown

- **Example**: K-nearest neighbors become ineffective as neighbors are too far in high dimensions.
- **Polynomial Fit Challenges**: Poor generalization to high dimensions.

# Working with High-Dimensional Data

## Strategies for High Dimensions

- **Data Concentration**: Real high-dimensional data often lie in lower effective dimensions.
- **Exploiting Smoothness**: Using properties like local interpolation in real data.

## Successful Methods

- **Approach**: Methods that leverage these properties tend to work well in high-dimensional spaces.

# Questions?

# See you in the next class!