---
title: "Exploratory Data Analysis 2"
author: 
  - name: "Beatrice Taylor"
email: "beatrice.taylor@ucl.ac.uk"
date-as-string: "8th October 2025"
from: markdown+emoji
format: revealjs 
jupyter: python3
---

# Last week

## Overview of lecture 1

Need to fill this in. 

# This week 

## First things first

:::: {.columns}

::: {.column width="60%"}

<div style="text-align:center;">
  <img src="L3_images/scientific_method.png" alt="The scientific method" style="max-width:100%">
</div>


:::

::: {.column width="40%"}

<br>

Exploratory data analysis is the first step of any data science project. 

:::

::::

::: {.notes}
link to scientific process and how we test and then iterate over ideas and concepts. Something we will talk about more next lecture. 
:::

## Introducing statistical concepts 

::: {.incremental}
- This lecture is focussed on statistical concepts
- Data science is about using ideas from statistics to describe large datasets 
- Here looking to describe numerical data
- Focus on probability distributions 
:::

## Learning Objectives
By the end of this lecture you should be able to:

::: {.incremental}
1. Describe the characteristic features of common probability distributions. 
2. Calculate exponentials and logarithms. 
3. Evaluate whether a dataset is representative. 
:::

# Motivation 

## What is the likelihood of events occuring? 

**Question**

What is the probability of someone at UCL being over 190cm? 

. . . 

**How can we try to answer this?**

. . .

::: {.fragment .strike}
We could try and find someone on campus who is over 190cm. 
:::

. . .

Better to try and understand the distribution of heights. 

::: {.notes}
This week thinking about the distribution of data. 
How do we understand the spread of height data, how to we calculate important features of this data, how can we describe this data using mathematical equations? 

Next week, thinking about formulating and answering the research question. 
:::

# Random Sampling 

## The dream vs reality 
Ideally, we would like all the relevant data.

. . .

... in reality we normally only have some. 

<div style="text-align:center;">
  <img src="L2_images/stickfigures.gif" alt="Random Sampling" style="max-width:80%;">
</div>

::: {.notes}
It would be great if I knew the height of everyone at UCL, but unrealistic to collect – however maybe I could collect all students in this lecture theatre.  
:::

## Approximating 
Hence, we sample a subset of the data. 

. . .

We need to choose our sample carefully. We want what happens in the sample to approximate what happens in the whole population. 

. . .

**In practise**

::: {.incremental}
- different sampling approaches
  - random sampling 
  - systematic sampling 
:::

# Is the data representative? 

## Bias
It's important to understand if your dataset truly representative or is it susceptible to bias? 

<div style="text-align:center;">
  <img src="L2_images/stickfigures_biased.gif" alt="Random Sampling" style="max-width:80%;">
</div>


## Cognitive bias 
[*Systematic patterns in how we think about, and perceive, the world.*]{style="color:#49a0c4"} 

:::: {.columns}

::: {.column width="50%"}

We all have cognitive biases. 

These can impact our research:

::: {.incremental}
- data collection 
- data selection 
- data processing 
- modelling choices
:::

:::

::: {.column width="50%"}

<br>
<br>

<div style="text-align:center;">
  <img src="L2_images/human_brain_2.png" alt="Human Brain" style="max-width:60%">
</div>

:::

::::

## Why is this important? 

If we're not careful we can propagate bias to the research, and hence results. 

. . .

This can lead to incorrect conclusions. 

## Types of bias 

- Research bias
  - Cognitive bias 

- Dataset bias 
  - Historical bias 
  - Selection bias 

::: {.notes}
Already talked about the idea of cognitive bias. There are lots and lots of different types of biases - going to talk a little about different ways they can bias the dataset. 
:::

## Historical bias
[*Reflects existing, real world, inequalities*]{style="color:#49a0c4"} 

Examples: 

- [Police profiling](https://www.amnesty.org.uk/press-releases/uk-police-forces-supercharging-racism-crime-predicting-tech-new-report) 
  - Automated tools to detect 'criminals'. 
  - Trained on datasets which reflect current racist practises. 
- [Hungry judges](https://en.wikipedia.org/wiki/Hungry_judge_effect)
  - Tools to help sentencing of 'criminals'. 
  - Patterns in the dataset --- judges are less lenient before lunch. 

::: {.notes}
Police profiling models - meant to make easy to spot criminals - but use exisitng crime data - which reflect racist prejudices in society - and the model learns to pick out people of certain ethnic backgrounds. 

Tools also used to help the sentnecing of criminals in courts of law - observe the hungry judges effect, where they sentnece differently based on how hungry they are - tend to be harsher just before lunch and more lenient after lunch. 
:::

## Selection bias 
[*When the sample chosen doesn’t represent the whole population of interest*]{style="color:#49a0c4"} 

Examples: 

- Self selection [Roy Model](https://en.wikipedia.org/wiki/Roy_model) 
  - Underlying characteristics of people who self select into certain groups. 
- [WEIRD people](https://oecs.mit.edu/pub/spow8trw/release/1) 
  - Commonly sampled in behavioural sciences. 
  - Reflects a very small proportion of global population. 

::: {.notes}
Roy model - example from economics - basically there's not a random selection of people in a specific job - thye have chosen that job based on underlying unobserved characterisitcs. For example suppose I'm interested in wages of workers in different occupations - but they have self-selected into that occupation based on their unique skill set - so it's a biased representation. 

WEIRD = western, educated, industrialised, rich, democratic
:::

## Can data ever be truly representative? 

Probably not. 

::: {.notes}
Even when we think we have really good data, someone has made the choice to collect this data. Why that data over other data? Does that data reflect their underlying cognitive biases?
:::

. . .

**Failing that...**
.. we can acknowledge our biases!

<div style="text-align:center;">
  <img 
    src="https://imgs.xkcd.com/comics/flawed_data.png" 
    alt="XKCD - Flawed Data" 
    style="width:800px">
  <div style="font-size:0.8em; color: #555; margin-top:4px;">
    Image credit: [xkcd](https://xkcd.com/2494/)
  </div>
</div>

# Descriptive Statistics 

## What to declare 

- Sample size (n) 
- Mean, median, mode 
- Standard deviation 
- Range 

## Example 
::: columns

::: column

Let's look at a dataset of students height. 

<br>

Easy to print the summary statistics in Python, using `pandas`: 

In [1]:
#| echo: true
#| output: false   # show code only here
import pandas as pd 

height_df = pd.read_csv("L2_data/heights.csv")
height_df.describe().round(2)

Unnamed: 0,Height_cm
count,1000.0
mean,161.19
std,9.79
min,128.59
25%,154.52
50%,161.25
75%,167.48
max,199.53


:::

::: column


In [2]:
#| echo: false     # hide code
#| output: true    # show only output
import pandas as pd 

height_df = pd.read_csv("L2_data/heights.csv")
height_df.describe().round(2)

Unnamed: 0,Height_cm
count,1000.0
mean,161.19
std,9.79
min,128.59
25%,154.52
50%,161.25
75%,167.48
max,199.53


:::
:::



## Same same but different 
What about when the descriptive statistics don't tell us enough? 

<div style="text-align:center;">
  <img src="L2_images/anscombes_four.png" alt="Anscombes quartet" style="max-width:70%">
</div>

::: {.notes}
These all have mean x = 9, variance x = 11. mean y = 7.5, variance y =4.1
:::


# Normal Distribution
The most fundamental distribution

## Everywhere you look

<div style="text-align:center;">
  <img src="L2_images/normal_distribution.png" alt="Normal distribution" style="max-width:90%">
</div>


::: {.notes}
- Human height
- Happiness
- Petal size
:::



## You've seen it all before  

::: {.incremental}
- Symmetrical
- Single peak 
- Smooth tails on both sides
:::


## Naturally occurring 

<div style="text-align:center;">
  <img src="L2_images/height_histogram.png" alt="Normal distribution" style="max-width:90%">
</div>

## Naturally occurring 

<div style="text-align:center;">
  <img src="L2_images/height_histogram_pdf.png" alt="Normal distribution" style="max-width:90%">
</div>


## Uniquely described by two variables...

<div style="text-align:center;">
  <img src="L2_images/normal_distribution_annotated.png" alt="Normal distribution" style="max-width:90%">
</div>


## ...and a probability distribution function

<div style="text-align:center;">
  <img src="L2_images/normal_distribution_pdf.png" alt="Normal distribution with annotated probability density function" style="max-width:90%">
</div>

[*The probability density function (PDF) describes the likelihood of different outcomes for a continuous random variable*]{style="color:#49a0c4"} 

<!-- The PDF of the normal distribution is: 
```{=tex}
\begin{align}
\frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{align}
``` -->

## Key features 

::: {.incremental}
- Characteristic size (all data points close to the mean): symmetric and single peak
- Data is equally likely to be larger or smaller than average (symmetric): Smooth tails on both sides
- Data is continuous (it is something you measure not something you count) 
:::

## Sampling distributions 

[The distribution of the random variable when derived from a random sample of size $n$]{style="color:#49a0c4"} 

. . .

In the case of the normal distribution - standard deviation becomes: 

```{=tex}
\begin{align}
\frac{\sigma}{\sqrt{n}}
\end{align}
```

## Calculating probabilities 

Can use the PDF to evaluate the probability at a specific point. 

```{=tex}
\begin{align}
x \sim N(0,1)
\end{align}
```

. . .

```{=tex}
\begin{align}
p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{align}
```

. . . 

```{=tex}
\begin{align}
p(x=0.5) = \frac{1}{\sqrt{2 \pi \times 1^2}} e^{-\frac{(0.5-0)^2}{(2\times 1)^2}}
\\
= \frac{1}{\sqrt{2 \pi}} e^{-\frac{0.25}{4}} = 0.3747
\end{align}
```

<!-- Generally want the probability that $x>p$ or $x<=p$. 

Area under the curve. -->

## Not everything is normal 
Many real world datasets are *approximately* normally distributed.

. . .

But not all ---  might not have a characteristic size, or not continuous, or not symmetric. 

# More generally, what is a probability distribution? 

## Continuous vs. Discrete 
**Continuous data** 
Measurable data which can take any value within a given range.  

[*example*: height]{style="color:#abc766"}

**Discrete data**
Measurable data which can take seperate, countable values. 

[*example*: shoe size]{style="color:#abc766"} 

::: {.notes}
Suppose I have someone who is x cm tall, and someone who is y cm tall, I can find someone in between whose x.5cm tall. 

But shoe sizes are discrete (not the length of your foot). 
:::

## Back to the probability function 

```{=tex}
\begin{align}
p(x)
\end{align}
```

Having a function for the distribution allows us to evaluate the probability of events, and hence evaluate hypotheses. 

. . .

For discrete distributions we have the probability mass function. 

. . .

**Sampling distributions**

As for the normal distribution, in the general case we should be aware of the sampling distribution. 

# Binomial distribution 

## Coin toss 

:::: {.columns}

::: {.column width="50%"}

**Discrete outcomes**

Describes the frequency of successes in a test with 2 outcomes 

<br>

::: {.incremental}
- I flip a coin 10 times
- How often can I expect to get at least 7 heads? 
:::

:::

::: {.column width="50%"}

<br>
<br>

<div style="text-align:center;">
  <img src="L2_images/coin_flip.jpg" alt="A hand flipping a coin." style="max-width:70%">
</div>

:::

::::


## Probability mass function 

```{=tex}
\begin{align}
P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}
\end{align}
```

Where $n$ is the number of trials, and $p$ is the probability of success for each trial. 

<br>

[*The probability mass function (PMF) describes the likelihood of different outcomes for a discrete random variable*]{style="color:#49a0c4"} 

## Example 

::: {.incremental}
- I flip a coin 10 times 
  - **$n=10$, $p=0.5$**
- How often can I expect to get at least 7 heads? 
  - **$k=7$**
:::

. . . 

Evaluating the PMF, we get: 

```{=tex}
\begin{align}
P(X = 7) = \binom{10}{7} 0.5^7 (1-0.5)^{10-7}
\\ = 0.1172
\end{align}
```

# Poisson distribution 

## Death by horse kicks 

:::: {.columns}

::: {.column width="50%"}

::: {.incremental}
- It's 1894. 
- You're the statistician Ladislaus Bortkiewicz.
- And you're wondering, 
- How many soldiers in the Prussian army have been killed by horse kicks?
:::

:::

::: {.column width="50%"}

<br>

![](L2_images/horse_kicks.jpeg)

:::

::::

## Measuring rare events

::: {.incremental}
- Imagine a situation where certain rare events (like arrival of mail) can occur in an **independent** fashion. 
- The Poisson distribution estimates how many such events are expected within a time interval
- Fixed interval (e.g. one minute)
- Fixed rate of events ($\lambda$) (e.g. 4 cars per minute, $\lambda=4$)
-	Events occur independently.
-	Poisson Distribution gives the probability of $k$ events.
:::

## Probability mass function 

```{=tex}
\begin{align}
P(X = k) = \frac{\lambda^k e^{- \lambda}}{k!}
\end{align}
```

Where $\lambda$ is the expected number of events in a given interval.

## Example 

::: {.incremental}
- Between 1883 and 1893 there were an average of 2 deaths from horse kicks a year. 
  - **$\lambda=2$**
- What's the probability of seeing 10 deaths from horse kicks in 1894? 
::: 

. . .

```{=tex}
\begin{align}
P(X = 10) = \frac{2^10 e^{-2}}{10!}
\\ = 0.000038
\end{align}
```

# Exponentials and Logarithms 

## Exponentials

If the Poisson measures the probability of x events within a time period, then the Exponential measures how long we are likely to wait between events.

. . .

[*The greatest shortcoming of the human race is our inability to understand the exponential function*]{style="color:#49a0c4"} – Albert Bartlett (physicist)

::: {.notes}
He's a bit dramatic
:::

## A game of chess...

<!-- <div style="text-align:center;">
  <img src="L2_images/chessboard_rice.gif" alt="Exponential rice" style="max-width:60%;">
</div> -->

:::: {.columns}

::: {.column width="50%"}

::: {.incremental}
- You've invented chess. 
- The emperor asks what you would like as thanks. 
- You ask for grains of rice.  
:::

:::

::: {.column width="50%"}

<div style="text-align:center;">
  <img src="L2_images/Radha_Krishna_chess.jpg" alt="An illustration of Krishna and Radha playing Chaturanga." style="max-width:90%;">
  <div style="font-size:0.8em; color: #555; margin-top:4px;">
    Image credit: https://simple.wikipedia.org/wiki/Chaturanga
  </div>
</div>

:::

::::

::: {.notes}
Famous story, of the man who invented chess. 
Emperor was so grateful, he said what do you want in return 
The man asked for rice, such that on each square of the chess board, the number of grains of rice doubled. 
The Emperor thought this was a really small ask for such a great invention. He asked if he didn't want a better gift. 
But the man insisted that what he wanted was rice. 
:::

## ...and rice 

:::: {.columns}

::: {.column width="80%"}

<div style="text-align:center;">
  <img src="L2_images/chessboard_numbers.gif" alt="Exponential rice" style="max-width:75%;">
</div>

:::

::: {.column width="20%"}

::: {.incremental}
- which is more rice than there is on earth... 
:::

:::

::::

::: {.notes}
A bowl of rice is around 4,000 grains. 
:::

## Writing the equation

:::: {.columns}

::: {.column width="50%"}
<div style="text-align: center;">

| x   | y           |
|-----|-------------|
| 1   | 2           |
| 2   | 4           |
| 3   | 8           |
| 4   | 16          |
| 5   | 32          |
| 6   | 64          |
| 7   | 128         |
</div>
:::

::: {.column width="50%"}

```{=tex}
\begin{align}
y = 2^x
\end{align}
```

:::

::::


:::{.notes}
Here we have the base, 2, and the exponent x. 
:::

## What does this look like on a graph?

<div style="text-align:center;">
  <img src="L2_images/discrete_exponential_chess.gif" alt="Exponential grains of rice." style="max-width:90%;">
</div>


<!-- ## At it like rabbits 

:::: {.columns}

::: {.column width="50%"}

::: {.incremental}
- Each Sunday you go for a walk in your local park. 
- At first you notice two rabbits.
- The next time there's four. 
- Then eight.
- And you're wondering, 
- How many rabbits will there be in a year? 
:::

:::

::: {.column width="50%"}

<br>

![](L2_images/rabbits.jpeg)

:::

::::

## Counting rabbits 

<div style="text-align:center;">
  <img src="L2_images/discrete_exponential.gif" alt="Exponential rabbit populations" style="max-width:90%;">
</div>

 -->

## The exponential function

The (natural) exponential function is:

```{=tex}
\begin{align}
y=e^x
\end{align}
```

$e$ here is eulers number - a mathematical constant. 
```{=tex}
\begin{align}
e \approx 2.718... 
\end{align}
```

:::{.notes}
Its a mathematical constant simialr to pi - it's just a fixed numebr which we have a name for. 
Like pi comes from geometry (circles etc), e comes from the limit of the equation of compound interest
:::

## When I can feed my town?

Allows us to answer questions like: 

::: {.incremental}
- "how many grains of rice at t=10?"
- “when will I have enough rice to feed my entire town?” 
:::

This is easier said than done – the best way is to invert the equation.

## Inverse operations 
[*The mathematical operation that reverses.*]{style="color:#49a0c4"} 

Subtract is the inverse of adding. 
```{=tex}
\begin{align}
2 + x=5 \implies 5-2=x
\end{align}
```
Divide is the inverse of multiplying.
```{=tex}
\begin{align}
2 \times x =6 \implies 6 \div 2=x
\end{align}
```

## Logarithms 

Taking the logarithm is the inverse of taking the exponential. 

. . .

```{=tex}
\begin{align}
2^3 = 8 \implies \log_2(8) =3
\end{align}
```

. . .

More generally: 
```{=tex}
\begin{align}
a^x = b \implies \log_a(b) =x
\end{align}
```

. . .

For the natural logarithm:
```{=tex}
\begin{align}
e^x = b \implies \log_e(b) =ln(b) = x
\end{align}
```

<!-- [*Note $ \log_e = \ln $*]{style="color:#49a0c4"} -->

::: {.notes}
Multiply and divide are inverse operations of one another (they reverse the process). 

Read the equation as log of 8 base 2 equals 3. 
:::

## Natural logarithm 

<div style="text-align:center;">
  <img src="L2_images/natural_logarithm.png" alt="Natural logarithm" style="max-width:90%;">
</div>

## Log rules 

There are some general rules for how we apply logarithms:

```{=tex}
\begin{align}
log_a(b \times c) &= log_a(b) + log_a(c)
\\ log_a(\frac{b}{c}) &= log_a(b)-loc_b(c)
\\ log_a(b^c) &= c \times log_a(b)
\\ log_a(1)&=0
\\ log_a(a)&=1
\end{align}
```

## Transforming data 

Some of the most important rules: 

```{=tex}
\begin{align}
log_a(a^x) = x
\\ ln(e^x) = x
\end{align}
```

. . .

When we have exponential data we can take the logarithm of it - and hence simplify it. 

# Overview 
We've covered: 

- Representative data 
- Normal distribution
- Binomial distribution 
- Poisson distribution 
- Exponentials 
- Logarithms 


# Practical 
Practical will focus on plotting and generating summary statistics for data following a range of probability distributions. 

. . .

Have questions prepared!