# Statistics
Statistics is the practice of collecting and analyzing data to discover findings that are useful or predict what causes those findings to happen. Probability often plays a large role in statistics, as we use data to estimate how likely an event is to happen. **Descriptive statistics** we use to summarize data. **Inferential statistics** tries to uncover attributes about a larger population, often based on a sample.

## Populations
A population is a particular group of interest we want to study, such as “all seniors over the age of 65 in the North America,” “all golden retrievers in Scotland,” or “current high school sophomores at Los Altos High School.”
Simulations can be useful but rarely are accurate, as simulations capture only so many variables and have assump‐ tions built in.

## Samples
A sample is a subset of the population that is ideally random and unbiased, which we use to infer attributes about the population. We often have to study samples because polling the entire population is not always possible.

## Bias
If we are going to infer attributes about a population based on a sample, it’s important the sample be as random as possible so we do not skew our conclusions.

# Mean
The mean is the average of a set of values. The operation is simple to do: sum the values and divide by the number of values. The mean is useful because it shows where the “center of gravity” exists for an observed set of values.
Calculating mean in Python:

In [None]:
sample = [1, 3, 2, 5, 7, 0, 2, 3]

In [None]:
mean = sum(sample) / len(sample)

In [None]:
print(mean)

2.875


It is actually a weighted average called the weighted mean. The mean we commonly use gives equal importance to each value. But we can manipulate the mean and give each item a different weight:

In [None]:
# Three exams of .20 weight each and final exam of .40 weight
sample = [90, 80, 63, 87]
weights = [.20, .20, .20, .40]

In [None]:
weighed_mean = sum(s * w for s,w in zip(sample, weights)) / sum(weights)

In [None]:
print(weighed_mean)

81.4


# Median
he median is the middlemost value in a set of ordered values. You sequentially order the values, and the median will be the centermost value. The median can be a helpful alternative to the mean when data is skewed by outliers, or values that are extremely large and small compared to the rest of the values. Calculating the median in Python:

In [None]:
# Number of pets each person owns
sample = [0, 1, 5, 7, 9, 10, 14]

In [None]:
def median(values):
  ordered = sorted(values)
  print(ordered)
  n = len(ordered)
  mid = int(n / 2) - 1 if n % 2 == 0 else int(n / 2)
  if n % 2 == 0:
    return(ordered[mid] + ordered[mid + 1]) / 2.0
  else:
    return ordered[mid]

In [None]:
print(median(sample))

[0, 1, 5, 7, 9, 10, 14]
7


#The Median Is a Quantile
There is a concept of quantiles in descriptive statistics. The concept of quantiles is essentially the same as a median, just cutting the data in other places besides the middle. The median is actually the 50% quantile, or the value where 50% of ordered values are behind it. Then there are the 25%, 50%, and 75% quantiles, which are known as quartiles because they cut data in 25% increments.

#Mode
The mode is the most frequently occurring set of values. It primarily becomes useful when your data is repetitive and you want to find which values occur the most frequently. When no value occurs more than once, there is no mode. When two values occur with an equal amount of frequency, then the dataset is considered bimodal. Calculating the mode in Python:

In [None]:
# Number of pets each person owns
from collections import defaultdict

In [None]:
sample=[1, 3, 2, 5, 7, 0, 2, 3]

In [None]:
def mode(values):
  counts = defaultdict(lambda: 0)

  for s in values:
    counts[s] += 1

  max_count = max(counts.values())
  modes = [v for v in set(values) if counts[v] == max_count]
  return modes

In [None]:
print(mode(sample))

[2, 3]


#Variance
Variance is a measure of how spread out our data is. Calculating variance in Python:

In [None]:
# Number of pets each person owns
data = [0, 1, 5, 7, 9, 10, 14]

In [None]:
def variance(values):
  mean = sum(values) / len(values)
  _variance = sum((v - mean) ** 2 for v in values) / len(values)
  return _variance

In [None]:
print(variance(data))

21.387755102040813


The square root of the variance gives us the standard deviation:

In [None]:
from math import sqrt

In [None]:
# Number of pets each person owns
data = [0, 1, 5, 7, 9, 10, 14]

In [None]:
def variance(values):
  mean = sum(values) / len(values)
  _variance = sum((v - mean) ** 2 for v in values) / len(values)
  return _variance

In [None]:
def std_dev(values):
  return sqrt(variance(values))

In [None]:
print(std_dev(data))

4.624689730353898


When we average the squared differences, we divide by n – 1 rather than the total number of items n. We do this to decrease any bias in a sample and not underestimate the variance of the population based on our sample. By counting values short of one item in our divisor, we increase the variance and therefore capture greater uncertainty in our sample. Calculating standard deviation for a sample:

In [None]:
from math import sqrt

In [None]:
# Number of pets each person owns
data = [0, 1, 5, 7, 9, 10, 14]

In [None]:
def variance(values, is_sample: bool = False):
  mean = sum(values) / len(values)
  _variance = sum((v - mean) ** 2 for v in values) / (len(values) - (1 if is_sample else 0))
  return _variance

In [None]:
def std_dev(values, is_sample: bool = False):
  return sqrt(variance(values, is_sample))

In [None]:
print("VARIANCE = {}".format(variance(data, is_sample=True)))
print("STD DEV = {}".format(std_dev(data, is_sample=True)))

VARIANCE = 24.95238095238095
STD DEV = 4.99523582550223


#The normal distribution
The normal distribution, also known as the Gaussian distribution, is a symmetrical bell-shaped distribution that has most mass around the mean, and its spread is defined as a standard deviation.

#Properties of a Normal Distribution
The normal distribution has several important properties that make it useful:
• It’s symmetrical; both sides are identically mirrored at the mean, which is the center.
• Most mass is at the center around the mean.
• It has a spread (being narrow or wide) that is specified by standard deviation.
• The “tails” are the least likely outcomes and approach zero infinitely but never touch zero.
• It resembles a lot of phenomena in nature and daily life, and even generalizes nonnormal problems because of the central limit theorem, which we will talk about shortly.
The normal distribution function in Python:



In [None]:
# normal distribution, returns likelihood
def normal_pdf(x: float, mean: float, std_dev: float) -> float:
  return (1.0 / (2.0 * math.pi * std_dev ** 2) ** 0.5) * math.exp(-1.0 * ((x - mean) ** 2 / (2.0 * std_dev ** 2)))


# The Cumulative Distribution Function (CDF)
With the normal distribution, the vertical axis is not the probability but rather the likelihood for the data. To find the probability we need to look at a given range, and then find the area under the curve for that range.
The normal distribution CDF in Python:

In [None]:
from scipy.stats import norm

In [None]:
mean = 64.43
std_dev = 2.99

In [None]:
x = norm.cdf(64.43, mean, std_dev)

In [None]:
print(x)

0.5


This area of .5 or 50% up to the mean is known because of the symmetry of our normal distribution, and we can expect the other side of the bell curve to also have 50% of the area.

It is common to rescale a normal distribution so that the mean is 0 and the standard deviation is 1, which is known as the standard normal distribution. This makes it easy to compare the spread of one normal distribution to another normal distribution, even if they have different means and variances.
Of particular importance with the standard normal distribution is it expresses all x-values in terms of standard deviations, known as Z-scores. Turning an x-value into a Z-score uses a basic scaling formula:
z = (x − μ) / σ
Turn Z-scores into x-values and vice versa:

In [None]:
def z_score(x, mean, std):
  return (x - mean) / std


In [None]:
def z_to_x(z, mean, std):
  return (z * std) + mean

In [None]:
mean = 140000
std_dev = 3000
x = 150000

In [None]:
# Convert to Z-score and then back to X
z = z_score(x, mean, std_dev)
back_to_x = z_to_x(z, mean, std_dev)

In [None]:
print("Z-Score: {}".format(z))
print("Back to X: {}".format(back_to_x))

Z-Score: 3.3333333333333335
Back to X: 150000.0


Exploring the central limit theorem in Python:

In [None]:
import random
import plotly.express as px

In [None]:
sample_size = 31
sample_count = 1000

In [None]:
# Central limit theorem, 1000 samples each with 31 random numbers between 0.0 and 1.0
x_values = [(sum([random.uniform(0.0, 1.0) for i in range(sample_size)]) / sample_size) for _ in range(sample_count)]

In [None]:
y_values = [1 for _ in range(sample_count)]

In [None]:
px.histogram(x=x_values, y = y_values, nbins=20).show()

The central limit theorem, which states when we take large enough samples of a population, calculate the mean of each, and plot them as a distribution, we have:
1. The mean of the sample means is equal to the population mean.
2. If the population is normal, then the sample means will be normal.
3. If the population is not normal, but the sample size is greater than 30, the sample means will still roughly form a normal distribution.
4. The standard deviation of the sample means equals the population standard deviation divided by the square root of n:


```
sample standard deviation = population standard deviation / sqrt(sample size)
```


A confidence interval is a range calculation showing how confidently we believe a sample mean (or other parameter) falls in a range for the population mean.
P-value is the probability of something occurring by chance rather than because of a hypothesized explanation.
Null hypothesis (H0), saying that the variable in question had no impact on the experiment and any positive results are just random luck. The alternative hypothesis (H1) poses that a variable in question (called the controlled variable) is causing a positive result.
Let’s briefly address how to deal with smaller samples of 30 or fewer; we will need this when we do linear regression in Chapter 5. Whether we are calculating confidence intervals or doing hypothesis testing, if we have 30 or fewer items in a sample we would opt to use a T-distribution instead of a normal distribution. The T-distribution is like a normal distribution but has fatter tails to reflect more variance and uncertainty.
Getting a critical value range with a T-distribution:

In [20]:
from scipy.stats import t

In [21]:
# get critical value range for 95% confidence with a sample size of 25
n=25
lower = t.ppf(.025, df=n-1)
upper = t.ppf(.975, df=n-1)

In [22]:
print(lower, upper)

-2.063898561628021 2.0638985616280205
