# Descriptive Statistics

In [19]:
import math
import pandas as pd

## Measure of cental tendency
### Mean
$$\mu = \frac{\sum  x_{i}}{N}$$

In [20]:
def mean(*args):
    val_sum = sum(args)
    mean_val = val_sum / len(args)
    return mean_val
print(f"Mean: {mean(1, 2, 3, 4, 5)}")

Mean: 3.0


### Median
The median of a dataset is defined as the value that separates the higher half from the lower half of the data. 

- If the number of data points $n$ is **odd**, the median is the middle value: <br>
  $$\text{Median} = x_{\left(\frac{n+1}{2}\right)} $$

- If the number of data points \(n\) is **even**, the median is the average of the two middle values: <br>

  $$\text{Median} = \frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2} + 1\right)}}{2}$$
  

In [21]:
def median(*args):
    args = sorted(args)
    if len(args) % 2 == 0:
        i = round((len(args) + 1) / 2)
        j = i - 1
        return (args[i] + args[j]) / 2
    else:
        k = round(len(args) / 2)
        return args[k]
print(f"Median: {median(1, 2, 3, 4, 5, 6)}")

Median: 4.5


### Mode

The mode of a dataset is the value that appears **most frequently**. 

- For a given dataset $ x_1, x_2, \dots, x_n $, the mode is the value $ x $ that occurs with the highest frequency. In other words:
  
  $$
  \text{Mode} = \underset{x}{\text{argmax}} \ f(x)
  $$

  where $ f(x) $ represents the frequency of value $ x $ in the dataset.
  
- A dataset can be:
  - **Unimodal**: If there is only one mode.
  - **Bimodal**: If there are two modes.
  - **Multimodal**: If there are more than two modes.

In [22]:
def mode(*args):
    # Count how many times values show up in the list and put it in a dictionary
    dict_vals = {i: args.count(i) for i in args}
    # Create a list of keys that have the maximum number of occurrence in the list
    max_list = [k for k, v in dict_vals.items() if v == max(dict_vals.values())]
    return max_list
print(f"Mode: {mode(1, 2, 3, 4, 5, 5, 4)}")

Mode: [4, 5]


## Measure of variability
### Variance

Variance measures how far the values in a dataset spread out from the mean (average).  It is the average of the squared differences from the mean.

- For a dataset $x_1, x_2, \dots, x_n$, the variance $\sigma^2$ is given by:

  $$
  \sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2
  $$

  where:
  - $n$ is the number of data points,
  - $x_i$ is the $i$-th data point,
  - $\mu$ is the mean of the dataset.

- For a sample variance (used when data is a sample from a larger population), the formula is:

  $$
  s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
  $$

  where:
  - $\bar{x}$ is the sample mean,
  - $n - 1$ is used instead of $n$ for an unbiased estimate of population variance.

In [23]:
def variance(*args):
    mean_val = mean(*args)
    numerator = 0
    for i in args:
        numerator += (i - mean_val) ** 2
    denominator = len(args) - 1
    return numerator / denominator

print(f"Variance: {variance(4, 6, 3, 5, 2)}")

Variance: 2.5


### Standard Deviation

The standard deviation is the square root of the variance. It measures the average distance of each data point from the mean.
- Because of Squared values variance gives extra weight to outliers
- It is a more interpretable measure of spread because it is expressed in the same units as the data.
- larger standard deviation = larger spread

- For a population standard deviation:

  $$
  \sigma = \sqrt{\sigma^2}
  $$

  where $\sigma^2$ is the variance.

- For a sample standard deviation:

  $$
  s = \sqrt{s^2}
  $$

  where $s^2$ is the sample variance.

In [24]:
def standard_deviation(*args):
    return math.sqrt(variance(*args))

print(f"Standard Deviation: {standard_deviation(4, 6, 3, 5, 2)}")

Standard Deviation: 1.5811388300841898


### Coefficient Variation
The Coefficient of Variation (CV) measures the relative variability of a dataset by comparing the standard deviation to the mean. It is expressed as a percentage, allowing for comparison across different datasets.

$$
\text{CV} = \frac{\sigma}{\mu} \times 100\%
$$

Where:
- $\sigma$ = standard deviation
- $\mu$ = mean of the dataset

In [25]:
def coefficient_variation(*args):
    return standard_deviation(*args) / mean(*args)

print(f"CV (miles): {coefficient_variation(3, 4, 4.5, 3.5)}")
print(f"CV (kms): {coefficient_variation(4.828, 6.437, 7.242, 5.632)}")

CV (miles): 0.17213259316477408
CV (kms): 0.17214686292344047


### Covariance

Covariance is a measure of the degree to which two random variables change together. It indicates the direction of the linear relationship between the variables.

#### Formula

For two random variables $X$ and $Y$, the covariance is calculated as:

$$
\text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu_X)(y_i - \mu_Y)
$$

where:
- $n$ is the number of data points,
- $x_i$ and $y_i$ are the individual sample points,
- $\mu_X$ is the mean of variable $X$,
- $\mu_Y$ is the mean of variable $Y$.

#### Interpretation
- A **positive covariance** indicates that as one variable increases, the other tends to increase as well.
- A **negative covariance** indicates that as one variable increases, the other tends to decrease.
- A covariance close to zero suggests that the variables do not have a linear relationship.

In [26]:
def covariance(*args):
    # Use a list comprehension to get all values stored in the 1st & 2nd list
    list_1 = [i[0] for i in args]
    list_2 = [i[1] for i in args]
    # Pass those lists to get their means
    list_1_mean = mean(*list_1[0])
    list_2_mean = mean(*list_2[0])
    numerator = 0
 
    # We must have the same number of elements in both lists
    if len(list_1[0]) == len(list_2[0]):
        for i in range(len(list_1[0])):
            # FInd xi - x mean * yi - y mean
            numerator += (list_1[0][i] - list_1_mean) * (list_2[0][i] - list_2_mean)
        denominator = len(list_1[0]) - 1
        return numerator / denominator
    else:
        print("Error : You must have the same number of values in both lists")

market_cap_earnings_arr = [[1532, 1488, 1343, 928, 615], [58, 35, 75, 41, 17]]
print(f"Stock covariance: {covariance(market_cap_earnings_arr)}")
 

Stock covariance: 5803.200000000001


### Correlation Coefficients

Correlation measures the strength and direction of a linear relationship between two variables. It is quantified by the correlation coefficient, often represented as $r$.

The formula for Pearson's correlation coefficient is:

$$
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}
$$

Where:
- $x_i$ and $y_i$ are individual sample points.
- $\bar{x}$ is the mean of variable $X$.
- $\bar{y}$ is the mean of variable $Y$.

The value of $r$ ranges from $-1$ to $1$:
- $r = 1$: Perfect positive correlation.
- $r = -1$: Perfect negative correlation.
- $r = 0$: No correlation.

In [27]:
def correlation_coefficient(*args):
    list_1 = [i[0] for i in args]
    list_2 = [i[1] for i in args]
    # Pass those lists to get their standard deviations
    list_1_sd = standard_deviation(*list_1[0])
    list_2_sd = standard_deviation(*list_2[0])
    print(f"L1 SD : {list_1_sd}")
    print(f"L2 SD : {list_2_sd}")
    denominator = list_1_sd * list_2_sd
    # Get the covariance
    numerator = covariance(*args)
    return numerator / denominator
    
market_cap_earnings_arr = [[1532, 1488, 1343, 928, 615], [58, 35, 75, 41, 17]]
print(f"Stock correlation coefficient: {correlation_coefficient(market_cap_earnings_arr)}")

L1 SD : 396.2508044155873
L2 SD : 22.185580902919806
Stock correlation coefficient: 0.660125602195931


### Interquartile Range (IQR)

The Interquartile Range (IQR) is a measure of statistical dispersion that represents the range within which the central 50% of the data points lie. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1). The IQR is useful for identifying outliers and understanding the spread of the middle half of the data.


$$
\text{IQR} = Q3 - Q1
$$

Where:
- \( Q1 \) = First Quartile (25th percentile)
- \( Q3 \) = Third Quartile (75th percentile)

In [28]:
# Creating a sample dataset
data = {
    "Values": [10, 15, 14, 20, 18, 25, 30, 35, 28, 22]
}
df = pd.DataFrame(data)

# Calculate Q1 and Q3
Q1 = df["Values"].quantile(0.25)  # First Quartile (25th percentile)
Q3 = df["Values"].quantile(0.75)  # Third Quartile (75th percentile)

# Calculate IQR
IQR = Q3 - Q1

# Calculate lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Display the results
print("Sample Data:", df["Values"].values)
print("First Quartile (Q1):", Q1)
print("Third Quartile (Q3):", Q3)
print("Interquartile Range (IQR):", IQR)
print("Lower Bound for Outliers:", lower_bound)
print("Upper Bound for Outliers:", upper_bound)

Sample Data: [10 15 14 20 18 25 30 35 28 22]
First Quartile (Q1): 15.75
Third Quartile (Q3): 27.25
Interquartile Range (IQR): 11.5
Lower Bound for Outliers: -1.5
Upper Bound for Outliers: 44.5
