In [None]:
# Dependencies

# Standard Dependencies
import os
import numpy as np
import pandas as pd
from math import sqrt

# Visualization
from pylab import *
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sns

# Statistics
from statistics import median
from scipy import signal
from scipy.special import factorial
import scipy.stats as stats
from scipy.stats import sem, binom, lognorm, poisson, bernoulli, spearmanr
from scipy.fftpack import fft, fftshift

# Scikit-learn for Machine Learning models
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Seed for reproducability
seed = 12345
np.random.seed(seed)

In [None]:
main = pd.read_csv('/kaggle/input/toy-dataset/toy_dataset.csv')

## Discrete and Continuous Variables

In statistics, we categorize variables into two main types: discrete and continuous.

### Discrete Variables

A **discrete variable** is one that can only take on a "countable" number of distinct values. If you can count the possible values it can assume, it's considered a discrete variable. An illustrative example of a discrete variable is the outcome of rolling a six-sided die. In this case, the variable can only have one of six distinct outcomes, making it discrete. Additionally, a discrete random variable can have an infinite number of values, such as the set of natural numbers (1, 2, 3, etc.), which is countable and therefore discrete.

In statistical analysis, we represent the distribution of discrete variables using two key functions:
- **PMF (Probability Mass Function)**: The PMF defines the probability associated with all possible values of the discrete random variable. It tells us the likelihood of each specific outcome.

- **CDF (Cumulative Distribution Function)**: The CDF represents the cumulative probability that the random variable X will have an outcome less than or equal to a given value x. The name CDF is used both for discrete and continuous distributions.

### Continuous Variables

In contrast, a **continuous variable** can take on an "uncountable" number of values within a specific range. A classic example of a continuous variable is length. Length can be measured to any degree of precision, making it continuous.

For continuous variables, we use different probability functions compared to discrete ones:
- **PDF (Probability Density Function)**: The PDF serves a similar purpose as the PMF but for continuous values. It describes the likelihood of the variable taking on specific values within a given range.

- **CDF (Cumulative Distribution Function)**: Just like for discrete variables, the CDF for continuous variables represents the cumulative probability that the random variable X will have an outcome less than or equal to a given value x.

While the mathematical functions that define PMFs, PDFs, and CDFs can appear complex at first, their visual representations are often more intuitive. These functions play a crucial role in statistical analysis and modeling.


## PMF (Probability Mass Function)

The PMF, or **Probability Mass Function**, is a concept used in probability theory and statistics. It describes the probability distribution of a discrete random variable, which means that the variable can only take on a countable set of values.

### Visualizing a PMF

To better understand the PMF, let's visualize it using an example of a binomial distribution. In a PMF, the possible values are discrete and typically represented as integers. For instance, there are no values between 50 and 51 in a PMF.

### Binomial Distribution PMF

The PMF of a binomial distribution can be expressed as follows:

```
PMF(x; n, p) = (n choose x) * p^x * (1 - p)^(n - x)
```

- `x` represents the specific value of the random variable.
- `n` is the number of trials or experiments.
- `p` is the probability of success in a single trial.

This formula calculates the probability of getting `x` successful outcomes in `n` independent Bernoulli trials, where each trial has a probability of success equal to `p`.

If you want to learn more about binomial distributions or other probability distributions, you can refer to the "Distributions" section for additional information.

In [None]:
# Set parameters for the binomial distribution
n = 1000
p = 0.1

# Create a figure and axis for the plot
fig, ax = plt.subplots(figsize=(17, 5))

# Define the range of values for x using percentiles
x = np.arange(binom.ppf(0.01, n, p), binom.ppf(0.99, n, p))

# Plot the PMF as blue circles
ax.plot(x, binom.pmf(x, n, p), 'bo', ms=8, label='Binomial PMF')

# Add vertical lines to the plot for better visualization
ax.vlines(x, 0, binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)

# Create a frozen random variable for the binomial distribution
rv = binom(n, p)

# Uncomment the next line to plot the frozen PMF if needed
# ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='Frozen PMF')

# Add a legend and set plot title
ax.legend(loc='best', frameon=False, fontsize='xx-large')
plt.title('PMF of a binomial distribution (n=100, p=0.5)', fontsize='xx-large')

# Show the plot
plt.show()


## PDF (Probability Density Function)

The PDF, or **Probability Density Function**, is a concept used in probability theory and statistics, similar to the PMF but for continuous random variables. Unlike the discrete PMF, the PDF describes the probability distribution of a continuous random variable, which can take on an uncountable number of values within a given range.

### Visualizing a PDF

A PDF represents the likelihood of the continuous variable assuming specific values within a range. To visualize this concept, let's consider a simple example: the normal distribution.

### Normal Distribution PDF

The PDF of a normal distribution with mean (μ) 0 and standard deviation (σ) 1 can be expressed as follows:

```
PDF(x; μ=0, σ=1) = (1 / (σ * sqrt(2π))) * exp(-((x - μ)^2) / (2σ^2))
```

- `x` represents a specific value of the random variable.
- `μ` is the mean (average) of the distribution.
- `σ` is the standard deviation, which measures the spread of the distribution.
- `π` is the mathematical constant pi.

This formula calculates the probability density at a given point `x` on the normal distribution curve. It is highest at the mean (`μ`) and decreases as you move away from the mean in either direction.

Visualizing the PDF of a normal distribution helps us understand how likely different values are in a continuous dataset, and it's a fundamental concept in statistics and data analysis.

In [None]:
# Define the parameters for the normal distribution
mu = 0
variance = 1
sigma = np.sqrt(variance)

# Generate values for the x-axis
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)

# Create a figure and axis for the plot
plt.figure(figsize=(16, 5))

# Plot the normal distribution PDF
plt.plot(x, stats.norm.pdf(x, mu, sigma), label='Normal Distribution')

# Add title and legend
plt.title('Normal Distribution with Mean = 0 and Standard Deviation = 1', fontsize='xx-large')
plt.legend(fontsize='xx-large')

# Show the plot
plt.show()


## CDF (Cumulative Distribution Function)

The CDF, or **Cumulative Distribution Function**, is a fundamental concept in probability theory and statistics. It represents the cumulative probability that a random variable X will take a value less than or equal to a specified value x (denoted as P(X ≤ x)).

### Visualizing the CDF

In the context of continuous random variables, the CDF accumulates probabilities as you move along the range of values. The CDF is bounded between 0 and 1, as it represents cumulative probabilities.

### Normal Distribution CDF

The CDF of a normal distribution with mean (μ) and standard deviation (σ) is a crucial example. The formula for the CDF of a normal distribution is:

```
CDF(x; μ, σ) = 0.5 * [1 + erf((x - μ) / (σ * sqrt(2)))]
```

- `x` represents the specific value for which you want to calculate the cumulative probability.
- `μ` is the mean (average) of the normal distribution.
- `σ` is the standard deviation, which measures the spread of the distribution.
- `erf` is the error function, a mathematical function used to calculate the cumulative probability.

Visualizing the CDF of a normal distribution helps us understand how the probability accumulates as we move along the distribution curve. It's a valuable tool for various statistical analyses and hypothesis testing.

In [None]:
# Generate data for x and y
X = np.arange(-2, 2, 0.01)
Y = np.exp(-X ** 2)

# Normalize data
Y = Y / (0.01 * Y).sum()

# Create a figure and axis for the plot
plt.figure(figsize=(15, 5))

# Set the title
plt.title('Continuous Normal Distributions', fontsize='xx-large')

# Plot the Probability Density Function (PDF)
plt.plot(X, Y, label='Probability Density Function (PDF)')

# Calculate and plot the Cumulative Distribution Function (CDF)
plt.plot(X, np.cumsum(Y * 0.01), 'r', label='Cumulative Distribution Function (CDF)')

# Add a legend
plt.legend(fontsize='xx-large')

# Show the plot
plt.show()
