# Probability Theory Fundamentals

Welcome to the lecture on Probability Theory Fundamentals! In this lecture, we will dive into the essential concepts that form the backbone of probabilistic modeling in machine learning. Probability theory provides a mathematical framework for quantifying uncertainty and making informed decisions based on available information.


Understanding probability theory is crucial for working with machine learning models, as it allows us to:
- Represent and reason about uncertainty in data and model predictions
- Make probabilistic inferences and decisions
- Incorporate prior knowledge and update beliefs based on observed evidence


In this lecture, we will cover the following key topics:
- Random variables and probability distributions
- Discrete and continuous random variables
- Probability mass functions (PMF) and probability density functions (PDF)
- Cumulative distribution functions (CDF)
- Joint, marginal, and conditional probability
- Independence and conditional independence


We will explore each of these concepts in detail, using clear explanations and illustrative examples to help you grasp the fundamentals of probability theory.


For instance, let's consider a simple example of flipping a fair coin. The outcome of a coin flip can be represented as a random variable with two possible values: heads (H) or tails (T). We can assign probabilities to these outcomes, such as P(H) = 0.5 and P(T) = 0.5, indicating that both outcomes are equally likely.


Throughout this lecture, we will build upon such examples to illustrate the concepts of probability distributions, joint and conditional probabilities, and independence.


By the end of this lecture, you will have a solid understanding of the fundamental concepts of probability theory and how they relate to machine learning. You will be equipped with the necessary knowledge to work with probabilistic models and make informed decisions under uncertainty.


**Table of contents**<a id='toc0_'></a>    
- [Random Variables and Probability Distributions](#toc1_)    
  - [Random Variables](#toc1_1_)    
  - [Probability Distributions](#toc1_2_)    
  - [Cumulative Distribution Functions (CDF)](#toc1_3_)    
- [Discrete and Continuous Random Variables](#toc2_)    
  - [Discrete Random Variables](#toc2_1_)    
  - [Continuous Random Variables](#toc2_2_)    
- [Probability Mass Functions (PMF) and Probability Density Functions (PDF)](#toc3_)    
  - [Probability Mass Functions (PMF)](#toc3_1_)    
  - [Probability Density Functions (PDF)](#toc3_2_)    
- [Cumulative Distribution Functions (CDF)](#toc4_)    
  - [Definition of CDF](#toc4_1_)    
  - [Properties of CDF](#toc4_2_)    
  - [CDF for Discrete Random Variables](#toc4_3_)    
  - [CDF for Continuous Random Variables](#toc4_4_)    
- [Joint, Marginal, and Conditional Probability](#toc5_)    
  - [Joint Probability](#toc5_1_)    
  - [Marginal Probability](#toc5_2_)    
  - [Conditional Probability](#toc5_3_)    
- [Independence and Conditional Independence](#toc6_)    
  - [Independence](#toc6_1_)    
  - [Conditional Independence](#toc6_2_)    
- [Summary and Further Resources](#toc7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Random Variables and Probability Distributions](#toc0_)

In this section, we will introduce the concepts of **random variables** and **probability distributions**, which are fundamental building blocks of probability theory.


### <a id='toc1_1_'></a>[Random Variables](#toc0_)


A random variable is a variable whose value is subject to chance or uncertainty. It represents the possible outcomes of a random experiment or process. Random variables are typically denoted by capital letters, such as *X*, *Y*, or *Z*.


There are two types of random variables:
1. **Discrete random variables**: These variables can only take on a countable number of distinct values. Examples include:
   - The *number of heads* obtained when flipping a coin three times (possible values: 0, 1, 2, 3)
   - The *outcome of rolling a six-sided die* (possible values: 1, 2, 3, 4, 5, 6)

2. **Continuous random variables**: These variables can take on any value within a specified range or interval. Examples include:
   - The *height of a randomly selected person* (possible values: any positive real number)
   - The *time it takes for a chemical reaction to complete* (possible values: any non-negative real number)


### <a id='toc1_2_'></a>[Probability Distributions](#toc0_)


A **probability distribution** is a mathematical function that describes the likelihood of different outcomes or values of a random variable. It assigns probabilities to the possible values that the random variable can take.


For discrete random variables, the probability distribution is specified by the **probability mass function (PMF)**. The PMF, denoted as *P(X = x)*, gives the probability that the random variable *X* takes on a specific value *x*.


**Example**: Let *X* be the number of heads obtained when flipping a fair coin twice. The PMF of *X* can be defined as:
- *P(X = 0) = 0.25* (probability of getting no heads)
- *P(X = 1) = 0.50* (probability of getting one head)
- *P(X = 2) = 0.25* (probability of getting two heads)


For continuous random variables, the probability distribution is specified by the **probability density function (PDF)**. The PDF, denoted as *f(x)*, describes the relative likelihood of the random variable taking on a particular value.


**Example**: Let *X* be the time (in minutes) it takes for a student to complete a test. The PDF of *X* might be a normal distribution with a mean of 60 minutes and a standard deviation of 10 minutes, indicating that most students complete the test around the 60-minute mark, with some variability.


*It's important to note that for continuous random variables, the probability of the variable taking on any specific value is zero. Instead, we consider the probability of the variable falling within a certain range of values, which is determined by integrating the PDF over that range.*


### <a id='toc1_3_'></a>[Cumulative Distribution Functions (CDF)](#toc0_)


The **cumulative distribution function (CDF)** is another way to describe the probability distribution of a random variable. The CDF, denoted as *F(x)*, gives the probability that the random variable *X* takes on a value less than or equal to *x*.


For discrete random variables, the CDF is calculated by summing up the probabilities of all values less than or equal to *x*:
*F(x) = P(X ≤ x) = Σ P(X = k)* for all *k ≤ x*


For continuous random variables, the CDF is obtained by integrating the PDF from negative infinity to *x*:
*F(x) = P(X ≤ x) = ∫ f(t) dt* from *-∞* to *x*


The CDF is a non-decreasing function that starts at 0 and approaches 1 as *x* increases.


Understanding random variables and probability distributions is essential for working with probabilistic models in machine learning. They provide a way to quantify uncertainty and make probabilistic statements about the outcomes of random processes.


In the next section, we will explore the concepts of joint, marginal, and conditional probability, which deal with the relationships between multiple random variables.

## <a id='toc2_'></a>[Discrete and Continuous Random Variables](#toc0_)

In this section, we will delve deeper into the two types of random variables: **discrete** and **continuous**. Understanding the differences between these types of random variables is crucial for modeling and analyzing various phenomena in machine learning.


### <a id='toc2_1_'></a>[Discrete Random Variables](#toc0_)


A **discrete random variable** is a variable that can only take on a *countable* number of distinct values. These values are often integers or specific categories. The probability distribution of a discrete random variable is described by the **probability mass function (PMF)**.


Key characteristics of discrete random variables:
- The possible values are *countable* and often *finite*.
- Each possible value has a *specific probability* associated with it.
- The probabilities of all possible values **sum up to 1**.


Examples of discrete random variables:
1. The *number of defective items* in a batch of 100 products.
   - Possible values: 0, 1, 2, ..., 100
   - PMF: **P(X = k)** = probability of having exactly *k* defective items

2. The *outcome of a single dice roll*.
   - Possible values: 1, 2, 3, 4, 5, 6
   - PMF: **P(X = k)** = 1/6 for each possible value *k* (assuming a fair dice)

3. The *number of customers arriving* at a store in a given hour.
   - Possible values: 0, 1, 2, 3, ...
   - PMF: **P(X = k)** = probability of having exactly *k* customers arrive in an hour


Discrete random variables are commonly used to model **counts**, **categories**, or **outcomes** of experiments with a finite number of possibilities.


### <a id='toc2_2_'></a>[Continuous Random Variables](#toc0_)


A **continuous random variable** is a variable that can take on *any value* within a specified range or interval. The probability distribution of a continuous random variable is described by the **probability density function (PDF)**.


Key characteristics of continuous random variables:
- The possible values are *uncountable* and often span a *continuous range*.
- The probability of the variable taking on any *specific value* is **zero**.
- The probability of the variable falling within a certain range is determined by **integrating** the PDF over that range.


Examples of continuous random variables:
1. The *height of a randomly selected adult*.
   - Possible values: any positive real number (e.g., 1.65 meters, 1.78 meters)
   - PDF: **f(x)** = probability density at height *x*

2. The *time it takes for a machine to process a task*.
   - Possible values: any non-negative real number (e.g., 2.5 seconds, 4.8 seconds)
   - PDF: **f(x)** = probability density at time *x*

3. The *weight of a randomly chosen fruit* from a basket.
   - Possible values: any positive real number (e.g., 0.3 kilograms, 0.5 kilograms)
   - PDF: **f(x)** = probability density at weight *x*


Continuous random variables are often used to model **measurements**, **durations**, or **quantities** that can take on any value within a given range.


It's important to note that while the probability of a continuous random variable taking on a specific value is **zero**, we can still calculate the probability of the variable falling within a certain range by **integrating** the PDF over that range.


For example, if *X* is a continuous random variable representing the height of a person, we can calculate the probability of a person's height being between 1.6 and 1.8 meters by integrating the PDF of *X* over that range:
**P(1.6 ≤ X ≤ 1.8)** = ∫ f(x) dx from 1.6 to 1.8


Understanding the distinction between discrete and continuous random variables is essential for selecting appropriate probability distributions and applying the correct mathematical techniques when working with probabilistic models in machine learning.


In the next section, we will explore probability mass functions (PMF) and probability density functions (PDF) in more detail.

## <a id='toc3_'></a>[Probability Mass Functions (PMF) and Probability Density Functions (PDF)](#toc0_)

In this section, we will explore two important concepts in probability theory: **probability mass functions (PMF)** for discrete random variables and **probability density functions (PDF)** for continuous random variables. These functions provide a way to describe the probability distribution of a random variable.


### <a id='toc3_1_'></a>[Probability Mass Functions (PMF)](#toc0_)


A **probability mass function (PMF)** is a function that maps each possible value of a *discrete random variable* to its probability of occurrence. The PMF, denoted as **P(X = x)**, gives the probability that the random variable *X* takes on a specific value *x*.


Key properties of a PMF:
1. The PMF is defined only for the *possible values* of the discrete random variable.
2. The PMF returns a probability value between 0 and 1 (inclusive) for each possible value.
3. The **sum of probabilities** for all possible values of the random variable is equal to 1.


Mathematically, a PMF satisfies the following conditions:
- $0 \leq P(X = x) \leq 1$ for all possible values of *x*
- $\sum P(X = x) = 1$, where the sum is taken over all possible values of *x*


Example: Let *X* be a discrete random variable representing the number of heads obtained when flipping a fair coin three times. The PMF of *X* can be defined as:
- P(X = 0) = 1/8 (probability of getting no heads)
- P(X = 1) = 3/8 (probability of getting one head)
- P(X = 2) = 3/8 (probability of getting two heads)
- P(X = 3) = 1/8 (probability of getting three heads)


### <a id='toc3_2_'></a>[Probability Density Functions (PDF)](#toc0_)


A **probability density function (PDF)** is a function that describes the relative likelihood of a *continuous random variable* taking on a specific value. The PDF, denoted as **f(x)**, does not directly give the probability of the random variable taking on a specific value, but rather the *probability density* at that value.


Key properties of a PDF:
1. The PDF is defined over the *entire range* of the continuous random variable.
2. The PDF can take on any non-negative value, including values greater than 1.
3. The **area under the PDF curve** over the entire range of the random variable is equal to 1.


Mathematically, a PDF satisfies the following conditions:
- $f(x) \geq 0$ for all values of *x*
- $\int f(x) dx = 1$, where the integral is taken over the entire range of *x*


To calculate the probability of a continuous random variable falling within a specific range, we need to integrate the PDF over that range:
$P(a \leq X \leq b) = \int_a^b f(x) dx$


Example: Let *X* be a continuous random variable representing the time (in minutes) it takes for a student to complete a test. The PDF of *X* might be a normal distribution with a mean of 60 minutes and a standard deviation of 10 minutes. The PDF, denoted as *f(x)*, can be expressed as:
$f(x) = \frac{1}{10\sqrt{2\pi}} e^{-\frac{(x-60)^2}{2\cdot10^2}}$


While the PDF does not directly give probabilities, it allows us to calculate the probability of the random variable falling within a certain range by integrating the PDF over that range.


Understanding PMFs and PDFs is crucial for working with discrete and continuous random variables, respectively. They provide a way to quantify the probability distribution and make probabilistic statements about the random variable.


In the next section, we will discuss cumulative distribution functions (CDF), which offer another perspective on describing the probability distribution of a random variable.

## <a id='toc4_'></a>[Cumulative Distribution Functions (CDF)](#toc0_)

In this section, we will discuss **cumulative distribution functions (CDF)**, which provide another way to describe the probability distribution of a random variable. The CDF is a function that maps each possible value of a random variable to the probability that the random variable takes on a value less than or equal to that value.


### <a id='toc4_1_'></a>[Definition of CDF](#toc0_)


The **cumulative distribution function (CDF)** of a random variable *X*, denoted as **F(x)**, is defined as:
$F(x) = P(X \leq x)$


In other words, the CDF gives the probability that the random variable *X* takes on a value less than or equal to *x*.


### <a id='toc4_2_'></a>[Properties of CDF](#toc0_)


The CDF has the following properties:
1. The CDF is a *non-decreasing function*. As the value of *x* increases, the CDF either remains constant or increases, but never decreases.
2. The CDF is *right-continuous*. The limit of the CDF as *x* approaches a value from the right is equal to the CDF at that value.
3. The CDF ranges from 0 to 1. As *x* approaches negative infinity, the CDF approaches 0, and as *x* approaches positive infinity, the CDF approaches 1.


Mathematically, these properties can be expressed as:
- $F(x_1) \leq F(x_2)$ for all $x_1 \leq x_2$
- $\lim_{x \to a^+} F(x) = F(a)$ for all *a*
- $\lim_{x \to -\infty} F(x) = 0$ and $\lim_{x \to \infty} F(x) = 1$


### <a id='toc4_3_'></a>[CDF for Discrete Random Variables](#toc0_)


For a discrete random variable *X* with probability mass function (PMF) *P(X = x)*, the CDF can be calculated as:
$F(x) = \sum_{x_i \leq x} P(X = x_i)$


In other words, the CDF for a discrete random variable is the sum of the probabilities of all values less than or equal to *x*.


Example: Let *X* be a discrete random variable representing the number of heads obtained when flipping a fair coin twice. The PMF of *X* is:
- P(X = 0) = 1/4
- P(X = 1) = 1/2
- P(X = 2) = 1/4


The CDF of *X* can be calculated as:
- F(x < 0) = 0
- F(0 ≤ x < 1) = P(X = 0) = 1/4
- F(1 ≤ x < 2) = P(X = 0) + P(X = 1) = 1/4 + 1/2 = 3/4
- F(x ≥ 2) = P(X = 0) + P(X = 1) + P(X = 2) = 1/4 + 1/2 + 1/4 = 1


### <a id='toc4_4_'></a>[CDF for Continuous Random Variables](#toc0_)


For a continuous random variable *X* with probability density function (PDF) *f(x)*, the CDF can be calculated as:
$F(x) = \int_{-\infty}^x f(t) dt$


In other words, the CDF for a continuous random variable is the integral of the PDF from negative infinity to *x*.


Example: Let *X* be a continuous random variable representing the time (in minutes) it takes for a student to complete a test. The PDF of *X* is a uniform distribution between 30 and 90 minutes:
$f(x) = \begin{cases} \frac{1}{60}, & 30 \leq x \leq 90 \\ 0, & \text{otherwise} \end{cases}$


The CDF of *X* can be calculated as:
$F(x) = \begin{cases} 0, & x < 30 \\ \frac{x-30}{60}, & 30 \leq x \leq 90 \\ 1, & x > 90 \end{cases}$


The CDF provides a cumulative perspective on the probability distribution of a random variable. It allows us to calculate probabilities for intervals and make probabilistic statements about the random variable.


In the next section, we will explore joint, marginal, and conditional probability, which deal with the relationships between multiple random variables.

## <a id='toc5_'></a>[Joint, Marginal, and Conditional Probability](#toc0_)

In this section, we will explore the concepts of joint, marginal, and conditional probability, which are fundamental for understanding the relationships between multiple random variables.


### <a id='toc5_1_'></a>[Joint Probability](#toc0_)


**Joint probability** is the probability of two or more events occurring simultaneously. For discrete random variables *X* and *Y*, the joint probability mass function (PMF) is denoted as **P(X = x, Y = y)**, which gives the probability that *X* takes on the value *x* and *Y* takes on the value *y* simultaneously.


Similarly, for continuous random variables *X* and *Y*, the joint probability density function (PDF) is denoted as **f(x, y)**, which describes the probability density of *X* and *Y* taking on specific values simultaneously.


Properties of joint probability:
1. The joint probability is always non-negative: $P(X = x, Y = y) \geq 0$ for discrete variables, and $f(x, y) \geq 0$ for continuous variables.
2. The sum (for discrete variables) or integral (for continuous variables) of the joint probability over all possible values of *X* and *Y* is equal to 1.


Example: Consider rolling two fair six-sided dice. Let *X* be the number on the first die and *Y* be the number on the second die. The joint PMF of *X* and *Y* is:
$P(X = x, Y = y) = \frac{1}{36}$ for $x, y \in \{1, 2, 3, 4, 5, 6\}$


### <a id='toc5_2_'></a>[Marginal Probability](#toc0_)


**Marginal probability** is the probability of an event occurring for a single random variable, regardless of the values of other random variables. It can be obtained by summing (for discrete variables) or integrating (for continuous variables) the joint probability over the other variables.


For discrete random variables *X* and *Y*, the marginal PMF of *X* is denoted as **P(X = x)** and can be calculated as:
$P(X = x) = \sum_y P(X = x, Y = y)$

For continuous random variables *X* and *Y*, the marginal PDF of *X* is denoted as **f(x)** and can be calculated as:
$f(x) = \int_{-\infty}^{\infty} f(x, y) dy$


Example: Using the previous example of rolling two fair six-sided dice, the marginal PMF of *X* (the number on the first die) is:
$P(X = x) = \sum_{y=1}^6 P(X = x, Y = y) = \frac{1}{6}$ for $x \in \{1, 2, 3, 4, 5, 6\}$


### <a id='toc5_3_'></a>[Conditional Probability](#toc0_)


**Conditional probability** is the probability of an event occurring given that another event has already occurred. It is denoted as **P(X = x | Y = y)** for discrete random variables and **f(x | y)** for continuous random variables.


For discrete random variables *X* and *Y*, the conditional PMF is calculated as:
$P(X = x | Y = y) = \frac{P(X = x, Y = y)}{P(Y = y)}$

For continuous random variables *X* and *Y*, the conditional PDF is calculated as:
$f(x | y) = \frac{f(x, y)}{f(y)}$


Example: In the dice rolling example, the conditional probability of *X* being 3 given that *Y* is 4 is:
$P(X = 3 | Y = 4) = \frac{P(X = 3, Y = 4)}{P(Y = 4)} = \frac{\frac{1}{36}}{\frac{1}{6}} = \frac{1}{6}$


Conditional probability allows us to update our beliefs about one random variable based on the information about another random variable.


Understanding joint, marginal, and conditional probability is crucial for analyzing the relationships between random variables and making probabilistic inferences based on available information.


In the next section, we will discuss independence and conditional independence, which are important concepts in probability theory and machine learning.

## <a id='toc6_'></a>[Independence and Conditional Independence](#toc0_)

In this section, we will discuss the concepts of independence and conditional independence, which are fundamental for understanding the relationships between random variables and simplifying probability calculations.


### <a id='toc6_1_'></a>[Independence](#toc0_)


Two random variables *X* and *Y* are said to be **independent** if the occurrence of one event does not affect the probability of the other event occurring. In other words, the joint probability of *X* and *Y* is equal to the product of their individual marginal probabilities.


For discrete random variables *X* and *Y*, independence is defined as:
$P(X = x, Y = y) = P(X = x) \cdot P(Y = y)$ for all values of *x* and *y*.


For continuous random variables *X* and *Y*, independence is defined as:
$f(x, y) = f(x) \cdot f(y)$ for all values of *x* and *y*.


If two random variables are independent, knowing the value of one variable does not provide any information about the value of the other variable.


Example: Consider flipping a fair coin and rolling a fair six-sided die. Let *X* be the outcome of the coin flip (0 for tails, 1 for heads) and *Y* be the number on the die. *X* and *Y* are independent because the outcome of the coin flip does not affect the outcome of the die roll, and vice versa. The joint PMF of *X* and *Y* is:
$P(X = x, Y = y) = P(X = x) \cdot P(Y = y) = \frac{1}{2} \cdot \frac{1}{6} = \frac{1}{12}$ for $x \in \{0, 1\}$ and $y \in \{1, 2, 3, 4, 5, 6\}$.


### <a id='toc6_2_'></a>[Conditional Independence](#toc0_)


Two random variables *X* and *Y* are said to be **conditionally independent** given a third random variable *Z* if, once the value of *Z* is known, the occurrence of *X* does not affect the probability of *Y*, and vice versa. In other words, the conditional probability of *X* and *Y* given *Z* is equal to the product of their individual conditional probabilities given *Z*.


For discrete random variables *X*, *Y*, and *Z*, conditional independence is defined as:
$P(X = x, Y = y | Z = z) = P(X = x | Z = z) \cdot P(Y = y | Z = z)$ for all values of *x*, *y*, and *z*.


For continuous random variables *X*, *Y*, and *Z*, conditional independence is defined as:
$f(x, y | z) = f(x | z) \cdot f(y | z)$ for all values of *x*, *y*, and *z*.


Conditional independence is a weaker notion than independence because it only holds under the condition of a specific value of the third variable.


Example: Consider a simple weather model where the probability of rain (*R*) depends on the presence of clouds (*C*). Let's also consider the probability of a person carrying an umbrella (*U*). In this case, *R* and *U* are conditionally independent given *C*. Once we know whether there are clouds or not, the presence of an umbrella does not provide any additional information about the probability of rain, and vice versa. Mathematically:
$P(R = r, U = u | C = c) = P(R = r | C = c) \cdot P(U = u | C = c)$ for all values of *r*, *u*, and *c*.


Independence and conditional independence are important concepts in probability theory and machine learning because they allow us to simplify probability calculations and make assumptions about the relationships between variables. Many machine learning algorithms, such as Naive Bayes classifiers and Bayesian networks, rely on independence assumptions to make probabilistic inferences efficiently.


Understanding independence and conditional independence is crucial for building and interpreting probabilistic models in machine learning.


This concludes our discussion on the fundamentals of probability theory. In the next chapter, we will explore how these concepts are applied in various machine learning algorithms and techniques.

## <a id='toc7_'></a>[Summary and Further Resources](#toc0_)

In this lecture, we covered the fundamental concepts of probability theory that are essential for understanding and applying probabilistic modeling in machine learning. Here's a summary of the key points:

- We introduced random variables, which are variables that take on different values with associated probabilities. Random variables can be discrete or continuous.
- We discussed probability distributions, which describe the likelihood of a random variable taking on different values. Probability mass functions (PMF) are used for discrete random variables, while probability density functions (PDF) are used for continuous random variables.
- We explored cumulative distribution functions (CDF), which provide the probability that a random variable takes on a value less than or equal to a given value.
- We covered joint probability, which is the probability of two or more events occurring simultaneously, and marginal probability, which is the probability of a single event occurring regardless of other events.
- We introduced conditional probability, which is the probability of an event occurring given that another event has already occurred.
- We discussed independence and conditional independence, which are important concepts for simplifying probability calculations and making assumptions about the relationships between variables.


To further deepen your understanding of probability theory and its applications in machine learning, consider exploring the following resources:

- Books:
  - [Pattern Recognition and Machine Learning](https://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738) by Christopher M. Bishop
  - [Probability Theory: The Logic of Science](https://www.amazon.com/Probability-Theory-The-Logic-Science/dp/0521592712) by E.T. Jaynes
  - [Introduction to Probability](https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1138369918) by Joseph K. Blitzstein and Jessica Hwang

- Online Courses:
  - [MIT OpenCourseWare: Probabilistic Systems Analysis and Applied Probability](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-041-probabilistic-systems-analysis-and-applied-probability-fall-2010/)
  - [Stanford University: Probability and Statistics](https://online.stanford.edu/courses/gse-yprobstat-probability-and-statistics)
  - [Khan Academy: Probability and Statistics](https://www.khanacademy.org/math/probability)

- Tutorials and Articles:
  - [A Comprehensive Guide to Probability Distributions](https://towardsdatascience.com/a-comprehensive-guide-to-probability-distributions-8f3d5a4e7d0c) on Towards Data Science
  - [Joint, Marginal, and Conditional Probability](https://www.geeksforgeeks.org/joint-marginal-and-conditional-probability/) on GeeksforGeeks
  - [Probability Theory](https://math.stackexchange.com/questions/tagged/probability-theory) on Mathematics Stack Exchange


These resources will help you gain a deeper understanding of probability theory and its applications in machine learning. They provide a mix of theoretical foundations, practical examples, and hands-on tutorials to reinforce your learning.


Remember, mastering probability theory is crucial for effectively working with probabilistic models in machine learning. So, take your time to explore these resources and practice applying the concepts to real-world problems.