# This notebook is all about the statistics

## *Relative Frequency*

Sure! Think of **relative frequency** as a way to compare how often something happens relative to the total number of events.  

### Example:  
Imagine you have a bag of **100 candies**, and the colors are:  
- **Red - 30 candies**  
- **Blue - 20 candies**  
- **Green - 50 candies**  

The **relative frequency** of each color is:  
- **Red** = 30/100 = **0.30 (or 30%)**  
- **Blue** = 20/100 = **0.20 (or 20%)**  
- **Green** = 50/100 = **0.50 (or 50%)**  

So, relative frequency tells us how **frequent** an event is **compared to the total**. It's usually written as a **fraction, decimal, or percentage**.  

It helps in **probability, statistics, and data analysis** to understand trends or patterns. 😊

## *Mean Median Mode*


### **Mean, Median, and Mode** are three ways to describe the **center** of a set of numbers.  

---

### **1. Mean (Average)**  
**Definition:** The mean is the **average** of all numbers. You add them up and divide by how many numbers there are.  

**Example:**  
Suppose your exam scores are: **50, 60, 70, 80, 90**  
To find the mean:  

\[
\frac{50 + 60 + 70 + 80 + 90}{5} = \frac{350}{5} = 70
\]

So, the **mean = 70**  

💡 **Think of it as:** The "balancing point" of all numbers.  

---

### **2. Median (Middle Value)**  
**Definition:** The median is the **middle number** when numbers are arranged in order.  

**Example:**  
Numbers: **10, 20, 30, 40, 50**  
The middle number is **30**, so the **median = 30**  

**If there are even numbers?**  
Example: **10, 20, 30, 40, 50, 60**  
Middle two numbers = **30, 40**  
Median = **(30 + 40) / 2 = 35**  

💡 **Think of it as:** The "middle seat" in a row.  

---

### **3. Mode (Most Frequent Value)**  
**Definition:** The mode is the number that **appears most often** in the list.  

**Example:**  
Numbers: **2, 3, 3, 3, 4, 5, 6, 6, 6, 7**  
- **3 appears 3 times**  
- **6 appears 3 times**  

So, there are **two modes: 3 & 6** (this is called **bimodal**).  
If no number repeats, there is **no mode**.  

💡 **Think of it as:** The "most popular" number.  

---

### **Quick Summary:**  
| Measure  | Meaning | Example |  
|----------|---------|---------|  
| **Mean** | Average | (50+60+70+80+90) ÷ 5 = 70 |  
| **Median** | Middle value | Middle of **10, 20, 30, 40, 50** → **30** |  
| **Mode** | Most frequent | **3 & 6** appear the most in **2,3,3,3,4,5,6,6,6,7** |

Each measure is useful in different situations. **Mean** is good for overall trends, **Median** helps when there are extreme values, and **Mode** helps when looking for the most common occurrence.  



### **Pros and Cons of Mean, Median, and Mode**  

Each of these measures has strengths and weaknesses. Let’s break it down:  

---

## **1. Mean (Average)**  
📌 **Pros:**  
✅ Uses all values → Gives a complete picture of the data.  
✅ Best for normally distributed (balanced) data.  
✅ Useful in calculations like variance and standard deviation.  

❌ **Cons:**  
❌ **Sensitive to outliers** (extreme values can distort it).  
   - Example: **10, 20, 30, 40, 1000** → Mean = **220**, which doesn’t represent most numbers.  
❌ Doesn’t always show the most common value.  

💡 **Best used when:** Data is evenly spread, and there are no extreme outliers.  

---

## **2. Median (Middle Value)**  
📌 **Pros:**  
✅ **Not affected by outliers** (good for skewed data).  
✅ Always gives a real data value (not like mean, which can be a decimal).  
✅ Great for analyzing income, house prices, etc. (where extremes exist).  

❌ **Cons:**  
❌ Ignores most values except the middle ones.  
❌ Less useful for further calculations (like standard deviation).  
❌ If the dataset is large, finding the median manually takes time.  

💡 **Best used when:** There are extreme values or skewed distributions.  

---

## **3. Mode (Most Frequent Value)**  
📌 **Pros:**  
✅ Works well for **categorical data** (e.g., most popular movie genre).  
✅ Simple to find, no calculations needed.  
✅ Useful for detecting trends (e.g., most common customer complaint).  

❌ **Cons:**  
❌ Sometimes there is **no mode** or **more than one mode**, making it unclear.  
❌ Doesn’t consider all numbers.  
❌ Less useful for numerical analysis compared to mean and median.  

💡 **Best used when:** You want to find the most common occurrence (e.g., most sold product, most common age group).  

---

### **Quick Comparison Table**  

| Measure  | Pros | Cons | Best Used When |
|----------|------|------|---------------|
| **Mean** | Uses all data, best for balanced data | Affected by outliers | Data is evenly distributed |
| **Median** | Not affected by outliers, always a real value | Ignores most data | Data has extreme values (income, house prices) |
| **Mode** | Best for categories, shows trends | May not exist, ignores many values | Finding the most common occurrence |

Each measure is useful depending on the situation! 😊

## *Skewness*

### **Skewness - Explained in Simple Terms**  

**Skewness** tells us **how "asymmetrical" or "lopsided"** a dataset is. It shows whether the data is **leaning more toward one side** instead of being evenly distributed.  

---

### **Types of Skewness**  

#### **1. No Skewness (Symmetrical / Normal Distribution)**
📌 **What it means:** The data is evenly spread around the center (mean = median = mode).  
📌 **Example:** Heights of people in a population usually follow a normal distribution.  

📊 **Graph Shape:** Bell-shaped, like this:  

```
       *
     *   *
   *       *
 *           *
-----------------
```

---

#### **2. Positive Skewness (Right-Skewed)**
📌 **What it means:** The tail of the distribution is **longer on the right** (higher values are pulling the mean up).  
📌 **Example:** **Income distribution** (a few people earn very high salaries, but most earn less).  

📊 **Graph Shape:**  
```
*  
**  
***  
*****________
```
✔ **Mean > Median > Mode**  

---

#### **3. Negative Skewness (Left-Skewed)**
📌 **What it means:** The tail of the distribution is **longer on the left** (lower values are pulling the mean down).  
📌 **Example:** **Exam scores** (if most students score high, but a few fail with very low scores).  

📊 **Graph Shape:**  
```
        *****  
       ***  
      **  
     *________  
```
✔ **Mean < Median < Mode**  

---

### **Why is Skewness Important?**  
- Helps understand **data distribution** before using **mean, median, or mode**.  
- Affects **decision-making** in finance, economics, and data analysis.  
- Determines if **data transformation** (like log scaling) is needed.  

### **Quick Summary:**  
| Type of Skewness | Meaning | Example | Mean vs. Median |
|-----------------|---------|---------|----------------|
| **No Skew (Symmetric)** | Evenly spread | Heights of people | Mean ≈ Median ≈ Mode |
| **Positive Skew (Right-Skewed)** | Long tail on right | Income, house prices | Mean > Median |
| **Negative Skew (Left-Skewed)** | Long tail on left | Exam scores, waiting times | Mean < Median |

Let me know if you need more clarification! 😊

# Variance

Sure! Think of **variance** in data science as how much your data points are spread out from their average value. If the variance is **low**, the data points are close to the average. If the variance is **high**, the data points are spread out.

### **Example: Exam Scores**
Let’s say we have the math exam scores of two classes:

#### **Class A Scores:**  
80, 82, 81, 79, 83  
- The scores are close to each other.  
- The variance is **low** because everyone scored around the same marks.

#### **Class B Scores:**  
50, 90, 40, 95, 30  
- The scores are very spread out.  
- The variance is **high** because some students scored very low while others scored very high.

### **Why Does Variance Matter?**
- **Low variance** means your data is consistent, which is good in many cases.  
- **High variance** can indicate unpredictability or randomness in the data, which might make predictions less reliable.  

Let me know if you need more clarification! 😊

# Standard Deviation

### **Standard Deviation (SD) in Simple Terms**  
Standard deviation is just a way to measure **how much the data varies** from the average (mean). It's closely related to **variance** but in the same units as the original data, making it easier to understand.

### **Example: Exam Scores Again**  
Let's say two classes took a math test.

#### **Class A Scores:**  
80, 82, 81, 79, 83  
- The average (mean) score is **81**.  
- The scores are **close** to the average.  
- **Standard deviation is low** (small variation in scores).  

#### **Class B Scores:**  
50, 90, 40, 95, 30  
- The average score is also **61**.  
- But the scores are **spread out** (some very high, some very low).  
- **Standard deviation is high** (large variation in scores).  

### **Formula Connection:**
- **Variance** measures the **spread** of the data.  
- **Standard Deviation (SD) = Square Root of Variance**  
  - Since variance is in **squared units**, taking the square root brings it back to the original unit.  

### **Why is SD Important?**  
- **Low SD** → Data points are **consistent** (good for reliable predictions).  
- **High SD** → Data points are **scattered** (less predictable, may need investigation).  

Let me know if you need a deeper breakdown! 😊

# Covariance

### **Covariance in Simple Terms**  

Covariance tells us how **two variables change together**—whether they increase or decrease **together** or move **oppositely**.  

### **Example: Study Time vs. Exam Scores**  
Imagine we track **study time (hours)** and **exam scores** for 5 students:

| Study Time (Hours) | Exam Score (%) |
|-------------------|--------------|
| 2               | 50           |
| 4               | 65           |
| 6               | 80           |
| 8               | 90           |
| 10              | 95           |

Here, as **study time increases, exam scores also increase**.  
👉 **Covariance is positive** (both move in the same direction).  

### **Interpreting Covariance:**  
- **Positive Covariance** → Both variables increase or decrease together (e.g., more study time → higher marks).  
- **Negative Covariance** → One increases while the other decreases (e.g., more TV time → lower marks).  
- **Zero Covariance** → No relationship (random movement).  

### **Covariance vs. Correlation**  
Covariance only shows **direction**, but **correlation** (which normalizes covariance) tells us **both direction and strength** of the relationship.

Let me know if you need a practical example or calculation! 😊

# Correlation

### **Correlation in Simple Terms**  

Correlation tells us **how strongly** two variables are related and in which direction. It's similar to **covariance**, but it's standardized, meaning it always ranges from **-1 to 1**.

### **Example: Study Time vs. Exam Scores**  
Let’s take the same example:

| Study Time (Hours) | Exam Score (%) |
|-------------------|--------------|
| 2               | 50           |
| 4               | 65           |
| 6               | 80           |
| 8               | 90           |
| 10              | 95           |

Here, as **study time increases, exam scores also increase**.  
👉 **Strong positive correlation (~1)**  

### **Correlation Values & Meaning**  
- **+1 → Perfect Positive Correlation** (Both increase together)  
- **0 → No Correlation** (No relationship)  
- **-1 → Perfect Negative Correlation** (One increases, the other decreases)  

### **Difference Between Covariance & Correlation**  
| Feature        | Covariance | Correlation |
|--------------|------------|--------------|
| **Range**     | Any value  | -1 to 1  |
| **Units?**    | No standard unit | Unit-free (easier to interpret) |
| **Meaning?**  | Shows **direction** (positive or negative) | Shows **direction & strength** of the relationship |

### **Why Use Correlation Instead of Covariance?**  
Covariance depends on the scale of data, making it harder to interpret. Correlation **removes the scale effect**, making it comparable across different datasets.

Let me know if you need a formula breakdown or practical calculation! 😊

# Distributions

### **Understanding Different Distributions in Simple Terms**  

Distributions describe how data is spread or distributed. Let’s break down the **Normal Distribution, Uniform Distribution, and Standard Normal Distribution** in simple terms.  

---

## **1️⃣ Normal Distribution (Bell Curve)**
This is the most common distribution in statistics and machine learning. It looks like a **bell-shaped curve** where most values are clustered around the mean (average), and fewer values exist at the extremes.  

### **Example:**  
Imagine you measure the height of 1,000 people. Most people are of **average height**, and fewer people are either **very tall or very short**.  

### **Key Features:**
✅ **Symmetrical** around the mean.  
✅ **Mean = Median = Mode** (centered at the middle).  
✅ The data follows a predictable pattern:  
   - 68% of values lie within **1 standard deviation** from the mean.  
   - 95% of values lie within **2 standard deviations**.  
   - 99.7% of values lie within **3 standard deviations**.  

### **Real-Life Examples:**
📌 Heights, weights, IQ scores, blood pressure levels, etc.  

---

## **2️⃣ Uniform Distribution (Flat Distribution)**
In a **uniform distribution**, all values have an **equal chance of occurring**. Instead of a bell curve, you get a **flat, rectangular shape**.  

### **Example:**  
Imagine rolling a **fair 6-sided die**. Each number (1 to 6) has an equal chance (1/6 or ~16.67%) of appearing.  

### **Key Features:**
✅ All values in the range have **equal probability**.  
✅ The graph looks **flat** (no peaks or clustering).  

### **Real-Life Examples:**
📌 Rolling dice, picking a random number from 1 to 100, random password generation.  

---

## **3️⃣ Standard Normal Distribution (Z-Score Distribution)**
A **Standard Normal Distribution** is just a **Normal Distribution** but with:  
✅ **Mean = 0**  
✅ **Standard Deviation = 1**  

This is useful because it allows us to compare different datasets using a **Z-score**, which tells us **how many standard deviations** a value is from the mean.  

### **Example:**  
If a student’s test score is **2 standard deviations above the mean**, it means they scored better than most students.  

### **Why Use Standard Normal Distribution?**  
📌 It helps compare different datasets (e.g., comparing heights and weights, even though they have different units).  
📌 It is used in hypothesis testing, Z-tests, and machine learning models.  

---

### **Summary Table**
| Distribution Type  | Shape | Example |
|------------------|--------|---------|
| **Normal (Bell Curve)** | Bell-shaped | Heights of people, IQ scores |
| **Uniform** | Flat (Equal chance) | Rolling a die, random numbers |
| **Standard Normal** | Bell-shaped (Mean = 0, SD = 1) | Used for comparing datasets using Z-scores |

Would you like me to generate sample Python code to visualize these distributions? 😊