<a href="https://colab.research.google.com/github/shreyaasoba/Data-Science-Topics-Series/blob/main/Module4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Module 4: Statistics and Probability**

# Understanding Statistics for Machine Learning

This README provides a concise yet comprehensive overview of key statistical concepts frequently used in machine learning. These foundational principles are essential for data analysis, model evaluation, and interpreting results effectively.

---

## Key Concepts

### 1. **Descriptive Statistics**
Descriptive statistics summarize and describe the main features of a dataset.

- **Mean**: The average value of a dataset, calculated as:
  \[
  \text{Mean} = \frac{\sum{\text{Values}}}{\text{Number of Values}}
  \]
  Example:
  ```python
  import numpy as np
  data = [1, 2, 3, 4, 5]
  mean = np.mean(data)
  print(mean)  # Output: 3
  ```

- **Median**: The middle value when the data is ordered from lowest to highest. If the dataset has an even number of values, the median is the average of the two middle values.
  Example:
  ```python
  median = np.median(data)
  print(median)  # Output: 3
  ```

- **Mode**: The most frequently occurring value in the dataset.
  Example:
  ```python
  from scipy import stats
  mode = stats.mode(data)
  print(mode)  # Output: ModeResult(mode=array([1]), count=array([1]))
  ```

---

### 2. **Probability Distributions**
Probability distributions describe how values in a dataset are distributed.

- **Normal Distribution (Bell Curve)**: A symmetrical distribution where:
  - The mean, median, and mode are equal.
  - Most data points cluster around the mean, with fewer values at the extremes.
  - Characterized by the probability density function (PDF):
    \[
    f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
    \]
  Example:
  ```python
  import matplotlib.pyplot as plt
  import numpy as np
  
  data = np.random.normal(0, 1, 1000)
  plt.hist(data, bins=30, density=True)
  plt.title("Normal Distribution")
  plt.show()
  ```

---

### 3. **Hypothesis Testing**
Hypothesis testing is used to make inferences or draw conclusions about a population based on a sample.

- **Null Hypothesis (H₀)**: Assumes there is no significant effect or difference between groups.
  Example: "The mean test scores of two groups are equal."

- **P-value**: The probability of observing data as extreme as what was observed, assuming the null hypothesis is true.
  - A **low p-value** (e.g., < 0.05) indicates strong evidence against H₀, leading to its rejection.
  - A **high p-value** suggests insufficient evidence to reject H₀.

Example:
```python
from scipy.stats import ttest_ind

# Two sample groups
group1 = [1, 2, 3, 4, 5]
group2 = [2, 3, 4, 5, 6]

# Perform t-test
t_stat, p_value = ttest_ind(group1, group2)
print(f"t-statistic: {t_stat}, p-value: {p_value}")
```

---

### 4. **Key Points to Remember**

1. **Descriptive Statistics**:
   - Provide a basic summary of the data (mean, median, mode).
   - Help understand the distribution and central tendencies.

2. **Probability Distributions**:
   - The normal distribution is a common reference in statistics.
   - Not all datasets follow a normal distribution.

3. **Hypothesis Testing**:
   - Involves comparing observed data against the null hypothesis.
   - Interpreting a p-value requires considering the study context and the predetermined significance level (commonly 0.05).

---

### Additional Resources
- [Scipy Documentation](https://docs.scipy.org/doc/scipy/)
- [Numpy Documentation](https://numpy.org/doc/stable/)
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)


In [1]:
import numpy as np

data = np.random.normal(loc=50, scale=10, size=100) # Generate random data

mean = np.mean(data)
std_dev = np.std(data)

print("Mean:", mean)
print("Standard Deviation:", std_dev)


Mean: 48.53359731225125
Standard Deviation: 9.570923152973858


---
**Practice:**

Perform a t-test to compare two datasets.
Plot a histogram of the generated data.

---