# üìä Statistics Basics
### üéì **Advanced Machine Learning 2024**
#### üë®‚Äçüè´ **By: George Samuel**

---

## üßê 1. What is Statistics?

**Definition:**
> Statistics is the science of **summarizing** and **describing** data.

### üá™üá¨ Real-World Example:
Imagine you have a dataset containing **100,000,000 observations** about the height of Egyptian people.

* **The Problem:** If you want to describe how tall Egyptians are, you cannot list the height of every single person!
* **The Solution:** You summarize these 100M observations into one single number (e.g., *"The average height is 170cm"*).
* **The Result:** This single number (170cm) is called a **Statistical Measure**.

---

## üìè 2. Statistical Measures & Data Types

### üìå What is a Statistical Measure?
A number calculated to **summarize many records (rows)** of information into one single value. These measures allow us to make **statistical inferences** about the population.

### üóÇÔ∏è Types of Data
Since measures depend on data, we must understand data types first:

| Feature | üî¢ Continuous Data (Numerical) | üì¶ Discrete Data (Categorical) |
| :--- | :--- | :--- |
| **Definition** | Data with an **infinite** number of possible values. | Data with a **finite** number of possible values. |
| **Data Type** | `Float` or `Int` (with large unique values). | `String` or `Int` (with small unique values). |
| **Examples** | Salary, Weight, Number of hours played. | City Name, Number of children. |

---

## üìâ 3. Popular Statistical Measures

We focus on three main categories:
1.  **üé≤ Probability**
2.  **üéØ Measures of Central Tendency**
3.  **„Ä∞Ô∏è Measures of Dispersion (Deviation)**

---

### üé≤ A. Probability
It is the ratio between the **frequency of a unique value** & the **total number of samples**.

#### Examples:
* **Dice Roll üé≤:** Values $\{1, 2, 3, 4, 5, 6\}$.
    * Probability of rolling a `1` = $1/6 = 0.167$
* **Box of Balls üî¥üîµüü°:** Values $\{Blue, Red, Yellow\}$.
    * If there are 4 Blue balls out of 12 total:
    * Probability of Blue = $4/12 = 0.333$

---

### üéØ B. Measures of Central Tendency
Used to represent the **average values** of the data.



#### 1Ô∏è‚É£ Mean ($\mu$)
* **Definition:** Summation of all values divided by the total number of observations.
* **Formula:** $\mu = \frac{\sum X}{N}$
* **Usage:** Numerical data without extreme values (**sensitive to outliers**).

> **Example:**
> Data: `[5, 2, 3, 10, 20]`
> Mean = $(5+2+3+10+20) / 5 = 8$

#### 2Ô∏è‚É£ Median
* **Definition:** The **middle value** in the data after being **sorted**.
* **Usage:** Numerical data containing **outliers**.

> **Example 1 (Odd count):**
> Data: `[5, 2, 3, 10, 20]` $\rightarrow$ Sort: `[2, 3, 5, 10, 20]`
> Median = **5**
>
> **Example 2 (Even count):**
> Data: `[3, 5, 2, 3, 10, 20]` $\rightarrow$ Sort: `[2, 3, 3, 5, 10, 20]`
> Median = $(3+5) / 2 = 4$

#### 3Ô∏è‚É£ Mode
* **Definition:** The **most frequent** value.
* **Usage:** Categorical data.

> **Example 1:** `[5, 2, 3, 3, 2, 3, 1, 5]` $\rightarrow$ Mode = **3** (Most repeated)
> **Example 2:** `[Cairo, Alex, Aswan, Alex]` $\rightarrow$ Mode = **Alex**

---

### „Ä∞Ô∏è C. Measures of Dispersion (Deviation)
Used to measure the **spread** of the data.



**Why do we need them?**
Imagine two sets with the **same Mean (5)**:
* Set 1: `[5, 5, 5, 5, 5]` (No spread)
* Set 2: `[-5, 0, 5, 10, 15]` (High spread)
We need a number to describe this "Spread".

#### 1Ô∏è‚É£ Variance ($\sigma^2$)
The average of squared differences between each value and the mean.
$$\sigma^2 = \frac{\sum (X_i - \mu)^2}{N}$$

> **Calculation Example:**
> Data: `[-5, 0, 5, 10, 15]`, Mean ($\mu$) = 5
> $\sigma^2 = \frac{(5 - (-5))^2 + (5-0)^2 + ...}{5} = 50$

#### 2Ô∏è‚É£ Standard Deviation ($\sigma$)
The square root of the variance.
$$\sigma = \sqrt{\text{Variance}}$$

* **Preference:** Standard Deviation is preferred over Variance because it is generally considered less sensitive to outliers in interpretation context (scales back to original units).

> **Calculation Example:**
> If Variance ($\sigma^2$) = 50
> Standard Deviation ($\sigma$) = $\sqrt{50} \approx 7.07$

---

## üåç 4. Population Vs Sample



### üèôÔ∏è Population
* **Definition:** The **whole complete set** of observations.
* **Example:** Heights of all **100,000,000** people in Egypt.
* **Challenge:** Impossible to collect due to time and money constraints.

### üß™ Sample
* **Definition:** A **randomly chosen subset** from the population.
* **Goal:** Represents the whole population without dealing with all data.
* **Trade-off:** The larger the sample, the better the representation, but the harder it is to collect.

# üêº 4. Statistics using Pandas

## üõ†Ô∏è What is Pandas?

While you can apply statistics using **Numpy**, **Pandas** is a library built *on top* of Numpy specifically designed for **tabular datasets**.

### üèóÔ∏è Data Structures Comparison

| Concept | Numpy Term | Pandas Term | Description |
| :--- | :--- | :--- | :--- |
| **Matrix** | 2D Array | **DataFrame** | The main datatype, represents the whole table. |
| **Vector** | 1D Array | **Series** | Represents a single column or row within the DataFrame. |

### üìÇ Reading Tabular Data
Pandas excels at reading common file formats:
* üìÑ **CSV Files** (Comma Separated Values)
* üìä **XLSX Files** (Excel Spreadsheets)

---

## üìà 7. Data Distribution

### üßê What is Data Distribution?
It describes how observations are **spread** across the unique values of the data.
> In simple terms: How frequently does each unique value occur?

**Example:**
Variable $X = [0, 5, 5, 5, 10, 0, 5, 10, 5]$
* **55.6%** of data belongs to $(X=5)$
* **22.2%** of data belongs to $(X=0)$
* **22.2%** of data belongs to $(X=10)$

### üìä The Histogram
A common 2-dimensional graph to visualize distribution:
* **X-axis:** Unique values.
* **Y-axis:** Probability (Frequency) of each value.
* **Bar Height:** Represents the probability.

[Image of histogram chart example statistics]

### üîÑ Types of Distributions

| Type | Description | Shape |
| :--- | :--- | :--- |
| **Uniform Distribution** | All values occur equally with the **same frequency**. | Flat / Rectangular |
| **Normal Distribution** | Values cluster around the **mean**. Symmetric "Bell Curve". | üîî Bell Shape |
| **Right-Skewed** | Tail extends to the **right**. Most data is on the left. | üìâ Tail $\rightarrow$ Right |
| **Left-Skewed** | Tail extends to the **left**. Most data is on the right. | üìà Tail $\leftarrow$ Left |

[Image of statistical distribution types comparison normal skewed uniform]

---

## üõë 8. Quartiles & Outliers

### üïµÔ∏è What are Quartiles?
Quartiles are techniques used to identify **Outliers**.
> **Outlier:** An extreme value that is "strange" or rare (e.g., A person aged 180 years).

Quartiles divide the data to help us calculate **Fences** (Thresholds). Any number outside these fences is an outlier.

### üî¢ The Three Quartiles
1.  **Q1 (First Quartile):** Median of the *lower* half.
2.  **Q2 (Second Quartile):** The **Median** of the entire data.
3.  **Q3 (Third Quartile):** Median of the *upper* half.

### üìê How to Calculate?
1.  **Sort** the data.
2.  Find **Q2** (Median of all data).
3.  Find **Q1** (Median of the subset *left*/before Q2).
4.  Find **Q3** (Median of the subset *right*/after Q2).

### üöß Calculating Outlier Fences
We use the **IQR (Inter-Quartile Range)** to build the fences.

$$IQR = Q3 - Q1$$

* **üîª Lower Fence:** $Q1 - (1.5 \times IQR)$
    * *If value < Lower Fence $\rightarrow$ Outlier*
* **üî∫ Upper Fence:** $Q3 + (1.5 \times IQR)$
    * *If value > Upper Fence $\rightarrow$ Outlier*

[Image of box plot with quartiles and outliers explanation]

---

## üîó 9. Covariance & Correlation

Both measure the **relationship** between two variables ($X$ and $Y$).

### üìâ Covariance ($Cov$)
Describes **how much** two variables change together.
* **Formula:** $Cov(X,Y) = \frac{\sum (X_i - \mu_x)(Y_i - \mu_y)}{n}$
* **Interpretation:**
    * **High Positive (+):** Direct relationship ($X \uparrow, Y \uparrow$).
    * **High Negative (-):** Inverse relationship ($X \uparrow, Y \downarrow$).
    * **Zero:** No relation.
* **Limitation:** Values are unbounded $[-\infty, \infty]$. Hard to compare strength between different pairs.

### üìä Correlation ($Corr$)
It is the **Normalized** version of Covariance.
* **Formula:** $Corr(X,Y) = \frac{Cov(X,Y)}{\sigma_x \times \sigma_y}$
* **Range:** Strictly between **$[-1, 1]$**.

### üÜö Comparison

| Feature | Covariance | Correlation |
| :--- | :--- | :--- |
| **Range** | $[-\infty, \infty]$ (Unbounded) | $[-1, 1]$ (Normalized) |
| **Interpretation** | Shows **Direction** only. | Shows **Direction** & **Strength**. |
| **Comparison** | Cannot compare two relationships. | Can compare strength (e.g., 0.5 is twice as strong as 0.25). |

> **Pandas Tip:**
> * `df.cov()` computes the Covariance Matrix.
> * `df.corr()` computes the Correlation Matrix.

In [1]:
import pandas as pd

**ReadingTabularData**

In [2]:
# CSV Files
df =pd.read_csv("path")
#XLSX files
df =pd.read_excel("path")

FileNotFoundError: [Errno 2] No such file or directory: 'path'

**PandasforStatistics:**

In [None]:
# MeanofLengthColumn
df.length.mean() #which length is a colum in csv
# MedianofLengthColumn
df.length.median() 
#  ModeofCityColumn
df.city.mode()[0] 
# Standard Deviation of Length Column
df.length.std() 
# Variance of Length Column 
df.length.var()
# Minimum Value in Length Column
df.length.min()
# Maximum Value in Length Column
df.length.max()
# Summary Statistics for Length Column
df.length.describe()
# Summary Statistics for Entire DataFrame
df.describe()
# Correlation between Length and Width Columns
df.length.corr(df.width)
# Covariance between Length and Width Columns
df.length.cov(df.width)
# Count of Unique Values in City Column
df.city.nunique()
# Frequency Count of Values in City Column
df.city.value_counts()

**CreateSeries:**

In [None]:
# Withdefaultindex
s=pd.Series([10, 20, 30, 40, 50])
# Specifytheindex
s_1=pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(s, s_1)

0    10
1    20
2    30
3    40
4    50
dtype: int64 
 a    10
b    20
c    30
d    40
e    50
dtype: int64


**CreateDataFrame**

In [None]:
data=[[1, 2, 3],
      [4, 5, 6], 
      [7, 8, 9]]
df=pd.DataFrame(data, columns=['col1', 'col2', 'col3'])
df

Unnamed: 0,col1,col2,col3
0,1,2,3
1,4,5,6
2,7,8,9


**RenameDataFrameColumns&index**

In [None]:
# RenameDataFrameColumns
df=pd.DataFrame([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]])
display(df)
columns=['col1', 'col2', 'col3']
df.columns=columns
display(df)
# RenameDataFrameindex
df=pd.DataFrame([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]])
index=['row1', 'row2', 'row3']
df.index=index
display(df)


Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


Unnamed: 0,col1,col2,col3
0,1,2,3
1,4,5,6
2,7,8,9


# üóÇÔ∏è Common Pandas Dtypes
Pandas supports several data types to handle different kinds of information efficiently.



| Dtype | Description | Examples |
| :--- | :--- | :--- |
| **‚úÖ Bool** | Represents **Numerical** datatypes with logical values. | `True`, `False` |
| **üî¢ Int** | Represents **Numerical** datatypes with **integer** numbers. | `10`, `450`, `-5` |
| **üåä Float** | Represents **Numerical** datatypes with **continuous** (decimal) values. | `10.5`, `3.14`, `0.001` |
| **üè∑Ô∏è Category** | Represents **Categorical** datatypes (fixed number of unique values). Efficient for memory. | `"Low"`, `"Medium"`, `"High"` |
| **üì¶ Object** | A mix of **Categorical** & **Numerical** types. Can carry any Python object. | Strings `"Hello"`, Lists `[1,2]`, Tuples, etc. |


**PandasDtypes**

In [13]:
# GetDatatypesofallcolumns
df.dtypes
#  GetDatatypeofonecolumn
df["col1"].dtype

CategoricalDtype(categories=[1, 4, 7], ordered=False, categories_dtype=int64)

**ChangeDatatype:**

In [14]:
# ChangeDatatypeofonecolumn
df
df["col1"]=df["col1"].astype("category")
df.dtypes
#  ChangeDatatypesofgroupofcolumns
cols=["col1","col3"]
df[cols]=df[cols].astype("category")
df.dtypes

col1    category
col2       int64
col3    category
dtype: object


## üßê What is EDA?
**Definition:**
> **EDA** stands for **Exploratory Data Analysis**.

It is the process of exploring and understanding the data to extract meaningful **insights**.
* **Pandas** provides built-in methods and features that help us answer different questions about the data efficiently.

---

## üõ†Ô∏è Pandas Methods for EDA
Here are the essential functions to explore your DataFrame:

### 1Ô∏è‚É£ Data Preview (Inspecting Rows)
Quickly look at the data to understand its structure.

| Task | Pandas Method |
| :--- | :--- |
| **Get the first `n` rows** | `df.head(n)` |
| **Get the last `n` rows** | `df.tail(n)` |

### 2Ô∏è‚É£ Statistical Summaries
Get a summary of the data distribution.

| Task | Pandas Method |
| :--- | :--- |
| **Numerical Columns** (Mean, std, min, max...) | `df.describe()` |
| **Categorical Columns** (Count, unique, top...) | `df.describe(include='O')` |

### 3Ô∏è‚É£ DataFrame Structure & Metadata
Understand the shape and types of your data.

| Task | Pandas Method |
| :--- | :--- |
| **Get Basic Information** (Dtypes, Non-nulls) | `df.info()` |
| **Get Columns Names** | `df.columns` |
| **Get Rows Names** (Index) | `df.index` |

---

### 4Ô∏è‚É£ Analyzing Values
Understand the content of specific columns.

```python
# 1. Get Unique Values (Array of distinct values)
df['column'].unique()

# 2. Get Unique Values Number (Count of distinct values)
df['column'].nunique()

# 3. Unique Values Frequency (How many times each value appears)
df['column'].value_counts()


## üîÑ What is Data Manipulation?
Pandas provides built-in methods and attributes that allow you to **apply different operations** over Pandas DataFrames or Series. Below are the most popular & important attributes for modifying and handling data.

---

## 1Ô∏è‚É£ Converting Pandas to Numpy
Sometimes you need to convert data back to Numpy arrays for mathematical operations or model inputs.

| Task | Pandas Method | Result Shape |
| :--- | :--- | :--- |
| **Convert Series to Array** | `series.values` or `series.to_numpy()` | **1D-Array** (Vector) |
| **Convert DataFrame to Array** | `df.values` or `df.to_numpy()` | **2D-Array** (Matrix) |

---

## 2Ô∏è‚É£ Replacing Values
Modifying specific values within the DataFrame.

### üîπ 1. Replace a Single Value
Replacing one specific value with another.
```python
# Replace value 10 with 100
df.replace(10, 100)

**DataManipulation(ReplaceValues)**

In [20]:
# Replace a Single Value
data=[[1,444,'abc'],
      [2,555,'def'],
      [3,666,'ghi'],
      [4,777,'xyz']]
df=pd.DataFrame(data)
df.replace(555,"ali")
# Replace Multiple Values using Dictionary
df.replace({444:"omer","abc":444})
# Replace Multiple Values by OneValue
df.replace([1,666,"abc"],"aaa")

Unnamed: 0,0,1,2
0,aaa,444,aaa
1,2,555,def
2,3,aaa,ghi
3,4,777,xyz


**DataManipulation(Sorting):**