# Explortor Data Analysis

**variables with measured or count data might have thousand of distinct values . A basic step in exploring your data is getting a "typical value" for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency)**

## Key terms of Estimates of Location

**Mean** : The sum of all values divided by the number of values.
- Synonym: average

**Weighted mean** : The sum of all values times a weight divided by the sum of the weights.
- Synonym: weighted average

**Median**: The value such that one-half of the data lies above and below.
- Synonym: 50th percentile

**Percentile**: The value such that P percent of the data lies below.
- Synonym: quantile

**Weighted median**: The value such that one-half of the sum of the weights lies above and below the sorted data.

**Trimmed mean**: The average of all values after dropping a fixed number of extreme values.
- Synonym: truncated mean

**Robust**: Not sensitive to extreme values.
- Synonym: resistant

**Outlier**: A data value that is very different from most of the data.
- Synonym: extreme value

### **Metric vs Estimate**

Statisticians often use the term **"estimate"** to describe a value calculated from data, emphasizing that it is an approximation — not the exact truth. This reflects the core principle of statistics: dealing with **uncertainty** and understanding how far our calculations might be from the true value.

In contrast, data scientists and business analysts typically refer to the same kind of value as a **"metric"**. Their focus is less on uncertainty and more on **tracking performance** or **achieving specific business objectives**. A metric is treated as a concrete number used for decision-making and evaluating progress.

Though the underlying data might be the same, the terminology reveals a difference in mindset and purpose:

* **Statisticians estimate** – they account for uncertainty and variability.
* **Data scientists measure** – they focus on actionable insights and goals.

**In summary:**

- **Statistics is about understanding uncertainty.**
- **Data science is about driving outcomes.**
- That’s why statisticians **estimate**, and data scientists **measure**.



## **Trimmed Mean**

A **trimmed mean** is a variation of the average (mean) that helps reduce the effect of extreme values (very high or very low numbers).

### **How It Works**

To calculate a trimmed mean:

1. **Sort** the data values from smallest to largest.
2. **Remove** a fixed number of the smallest and largest values (this number is usually the same at both ends).
3. **Calculate** the average (mean) of the values that are left.


#### **Real-Life Example**

In **international diving competitions**, each diver is scored by 5 judges. To prevent bias:

* The **highest** and **lowest** scores are removed.
* The remaining **three scores** are averaged to get the final score.

This makes it harder for a single judge to unfairly influence the result.

### **When to Use It**

Trimmed means are useful when your data may include **outliers** or **errors** that could skew a normal average. In many situations, a trimmed mean gives a **more accurate and fair** summary of the data than the regular mean.




## **Weighted Mean (Weighted Average)**

A **weighted mean** is a type of average where each data value is given a **weight**, and then the average is calculated based on those weights.

---
### **Why Use a Weighted Mean?**

There are **two main reasons** to use a weighted mean:

####  1. **Some values are more variable or less reliable**

Example:
Suppose we’re collecting data from several sensors, and one of them is **less accurate**.
We can **give it a lower weight** so that its less reliable data doesn't affect the final average too much.

####  2. **The data doesn’t fairly represent all groups**

Example:
In an **online experiment**, one group of users may have much more data than another.
To **correct this imbalance**, we can **assign weights** to each group to better represent the entire population.

---


### **In Simple Terms:**

* **Simple Mean** = All values are treated equally.
* **Weighted Mean** = Some values are more important (weighted more).


### **When to Use a Weighted Mean**

Use a weighted mean when:

* Some data points are **more accurate**, **more important**, or **more relevant**.
* Your dataset is **imbalanced**, and you want to **correct for over- or under-represented groups**.
* You’re working with **survey data**, **multiple sensors**, or **group-level summaries**.


### **Quick Summary**

 * **Simple Mean** → All values are equal
 * **Weighted Mean** → Some values matter more



---
# **Median, Robust Estimates & Weighted Median**

## **What Is the Median?**

The **median** is the value that lies **in the middle** of a **sorted list** of data.

* If the number of data points is **odd**, the **middle value** is the median.
* If the number is **even**, the **median** is the **average of the two middle values**, even if that value isn't actually in the dataset.

---

## **Median vs Mean (Average)**

* The **mean** takes into account **all values** in the dataset.
* The **median** only depends on the **middle position** in the sorted list.

While the mean might seem more informative because it uses all data points, it’s also more **sensitive to outliers** (very large or very small values). In contrast, the **median is more stable** when outliers are present.

---

## **Example: Household Incomes in Seattle**

Imagine comparing household incomes in two neighborhoods near Lake Washington:

* **Medina**
* **Windermere**

If you calculate the **mean income**, Medina will appear much richer — because **Bill Gates lives there**, and his wealth **heavily skews the average**.

However, if you use the **median**, Bill Gates' income won’t impact the result much, because the **median looks only at the middle** of the data — **not the extremes**.

This makes the **median a more "robust" measure** when there are outliers.

---

## **What Does "Robust" Mean Here?**

A **robust estimate** is one that **doesn't get affected much by extreme values or errors** in the data.
The **median** is robust.
The **mean** is **not robust**, because outliers can pull it far away from the “typical” value.

---

## **Weighted Median — Like Median, But With Weights**

Just like we can compute a **weighted mean**, we can also compute a **weighted median**.

### How It Works:

1. First, sort the data values.
2. Each value has an associated **weight** (a number representing importance, frequency, or reliability).
3. Instead of finding the middle value by position, the **weighted median** is the point where:

   * The **total weight** of values **below it** is **equal to** the **total weight** of values **above it**.

### Key Properties:

* The **weighted median**, like the median, is **robust to outliers**.
* It's useful when **some data points are more important or frequent** than others — for example, in surveys, population studies, or sensor data.

---

## **Summary**

| Term                | Meaning                                                             |
| ------------------- | ------------------------------------------------------------------- |
| **Mean**            | Average of all values; **sensitive to outliers**                    |
| **Median**          | Middle value in sorted list; **robust to outliers**                 |
| **Weighted Mean**   | Average where some values count more; **used for imbalanced data**  |
| **Weighted Median** | Middle value based on weights; **robust + accounts for importance** |

---




---

## **Outliers — What Are They and Why They Matter**

###  **What Is an Outlier?**

An **outlier** is a value in a dataset that is **very different** (much higher or lower) compared to the rest of the values.

Outliers can **pull the results** in one direction and distort the true picture of the data — especially when using the **mean** (average).

---

### **Median Is Robust — It Handles Outliers Well**

The **median** is considered a **robust estimate of location** because it is **not affected** by outliers.

On the other hand, the **mean** can be **heavily influenced** by outliers, making it a **less reliable measure** when such values are present.

---

### **Outliers Are Not Always Wrong**

Just because a value is an outlier **doesn’t mean it’s incorrect**.

#### For example:

If you're measuring incomes in a neighborhood and **Bill Gates lives there**, his income would be a **huge outlier**, but it’s **100% valid**.

However, outliers **can also happen due to errors**, such as:

* ❌ Mixing up units (e.g., kilometers vs meters)
* ❌ Bad readings from sensors
* ❌ Data entry mistakes

---

### **Mean vs. Median in the Presence of Outliers**

| Statistic  | Sensitive to Outliers? | Suitable When Outliers Exist? |
| ---------- | ---------------------- | ----------------------------- |
| **Mean**   | ✅ Yes                  | ❌ Not ideal                   |
| **Median** | ❌ No                   | ✅ Much better                 |

If your data contains errors or outliers, the **mean can give misleading results**, while the **median will still give a reasonable estimate** of the center.

---

### **What Should You Do with Outliers?**

Outliers should not be ignored. Instead, they should be:

1. **Identified**
2. **Investigated**
3. **Understood** — Are they valid values or errors?

Sometimes they **reveal important insights**, and other times they point to **data quality problems**.

---

## **Summary**

* **Outlier** = A value that is far away from the rest of the data
* **Median** = Robust to outliers; still gives a stable result
* **Mean** = Sensitive to outliers; can be misleading
* **Outliers aren’t always wrong**, but they are always worth checking

---
