### 1. **What is Statistics?**

  > Statistics is the discipline that involves collecting, organizing, summarizing, analyzing, and interpreting data to make informed decisions or predictions.

* **Descriptive Statistics**: Summarizes data.

  * *Example*: Average score of students in a class.
* **Inferential Statistics**: Makes predictions using sample data.

  * *Example*: Predicting election results from a survey.

---

### 2. **Types of Variables**

* **Categorical**:

  * *Nominal*: No order.
    *Example*: Eye color (blue, green, brown)
  * *Ordinal*: Has order.
    *Example*: Movie ratings (poor, average, good)
* **Numerical**:

  * *Discrete*: Countable.
    *Example*: Number of children in a family
  * *Continuous*: Measurable.
    *Example*: Height of students

---

### Variable Measurement

We have four different types of measured variables:

1. **Nominal**
2. **Ordinal**
3. **Interval**
4. **Ratio**

Understanding these types is crucial because your dataset will contain these variables, enabling you to perform effective data analysis. Let’s explore each type:

#### Nominal

Nominal data, also known as categorical or qualitative data, is divided into different classes. Examples include:

* Colors
* Gender
* Types of flowers

This type of data is used to label variables without providing any quantitative value.

#### Ordinal

In ordinal data, the order of values matters, but the specific value does not. For instance:

* Students' marks: 100, 96, 57, 85, 44
  * Ranks: 1st, 2nd, 3rd, 4th, 5th

Here, the focus is on the order (ranks) rather than the actual values (marks).

#### Interval

Interval data has meaningful order and values, but lacks a true zero. For example:

* Temperature in Fahrenheit: 70-80°F, 80-90°F
* Distance ranges: 10-20 miles, 20-30 miles

Zero Fahrenheit does not signify the absence of temperature.

#### Ratio

Ratio data has all the properties of interval data, but with a meaningful zero point, indicating the absence of the variable being measured. This means you can make meaningful statements about how many times greater one object is compared to another. Examples include:

* **Height:** Zero height means the absence of height.
* **Weight:** Zero weight means the absence of weight.
* **Income:** Zero income means no income.
* **Distance:** Zero distance means no distance traveled.

### Examples and Analysis

#### Height

* **Data:** 150 cm, 160 cm, 170 cm, 180 cm, 190 cm
* **Properties:**
  * Order matters: 150 cm < 160 cm < 170 cm < 180 cm < 190 cm
  * Value matters: The difference between 150 cm and 160 cm is the same as between 180 cm and 190 cm.
  * True zero: A height of 0 cm indicates no height.

#### Weight

* **Data:** 50 kg, 60 kg, 70 kg, 80 kg, 90 kg
* **Properties:**
  * Order matters: 50 kg < 60 kg < 70 kg < 80 kg < 90 kg
  * Value matters: The difference between 50 kg and 60 kg is the same as between 80 kg and 90 kg.
  * True zero: A weight of 0 kg indicates no weight.

#### Income

* **Data:** $30,000, $40,000, $50,000, $60,000, $70,000
* **Properties:**
  * Order matters: $30,000 < $40,000 < $50,000 < $60,000 < $70,000
  * Value matters: The difference between $30,000 and $40,000 is the same as between $60,000 and $70,000.
  * True zero: An income of $0 indicates no income.

#### Distance

* **Data:** 10 km, 20 km, 30 km, 40 km, 50 km
* **Properties:**
  * Order matters: 10 km < 20 km < 30 km < 40 km < 50 km
  * Value matters: The difference between 10 km and 20 km is the same as between 40 km and 50 km.
  * True zero: A distance of 0 km indicates no distance traveled.

### Key Points

* **Order Matters:** Like ordinal and interval data, the order of values in ratio data is meaningful.
* **Value Matters:** The difference between values is consistent and meaningful.
* **True Zero:** Unlike interval data, ratio data has a true zero, making statements like "twice as much" meaningful.

---

# 📊 Foundation of Statistics

*Understanding Data with Clarity and Purpose*

Statistics is the science of learning from data. It forms the foundation for nearly every data-driven decision in business, science, government, and technology. Whether you're analyzing trends, building machine learning models, or making predictions, a strong foundation in statistics is essential.

---

## 🧱 1. What Is Statistics?

**Definition:**
Statistics is the discipline that involves **collecting**, **organizing**, **summarizing**, **analyzing**, and **interpreting data** to make informed decisions or predictions.

### 🔍 Example

Imagine you survey 1,000 people about their favorite smartphone brand. You can use statistics to:

* Summarize responses (e.g., 60% prefer Apple)
* Predict trends (e.g., Apple's popularity among youth)
* Compare groups (e.g., preferences across regions)

---

## 🔢 2. Types of Data

### a. **Qualitative (Categorical) Data**

* Non-numeric data grouped into categories
* Examples: colors, brands, gender, blood types

### b. **Quantitative (Numerical) Data**

* Data represented with numbers
* Can be further classified into:

  * **Discrete**: Countable values (e.g., number of students in a class)
  * **Continuous**: Measurable values (e.g., weight, height)

---

## 📏 3. Scales of Measurement

These scales help determine which statistical methods can be applied.

| Scale    | Description                                | Example                    |
| -------- | ------------------------------------------ | -------------------------- |
| Nominal  | Categories with no order                   | Colors, Gender             |
| Ordinal  | Ordered categories                         | Movie ratings (★ to ★★★★★) |
| Interval | Numeric with equal intervals, no true zero | Temperature in °C          |
| Ratio    | Numeric with true zero                     | Height, Weight, Age        |

---

## 📈 4. Descriptive vs Inferential Statistics

### ✏️ **Descriptive Statistics**

* Used to summarize and describe features of data
* Includes:

  * Measures of central tendency (Mean, Median, Mode)
  * Measures of dispersion (Range, Variance, Standard Deviation)
  * Charts (bar chart, pie chart, histogram)

### 🧪 **Inferential Statistics**

* Used to make generalizations or predictions about a population from a sample
* Includes:

  * Confidence intervals
  * Hypothesis testing
  * Regression analysis

### 🔍 Example

If you measure the test scores of 30 students and estimate the average score of the entire school, you're using **inferential statistics**.

---

## 🎲 5. Basics of Probability

Probability measures the **likelihood** of an event occurring. It ranges from 0 (impossible) to 1 (certain).

### 📌 Concepts

* **Independent events**: One event doesn't affect the other (e.g., flipping two coins)
* **Conditional probability**: Probability of one event given that another has occurred
* **Bayes’ Theorem**: Updates probability as more evidence is available

---

## 🧪 6. Population vs Sample

| Term       | Description                           | Example                  |
| ---------- | ------------------------------------- | ------------------------ |
| Population | The entire group you're studying      | All people in a country  |
| Sample     | A subset used to study the population | 1,000 survey respondents |

**Why sample?**
Studying the entire population is often impractical or impossible. A well-chosen sample gives reliable results at lower cost and effort.

---

### 4. **Data Collection**

* **Observational**: No control over variables.
  *Example*: Surveying public opinion
* **Experimental**: Researcher controls variables.
  *Example*: Testing a new drug on two groups

Here’s a simplified version of the explanation:

---

To survey 1,000 voters, we need to sample them properly. Here's how it's done — and what to avoid:

### 🚫 **What *not* to do:**

* **Sample of convenience**: Picking 1,000 people from your own town is easy, but not reliable. People in your town may not represent the whole country.
* **This causes bias**: Your sample might lean toward one political view, making the results misleading.

### 🎯 **3 Common Biases to Avoid:**

1. **Selection bias**: Some groups are more likely to be chosen (e.g., only people from one area).
2. **Non-response bias**: Some people don’t respond (e.g., parents busy at dinner time).
3. **Voluntary response bias**: Only people with strong opinions respond (like angry or thrilled customers writing reviews).

### ✅ **Better Sampling Methods:**

#### 1. **Simple Random Sampling**

* Every group of 1,000 voters has an equal chance of being selected.
* Example: A computer randomly dials phone numbers.

#### 2. **Stratified Random Sampling**

* Divide the population into groups (e.g., urban, suburban, rural).
* Randomly sample from each group.
* Combine the results. This gives more accurate data but is harder to do.

---

* ## Sampling Techniques

1. **Simple Random Sampling**

   * Definition: Every member of the population has an equal chance of being selected.
   * Example: Randomly selecting individuals for a survey.
2. **Stratified Sampling**

   * Definition: The population is divided into non-overlapping groups (strata) and samples are taken from each group.
   * Example: Sampling based on gender (male, female) or age groups.
3. **Systematic Sampling**

   * Definition: Selecting every nth individual from the population.
   * Example: Surveying every 8th person entering a mall.
4. **Convenient Sampling**

   * Definition: Sampling individuals who are conveniently available.
   * Example: Surveying individuals who have expertise in a specific topic like data science.

   ## Practical Applications

    * **Election Polls**: Using random sampling to predict election results.
    * **Household Surveys**: RBI using stratified or random sampling for surveys.
    * **Drug Testing**: Using stratified sampling based on age groups for testing new drugs.

Here’s a simplified version of the explanation:

---

To survey 1,000 voters, we need to sample them properly. Here's how it's done — and what to avoid:

### 🚫 **What *not* to do:**

* **Sample of convenience**: Picking 1,000 people from your own town is easy, but not reliable. People in your town may not represent the whole country.
* **This causes bias**: Your sample might lean toward one political view, making the results misleading.

### 🎯 **3 Common Biases to Avoid:**

1. **Selection bias**: Some groups are more likely to be chosen (e.g., only people from one area).
2. **Non-response bias**: Some people don’t respond (e.g., parents busy at dinner time).
3. **Voluntary response bias**: Only people with strong opinions respond (like angry or thrilled customers writing reviews).

### ✅ **Better Sampling Methods:**

#### 1. **Simple Random Sampling**

* Every group of 1,000 voters has an equal chance of being selected.
* Example: A computer randomly dials phone numbers.

#### 2. **Stratified Random Sampling**

* Divide the population into groups (e.g., urban, suburban, rural).
* Randomly sample from each group.
* Combine the results. This gives more accurate data but is harder to do.

---

In short: **Random sampling reduces bias** and gives more trustworthy results.

---

### 5. **Data Cleaning Basics**

* **Missing Values**: Filling blanks
  *Example*: If age is missing, fill with average age
* **Outliers**: Unusual values
  *Example*: Income of ₹10 Cr in a middle-income group
* **Encoding Categorical Data**: Convert categories to numbers
  *Example*: Gender → Male = 0, Female = 1

## Observational Study vs. Experimental

Sure! Here's a **simplified version** of the explanation with key points clearly laid out:

---

### 🍖 **Red Meat and Cancer – What Does the Data Say?**

You may have seen news saying that people who eat red meat have higher rates of certain cancers. But **this doesn’t mean red meat causes cancer**.

#### 🔍 **Why Not?**

* People who don’t eat red meat often **exercise more** and **drink less alcohol**.
* These **other factors** could be the real reason for lower cancer rates.

---

### 🧪 **Observational Study vs. Experiment**

#### 📊 **Observational Study**

* Just **observes** people and **measures outcomes**.
* Example: Comparing cancer rates between red meat eaters and non-eaters.
* **Shows association**, **not causation**.
* Could be affected by **confounding factors** (also called **lurking variables**) like exercise or alcohol use.

---

### 🧫 **Experiment**

To find out if **red meat actually causes cancer**, we need an experiment.

#### Here's how it's done

1. **Treatment & Control Groups**:

   * One group gets the **treatment** (e.g., medication or specific diet).
   * The other group is the **control** (does not get the treatment).

2. **Random Assignment**:

   * People are randomly put into the treatment or control group (like tossing a coin).
   * This makes sure the groups are **similar** in every way **except** for the treatment.

3. **Placebo**:

   * The control group receives something that looks like the treatment but does **nothing**.
   * This accounts for the **placebo effect** — where just thinking you're getting treated can make you feel better.

4. **Double-Blind Setup**:

   * Neither the **participants** nor the **evaluators** know who got the treatment.
   * This avoids **bias** in judging results.

---

### 💊 **The Placebo Effect**

* It's the effect of **thinking you're being treated**, even if you're not.
* It’s not fully understood — it lies between **biology and psychology**.
* If you're curious, check out: *“The Weird Power of the Placebo Effect, Explained”* by Brian Resnick.

---

A good experiment randomly assigns people to either the treatment or control group.

* This random process helps make sure that any other factors (called **confounders**) affect both groups equally.
* However, sometimes, just by chance, a confounder might still be more common in one group.
* The important part is: **because we used random assignment**, we can **measure how much of the difference in results could just be due to chance**.
* Later, we’ll learn how to calculate that.

---



<details>

<summary><h2>Questionnaire - Click Here </h2></summary>

### **1. Sampling at Times Square**

A news company located next to Times Square in New York wants to get a sense of how people feel about a proposed law on immigration. A reporter steps out of the building and randomly selects 100 people walking there and asks them about the proposed law.  
**What can we say about this sampling plan?**

* [ ] It represents a simple random sampling  
* [ ] It leads to selection bias  
* [ ] It leads to non-response bias  
* [ ] It leads to voluntary response bias  

<details>
<summary>🔍 Reveal Answer</summary>

**Answer: It leads to selection bias**  
This sample is not representative of the general population. People walking in Times Square are likely to differ systematically from the broader U.S. population.
</details>

---

### **2. Sampling Car Owners**

A car company wants to get a sense of how satisfied the owners of its new car model are with the quality of that car. It randomly selects 250 numbers from all the vehicle registration numbers that have been issued for this model and contacts the owners.  
**What can we say about this sampling plan?**

* [ ] It represents a simple random sampling  
* [ ] It leads to selection bias  
* [ ] It leads to non-response bias  
* [ ] It leads to voluntary response bias  

<details>
<summary>🔍 Reveal Answer</summary>

**Answer: It represents a simple random sampling**  
The car company is randomly selecting from the entire population of interest, making this a proper random sample.
</details>

---

### **3. Airline Email Survey (No Incentive)**

An airline sends an email to a random sample of customers who flew with the airline the day before. The email requests a 10-minute survey to improve service.  
**What can we say about this sampling plan?**

* [ ] It represents a simple random sampling  
* [ ] It leads to selection bias  
* [ ] It leads to non-response bias  
* [ ] It leads to voluntary response bias  

<details>
<summary>🔍 Reveal Answer</summary>

**Answer: It leads to non-response bias**  
People who are too busy or uninterested might ignore the survey, skewing the results.
</details>

---

### **4. Airline Email Survey (with $100 Incentive)**

Same as above, but now the email also offers a $100 gift card for completing the survey.  
**What can we say about this sampling plan?**

* [ ] It represents a simple random sampling  
* [ ] It leads to selection bias  
* [ ] It leads to non-response bias  
* [ ] It leads to voluntary response bias  

<details>
<summary>🔍 Reveal Answer</summary>

**Answer: It represents a simple random sampling**  
The incentive increases the response rate, helping reduce bias while maintaining randomness.
</details>

---

### **5. Paleo Diet Study**

A news channel recruits 100 people who have followed the Paleo diet and 100 who haven’t. They find more weight loss in the diet group, with statistical significance.  
**Which statement is true?**

* [ ] This is a randomized controlled experiment.  
* [ ] It is possible that the difference in weight loss is due to the placebo effect.  
* [ ] A future experiment proving the diet doesn’t work would confirm placebo effect as the cause.  

<details>
<summary>🔍 Reveal Answer</summary>

**Answer: It is possible that the difference in weight loss is due to the placebo effect**  
Without random assignment or placebo control, we can’t rule out placebo or other confounding variables.
</details>

---

### **6. Estrogen & Bone Loss Experiment**

Which group should be recruited for a proper experiment testing whether oral contraceptives prevent bone loss in female cross country runners?

* [ ] A group of women who are competitive runners and another group of non-athletes  
* [ ] A group of runners who already take oral contraceptives and another group who don’t  
* [ ] A group of runners not currently taking contraceptives, but willing to take them if asked  

<details>
<summary>🔍 Reveal Answer</summary>

**Answer: A group of female runners who are not currently taking oral contraceptives, but are willing to if randomized**  
This setup allows for proper random assignment to treatment and control groups.
</details>

### **📌 Estimation Considerations**

**Question:**  
What should we take into consideration when making an estimate?

* [ ] It should be close to the parameter.  
* [ ] We should be aware of possible bias.  
* [ ] We should be aware of the unavoidable chance error.  

<details>
<summary>🔍 Reveal Answer</summary>

**Correct Answer:**  
✅ All of the above

### ✅ Explanation

When making an estimate, we should consider:

1. **Closeness to the Parameter** – The estimate should aim to be as close as possible to the actual population value.
2. **Bias** – Any consistent error in the estimate caused by the sampling method or other flaws should be avoided.
3. **Chance Error** – Some variability is always present due to random sampling; we must account for this natural uncertainty.

</details>

</details>


In [2]:
from IPython.display import display, HTML

html_content = """
<details>
  <summary><strong>Click to expand: Summary of Key Findings</strong></summary>
  <ul>
    <li><b>Mean:</b> The average value across the dataset.</li>
    <li><b>Median:</b> The middle value when data is sorted.</li>
    <li><b>Mode:</b> The most frequently occurring value.</li>
    <li><b>Skewness:</b> Indicates asymmetry of the distribution.</li>
  </ul>
</details>
"""

display(HTML(html_content))
