 
1. Statistics: The science of collecting, organizing, analyzing, and interpreting data.

2. Two main branches:
   - Descriptive statistics: Organizing and summarizing data
   - Inferential statistics: Using sample data to draw conclusions about a population

3. Key terms:
   - Population: The complete set of all items or individuals under study
   - Sample: A subset of the population, used to make inferences about the whole

 

 
1. Simple Random Sampling:
   - Every member of the population has an equal chance of being selected
   - Unbiased, but can be impractical for large populations

2. Stratified Sampling:
   - Population is divided into non-overlapping groups (strata)
   - Samples are then taken from each stratum
   - Ensures representation of all subgroups

3. Systematic Sampling:
   - Selecting every Nth individual from the population
   - N is determined by dividing population size by desired sample size
   - Easy to implement but can introduce bias if there's a pattern in the data

4. Convenience Sampling:
   - Selecting readily available individuals
   - Quick and easy, but often leads to bias
   - Not representative of the entire population

 

1. Sampling techniques are used for several important reasons:

- Cost-effectiveness: Studying an entire population can be prohibitively expensive and time-consuming. Sampling allows researchers to gather meaningful data at a fraction of the cost.
- Time efficiency: Collecting data from a sample is much faster than surveying an entire population, enabling quicker decision-making and more timely research.

# Variables:
    Here's a concise overview of variables and their types in statistics:

Variables in statistics are characteristics or attributes that can be measured or observed. They are typically classified into two main categories:

1. Qualitative (Categorical) Variables:
   - Nominal: No natural order (e.g., colors, gender)
   - Ordinal: Have a natural order (e.g., education levels)

2. Quantitative (Numerical) Variables:
   - Discrete: Countable, whole numbers (e.g., number of children)
   - Continuous: Can take any value within a range (e.g., height, weight)

Additionally, variables can be classified as:
- Independent: Manipulated or controlled in a study
- Dependent: Observed or measured outcomes





---

### 1. Frequency Distribution
   **1.1 Discrete Data**
   - **Bar Graph**  
     - In addition to showing frequency, the spacing between bars emphasizes that the data is discrete (non-continuous). Bar graphs can also be grouped or stacked to compare multiple categories.
  
   **1.2 Continuous Data**
   - **Histogram (Probability Density Function)**
        - The choice of bin width can significantly affect the shape of the histogram. Too wide a bin can obscure important details, while too narrow a bin can introduce noise.
        - Useful for showing distributions like normal or skewed distributions in continuous datasets.
        Probability Density Function (PDF)
        - PDFs represent the continuous counterpart of probability mass functions (PMF) used for discrete data.
        - Key property: the total area under a PDF equals 1, which corresponds to the total probability.
        - PDFs are integral in modeling phenomena like the normal distribution (bell curve), exponential distributions, etc.
        - Useful in statistical modeling, hypothesis testing, and simulations (e.g., Monte Carlo methods).


   - **Kernel Density Estimation (KDE)**
     - KDE smooths the data by placing a kernel (usually a Gaussian function) at each data point, which helps visualize the distribution without the discrete jumps of histograms. 
     - Bandwidth (smoothing parameter) plays a crucial role in balancing smoothness with the level of detail in the estimate.


#### MEASURE OF DATA OR COMPARE TWO DATAS :
###  Measures of Central Tendency
   **3.1 Mean**  
   - It’s the most commonly used measure, but outliers can skew it. In financial datasets (e.g., income), the mean can be misleading due to extreme values.
    Certainly. Here are the formulas for population mean and sample mean with their notations:

Population Mean:
μ = (Σ X) / N

Where:
μ (mu) = population mean
Σ (sigma) = sum of
X = each value in the population
N = total number of values in the population

Sample Mean:
x̄ = (Σ x) / n

Where:
x̄ (x-bar) = sample mean
Σ (sigma) = sum of
x = each value in the sample
n = number of values in the sample

   **3.2 Median**  
   - Particularly useful in highly skewed distributions (e.g., real estate prices). The median can often represent the "typical" value better than the mean in such cases.
  
   **3.3 Mode**  
   - Mode is particularly useful for nominal data, such as identifying the most common category in survey responses.
   - In continuous data, a distribution can have multiple modes, suggesting the presence of clusters within the data.

### Measures of Dispersion
   **4.1 Range**  
   - While easy to compute, the range ignores the distribution of values within the dataset. Two datasets with the same range can have different shapes and spread of data.

   **4.2 Interquartile Range (IQR)**  
   - The IQR is particularly useful for identifying outliers, which are defined as values that fall below Q1 – 1.5*IQR or above Q3 + 1.5*IQR. This measure is often visualized using box plots.
   
   **4.3 Variance**  
   - Variance quantifies the degree to which each data point deviates from the mean. It is the basis for many advanced statistical techniques, such as regression and machine learning models (e.g., in loss functions).
   
   **4.4 Standard Deviation**  
   - A small standard deviation indicates that data points are generally close to the mean, while a larger standard deviation indicates a wider spread. It is widely used in finance (e.g., for measuring stock price volatility).
   
### Comparing Datasets
   - When comparing datasets, besides central tendency (mean, median), and dispersion (variance, standard deviation), skewness and kurtosis also become relevant.
     - **Skewness** indicates the symmetry of the data distribution. Positive skew means the tail is on the right, and negative skew means it’s on the left.
     - **Kurtosis** indicates the "tailedness" of the distribution. High kurtosis means more data is in the tails (outliers), while low kurtosis means the data is more evenly distributed around the mean.

 

Sure! Here are short notes on each topic:

### 1. **A/B Testing**
- **Definition**: Comparing two versions (A and B) of a variable to determine which performs better.
- **Application**: Used in marketing, UX design, and product optimization.
- **Key Points**:
  - **Control Group**: Version A (baseline).
  - **Treatment Group**: Version B (variant).
  - **Objective**: Measure impact on specific outcomes (e.g., conversion rate).

### 2. **SUTVA (Stable Unit Treatment Value Assumption)**
- **Definition**: Assumption for causal inference.
- **Components**:
  - **No Interference**: Treatment of one unit doesn’t affect others.
  - **Consistency**: Observed outcome under treatment matches potential outcome.
- **Purpose**: Ensures valid causal effect estimation.

### 3. **Sampling Distributions**
- **Definition**: Distribution of a statistic (e.g., mean) from many samples drawn from the population.
- **Key Concepts**:
  - **Central Limit Theorem**: Sample means approach normal distribution as sample size increases.
  - **Standard Error**: Measures variability of sample statistic.

### 4. **Hypothesis Testing**
- **Definition**: Method to infer population parameters based on sample data.
- **Components**:
  - **Null Hypothesis (H0)**: No effect or difference.
  - **Alternative Hypothesis (H1)**: Contradicts H0.
  - **Test Statistic**: Value calculated from data.
  - **P-value**: Probability of observing the data under H0.
  - **Decision**: Reject or fail to reject H0 based on p-value and significance level.

### 5. **Bayesian Testing**
- **Definition**: Updating the probability of a hypothesis using Bayes' theorem.
- **Components**:
  - **Prior Probability**: Initial probability before data.
  - **Likelihood**: Probability of data given hypothesis.
  - **Posterior Probability**: Updated probability after data.
  - **Bayes' Theorem**: Calculates posterior from prior and likelihood.