<h3 align="center">Data & Visualization Basics</h3>

#### Types of data
- Nominal data consists of categories or labels with no inherent order or ranking.
- Ordinal data, unlike nominal data, has a specific order or ranking among categories.
- Continuous data encompasses an infinite range of precise values, often with decimals.
- Discrete data, on the other hand, comprises distinct, countable numerical values.

#### Visualization Basics
- Pie chart and Bar chart
  - Use a bar chart when you have benchmark values to compare with.
  - Use a Horizontal bar chart when category labels are long.
  - Use a Horizontal bar chart instead of a pie chart when the number of categories is more than 5.
  - Use a vertical bar chart (Column chart) for time series data.
- Histogram and Line chart
  - Histograms are primarily used to show the frequency distribution of a continuous or discrete dataset.
  - In a histogram, all bins (buckets) are of equal size.
  - A line chart is useful in presenting the trend or change in data over a period.
  - A stacked column chart is used to represent and compare multiple categories in a single bar, while also displaying the overall total.
- Scatter Plot and Bubble Plot
  - Scatter Plot helps in
    - Outlier detection
    - Visualizing the relationship between 2 variables (ex: area vs price, height vs exam score)
    - Identifying trend (ex: linear regression)
  - A bubble plot (or bubble chart) is used primarily in scenarios where you want to visualize relationships among three or more numeric variables in a single chart. It extends a scatter plot by adding a third dimension through the size of the bubbles (and sometimes a fourth dimension through color).
  - A bubble plot is best suited for exploring and communicating multivariate data relationships visually, where size and position convey important insights about three variables simultaneously.

<h3 align="center">Measures Of Central Tendency and Dispersion</h3>

#### Descriptive Vs Inferential Statistics
- Inferential statistics involve making predictions or drawing conclusions about a population based on a sample.
- Descriptive statistics are used to summarize and describe data, providing an overview of its main characteristics.

#### Measures of Dispersion (Variability): Range, IQR
- "IQR" and "Range" also referred to as measures of dispersion or variability.
- Range, calculated as Maximum Value - Minimum Value, reflects data spread.
- Unlike Range, IQR (Inter Quartile Range) is less influenced by outliers, making it a robust measure.
- Quartile Q1, Q2, and Q3 correspond to the 25th, 50th, and 75th percentiles, respectively.
- The 50th percentile is commonly known as the median.
- IQR is the difference between Q3 and Q1 showing the spread of the middle 50% of data.

#### Outlier Treatment using IQR
- IQR = Q3-Q1 # Get Q1 and Q3 value df.height.quantile([0.25,0.75])). lower and upper boundries for outlier detection
  - lower_limit = (Q1 - (1.5 * IQR))
  - upper_limit = (Q3 + (1.5 * IQR))

#### Measures of Dispersion: Variance and Standard Deviation
- Variance is a measure of how spread out a distribution is. It means How far each number is from every other number in a dataset. It is calculated as the average of the squared differences from the mean.
- The smaller the variance, the less spread out the data is. Conversely, the larger the variance, the more spread out the data is.
- Standard deviation is a measure of the amount of variation or dispersion of a set of values. It is calculated as the square root of the variance.
- The smaller the standard deviation, the closer the data points are to the mean. Conversely, the larger the standard deviation, the more spread out the data points are.
- The stock market's volatility is the best use case for variance and standard deviation.

#### Correlation and Causation
- Correlation is a statistical measure that shows the degree to which two variables are related.
- A correlation coefficient can range from -1 to 1. -1 (perfect negative correlation) < 0 (no correlation) < 1 (perfect positive correlation).
- Correlation: A statistical relationship between two variables, where changes in one variable are associated with changes in another, but it does not imply causation.
  - Example: Correlation example: Ice cream sales and drowning incidents often increase together during summer months. These two variables are correlated because both increase in warm weather. However, buying ice cream does not cause drowning incidents.
- Causation: A cause-and-effect relationship between variables, where changes in one variable directly lead to changes in another.
  - Example: Smoking causes an increase in the risk of lung cancer. Here, smoking is the direct cause, and the increase in lung cancer cases is the effect.

<h3 align="center">Probability and Distributions</h3>

#### Distribution
- Distribution: Arrangement or spread of different values.
- Normal Distribution: Most values clustered in the middle, forming a bell-shaped curve.
- Probability Distribution: Estimates the likelihood of various outcomes based on chance.
- Discrete Distribution: Things happen in specific steps or groups, like counting numbers.
- Continuous Distribution: Values occur anywhere within a range, such as height or weight measurements.

### Skewness
- Right-skewed distribution: Most data on the left with a few high values extending right.
- Left-skewed distribution: Most data on the right with a few low values extending left.
- Zero-skewed distribution: Data evenly spread around the mean, forming a symmetrical shape.
- Normal Distribution
  - Mean: The average value, calculated by summing all values and dividing by their count.
  - Standard Deviation: Measures how far data is spread from the mean; lower indicates closer to average, higher indicates more spread out.
  - 68-95-99.7 Rule: In a normal distribution, 68% of data falls within one standard deviation from the mean, 95% within two, and nearly 99.7% within three standard deviations.
  - An outlier is a number/value in a set that is much higher or lower than the others.
  - Outliers can be identified using a normal distribution and standard deviation, as they typically fall far outside the typical range of values.

### Z Score
- Z-score: Shows how many standard deviations a data point is from the mean.
- Formula for Z-score: (x−μ)/σ
- Uses: Comparing datasets and removing outliers.
- Identifying Outliers: Typically, outliers are identified when the Z-score exceeds 3 or falls below -3.

<h3 align="center">Central Limit Theorem</h3>

- `The Law of Large Numbers` states that as the number of trials increases, the average of the trial result gets closer to the theoretical or real average.
- `The central Limit Theorem` states that as the sample size increases, the distribution of sample means approaches a normal distribution, regardless of the population's original distribution.
- For sample sizes of 30 or more, sample means and proportions tend to follow a nearly normal distribution.
- The standard error quantifies the precision of a sample mean, calculated as the sample standard deviation divided by the square root of the sample size.

<h3 align="center">Hypothesis Testing</h3>

#### Null vs Alternate Hypothesis
- Hypothesis testing is a statistical technique for decision-making or inferring population characteristics.
- Null Hypothesis (H0): Represents a statement of no effect or no difference, serving as the benchmark in hypothesis testing.
- Alternative Hypothesis (Ha): Proposes a new effect or difference, challenging the null hypothesis.

#### Z Test, Rejection Region
- Z Test has 2 type
  1. Rejection region
  2. P-Value
- Z-Test: A statistical test used to determine if there's a significant difference between sample and population means.
- Significance Level (Alpha): Defines the threshold for rejecting the null hypothesis, commonly set at 0.05 or 5%.
  - Formula: (1 - confidence interval)
- Critical Z-Value (Z(crit)): The cut-off point in a Z-test distinguishing the rejection region for the null hypothesis.
- Rejection Region: The area beyond the critical Z-value (Z(crit)) where, if the Z-score falls, the null hypothesis is rejected due to significant evidence.

#### Z Test, P-Value
- P-Value: Assuming the null hypothesis is correct, what is the probability of obtaining results as extreme as observed in a statistical test.
- If the p-value is less than the chosen significance level (e.g., 0.05), it suggests strong evidence against the null hypothesis, leading to its rejection.

#### One-Tailed vs Two-Tailed Test
- One-Tail Test: Checks if a parameter is significantly different in one direction (greater or less).
- Two-Tail Test: Evaluates if a parameter is significantly different in any direction (greater or less).

#### Type 1 and Type 2 Errors
- Type 1 Error: Occurs when we incorrectly reject a true null hypothesis.
- Type 2 Error: Occurs when we fail to reject a false null hypothesis.
- Type 1 error is also known as a "false positive" while Type 2 error is a "false negative".
- Beta (β): The probability of making a Type 2 error (false negative).
- Statistical Power: The probability of correctly rejecting a false null hypothesis, equal to 1−β.
- Balancing Type 1 and Type 2 errors is crucial in statistical analysis.

#### Statistical power & effect size
- Statistical power, denoted as 1−β1−β, represents the probability of correctly rejecting a false null hypothesis in a hypothesis test.
- Effect size quantifies the magnitude of the difference or relationship between two groups or variables in a study.

#### A/B Testing
- A/B Testing: Compares two versions (A and B) to determine which performs better based on user engagement and relevant metrics.
- One-Sample Test: Used to check if a sample significantly differs from known population parameters.
- Two-Sample Test: Compares two independent sample data sets to identify significant differences between their means or proportions.

#### 9.15: A/B Testing using Z Test
- Control group: Baseline for comparing experimental effects.
- Test group: Subjects exposed to experimental treatment for evaluation.
- Statistical tests:
  - Use the Z-test when the sample size is >30.
  - Use the T-test when the sample size is ≤30.

<h3 align="center">Frequently Asked Questions</h3>

- Explain bias-variance tradeoff.
  - Bias: model’s error due to simplification/interpolation (underfitting).
  - Variance: error due to model sensitivity to training fluctuations (overfitting).
  - Balance needed for model generalization.
- Difference between correlation and covariance?
  - Covariance indicates direction of linear relationship; unbounded values.
  - Correlation quantifies strength and direction between -1 and 1.
- What is a p-value?
  - Probability of observing data assuming null hypothesis true.
  - Low p-value (<0.05) indicates evidence against null hypothesis.
- Type I and Type II errors?
  - Type I: false positive (rejecting true null hypothesis).
  - Type II: false negative (accepting false null hypothesis).
- Explain `principal component analysis (PCA)`.
  - Reduces data dimensionality by transforming features into uncorrelated components keeping most variance.
- Hypothesis testing steps?
  - Formulate null & alternative hypotheses.
  - Choose significance level.
  - Calculate test statistic & p-value.
  - Decide to reject/fail null hypothesis.
- Multicollinearity problem?
  - High correlation among features inflates variance of coefficients; reduce with PCA, drop features.