## What this is about & why statistics matters
Statistics is like the “language” that helps us understand data. In machine learning / data science, we rarely deal with just a few numbers — there are usually many data points. Statistics gives tools to summarize, describe, compare, and infer relationships from data.

statistics in two broad parts:
- Descriptive statistics: summarizing or describing what’s in your dataset (e.g. average, spread, correlations).
- Inferential statistics: using a sample of data to draw conclusions about a broader population — this includes estimation, hypothesis testing, regression, etc.

When we code or use machine learning, having good statistical grounding helps us understand data better, choose features wisely, evaluate models, and trust conclusions.


### Population and Sample
- Population: the full set of all possible data points you care about (e.g. heights of all people in a city).
- Sample: a subset chosen from that population (e.g. heights of 500 randomly selected people).

Why this matters: Often we can’t collect data from the entire population — so we use a sample to estimate what the population might look like. That’s where statistics helps you make valid generalizations.

### Estimation
Once we have a sample, you often want to estimate some property (parameter) of the population — like the true average height.
- Point Estimation: giving a single value as your estimate (e.g. sample mean as estimate of population mean).

- Interval Estimation (Confidence Interval): instead of one value, we give a range where we expect the true population parameter to lie (with certain confidence) — e.g. mean ± margin of error.

### Hypothesis Testing
This is a way to test assumptions/claims (hypotheses) about a population using sample data.
- Null Hypothesis (H₀): a default assumption (e.g. “the mean height = 170 cm”).
- Alternative Hypothesis (H₁): what you want to test (e.g. “mean height ≠ 170 cm”).

### Errors:

Type I Error: rejecting H₀ when it was actually true.

Type II Error: failing to reject H₀ when it was actually false

p-Value: probability of getting results as extreme (or more) as observed, assuming H₀ is true. If p-value is small (below a threshold e.g. 0.05), we reject H₀.

### Common tests:
t-Test / z-Test: to compare means.
- ANOVA (Analysis of Variance): compare means across more than 2 groups.

- Chi-Square Test: test association between categorical variables.

Why this matters: In ML/data science, hypothesis testing helps determine if observed effects/patterns are “real” or could have arisen by chance.


### Covariance and Correlation

These help understand relationships between two variables.
- Covariance: indicates whether two variables change together (positive → move in same direction; negative → opposite). But magnitude depends on scale — so not always easy to interpret.

- Correlation (e.g. Pearson’s correlation): normalized measure (–1 to +1) showing strength and direction of linear relationship between variables.

- Spearman Rank Correlation: good for monotonic but maybe non-linear relations.

Why this matters: In ML, exploring correlation helps with feature selection (which variables are related), detecting redundant features, or multicollinearity.
Or visualize with scatter plot & correlation coefficient.

### Visualization Techniques
Visualizing data helps you see patterns, distributions, relationships, outliers — often more intuitively than raw numbers.

Common techniques:

- Histograms — show distribution of single variable (how many values fall in each bin).
GeeksforGeeks
- Box Plots — show spread, median, quartiles, and outliers.
GeeksforGeeks
- Scatter Plots — useful for visualizing relationships between two variables.

Why this matters: Visual EDA (exploratory data analysis) is often the first step in any ML/data-science pipeline — helps you detect weird values, skewness, relationships, etc.

### Regression Analysis
Regression is about modeling relationships — how one (or more) variable(s) (predictor(s)) influence a target variable.

- Simple Linear Regression: relationship between one predictor and one target.
- Multiple Linear Regression: when multiple predictors are used.
- Also, regression comes with assumptions (linearity, independence, homoscedasticity (equal variance), normality of errors) — which good statistical practice helps check.
- Interpretation of coefficients: tells how much target changes when predictor changes (keeping other predictors constant).
- Model evaluation metrics: such as R-squared, adjusted R-squared, RMSE (root mean squared error).

Why this matters: Many ML algorithms (linear models, regression, etc.) rely on statistical foundations; also this helps when we want to understand why a model predicts what it predicts (interpretability).