# Data Science — Overview

## Purpose
- Build a repeatable pipeline from raw data to decisions.
- Combine statistics, programming, and domain knowledge.
- Quantify uncertainty and communicate impact.

## Key questions this section answers
- What data do we need and how do we validate it?
- Which model and metric match the business goal?
- How do we avoid leakage, bias, and overfitting?

## Topics
- Data collection, cleaning, and validation
- Exploratory data analysis (EDA) and visualization
- Feature engineering and preprocessing
- Supervised and unsupervised modeling
- Evaluation, experimentation, and error analysis
- Deployment, monitoring, and responsible AI

## References
- Pandas, NumPy, scikit-learn, statsmodels, PyTorch


In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

rng = np.random.default_rng(7)
n = 400
x1 = rng.normal(0, 1, n)
x2 = rng.normal(0, 1, n)
y = 2.0 * x1 - 0.8 * x2 + rng.normal(0, 1, n)

df = pd.DataFrame({"x1": x1, "x2": x2, "y": y})

fig = px.scatter(
    df,
    x="x1",
    y="y",
    color="x2",
    title="Signal + noise (synthetic regression data)",
    labels={"x1": "feature x1", "y": "target y", "x2": "feature x2"},
)
fig.show()

corr = df.corr(numeric_only=True)
fig = px.imshow(
    corr,
    text_auto=True,
    color_continuous_scale="RdBu",
    zmin=-1,
    zmax=1,
    title="Feature correlations",
)
fig.show()


## Takeaway
Even a tiny model can capture meaningful structure when the data is clean. Start with simple baselines and validate carefully before adding complexity.

