# Exploring Tabular Datasets

## 1. Characteristics of Rows

#### Independent and Identically Distributed (IID)

When working with tabular data, it's important to ensure that each row (i.e., example) is **independent and identically distributed (IID)** — unless you're dealing with time series or hierarchical data. IID means that each sample is drawn independently from the same underlying distribution, like repeated coin flips or dice rolls. This assumption simplifies many aspects of machine learning, such as model training, cross-validation, and sampling methods like bootstrapping. 

However, many real-world datasets violate the IID assumption. For instance, sales from the same store or responses from students in the same class tend to be correlated, introducing non-IID characteristics. Non-IID data can lead to misleading model performance, overfitting to hidden relationships, and biased validation scores. It's important to detect these patterns through data exploration and understand how they might impact your modeling process.

#### Why IID Matters in Machine Learning

Most statistical and machine learning methods assume that data is **independent and identically distributed (IID)**. While machine learning is often data-driven and nonparametric, it still relies heavily on the IID assumption to make reliable predictions. In reality, most datasets are not truly IID.

A key limitation of machine learning algorithms is that they are **column-aware but not row-aware**—they learn relationships between features and the target but cannot understand dependencies between rows. This means that if data has hidden patterns due to time or grouping, the model may wrongly interpret them as feature-based correlations.

**Time series and longitudinal data** are classic examples of non-IID data, where observations are autocorrelated. In such cases, time-based features (e.g., timestamps or lagged variables) help the model account for temporal dependencies. Depending on your data structure, you can:

* Use timestamps as features (for proper time series analysis with special validation techniques).
* Create time-based features by pivoting multiple time points into separate columns, allowing the data to be treated as IID.

Properly identifying and handling non-IID structures is essential to avoid misleading results and to ensure your models generalize well.


#### Handling Non-IID Data in Time and Groups

Even in **cross-sectional datasets**, comparing different time periods can introduce **temporal dependencies**, making the data non-IID—even when there's no interaction between units or groups. In such cases, the **order of observations matters**, and time series models become necessary to capture the correlation across time points.

To prepare data effectively for modeling:

* **Check how time affects your data**, and consider using **time features**, **lags**, or **moving averages** to control for temporal shifts.
* Be **explicit about grouping**: if hidden relationships or groups exist in your data, represent them with features and use **group-aware validation techniques**.
* For grouped data, prefer **group cross-validation** to avoid splitting related rows across train and validation sets.
* For temporal data, use **time-based validation** methods to preserve the order and avoid data leakage.

Properly addressing these issues improves model reliability and prevents overfitting to spurious or time-based patterns.



## 2. Characteristics of Columns



| **Feature Type**      | **Description**                                                             | **Examples**                     | **Key Notes**                                                           |
| --------------------- | --------------------------------------------------------------------------- | -------------------------------- | ----------------------------------------------------------------------- |
| **Numeric (Float)**   | Real numbers; can be **ratio** (true zero) or **interval** (arbitrary zero) | Price, weight, temperature       | Can standardize; ensure consistent units; interval data can be negative |
| **Numeric (Integer)** | Whole numbers; may also be **ordinal** or **categorical**                   | Count, age                       | Check continuity & uniqueness; treat carefully if used as labels        |
| **Ordinal**           | Ordered categories with **uneven spacing**                                  | Ratings (e.g., 1–5 stars), ranks | Do not compute mean/std; preserve order but not magnitude               |
| **Categorical**       | Unordered labels (strings or integers)                                      | Color, country, product type     | Handle with encoding; manage cardinality (low vs. high)                 |
| **Binary**            | Special case of categorical with **two values**                             | Yes/No, 0/1, presence/absence    | Often used directly in models                                           |
| **Date/Time**         | Temporal data                                                               | Timestamps, dates                | Decompose into parts (day, month, year); can convert to Unix time       |

---
