## 1. DataFrame Creation & Basics

**Why it matters:** As a data engineer, you'll create DataFrames from various sources - APIs returning JSON, config dicts, CSV files, databases. Understanding creation methods helps you handle any data source.

In [None]:
### 1.1 Creating from Dictionary
Keys become column names, values become column data.

**Why important:** Most common method when building DataFrames programmatically - API responses, transforming data structures, creating test data.

In [None]:
import pandas as pd

# Method 1: Dict of lists (column-oriented) - Most common
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'customer_id': ['C1', 'C2', 'C1', 'C3', 'C2', 'C1', 'C3', 'C2'],
    'product': ['laptop', 'phone', 'mouse', 'laptop', 'tablet', 'keyboard', 'mouse', 'laptop'],
    'amount': [1200, 800, 25, 1200, 500, 75, 25, 1200],
    'order_date': pd.to_datetime(['2024-01-05', '2024-01-06', '2024-01-06', 
                                   '2024-01-07', '2024-01-08', '2024-01-10',
                                   '2024-01-12', '2024-01-15'])
})
orders

In [None]:
# Method 2: List of dicts (row-oriented) - Common from API responses
data_rows = [
    {'name': 'Alice', 'age': 25, 'city': 'NYC'},
    {'name': 'Bob', 'age': 30, 'city': 'LA'},
    {'name': 'Charlie', 'age': 35, 'city': 'Chicago'}
]
df_from_records = pd.DataFrame(data_rows)
df_from_records

### 1.2 Creating from CSV / JSON
Read data from files - the most common source in data pipelines.

**Why important:** 90% of your data will come from files or APIs. Know the key parameters to handle encoding, delimiters, and data type issues.

In [None]:
# From CSV file
# df = pd.read_csv('data.csv')
# df = pd.read_csv('data.csv', sep=';', encoding='utf-8')  # Custom delimiter & encoding
# df = pd.read_csv('data.csv', parse_dates=['date_col'])   # Auto-parse dates

# From JSON file
# df = pd.read_json('data.json')
# df = pd.read_json('data.json', orient='records')  # List of dicts format

# From JSON string (API response)
# import json
# api_response = '[{"id": 1, "name": "test"}]'
# df = pd.read_json(api_response)

print("Uncomment and modify paths to test with your files")

### 1.3 Basic Attributes
Quick properties to understand DataFrame structure without viewing data.

**Why important:** Before processing, confirm you have the expected columns, correct data types, and proper indexing.

In [None]:
# shape: (rows, columns) tuple
print("Shape:", orders.shape)

# columns: column names as Index
print("\nColumns:", orders.columns.tolist())

# dtypes: data type of each column
print("\nData Types:\n", orders.dtypes)

# index: row labels
print("\nIndex:", orders.index)

---

## Section 1 Summary: DataFrame Creation & Basics

| Method | What It Does | When to Use |
|--------|--------------|-------------|
| `pd.DataFrame(dict)` | Create from dict of lists | Building DataFrames programmatically |
| `pd.DataFrame(list_of_dicts)` | Create from list of dicts | API responses, row-oriented data |
| `pd.read_csv()` | Read from CSV file | Most file-based data sources |
| `pd.read_json()` | Read from JSON file/string | API responses, config files |
| `.shape` | Get (rows, cols) | Quick size check |
| `.columns` | Get column names | Verify schema |
| `.dtypes` | Get data types | Check type correctness |
| `.index` | Get row labels | Understand indexing |

## 2. Inspection & Exploration

**Why it matters:** Before any data transformation, you need to understand your data - its size, types, missing values, and distribution. These functions are your first step in any data pipeline.

In [None]:
### 2.1 head(n=5)
Returns the first n rows of the DataFrame.

**Why important:** Quickly preview your data after loading to verify it loaded correctly and understand the structure.


In [None]:
# Default: first 5 rows
orders.head()

# Custom: first 3 rows
# orders.head(3)

### 2.2 tail(n=5)
Returns the last n rows of the DataFrame.

**Why important:** Check if data loaded completely, especially for time-series data where you want to see the most recent records.

In [None]:
# Default: last 5 rows
orders.tail()

# Custom: last 2 rows
# orders.tail(2)

### 2.3 info()
Prints concise summary: column names, non-null counts, data types, and memory usage.

**Why important:** Essential for data quality checks - identifies missing values (null counts) and incorrect data types (e.g., dates stored as strings).

In [None]:
# Schema, null counts, dtypes, memory usage
orders.info()

### 2.4 describe()
Generates descriptive statistics for numeric columns: count, mean, std, min, max, quartiles.

**Why important:** Quickly spot data issues - outliers (compare min/max to mean), data entry errors, unexpected ranges. For data engineers, this validates data quality before downstream processing.

In [None]:
# Statistics for numeric columns only
orders.describe()

# Include all columns (object/categorical too)
# orders.describe(include='all')

### 2.5 sample(n=1)
Returns n random rows from the DataFrame.

**Why important:** For large datasets, head/tail only show edges. sample() gives unbiased view of data across the entire dataset - useful for spotting patterns or issues not visible at the beginning/end.

In [None]:
# Random 3 rows
orders.sample(3)

# Reproducible random sample (same result every time)
# orders.sample(3, random_state=42)

# Sample 50% of rows
# orders.sample(frac=0.5)

### 2.6 shape
Returns tuple of (rows, columns) - the dimensions of the DataFrame.

**Why important:** First sanity check after loading data - did you get the expected number of rows? Helps catch truncated loads, filter mistakes, or join explosions.

In [None]:
# Get dimensions: (rows, columns)
print(f"Shape: {orders.shape}")
print(f"Rows: {orders.shape[0]}, Columns: {orders.shape[1]}")

### 2.7 value_counts()
Returns frequency count of unique values in a column (Series).

**Why important:** Essential for understanding categorical data - detect dirty data (typos, inconsistent casing), unexpected values, and data distribution. Critical before groupby operations.

In [None]:
# Count occurrences of each product
orders['product'].value_counts()

# With percentages
# orders['product'].value_counts(normalize=True)

# Include NaN in counts
# orders['product'].value_counts(dropna=False)

### 2.8 nunique()
Returns count of unique values per column.

**Why important:** Cardinality check - is this column a good join key (should be unique)? Too many categories for one-hot encoding? Detect if a column has only one value (useless for analysis).

In [None]:
# Unique count for all columns
orders.nunique()

# For a single column
# orders['customer_id'].nunique()

### 2.9 isnull().sum()
Returns count of null/missing values per column.

**Why important:** Data quality is everything in pipelines. Know exactly how many nulls exist before deciding to drop, fill, or flag them. Prevents silent failures in downstream transformations.

In [None]:
# Null count per column
orders.isnull().sum()

# Null percentage per column
# (orders.isnull().sum() / len(orders) * 100).round(2)

# Total nulls in entire DataFrame
# orders.isnull().sum().sum()

---

## Section 2 Summary: Inspection & Exploration

| Function | What It Does | Why Important for Data Engineers |
|----------|--------------|----------------------------------|
| `head(n)` | First n rows | Verify data loaded correctly |
| `tail(n)` | Last n rows | Check completeness, see recent records |
| `info()` | Schema, nulls, dtypes, memory | Identify missing values & wrong types |
| `describe()` | Stats for numeric columns | Spot outliers, validate ranges |
| `sample(n)` | Random n rows | Unbiased view of large datasets |
| `shape` | (rows, columns) tuple | Sanity check row counts |
| `value_counts()` | Frequency of unique values | Understand categorical distributions |
| `nunique()` | Count of unique values | Cardinality check for joins/encoding |
| `isnull().sum()` | Null count per column | Data quality assessment |

**Pro Tip:** Run these in order when loading new data:
```python
df.shape → df.info() → df.isnull().sum() → df.describe() → df.head()
```