# Data Analysis Roadmap

This notebook begins with a **step-by-step structure** that you should follow when analyzing any dataset.  
It acts like a **checklist** or **guide** for your projects.  

Each section is designed to help you think like a data scientist:  

1. **Setup & Imports** → Load all the libraries you need.  
2. **Load Raw Data** → Bring your dataset into Python so you can start exploring it.  
3. **Inspect Data** → Understand the shape, column types, and missing values.  
4. **Data Cleaning** → Make the dataset usable by fixing problems.  
5. **Preprocessing / Feature Engineering** → Transform the data into a form suitable for analysis or modeling.  
6. **Exploratory Data Analysis (EDA)** → Use plots and statistics to discover insights and relationships.  
7. **Summary & Insights** → Write down what you found and decide the next steps.  

👉 You should try to follow this order whenever you start with a **new dataset**.  
It makes your work **organized, clear, and reproducible**, which is very important in real projects.  

---


## 1) Setup & Imports 

**Goal:** Load and configure the tools you need for analysis in a clean, repeatable way.

**Do here:**
- Import libraries you actually use (avoid unused imports).
- Set display options (e.g., show more columns).
- Fix a random seed if you’ll do modeling later for reproducibility.

**Checklist:**
- [ ] `import pandas as pd`, `numpy as np`, plotting libs.
- [ ] `pd.set_option(...)` for better table display.
- [ ] Optional: `%matplotlib inline` in classic notebooks.

**Common mistakes:**
- Importing everything “just in case.” Keep it minimal and relevant.
- Mixing plotting backends/styles that make charts inconsistent.


In [None]:
# Minimal, clean imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Nice display options (customize if needed)
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', lambda x: f'{x:,.3f}')

# Optional: fixed randomness for later modeling steps
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

## 2) Load Raw Data

**Goal:** Read the dataset from a reliable path and confirm it loaded correctly.

**Do here:**
- Use a **relative path** inside your project/repo when possible.
- If CSV has special separators/encodings, set them explicitly.
- Immediately preview the first/last rows to sanity‑check.

**Checklist:**
- [ ] Use `pd.read_csv(...)` (or appropriate reader).
- [ ] Quick preview: `head()`, `tail()`.
- [ ] Confirm shape and column names look right.

**Common mistakes:**
- Hard‑coding absolute paths (breaks for other users).
- Forgetting to handle encodings (e.g., `encoding='utf-8'`).

**Writing tip (what to report):**
> “We loaded the dataset from \<source\>. It contains **N rows** and **M columns**. A quick preview indicates \<key columns\> are present and values appear reasonable.”


In [None]:
# TODO: Replace path with your actual file
data_path = 'path/to/your/raw_data.csv'

df = pd.read_csv(data_path)  # add args like sep=';', encoding='utf-8' if needed
display(df.head())
display(df.tail(2))

print('Shape:', df.shape)
print('Columns:', list(df.columns))

## 3) Inspect Data

**Goal:** Understand the data types, ranges, and missingness before making changes.

**Do here:**
- Check info/types, descriptive stats, and missing values.
- Separate numeric vs. categorical summaries where helpful.

**Checklist:**
- [ ] `df.info()` for dtypes & null counts.
- [ ] `df.describe()` for numeric; `df.describe(include='object')` for categoricals.
- [ ] `df.isnull().sum()` for per‑column missingness.

**Common mistakes:**
- Skipping type checks (later causes model/plot errors).
- Ignoring high cardinality categoricals.

**Writing tip (what to report):**
> “Columns \<A,B\> are numeric and \<C,D\> are categorical. We found **X%** missing in \<E\>. Numeric features have ranges \<min–max examples\>.”


In [None]:
# Types and memory
df.info()

# Summary stats
display(df.describe().T)  # numeric
try:
    display(df.describe(include='object').T)  # categorical
except Exception as e:
    print("No object columns or summary failed:", e)

# Missingness overview
missing = df.isnull().sum().sort_values(ascending=False)
display(missing[missing > 0])

## 4) Data Cleaning

**Goal:** Fix quality issues so the dataset is consistent and analysis‑ready.

**Typical actions:**
- Handle missing values (drop or impute with mean/median/mode or domain rules).
- Remove duplicates (`df.drop_duplicates()`).
- Convert data types (e.g., strings to dates, categories).
- Standardize categorical text (e.g., `'Male'`, `'male'` → `'Male'`).

**Checklist:**
- [ ] Choose an imputation strategy per column (justify it).
- [ ] Drop exact duplicate rows if inappropriate.
- [ ] Convert columns to correct dtypes (`pd.to_datetime`, `.astype('category')`).
- [ ] Verify changes didn’t shrink/grow rows unintentionally.

**Common mistakes:**
- Imputing target/leakage columns incorrectly.
- Mixing string categories due to whitespace/case differences.

**Writing tip:**
> “We removed **K** duplicate rows, converted \<date_col\> to datetime, normalized \<category_col\> labels, and imputed \<num_col\> with median due to skewness.”


In [None]:
# Example cleaning patterns — adapt to your dataset

# 1) Remove duplicates
before = len(df)
df = df.drop_duplicates()
after = len(df)
print(f"Removed {before - after} duplicate rows.")

# 2) Type conversions (edit to your columns)
# df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['category_col'] = df['category_col'].str.strip().str.title()

# 3) Basic missing-value handling (examples)
# numeric_cols = df.select_dtypes(include=[np.number]).columns
# for c in numeric_cols:
#     df[c] = df[c].fillna(df[c].median())

# cat_cols = df.select_dtypes(include=['object']).columns
# for c in cat_cols:
#     df[c] = df[c].fillna('Unknown')

# Sanity check after cleaning
display(df.info())
display(df.head(3))

## 5) Preprocessing & Feature Engineering

**Goal:** Prepare features that better represent the problem for models/analysis.

**Typical actions:**
- Rename columns to consistent, readable names.
- Encode categoricals (one‑hot, ordinal with care).
- Scale/normalize numeric features (only when needed).
- Create domain features (ratios, bins, interactions).

**Checklist:**
- [ ] Document any encoding/scaling choices.
- [ ] Keep a list of original → transformed columns.
- [ ] Avoid data leakage (fit transforms on train only).

**Common mistakes:**
- One‑hot exploding dimensionality without need.
- Scaling already standardized variables unnecessarily.

**Writing tip:**
> “We one‑hot encoded \<C\> (low cardinality), standardized \<X,Y\> due to varying scales, and engineered \<Z = X/Y\> to capture intensity.”


In [None]:
# Example renaming
# df = df.rename(columns={'Old Name':'new_name'})

# Example encoding
# df = pd.get_dummies(df, columns=['categorical_col'], drop_first=True)

# Example scaling (do this after train/test split in modeling workflows)
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# df[['num1','num2']] = scaler.fit_transform(df[['num1','num2']])

display(df.head(3))

## 6) Exploratory Data Analysis (EDA) 

**Goal:** Discover patterns, anomalies, relationships to guide insights and modeling.

**Do here:**
- Plot distributions (histograms), outliers (boxplots).
- Explore relationships (scatterplots, group comparisons).
- Check correlations (heatmap for numeric features).

**Checklist:**
- [ ] At least one distribution plot per key numeric feature.
- [ ] Compare groups if there is a target/class column.
- [ ] Correlation matrix inspected; note strong/weak links.

**Common mistakes:**
- Drawing conclusions from tiny subgroups.
- Ignoring class imbalance in target variable.

**Writing tip:**
> “Feature \<X\> is right‑skewed with outliers; \<Y\> correlates moderately with the target (r≈0.45). Groups \<A vs. B\> differ in median \<metric\>.”


In [None]:
# --- Histograms (numeric)
numeric_cols = df.select_dtypes(include=[np.number]).columns[:8]  # limit to first few for readability
df[numeric_cols].hist(figsize=(12, 8))
plt.tight_layout()
plt.show()

# --- Boxplot example (single feature; replace with your column)
# sns.boxplot(x=df['your_numeric_col']); plt.show()

# --- Scatter example (replace with meaningful pair)
# plt.figure(figsize=(5,4)); plt.scatter(df['feature_x'], df['feature_y'], alpha=0.6); plt.xlabel('feature_x'); plt.ylabel('feature_y'); plt.show()

# --- Correlation heatmap (numeric only)
if len(numeric_cols) > 1:
    corr = df.select_dtypes(include=[np.number]).corr()
    plt.figure(figsize=(10,8))
    sns.heatmap(corr, annot=False, cmap='coolwarm', center=0)
    plt.title('Correlation Heatmap (numeric features)')
    plt.show()

## 7) Summary & Insights

**Goal:** Convert findings into clear statements and next actions.

**Write here (replace bullets with your conclusions):**
- **Data Quality:** (e.g., “We imputed 3% missing in X with median; standardized labels in Y.”)
- **Key Patterns:** (e.g., “X is skewed; Y increases with Z; A and B groups differ significantly.”)
- **Implications:** (e.g., “We expect models to benefit from scaling X and encoding C; consider class balancing.”)
- **Next Steps:** (e.g., “Train/test split; baseline model; feature selection; validate with cross‑validation.”)

**Rubric tips for strong reports:**
- Support claims with a figure/table reference.
- Quantify (use %/corr values) rather than vague terms.
- Keep it reproducible—show code used to produce each figure/table.


> ✍️ **Template paragraph to adapt:**  
We followed a structured workflow to analyze the dataset. After cleaning duplicates and fixing types, we found moderate missingness in \<cols\> which we imputed with \<strategy\>. EDA showed \<main pattern 1\> and \<pattern 2\>, with \<feature\> moderately correlated with the target (r≈\<value\>). These insights motivate \<planned preprocessing/modeling steps\>. Next, we will proceed with \<train/test split, baseline model, evaluation plan\>.