# EDA

### **Basic Structure Check**

| Step             | Command         | Purpose                     |
|------------------|-----------------|-----------------------------|
| Shape of dataset | `df.shape`      | Get rows & columns count    |
| First few rows   | `df.head()`     | Peek at initial rows        |
| Last few rows    | `df.tail()`     | Peek at final rows          |
| Column info      | `df.info()`     | Data types + missing values |
| Summary stats    | `df.describe()` | Mean, std, min, max etc.    |

------------------------------------------------------------------------

### **Deep Dive Exploration**

| Step              | Command                    | Purpose                                 |
|------------------------|------------------------|------------------------|
| Data types        | `df.dtypes`                | Ensure correct types (int, float, etc.) |
| Unique values     | `df['col'].unique()`       | See all unique entries                  |
| Value counts      | `df['col'].value_counts()` | Frequency of each value                 |
| Missing values    | `df.isnull().sum()`        | Count nulls in each column              |
| Missing value viz | `msno.matrix(df)`          | Visualize missing data (📦 `missingno`) |

------------------------------------------------------------------------

### **Distribution & Summary**

| Step        | Command                           | Purpose                        |
|------------------------|------------------------|------------------------|
| Histograms  | `df.hist()` or `sns.histplot()`   | Visualize distribution         |
| Boxplot     | `sns.boxplot(x, y, data=df)`      | Detect outliers                |
| Violin plot | `sns.violinplot(x, y, data=df)`   | Distribution + outliers        |
| Pair plot   | `sns.pairplot(df)`                | Relationships between features |
| Count plot  | `sns.countplot(x='col', data=df)` | Category frequency             |

------------------------------------------------------------------------

### **Correlation Analysis**

| Step               | Command                              | Purpose                   |
|------------------------|------------------------|------------------------|
| Correlation matrix | `df.corr()`                          | Numeric correlation check |
| Heatmap            | `sns.heatmap(df.corr(), annot=True)` | Visual correlation check  |

------------------------------------------------------------------------

### **GroupBy & Aggregation**

| Step                | Command                                              | Purpose          |
|------------------------|------------------------|------------------------|
| Group and summarize | `df.groupby('col').mean()`                           | Avg per category |
| Custom aggregation  | `df.groupby('col').agg({'target': ['mean', 'sum']})` | Flexible stats   |

------------------------------------------------------------------------

### **Outlier Detection**

| Step               | Command                       | Purpose                          |
|------------------------|------------------------|------------------------|
| IQR-based outliers | `Q1 = ...` → `outliers = ...` | Manually detect outliers         |
| Z-score method     | `scipy.stats.zscore(df)`      | Standard score outlier detection |

------------------------------------------------------------------------

### **Time Series Analysis (if date exists)**

| Step                | Command                      | Purpose                 |
|------------------------|------------------------|------------------------|
| Convert to datetime | `pd.to_datetime(df['date'])` | Enables time ops        |
| Set as index        | `df.set_index('date')`       | Prepares for resampling |
| Resample & plot     | `df.resample('M').sum()`     | Monthly trend analysis  |

------------------------------------------------------------------------

### **Feature Engineering / Scaling**

| Step             | Command                                   | Purpose                       |
|------------------------|------------------------|------------------------|
| Standardization  | `StandardScaler().fit_transform(X)`       | Normalize mean & std          |
| Normalization    | `MinMaxScaler().fit_transform(X)`         | Scale between 0–1             |
| One-hot encoding | `pd.get_dummies(df['cat'])`               | Convert categories to numeric |
| Label encoding   | `LabelEncoder().fit_transform(df['cat'])` | Encode labels numerically     |

------------------------------------------------------------------------

### **Dimensionality Reduction**

| Step          | Command                                | Purpose                             |
|------------------------|------------------------|------------------------|
| PCA (sklearn) | `PCA(n_components=2).fit_transform(X)` | Reduce dimensions for visualization |
| Manual PCA    | Eigen decomposition                    | Understand from scratch             |

------------------------------------------------------------------------

### **Ready for Modeling**

| Step             | Command                                | Purpose                   |
|------------------------|------------------------|------------------------|
| Train-test split | `train_test_split(X, y)`               | Prepare train & test sets |
| Baseline model   | `LinearRegression().fit()`             | First model check         |
| Evaluation       | `accuracy_score()`, `r2_score()`, etc. | Measure model performance |

------------------------------------------------------------------------

### **Extra Debug Tools**

| Step             | Command                         | Purpose                    |
|------------------|---------------------------------|----------------------------|
| Check duplicates | `df.duplicated().sum()`         | Remove repeated rows       |
| Drop column      | `df.drop('col', axis=1)`        | Clean unnecessary features |
| Rename column    | `df.rename(columns={'a': 'b'})` | Rename for clarity         |
| Convert datatype | `df['col'].astype(float)`       | Fix type mismatch          |