# DS4440 - Practical Neural Networks
## Week 5 : Exploratory Data Analysis 

___
**Instructor** : Prof. Steve Schmidt <br/>
**Teaching Assistants** : Vishwajeet Hogale (hogale.v@northeastern.edu) | Chaitanya Agarwal (agarwal.cha@northeastern.edu)

## Part 1 : Exploratory Data Analysis

EDA helps you get to know your data better by:

Looking at the Data: Understanding what the data looks like, checking for patterns, missing values, or strange data points.
1. Visualizing: Using graphs, charts, and plots to visualize data distribution and relationships between variables.
2. Statistical Summary: Calculating basic statistics like averages, medians, and standard deviations to get a sense of the data's spread.

It's like getting a "first impression" of your data to spot trends, relationships, and problems, so you can make informed decisions for the next steps in analysis or modeling.

In this notebook, we'll follow the **six** key sections below:

1. **One Hot Encoding**
2. **Label Encoding**
3. **Scaling**
    - **Normalization**
    - **Standardization**
4. **Class Imbalance**


Let's dive in and explore how neural networks can tackle this exciting problem!

## 0. Setup and Load libraries

The below cell helps you download all the necessary libraries or packages required to run this notebook without running into any errors.

In [1]:
! pip install -r requirements.txt



## 1. Data Gathering

### **About the Dataset**  

The **Superstore** dataset contains **sales transaction records** from a retail store, providing insights into **shipping, customer segments, locations, product categories, and financial metrics**. This dataset is widely used for **sales analysis, demand forecasting, and business intelligence applications**.  

The dataset includes the following **13 attributes**:  

1. **Ship Mode**: The method of shipping used for the order (e.g., Standard Class, First Class)  
2. **Segment**: Customer segment (e.g., Consumer, Corporate, Home Office)  
3. **Country**: The country where the order was placed  
4. **City**: The city where the order was shipped  
5. **State**: The state where the order was shipped  
6. **Postal Code**: The postal code of the delivery address  
7. **Region**: The geographical region (e.g., West, East, Central, South)  
8. **Category**: The broad category of the product (e.g., Furniture, Office Supplies, Technology)  
9. **Sub-Category**: The specific type of product within a category (e.g., Chairs, Phones, Binders)  
10. **Sales**: The revenue generated from the sale  
11. **Quantity**: The number of units sold  
12. **Discount**: The discount applied to the sale  
13. **Profit**: The profit earned from the sale after deducting costs  

### **Dataset Use Cases**  
This dataset is useful for various business analytics and machine learning tasks, including:  
- **Sales Forecasting**: Predicting future sales based on historical data  
- **Customer Segmentation**: Identifying different customer groups based on purchasing behavior  
- **Profitability Analysis**: Analyzing which products and regions generate the highest profits  
- **Discount Impact**: Evaluating how discounts affect overall sales and profitability  

### **Dataset Source**  
The **Superstore** provides a great opportunity to explore **data preprocessing, exploratory data analysis (EDA), and predictive modeling** in retail analytics.


In [2]:
import pandas as pd
import numpy as np
from torchvision import datasets, transforms

# Load the dataset 
df = pd.read_csv("./data/Superstore.csv")

In [3]:
df.columns

Index(['Ship Mode', 'Segment', 'Country', 'City', 'State', 'Postal Code',
       'Region', 'Category', 'Sub-Category', 'Sales', 'Quantity', 'Discount',
       'Profit'],
      dtype='object')

In [4]:
df.head()

Unnamed: 0,Ship Mode,Segment,Country,City,State,Postal Code,Region,Category,Sub-Category,Sales,Quantity,Discount,Profit
0,Second Class,Consumer,United States,Henderson,Kentucky,42420,South,Furniture,Bookcases,261.96,2,0.0,41.9136
1,Second Class,Consumer,United States,Henderson,Kentucky,42420,South,Furniture,Chairs,731.94,3,0.0,219.582
2,Second Class,Corporate,United States,Los Angeles,California,90036,West,Office Supplies,Labels,14.62,2,0.0,6.8714
3,Standard Class,Consumer,United States,Fort Lauderdale,Florida,33311,South,Furniture,Tables,957.5775,5,0.45,-383.031
4,Standard Class,Consumer,United States,Fort Lauderdale,Florida,33311,South,Office Supplies,Storage,22.368,2,0.2,2.5164


### Check the size of the training dataset

In [5]:
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns")

The dataset has 9994 rows and 13 columns


### Explore the NaNs

In [6]:
df.isna().sum()

Ship Mode       0
Segment         0
Country         0
City            0
State           0
Postal Code     0
Region          0
Category        0
Sub-Category    0
Sales           0
Quantity        0
Discount        0
Profit          0
dtype: int64

### Check the column datatypes


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Ship Mode     9994 non-null   object 
 1   Segment       9994 non-null   object 
 2   Country       9994 non-null   object 
 3   City          9994 non-null   object 
 4   State         9994 non-null   object 
 5   Postal Code   9994 non-null   int64  
 6   Region        9994 non-null   object 
 7   Category      9994 non-null   object 
 8   Sub-Category  9994 non-null   object 
 9   Sales         9994 non-null   float64
 10  Quantity      9994 non-null   int64  
 11  Discount      9994 non-null   float64
 12  Profit        9994 non-null   float64
dtypes: float64(3), int64(2), object(8)
memory usage: 1015.1+ KB


### Get statistical information on numerical columns

In [8]:
df.describe()

Unnamed: 0,Postal Code,Sales,Quantity,Discount,Profit
count,9994.0,9994.0,9994.0,9994.0,9994.0
mean,55190.379428,229.858001,3.789574,0.156203,28.656896
std,32063.69335,623.245101,2.22511,0.206452,234.260108
min,1040.0,0.444,1.0,0.0,-6599.978
25%,23223.0,17.28,2.0,0.0,1.72875
50%,56430.5,54.49,3.0,0.2,8.6665
75%,90008.0,209.94,5.0,0.2,29.364
max,99301.0,22638.48,14.0,0.8,8399.976


### Get statistical information on categorical columns

In [9]:
categorical_columns_df = df.select_dtypes("object")
categorical_columns_df.value_counts()

Ship Mode       Segment      Country        City           State       Region   Category         Sub-Category
Standard Class  Consumer     United States  New York City  New York    East     Office Supplies  Binders         61
                                                                                                 Paper           41
                                            Los Angeles    California  West     Office Supplies  Paper           40
                                            Houston        Texas       Central  Office Supplies  Binders         31
                                            Los Angeles    California  West     Furniture        Furnishings     31
                                                                                                                 ..
Second Class    Corporate    United States  Denver         Colorado    West     Furniture        Furnishings      1
                                            Delray Beach   Florida     South  

In [10]:
categorical_columns_df.head()

Unnamed: 0,Ship Mode,Segment,Country,City,State,Region,Category,Sub-Category
0,Second Class,Consumer,United States,Henderson,Kentucky,South,Furniture,Bookcases
1,Second Class,Consumer,United States,Henderson,Kentucky,South,Furniture,Chairs
2,Second Class,Corporate,United States,Los Angeles,California,West,Office Supplies,Labels
3,Standard Class,Consumer,United States,Fort Lauderdale,Florida,South,Furniture,Tables
4,Standard Class,Consumer,United States,Fort Lauderdale,Florida,South,Office Supplies,Storage


## 2. One Hot Encoding


### 2.1 **What is One-Hot Encoding?**  
One-hot encoding is a technique used to convert **categorical variables** into a numerical format by creating **binary (0 or 1) columns** for each unique category.  

For example, **Ship Mode** would be transformed as follows:  

| Ship Mode       | Standard Class | First Class | Second Class | Same Day |  
|---------------|---------------|------------|-------------|----------|  
| Standard Class | 1             | 0          | 0           | 0        |  
| First Class   | 0             | 1          | 0           | 0        |  
| Second Class  | 0             | 0          | 1           | 0        |  
| Same Day      | 0             | 0          | 0           | 1        |  


In [17]:
# Selecting categorical columns for One-Hot Encoding
categorical_columns = categorical_columns_df.columns

# Apply One-Hot Encoding using pandas get_dummies()
df_ohe = pd.get_dummies(df, columns=categorical_columns)


In [18]:
df_ohe.head()

Unnamed: 0,Postal Code,Sales,Quantity,Discount,Profit,Ship Mode_First Class,Ship Mode_Same Day,Ship Mode_Second Class,Ship Mode_Standard Class,Segment_Consumer,...,Sub-Category_Envelopes,Sub-Category_Fasteners,Sub-Category_Furnishings,Sub-Category_Labels,Sub-Category_Machines,Sub-Category_Paper,Sub-Category_Phones,Sub-Category_Storage,Sub-Category_Supplies,Sub-Category_Tables
0,42420,261.96,2,0.0,41.9136,False,False,True,False,True,...,False,False,False,False,False,False,False,False,False,False
1,42420,731.94,3,0.0,219.582,False,False,True,False,True,...,False,False,False,False,False,False,False,False,False,False
2,90036,14.62,2,0.0,6.8714,False,False,True,False,False,...,False,False,False,True,False,False,False,False,False,False
3,33311,957.5775,5,0.45,-383.031,False,False,False,True,True,...,False,False,False,False,False,False,False,False,False,True
4,33311,22.368,2,0.2,2.5164,False,False,False,True,True,...,False,False,False,False,False,False,False,True,False,False


If the dataset is very large, **dummy encoding** (dropping one category to avoid multicollinearity) or **target encoding** may be better alternatives.

In [19]:
# Selecting categorical columns for One-Hot Encoding
categorical_columns = categorical_columns_df.columns

# Apply One-Hot Encoding using pandas get_dummies()
df_ohe = pd.get_dummies(df, columns=categorical_columns, drop_first=True)


In [20]:
df_ohe.head()

Unnamed: 0,Postal Code,Sales,Quantity,Discount,Profit,Ship Mode_Same Day,Ship Mode_Second Class,Ship Mode_Standard Class,Segment_Corporate,Segment_Home Office,...,Sub-Category_Envelopes,Sub-Category_Fasteners,Sub-Category_Furnishings,Sub-Category_Labels,Sub-Category_Machines,Sub-Category_Paper,Sub-Category_Phones,Sub-Category_Storage,Sub-Category_Supplies,Sub-Category_Tables
0,42420,261.96,2,0.0,41.9136,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,42420,731.94,3,0.0,219.582,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,90036,14.62,2,0.0,6.8714,False,True,False,True,False,...,False,False,False,True,False,False,False,False,False,False
3,33311,957.5775,5,0.45,-383.031,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True
4,33311,22.368,2,0.2,2.5164,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False


### 2.2 **When to Use One-Hot Encoding?**  
- When dealing with **categorical features** that do not have an **ordinal relationship** (i.e., the categories do not have a meaningful order).  
- When using machine learning models that cannot interpret categorical values directly, such as **linear regression, decision trees, and neural networks**.  


### 2.3 **Why do we use One-Hot Encoding?**

Several categorical columns in the **Superstore dataset** require one-hot encoding to be used effectively in machine learning models. Some key features that may need encoding include:  

- **Ship Mode** (e.g., Standard Class, First Class, Second Class)  
- **Segment** (e.g., Consumer, Corporate, Home Office)  
- **Region** (e.g., West, East, Central, South)  
- **Category** (e.g., Furniture, Office Supplies, Technology)  
- **Sub-Category** (e.g., Chairs, Phones, Binders)


### **Considerations**  
- One-hot encoding **increases the number of columns**, which can lead to **high-dimensional data**.    
- Some machine learning models, such as **tree-based models (e.g., Random Forest, XGBoost), can handle categorical data directly** without needing one-hot encoding.  


## 3. Label Encoding

### 3.1 **What is Label Encoding?**  
Label encoding is a technique used to convert **categorical variables** into numerical values by assigning each unique category a **unique integer (0, 1, 2, ...)**. Instead of creating multiple binary columns like one-hot encoding, it replaces each category with a single numeric label.

For example, if we apply label encoding to **Ship Mode**, we get:  

| Ship Mode       | Encoded Value |  
|---------------|--------------|  
| Standard Class | 0            |  
| First Class   | 1            |  
| Second Class  | 2            |  
| Same Day      | 3            | 

### 3.2 **Why Use Label Encoding?**  
- **Memory Efficient:** Unlike one-hot encoding, which creates multiple new columns, label encoding uses a single column, reducing memory usage.  
- **Useful for Tree-Based Models:** Algorithms like **decision trees, random forests, and XGBoost** can handle categorical data in numerical format without issues.  
- **Maintains Ordinal Relationships:** If the categorical variable has a meaningful order (e.g., "Low," "Medium," "High"), label encoding preserves that relationship.


### 3.3 **When to Use Label Encoding?**  
- **When dealing with large categorical variables** (e.g., hundreds or thousands of unique categories like ZIP codes or user IDs).  
- **When using tree-based models** that can naturally handle categorical variables.  
- **When categories have an ordinal relationship** (e.g., "Beginner" → 0, "Intermediate" → 1, "Expert" → 2).  
- **Avoid in linear models** if the categorical variable has no inherent order, as it might introduce unintended relationships.

### **When NOT to Use Label Encoding?**  
While label encoding is useful in many cases, there are situations where it can be problematic:  

1. **When Categories Have No Ordinal Relationship**  
   - Label encoding assigns numeric values (0, 1, 2, …), which might imply an order even when none exists.  
   - Example: If "Red," "Blue," and "Green" are assigned 0, 1, and 2, a model might incorrectly assume "Green" > "Blue" > "Red."  
   - **Better Alternative:** Use **one-hot encoding** for non-ordinal categorical variables.  

2. **When Using Linear Models**  
   - Linear regression and logistic regression assume numerical values have a continuous relationship, which can introduce bias if the assigned numbers are arbitrary.  
   - **Better Alternative:** One-hot encoding or target encoding (if applicable).  

3. **When There Are a Large Number of Unique Categories**  
   - If a categorical variable has thousands of unique values (e.g., customer IDs, product SKUs), label encoding can create issues by introducing meaningless relationships.  
   - **Better Alternative:** Embedding layers (for deep learning) or frequency encoding.  

4. **When Dealing with High Cardinality Categorical Variables in Some ML Models**  
   - Label encoding may not work well in distance-based models (e.g., KNN, SVM) since it can distort similarity measures.  
   - **Better Alternative:** Target encoding or entity embeddings.  

5. **When the Encoded Values Affect Model Performance Negatively**  
   - Some models may be biased by the assigned numeric values, leading to inaccurate predictions.  
   - **Better Alternative:** Experiment with both **one-hot encoding and label encoding** to see which works best.  


### **Key Takeaway**  
Use **label encoding** only when:  
- The model can handle numerical categorical values (e.g., decision trees, XGBoost).  
- The categorical variable has an inherent order.  
- The number of unique categories is manageable.  

Otherwise, consider **one-hot encoding**, **target encoding**, or **embedding methods** depending on the use case.

## 4. Sclaing

### 4.1 **Normalization**

#### 4.1.1 **What is Normalization?**  
Normalization is a technique used to **rescale numerical data** into a specific range, typically **[0,1]** or **[-1,1]**. It ensures that all features contribute equally to a model by eliminating differences in scale.

For example, using **Min-Max Normalization**:  

\[
X_{normalized} = \frac{X - X_{min}}{X_{max} - X_{min}}
\]

If we have values:  
- **Original Data:** `[10, 20, 30, 40, 50]`  
- **Normalized Data (0-1):** `[0.0, 0.25, 0.5, 0.75, 1.0]`  


#### 4.1.2 **Why Use Normalization?**  
- **Brings features to the same scale** → Prevents large-valued features from dominating the model.  
- **Speeds up gradient descent** → Helps models converge faster by avoiding large step sizes.  
- **Improves model performance** → Especially for models sensitive to feature magnitudes (e.g., neural networks, k-NN, SVMs).  
- **Reduces numerical instability** → Avoids very large/small values that can cause computational issues.  


#### 4.1.3 **When to Use Normalization?**  
- **When features have different scales** (e.g., age in years vs. income in millions).  
- **When using distance-based models** like **KNN, K-Means, SVM, PCA** (which rely on Euclidean distances).  
- **When working with deep learning models** (e.g., neural networks, CNNs, RNNs), as normalized inputs improve training stability.  


#### 4.1.4 **When NOT to Use Normalization?**  
- **When using tree-based models** (e.g., Decision Trees, Random Forest, XGBoost).  
  - Tree models split data based on feature values, and normalization doesn’t improve their performance.  
- **When the data has outliers**  
  - Min-Max normalization is sensitive to outliers as it rescales based on min/max values.  
  - **Better Alternative:** Use **Robust Scaling** or **Standardization** instead.  
- **When features are already on a similar scale**  
  - If all features have the same unit (e.g., heights in cm, weights in kg), normalization might be unnecessary.  


### 4.2 Standardization

#### 4.2.1 **What is Standardization?**  
Standardization is a technique used to **transform numerical data** so that it has a **mean of 0** and a **standard deviation of 1**. This ensures that all features contribute equally to the model, regardless of their original scale.

The formula for **Z-score Standardization**:  

\[
X_{standardized} = \frac{X - \mu}{\sigma}
\]

Where:  
- \( X \) = Original value  
- \( \mu \) = Mean of the feature  
- \( \sigma \) = Standard deviation of the feature  

For example, if we have a dataset with heights in cm:  
- **Original Data:** `[150, 160, 170, 180, 190]`  
- **Standardized Data:** `[-1.41, -0.71, 0.00, 0.71, 1.41]`  


#### 4.2.2 **Why Use Standardization?**  
- **Centers data around zero** → Makes optimization algorithms more stable.  
- **Ensures equal contribution of features** → Prevents larger-scaled features from dominating smaller-scaled ones.  
- **Speeds up gradient descent** → Helps deep learning models converge faster.  
- **Works well with normally distributed data** → Standardization assumes that the data follows a **Gaussian (normal) distribution**.  


#### 4.2.3 **When to Use Standardization?**  
- **When using distance-based models** → e.g., **KNN, K-Means, SVM, PCA** (which rely on Euclidean distances).  
- **When features have different units** → e.g., height in cm vs. weight in kg.  
- **When using deep learning models** → Neural networks perform better with standardized inputs.  
- **When data is normally distributed** → Standardization is ideal for normally distributed features.  


#### 4.2.4 **When NOT to Use Standardization?**  
- **When using tree-based models** (e.g., Decision Trees, Random Forest, XGBoost).  
  - These models split data based on feature values and don’t require standardization.  
- **When the data is not normally distributed**  
  - Standardization assumes a normal distribution; if the data is highly skewed, other techniques (like **log transformation**) may be better.  
- **When features are already on a similar scale**  
  - If all features have the same unit (e.g., percentages), standardization might be unnecessary.  
- **When using categorical data**  
  - Standardization is only for numerical features. Categorical variables should be handled separately (e.g., one-hot encoding).  


### 4.3 **Standardization vs. Normalization**  
| Feature          | Standardization (Z-score) | Normalization (Min-Max) |  
|----------------|-------------------------|----------------------|  
| **Range**      | Mean = 0, Std = 1        | [0,1] or [-1,1]     |  
| **Sensitive to Outliers?** | No (less sensitive) | Yes (min & max values affect range) |  
| **Best For?**  | Normally distributed data | Data without strong assumptions |  
| **Used In?**   | KNN, SVM, PCA, Neural Networks | KNN, SVM, Neural Networks |  

<br/><br/><br/>

#### How do they perform in the presence of outliers?
- **Normalization (Min-Max Scaling) is highly sensitive to outliers and should be avoided if outliers exist.**
- **Standardization (Z-score Scaling) is more robust but still affected by outliers.**
- **Robust Scaling or Log Transformation is best for handling outliers effectively.**


## 5. Class Imbalance

### 5.1 **What is Class Imbalance?**  
Class imbalance occurs when one class significantly outnumbers another in a classification dataset. This can lead to biased models that favor the majority class and fail to correctly predict the minority class.

For example, in a fraud detection dataset:  
- **Non-Fraud Transactions:** 98%  
- **Fraud Transactions:** 2%  

A model trained on this dataset might predict **"Non-Fraud"** 98% of the time but fail to detect actual fraud cases.


### 5.2 **Why Handle Class Imbalance?**  
- **Prevents biased models** → A model trained on imbalanced data may always predict the majority class and ignore the minority class.  
- **Improves recall for minority class** → Helps the model detect rare but important events (e.g., fraud, disease detection).  
- **Enhances overall model performance** → Avoids misleading accuracy metrics (e.g., 98% accuracy but 0% recall for the minority class).  


### 5.3 **When to Handle Class Imbalance?**  
- **When the minority class is critical** (e.g., fraud detection, medical diagnosis, defect detection).  
- **When accuracy alone is misleading** → Use **precision, recall, and F1-score** instead.  
- **When your model performs poorly on the minority class** → If the recall is very low, the model isn’t learning from the minority class.  


### 5.4 **When NOT to Handle Class Imbalance?**  
- **When the imbalance is not severe** → If the ratio is **80:20 or better**, some models (like tree-based models) can still perform well without intervention.  
- **When the dataset is large and diverse** → With enough data, some models naturally learn to handle imbalance.  
- **When using models like XGBoost or Random Forest** → These models are robust to imbalanced datasets and may not require resampling.  


### 5.5 **Techniques to Handle Class Imbalance**  
1. **Resampling Methods**  
   - **Oversampling the minority class** (e.g., SMOTE, ADASYN)  
   - **Undersampling the majority class** (random undersampling)  

2. **Algorithmic Approaches**  
   - **Class-weight adjustments** → Assign higher weights to the minority class (e.g., `class_weight='balanced'` in sklearn).  
   - **Cost-sensitive learning** → Penalize misclassifications of the minority class.  

3. **Synthetic Data Generation**  
   - Use **SMOTE (Synthetic Minority Over-sampling Technique)** to create synthetic minority class examples.  

4. **Anomaly Detection Approach**  
   - If the minority class is extremely rare, consider framing the problem as **anomaly detection** instead of classification.  
