---
title: "Q2. Grid-Based Outlier Discovery Approach (8 points)"
author: "TW"
date: "2025-03-29"
categories: python
draft: false
---

# Q2. Grid-Based Outlier Discovery Approach (8 points)
In this question, you should implement a grid-based outlier detection method
to find outliers in a large data set.
Data Descriptions :
1. Relevant data is in folder Data_Q2.
2. X.csv: Testing data, as input.

submissionSample.csv: sample of submission, 0 indicate inlier, 1
indicate outlier.

**Requirements :**

1. No relevant third-party packages, you must implement the algorithm by
yourself.

**Submissions :**

1. Please report your main experimental steps in Q2_readme.pdf . If your
codes refer to any blog, github, paper and so on, please report their
links in it.
2. Output your results in Q2_output.csv . The format refer to
submissionSample.csv or below. Note that the .csv file should contain
one column.

|result|
|--|
|0|
|1|
|…|
|1|

3. Pack all code files in folder Q2_code .
4. Pack all files/folders above in folder Q2 .

**Notes:**

We will grade according to the code, efficiency of your method, the
experiment steps and methods you mentioned in the report and the recall
and precision of the your model’s prediction.

In [1]:
%cd /content/drive/MyDrive/Notes/MSBD5002/Data_Q2

/content/drive/MyDrive/Notes/MSBD5002/Data_Q2


The task is to perform **grid-based outlier detection** in an **unsupervised** manner on the test data (`X.csv`) directly, as there’s no separate training set or labels to train a supervised model. We’ll treat this as a purely unsupervised outlier detection problem, where we apply the grid-based method to `X.csv` to identify outliers, evaluate the approach using internal metrics (since no ground truth labels are available), and generate predictions in the format specified by `submissionSample.csv`.

Let’s walk through the solution step by step, adjusting for the fact that `X.csv` is the test data and we have no training data or labels.

### Step 1: Understanding the Problem and Data
- **Data Description**:
  - `X.csv`: Test data with 286048 samples and 10 numerical features.
  - No training data or labels are provided, so we’ll treat this as an unsupervised outlier detection task.
- **Task**:
  - Implement a grid-based outlier detection method to identify outliers in `X.csv`.
  - Predict whether each data point is an inlier (0) or an outlier (1).
  - Output predictions in a CSV file (`Q2_output.csv`) with a single column `result`, matching the format of `submissionSample.csv`.
  - Document the approach, including experimental steps, methods, recall, and precision, in a report (`Q2_readme.pdf`).
- **Submission**:
  - Pack all code files and the output CSV into a folder named `Q2_code`.


#### Grid-Based Outlier Detection (Unsupervised)
Since this is an unsupervised task, we’ll apply the grid-based method directly to `X.csv`. The method will:
1. Divide the data space into a grid.
2. Identify low-density cells (those with very few points) as containing outliers.
3. Label points in low-density cells as outliers (1) and others as inliers (0).

The `submissionSample.csv` shows 10 rows, but `X.csv` has 286048 rows. This suggests that `submissionSample.csv` is just a sample format, and we need to generate predictions for all 286048 test samples in `X.csv`.



#### Evaluation Challenge
Without ground truth labels, computing recall and precision directly is not possible. However, the question asks for these metrics, so we’ll need to estimate them indirectly. A common approach in unsupervised outlier detection is to:
- Assume a small fraction of the data points are outliers (e.g., 5–10%).
- Use internal metrics like the proportion of points flagged as outliers to tune the model.
- Estimate recall and precision by treating the model’s own predictions as a proxy for ground truth, or by using a synthetic evaluation method (e.g., injecting known outliers).

For simplicity, we’ll tune the model to flag a reasonable fraction of points as outliers (e.g., 5–10%) and use this to estimate recall and precision indirectly.



### Step 2: Implementing Grid-Based Outlier Detection
We’ll implement the grid-based outlier detection method from scratch, apply it to `X.csv`, and generate predictions.

#### Step 2.1: Load and Preprocess the Data
Let’s load `X.csv` and normalize the data to ensure all features contribute equally to the grid.

- `StandardScaler` normalizes the data to have a mean of 0 and a standard deviation of 1, which is crucial for grid-based methods to ensure all dimensions are on the same scale.


In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import recall_score, precision_score
import os
import shutil

# Load the test data
X = pd.read_csv('X.csv', header=None)  # No header in X.csv
print("Shape of X:", X.shape)  # Should be (286048, 10)

# Normalize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled)

Shape of X: (286048, 10)



#### Step 2.2: Implement Grid-Based Outlier Detection
We’ll divide the data space into a grid, count the number of points in each cell, and label points in low-density cells as outliers.


In [3]:
# Define the grid-based outlier detection function
def grid_based_outlier_detection(X, num_bins=10, density_threshold=5):
    """
    Parameters:
    - X: DataFrame of scaled data
    - num_bins: Number of bins per dimension
    - density_threshold: Minimum number of points in a cell to consider it non-outlier
    Returns:
    - labels: Array of 0 (inlier) or 1 (outlier)
    """
    # Number of dimensions
    n_dims = X.shape[1]

    # Create bins for each dimension
    bins = [pd.cut(X.iloc[:, i], bins=num_bins, labels=False, include_lowest=True) for i in range(n_dims)]
    bins = np.array(bins).T  # Shape: (n_samples, n_dims)

    # Convert bin indices to a tuple to identify unique cells
    cell_ids = [tuple(bins[i]) for i in range(len(bins))]

    # Count the number of points in each cell
    from collections import Counter
    cell_counts = Counter(cell_ids)

    # Label points as inliers (0) or outliers (1) based on cell density
    labels = np.zeros(len(X), dtype=int)
    for i, cell in enumerate(cell_ids):
        if cell_counts[cell] < density_threshold:
            labels[i] = 1  # Outlier
        else:
            labels[i] = 0  # Inlier

    return labels, cell_counts


# Apply the grid-based method to X_scaled
num_bins = 10  # Number of bins per dimension
density_threshold = 5  # Threshold for considering a cell as low-density
labels, cell_counts = grid_based_outlier_detection(X_scaled, num_bins=num_bins, density_threshold=density_threshold)

# Print the fraction of points labeled as outliers
outlier_fraction = np.mean(labels)
print(f"Fraction of points labeled as outliers: {outlier_fraction:.3f}")

Fraction of points labeled as outliers: 0.345


- **Grid Creation**: We use `pd.cut` to bin each dimension into `num_bins` intervals. Each data point is assigned a bin index for each dimension, forming a "cell" in the grid.
- **Density Calculation**: We count the number of points in each cell using `Counter`.
- **Outlier Detection**: If a cell has fewer than `density_threshold` points, its points are labeled as outliers (1); otherwise, they are inliers (0).
- **Outlier Fraction**: We print the fraction of points labeled as outliers to get a sense of the model’s behavior. In outlier detection, we typically expect 5–10% of points to be outliers, depending on the domain.



#### Step 2.3: Tune Hyperparameters
The `num_bins` and `density_threshold` parameters control the model’s sensitivity:
- **num_bins**: Affects the granularity of the grid. Too few bins make the grid too coarse; too many make it too fine, potentially isolating many points.
- **density_threshold**: A higher threshold marks more points as outliers; a lower threshold marks fewer.

Since we don’t have ground truth labels, we’ll tune these parameters to achieve a reasonable outlier fraction (e.g., 5–10%). This is a common heuristic in unsupervised outlier detection when labels are unavailable.


In [4]:

# Tune num_bins and density_threshold to achieve a reasonable outlier fraction
target_outlier_fraction = 0.05  # Aim for 5% outliers
best_num_bins, best_density_threshold = num_bins, density_threshold
best_labels = labels
best_outlier_fraction = outlier_fraction

for nb in [5, 10, 15]:
    for dt in [3, 5, 10]:
        labels, _ = grid_based_outlier_detection(X_scaled, num_bins=nb, density_threshold=dt)
        outlier_fraction = np.mean(labels)
        print(f"num_bins={nb}, density_threshold={dt}, Outlier Fraction={outlier_fraction:.3f}")
        # Choose the parameters that get closest to the target outlier fraction
        if abs(outlier_fraction - target_outlier_fraction) < abs(best_outlier_fraction - target_outlier_fraction):
            best_num_bins, best_density_threshold = nb, dt
            best_labels = labels
            best_outlier_fraction = outlier_fraction

print(f"Best parameters: num_bins={best_num_bins}, density_threshold={best_density_threshold}")
print(f"Best outlier fraction: {best_outlier_fraction:.3f}")

# Use the best labels for final predictions
labels = best_labels


num_bins=5, density_threshold=3, Outlier Fraction=0.012
num_bins=5, density_threshold=5, Outlier Fraction=0.024
num_bins=5, density_threshold=10, Outlier Fraction=0.053
num_bins=10, density_threshold=3, Outlier Fraction=0.200
num_bins=10, density_threshold=5, Outlier Fraction=0.345
num_bins=10, density_threshold=10, Outlier Fraction=0.576
num_bins=15, density_threshold=3, Outlier Fraction=0.551
num_bins=15, density_threshold=5, Outlier Fraction=0.772
num_bins=15, density_threshold=10, Outlier Fraction=0.949
Best parameters: num_bins=5, density_threshold=10
Best outlier fraction: 0.053




- We loop over a few values of `num_bins` and `density_threshold` to find the combination that results in an outlier fraction closest to 5%. This is a heuristic to ensure the model isn’t too aggressive or too lenient in flagging outliers.



#### Step 2.4: Estimate Recall and Precision (Proxy)
Since we don’t have ground truth labels, computing recall and precision directly is impossible. However, the question requires these metrics, so we’ll use a proxy approach:
- Assume the true outlier fraction is around 5% (a common assumption in outlier detection tasks).
- Treat the top 5% of points (by some criterion, e.g., lowest cell density) as "true" outliers and the rest as inliers.
- Use this synthetic ground truth to estimate recall and precision.



In [5]:

# Create a synthetic ground truth by assuming the top 5% of points (by cell density) are outliers
_, cell_counts = grid_based_outlier_detection(X_scaled, num_bins=best_num_bins, density_threshold=best_density_threshold)

# Compute the density of each point (number of points in its cell)
cell_ids = [tuple(pd.cut(X_scaled.iloc[:, i], bins=best_num_bins, labels=False, include_lowest=True)) for i in range(X_scaled.shape[1])]
cell_ids = np.array(cell_ids).T
cell_ids = [tuple(cell_ids[i]) for i in range(len(cell_ids))]
densities = np.array([cell_counts[cell] for cell in cell_ids])

# Sort points by density and label the bottom 5% as outliers (1), others as inliers (0)
n_outliers = int(0.05 * len(X_scaled))  # Top 5% as outliers
sorted_indices = np.argsort(densities)
synthetic_labels = np.zeros(len(X_scaled), dtype=int)
synthetic_labels[sorted_indices[:n_outliers]] = 1  # Lowest-density points are outliers

# Compute recall and precision using the synthetic labels
recall = recall_score(synthetic_labels, labels, pos_label=1)
precision = precision_score(synthetic_labels, labels, pos_label=1)

print(f"Estimated Recall (proxy): {recall:.3f}")
print(f"Estimated Precision (proxy): {precision:.3f}")


Estimated Recall (proxy): 1.000
Estimated Precision (proxy): 0.936



- **Synthetic Labels**: We assume the 5% of points in the lowest-density cells are the "true" outliers. This is a rough approximation but allows us to estimate recall and precision.
- **Recall**: The proportion of synthetic outliers that the model correctly identifies.
- **Precision**: The proportion of points the model labels as outliers that are in the synthetic outlier set.

This is a proxy evaluation and should be interpreted with caution, as it relies on assumptions about the data.





### Step 3: Generate Predictions
We’ll use the best parameters to generate predictions for all 286048 samples in `X.csv` and save them in the required format.



In [6]:

# Generate final predictions using the best parameters
labels, _ = grid_based_outlier_detection(X_scaled, num_bins=best_num_bins, density_threshold=best_density_threshold)

# Create the output DataFrame
output_df = pd.DataFrame({'result': labels})

# Save to CSV
output_df.to_csv('Q2_output.csv', index=False)




The `Q2_output.csv` will look like:
```
result
0
1
0
...
```
It will have 286048 rows, one for each sample in `X.csv`.





### Step 4: Package the Submission
We need to pack all code files and the output CSV into a folder named `Q2_code`.

- **Note**: The question asks for `Q2_readme.pdf`, so you’ll need to convert the code and report to PDF format manually (e.g., by copying the code into a document and saving as PDF).



### Step 5: Write the Report
The report should include the experimental steps, methods, and the recall and precision of the model. Here’s a summary to include in `Q2_readme.pdf`:

#### Report Content
1. **Introduction**:
   - The task is to perform grid-based outlier detection on the test data (`X.csv`) in an unsupervised manner.
   - The goal is to classify data points as inliers (0) or outliers (1) and estimate recall and precision without ground truth labels.

2. **Experimental Steps**:
   - **Data Preprocessing**:
     - Loaded the test data from `X.csv` (286048 samples, 10 features).
     - Normalized the data using `StandardScaler` to ensure all features are on the same scale.
   - **Grid-Based Outlier Detection**:
     - Divided the data space into a grid with `num_bins` bins per dimension.
     - Counted the number of points in each cell.
     - Labeled points in cells with fewer than `density_threshold` points as outliers (1), others as inliers (0).
   - **Hyperparameter Tuning**:
     - Tuned `num_bins` and `density_threshold` to achieve a reasonable outlier fraction (target: 5%).
     - Best parameters: `num_bins=[value]`, `density_threshold=[value]`.
     - Achieved outlier fraction: [value].
   - **Evaluation**:
     - Since no ground truth labels were provided, created synthetic labels by assuming the 5% of points in the lowest-density cells are outliers.
     - Estimated recall = [value] and precision = [value] using the synthetic labels.
   - **Prediction**:
     - Generated predictions for all 286048 samples in `X.csv` using the best parameters.
     - Saved predictions in `Q2_output.csv` with a single column `result`.

3. **Methods**:
   - **Algorithm**: Grid-based outlier detection.
     - Normalized the data using `StandardScaler`.
     - Created a grid using `pd.cut` to bin each dimension.
     - Used a density threshold to identify outliers.
   - **Libraries**: Pandas, NumPy, Scikit-learn.
   - **Hyperparameters**:
     - `num_bins`: Number of bins per dimension.
     - `density_threshold`: Minimum number of points in a cell to be considered an inlier.
   - **Evaluation Metrics**: Recall and precision, estimated using synthetic labels.

4. **Results**:
   - Estimated Recall: [Your value, e.g., 0.85]
   - Estimated Precision: [Your value, e.g., 0.78]
   - Outlier Fraction: [Your value, e.g., 0.052]
   - The grid-based method identifies outliers by focusing on low-density regions in the data space.

5. **Challenges**:
   - Lack of ground truth labels made evaluation challenging; used synthetic labels as a proxy.
   - Choosing the right `num_bins` and `density_threshold` required tuning based on the outlier fraction heuristic.
   - The method may struggle with high-dimensional data due to the curse of dimensionality, but 10 dimensions were manageable.





### Final Submission
Your submission folder `Q2_code` should contain:
- `Q2_readme.pdf`: The report with code, experimental steps, methods, and metrics.
- `Q2_output.csv`: The predictions for all 286048 samples in the specified format.




### Notes and Potential Improvements
1. **Evaluation Without Labels**: The proxy evaluation using synthetic labels is a limitation. If ground truth labels become available, you can compute recall and precision directly.
2. **Outlier Fraction**: The target outlier fraction (5%) is a heuristic. Depending on the domain, you might adjust this (e.g., 1% or 10%).
3. **Improvements**:
   - **Adaptive Binning**: Use data-driven bin sizes (e.g., based on data distribution) instead of fixed `num_bins`.
   - **Dimensionality Reduction**: Apply PCA to reduce the dimensionality of the data before gridding, which can improve performance in high-dimensional spaces.
   - **Alternative Methods**: Compare with other unsupervised outlier detection methods like Isolation Forest or DBSCAN to validate the grid-based approach.

# END