1: What is Anomaly Detection? Explain its types (point, contextual, and
collective anomalies) with examples.

Anomaly Detection
Definition

Anomaly Detection (or outlier detection) is the process of identifying data points, events, or observations that deviate significantly from the normal pattern of the dataset.

Such anomalies often indicate fraud, network intrusions, equipment failures, or rare events.

It is widely used in domains like finance, cybersecurity, healthcare, IoT, and e-commerce.

Types of Anomalies
1. Point Anomalies

Definition: A single data point is significantly different from the rest of the data.

Example:

A customer suddenly spends ₹10,00,000 in one transaction, while their usual spending is under ₹10,000.

A temperature sensor showing 100°C when all other readings are around 25°C.

2. Contextual Anomalies

Definition: A data point is anomalous in a specific context, but may be normal in another.

Requires contextual information like time, location, or conditions.

Example:

A temperature of 30°C is normal in summer but anomalous in winter.

A network bandwidth spike at night (when usage is usually low) could be suspicious, but not during office hours.

3. Collective Anomalies

Definition: A group (collection) of data points is anomalous together, even if individual points are not.

Detected when patterns or sequences deviate from expected behavior.

Example:

Multiple small fraudulent credit card transactions from different locations in a short time.

A sudden burst of error logs on a server, indicating a cyberattack or system crash.

✅ In summary

Point Anomalies: Single unusual data points.

Contextual Anomalies: Normal data points in one context, abnormal in another.

Collective Anomalies: A set of data points that collectively form an unusual pattern.

Anomaly detection helps organizations spot fraud, detect system failures early, and improve security.

2: Compare Isolation Forest, DBSCAN, and Local Outlier Factor in terms of
their approach and suitable use cases.

Comparison of Anomaly Detection Methods
Algorithm	Approach	Strengths	Limitations	Suitable Use Cases
Isolation Forest	Randomly partitions data using decision trees. Anomalies are isolated faster since they require fewer splits.	- Works well on high-dimensional data.
- Efficient and scalable to large datasets.
- Does not assume distribution of data.	- May not capture local anomalies well.
- Sensitive to contamination parameter (expected % of anomalies).	- Credit card fraud detection.
- Network intrusion detection.
- Large-scale e-commerce datasets.
DBSCAN	Density-based clustering: points in low-density areas (not belonging to any cluster) are anomalies.	- Can find anomalies as noise points.
- Detects arbitrary-shaped clusters.
- No need to specify number of clusters.	- Sensitive to parameters (ε, MinPts).
- Struggles with high-dimensional data.	- Geospatial anomaly detection (e.g., unusual GPS locations).
- Customer segmentation with outlier detection.
- Sensor data anomalies.
Local Outlier Factor (LOF)	Compares local density of a point with its neighbors. A point with significantly lower density than neighbors is an outlier.	- Detects local anomalies well.
- No global distribution assumption.	- Computationally expensive for large datasets.
- Sensitive to k (number of neighbors).	- Fraud detection where anomalies are context-dependent.
- Detecting abnormal user behavior in small datasets.
- IoT sensor monitoring.
✅ In Summary

Isolation Forest → Best for large, high-dimensional datasets.

DBSCAN → Best when anomalies appear as noise outside dense clusters.

LOF → Best for local/contextual anomalies where density varies across the dataset.

3: What are the key components of a Time Series? Explain each with one
example.


Key Components of a Time Series

A time series is a sequence of observations recorded at regular time intervals. To analyze and forecast it, we break it into four main components:

1. Trend (T)

Definition: The long-term upward or downward movement in the data over time.

Example:

The steady increase in e-commerce sales over several years due to digital adoption.

Stock market index showing long-term growth.

2. Seasonality (S)

Definition: Regular, repeating patterns in the data at fixed intervals (daily, weekly, monthly, yearly).

Example:

Ice cream sales peak every summer and drop in winter.

Electricity demand increases every evening.

3. Cyclical (C)

Definition: Fluctuations that occur over longer, irregular time periods (not strictly periodic), often influenced by business or economic cycles.

Example:

Economic recessions and booms affect car sales and housing markets.

Oil prices fluctuating with global demand-supply cycles.

4. Irregular/Residual/Noise (I)

Definition: Random, unpredictable variations in the data that cannot be explained by trend, seasonality, or cycle.

Example:

A sudden spike in airline cancellations due to a storm.

A factory shutdown due to unexpected power failure.

✅ In summary

Trend: Long-term direction (e.g., rising stock prices).

Seasonality: Short-term regular patterns (e.g., holiday shopping spikes).

Cyclical: Long-term business/economic fluctuations (e.g., recession impact).

Irregular: Random unexpected events (e.g., natural disasters).

4: Define Stationary in time series. How can you test and transform a
non-stationary series into a stationary one?

Stationarity in Time Series
Definition

A time series is stationary if its statistical properties such as mean, variance, and autocovariance remain constant over time.

In a stationary series, patterns do not depend on the time at which the series is observed.

Most forecasting models (e.g., ARIMA) assume stationarity for reliable predictions.

Why Important?

Stationarity simplifies modeling.

Non-stationary series may have trends, seasonality, or varying variance, which can mislead forecasts.

How to Test Stationarity?

Visual Inspection

Plot the series. If it shows trend, seasonality, or changing variance, it is likely non-stationary.

Summary Statistics

Check rolling mean and variance over time. If they change, the series is non-stationary.

Statistical Tests

ADF (Augmented Dickey-Fuller) Test:

Null hypothesis (H₀): Series has a unit root (non-stationary).

If p-value < 0.05 → reject H₀ → series is stationary.

KPSS (Kwiatkowski–Phillips–Schmidt–Shin) Test:

Null hypothesis: Series is stationary.

If p-value < 0.05 → reject H₀ → series is non-stationary.

How to Transform a Non-Stationary Series into Stationary?

Differencing

Subtract current observation from previous observation.

Example:
𝑌
𝑡
′
=
𝑌
𝑡
−
𝑌
𝑡
−
1
Y
t
′
	​

=Y
t
	​

−Y
t−1
	​


Removes trend and stabilizes mean.

Detrending

Fit and remove the trend (e.g., regression or moving average).

Deseasonalizing

Divide or subtract seasonal component from the series.

Transformations

Apply log, square root, or Box-Cox transformations to stabilize variance.

Example: Stock prices → apply log to reduce heteroscedasticity.

✅ In summary

Stationary series: Constant mean, variance, autocovariance over time.

Test: Visual inspection, ADF/KPSS tests.

Make stationary: Differencing, detrending, deseasonalizing, transformations.

5: Differentiate between AR, MA, ARIMA, SARIMA, and SARIMAX models in
terms of structure and application


Difference between AR, MA, ARIMA, SARIMA, and SARIMAX Models

Time series forecasting uses different statistical models depending on patterns in the data.

1. AR (Autoregressive Model)

Structure:

Current value depends on its past values (lags).

𝑌
𝑡
=
𝑐
+
𝜙
1
𝑌
𝑡
−
1
+
𝜙
2
𝑌
𝑡
−
2
+
⋯
+
𝜖
𝑡
Y
t
	​

=c+ϕ
1
	​

Y
t−1
	​

+ϕ
2
	​

Y
t−2
	​

+⋯+ϵ
t
	​


Application:

Good for data with temporal correlation (past values strongly influence future).

Example: Stock prices depending on previous day’s values.

2. MA (Moving Average Model)

Structure:

Current value depends on past error terms (noise).

𝑌
𝑡
=
𝑐
+
𝜃
1
𝜖
𝑡
−
1
+
𝜃
2
𝜖
𝑡
−
2
+
⋯
+
𝜖
𝑡
Y
t
	​

=c+θ
1
	​

ϵ
t−1
	​

+θ
2
	​

ϵ
t−2
	​

+⋯+ϵ
t
	​


Application:

Suitable when random shocks (residuals) affect future values.

Example: Demand forecasting influenced by random market fluctuations.

3. ARIMA (Autoregressive Integrated Moving Average)

Structure:

Combines AR + differencing (I) + MA.

Notation: ARIMA(p, d, q)

p = AR terms

d = differencing order (to remove trend, make stationary)

q = MA terms

Application:

General-purpose model for non-stationary series with trend.

Example: Forecasting sales data with upward trend.

4. SARIMA (Seasonal ARIMA)

Structure:

Extends ARIMA by adding seasonal components.

Notation: SARIMA(p, d, q)(P, D, Q, m)

(P, D, Q) = seasonal AR, I, MA terms

m = seasonal period (e.g., 12 for monthly data with yearly seasonality)

Application:

Suitable for time series with seasonality.

Example: Monthly airline passenger data (seasonal peak in holidays).

5. SARIMAX (Seasonal ARIMA with Exogenous Variables)

Structure:

SARIMA + allows inclusion of external/exogenous variables (X).

Equation includes predictors besides the time series itself.

Application:

Best when external factors influence the series.

Example:

Forecasting sales considering holidays, marketing spend, or economic indicators.

Energy demand prediction using weather variables.

✅ Summary Table
Model	Structure	Handles Trend?	Handles Seasonality?	Allows External Factors?	Example Use Case
AR	Past values (lags)	No	No	No	Stock prices
MA	Past errors	No	No	No	Random shocks in demand
ARIMA	AR + Differencing + MA	Yes	No	No	Sales with trend
SARIMA	ARIMA + Seasonal terms	Yes	Yes	No	Airline passengers
SARIMAX	SARIMA + Exogenous variables	Yes	Yes	Yes	Sales with holidays/ads

In [None]:
Dataset:
● NYC Taxi Fare Data
● AirPassengers Dataset
Question 6: Load a time series dataset (e.g., AirPassengers), plot the original series,
and decompose it into trend, seasonality, and residual component

Here’s the AirPassengers dataset decomposition:

Original series (blue): Monthly airline passenger counts.

Trend (red): Long-term upward movement in passenger numbers.

Seasonality (green): Repeating yearly fluctuations (e.g., peaks during certain months).

Residuals (orange): Random variations not explained by trend or seasonality.

✅ This visualization helps us separate the components for better forecasting with models like SARIMA.

In [None]:
7: Apply Isolation Forest on a numerical dataset (e.g., NYC Taxi Fare) to
detect anomalies. Visualize the anomalies on a 2D scatter plot.

Generate a synthetic taxi fare dataset (features: distance and fare).

Apply Isolation Forest to detect anomalies.

Visualize the anomalies in a 2D scatter plot.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

# 1. Generate synthetic taxi fare dataset
np.random.seed(42)
n_samples = 300

# Normal data: fare roughly proportional to distance
distance = np.random.uniform(0.5, 20, n_samples)
fare = distance * 2.5 + np.random.normal(0, 2, n_samples)

# Add anomalies: unrealistic fares
anomalous_distance = np.random.uniform(5, 20, 10)
anomalous_fare = np.random.uniform(50, 150, 10)

# Combine
X = np.vstack((
    np.column_stack((distance, fare)),
    np.column_stack((anomalous_distance, anomalous_fare))
))
df = pd.DataFrame(X, columns=["distance", "fare"])

# 2. Apply Isolation Forest
iso = IsolationForest(contamination=0.03, random_state=42)
df["anomaly"] = iso.fit_predict(df[["distance", "fare"]])

# 3. Visualize
plt.figure(figsize=(8, 6))
plt.scatter(df["distance"], df["fare"], c=df["anomaly"], cmap="coolwarm", marker="o")
plt.xlabel("Distance (miles)")
plt.ylabel("Fare ($)")
plt.title("Isolation Forest - NYC Taxi Fare Anomaly Detection")
plt.colorbar(label="Anomaly (1=normal, -1=anomaly)")
plt.show()


🔎 Explanation:

Normal points (1) follow a linear relation between distance and fare.

Anomalies (-1) are fares too high/low compared to distance.

This helps flag suspicious taxi rides (e.g., overcharging).

In [None]:
8: Train a SARIMA model on the monthly airline passengers dataset.
Forecast the next 12 months and visualize the results.

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.api as sm

# -------------------------------
# 1. Load the dataset
# -------------------------------
# AirPassengers dataset is often available in R, so we can use statsmodels datasets
data = sm.datasets.get_rdataset("AirPassengers").data

# The dataset has 'time' as index and 'value' as passengers
data['Month'] = pd.date_range(start='1949-01', periods=len(data), freq='M')
data.set_index('Month', inplace=True)
ts = data['value']

# -------------------------------
# 2. Visualize the time series
# -------------------------------
plt.figure(figsize=(10,4))
plt.plot(ts, label="AirPassengers")
plt.title("Monthly Airline Passengers")
plt.xlabel("Year")
plt.ylabel("Passengers")
plt.legend()
plt.show()

# -------------------------------
# 3. Decompose the series
# -------------------------------
decomposition = seasonal_decompose(ts, model='multiplicative', period=12)
decomposition.plot()
plt.show()

# -------------------------------
# 4. Train SARIMA model
# -------------------------------
# SARIMA(p,d,q)(P,D,Q,s)
# Let's start with SARIMA(1,1,1)(1,1,1,12)
model = SARIMAX(ts, order=(1,1,1), seasonal_order=(1,1,1,12))
results = model.fit(disp=False)

print(results.summary())

# -------------------------------
# 5. Forecast next 12 months
# -------------------------------
forecast = results.get_forecast(steps=12)
forecast_mean = forecast.predicted_mean
forecast_ci = forecast.conf_int()

# -------------------------------
# 6. Visualization
# -------------------------------
plt.figure(figsize=(10,5))
plt.plot(ts, label="Observed")
plt.plot(forecast_mean.index, forecast_mean, label="Forecast", color="red")
plt.fill_between(forecast_ci.index,
                 forecast_ci.iloc[:, 0],
                 forecast_ci.iloc[:, 1], color="pink", alpha=0.3)

plt.title("SARIMA Forecast - Next 12 Months")
plt.xlabel("Year")
plt.ylabel("Passengers")
plt.legend()
plt.show()

📊 Explanation of Steps

Load dataset – The AirPassengers dataset contains monthly totals of international airline passengers from 1949–1960.

Plot original series – To see overall growth and seasonality.

Decompose – Splits into trend, seasonality, and residuals.

SARIMA model – Combines ARIMA with seasonality (s=12 months).

Example used: (p,d,q) = (1,1,1) and (P,D,Q,12) = (1,1,1,12).

Forecasting – Predicts the next 12 months.

Visualization – Forecast shown with confidence intervals.


In [None]:
9: Apply Local Outlier Factor (LOF) on any numerical dataset to detect
anomalies and visualize them using matplotlib.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs

# -------------------------------
# 1. Generate synthetic dataset
# -------------------------------
X, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.60, random_state=42)

# Add some outliers manually
rng = np.random.RandomState(42)
X_outliers = rng.uniform(low=-6, high=6, size=(20, 2))
X = np.vstack([X, X_outliers])

# -------------------------------
# 2. Apply Local Outlier Factor
# -------------------------------
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_pred = lof.fit_predict(X)
# -1 = anomaly, 1 = inlier
scores = lof.negative_outlier_factor_

# -------------------------------
# 3. Visualization
# -------------------------------
plt.figure(figsize=(8,6))

# Plot inliers
plt.scatter(X[y_pred==1, 0], X[y_pred==1, 1],
            c='blue', label='Inliers', s=40)

# Plot outliers
plt.scatter(X[y_pred==-1, 0], X[y_pred==-1, 1],
            c='red', label='Outliers', s=60, edgecolors='k')

plt.title("Local Outlier Factor (LOF) Anomaly Detection")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()
📊 Explanation

Dataset:

Generated a blob-shaped cluster (normal points).

Added random noise points (outliers).

Local Outlier Factor (LOF):

Works by comparing the local density of a point to its neighbors.

If density is much lower → marked as an outlier.

Visualization:

Blue = inliers

Red = anomalies

10: You are working as a data scientist for a power grid monitoring company.
Your goal is to forecast energy demand and also detect abnormal spikes or drops in
real-time consumption data collected every 15 minutes. The dataset includes features
like timestamp, region, weather conditions, and energy usage.
Explain your real-time data science workflow:
● How would you detect anomalies in this streaming data (Isolation Forest / LOF /
DBSCAN)?
● Which time series model would you use for short-term forecasting (ARIMA /
SARIMA / SARIMAX)?
● How would you validate and monitor the performance over time?
● How would this solution help business decisions or operations?

⚡ Real-Time Energy Demand Forecasting & Anomaly Detection Workflow
1. Anomaly Detection in Streaming Data

Since the dataset is high-frequency (every 15 minutes) and includes contextual features (region, weather, etc.), the anomaly detection system should handle both point anomalies (sudden spikes/drops) and contextual anomalies (e.g., normal usage at night but abnormal in the afternoon).

Isolation Forest

Scales well to large streaming data.

Detects unusual consumption by recursively partitioning data.

Works well for high-dimensional inputs (usage + weather + region).

LOF (Local Outlier Factor)

Useful for density-based anomalies (e.g., a point that is far from its local neighborhood).

More sensitive to local behavior, but not as scalable in real-time streaming.

DBSCAN

Can detect clusters of normal consumption vs. anomalies.

Less efficient for continuous streaming unless used in mini-batch mode.

👉 Choice: Use Isolation Forest for real-time detection due to scalability and robustness, and possibly supplement with LOF in batch mode for fine-grained anomaly detection.

2. Short-Term Forecasting Model

We need to forecast energy demand in the next few hours/days (short-term load forecasting).

ARIMA – Handles trends but not seasonality well.

SARIMA – Good for seasonality (daily/weekly demand cycles).

SARIMAX – Best choice here because it allows exogenous features (temperature, weather, holidays, region effects).

👉 Choice: SARIMAX for accurate short-term forecasts, since weather and region strongly affect energy usage.

3. Validation & Monitoring

Validation

Use train-test split with rolling windows to mimic real-time forecasting.

Metrics:

MAE / RMSE for forecasting accuracy.

Precision/Recall/F1 for anomaly detection (based on labeled anomalies, if available).

Monitoring in Production

Track forecast error drift over time.

Implement model retraining triggers (e.g., if forecast error > threshold for several days).

Use a dashboard (Grafana / Kibana) for real-time visualization of consumption, forecast, and anomaly alerts.

4. Business Value / Operational Impact

Load Balancing & Grid Stability

Forecasting helps schedule power generation and avoid blackouts.

Anomaly detection prevents overloads caused by unexpected spikes.

Cost Optimization

Helps energy providers buy/sell electricity in the market at optimal times.

Reduces penalties for demand-supply mismatch.

Preventive Maintenance

Abnormal drops might indicate faults in sensors or grid failures → quick maintenance response.

Customer Insights

Detect unusual consumption in specific regions to optimize energy distribution and demand-response programs.

✅ Final Workflow Summary:

Streaming data ingestion → feature engineering (lag features, weather) → Isolation Forest for anomaly detection → SARIMAX for short-term forecasts → real-time monitoring dashboards → feedback loop for retraining → business decision support (load balancing, cost saving, maintenance).