Question 1: What is Anomaly Detection? Explain its types (point, contextual, and
collective anomalies) with examples.

Answer: Anomaly Detection

Anomaly Detection is a data analysis technique used to identify data points, patterns, or observations that deviate significantly from normal behavior. These unusual patterns are called anomalies or outliers and often indicate critical events such as fraud, system failures, intrusions, or errors.

Types of Anomalies
1. Point Anomalies

A point anomaly occurs when a single data instance is significantly different from the rest of the data.

Example:

In a credit card transaction dataset, a transaction of ‚Çπ5,00,000 when the user usually spends ‚Çπ500‚Äì‚Çπ5,000.

A sudden spike in temperature sensor data showing 120¬∞C when normal values are around 25‚Äì30¬∞C.

üìå Most common and easiest type to detect.

2. Contextual Anomalies

A contextual anomaly is an observation that is anomalous only in a specific context (such as time, location, or season).

Example:

25¬∞C temperature is normal in summer but anomalous in winter.

High website traffic at 2 PM is normal, but the same traffic at 3 AM may be anomalous.

üìå Requires contextual information (time, location, environment).

3. Collective Anomalies

A collective anomaly occurs when a group of related data points is anomalous, even though individual points may appear normal.

Example:

A series of small network packets sent repeatedly could indicate a DDoS attack, even though each packet alone seems normal.

Continuous small withdrawals from a bank account indicating fraudulent behavior.

üìå Focuses on patterns rather than individual points.

Question 2: Compare Isolation Forest, DBSCAN, and Local Outlier Factor in terms of
their approach and suitable use cases.

Answer: 1. Isolation Forest (iForest)

Approach:

Based on the idea that anomalies are easier to isolate than normal points.

Uses an ensemble of random decision trees.

Anomalies require fewer splits to be isolated in the trees.

Key Characteristics:

Does not rely on distance or density.

Works well with high-dimensional data.

Scales efficiently to large datasets.

Suitable Use Cases:

Fraud detection (credit card, insurance)

Network intrusion detection

High-dimensional datasets (logs, sensor data)

Limitations:

Less interpretable

May struggle with local anomalies in dense regions

2. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Approach:

A density-based clustering algorithm.

Points in low-density regions are labeled as noise (anomalies).

Uses two parameters: eps (neighborhood radius) and min_samples.

Key Characteristics:

Identifies arbitrarily shaped clusters.

Does not require specifying the number of clusters.

Sensitive to parameter selection.

Suitable Use Cases:

Spatial data analysis

Geographical data

Datasets with clear density differences

Limitations:

Poor performance in high-dimensional data

Struggles with varying density clusters

3. Local Outlier Factor (LOF)

Approach:

Measures local density deviation of a point compared to its neighbors.

Points with significantly lower density than neighbors are flagged as anomalies.

Key Characteristics:

Excellent at detecting local anomalies.

Distance-based and neighborhood-sensitive.

Provides an outlier score (LOF score).

Suitable Use Cases:

Detecting subtle, local outliers

Datasets with varying densities

Time-series and spatial datasets

Limitations:

Computationally expensive

Sensitive to choice of k (number of neighbors)

Comparison Table
Aspect	Isolation Forest	DBSCAN	Local Outlier Factor
Core Idea	Isolation via random splits	Density-based clustering	Local density comparison
Type	Tree-based	Density-based	Density-based
Handles High Dimensions	‚úÖ Yes	‚ùå No	‚ùå Limited
Detects Local Outliers	‚ö†Ô∏è Limited	‚ùå No	‚úÖ Yes
Scalability	High	Medium	Low‚ÄìMedium
Key Parameters	n_estimators, contamination	eps, min_samples	n_neighbors
Output	Anomaly score	Clusters + noise	LOF score

Question 3: What are the key components of a Time Series? Explain each with one
example.

Answer: 1. Trend (T)

The trend represents the long-term movement or overall direction of the data over time.

Example:

A company‚Äôs annual sales increasing steadily over the last 10 years.

Rising average global temperature over decades.

üìà Shows long-term growth or decline.

2. Seasonality (S)

Seasonality refers to regular and repeating patterns at fixed intervals due to seasonal factors.

Example:

Ice cream sales increase every summer and decrease in winter.

Higher electricity consumption during daytime hours.

üîÅ Occurs at known, fixed intervals.

3. Cyclical Component (C)

The cyclical component represents long-term fluctuations that do not have a fixed or regular period, often influenced by economic or business cycles.

Example:

Economic expansions and recessions affecting employment rates.

Real estate market ups and downs over several years.

üîÑ Period and amplitude are not fixed.

4. Irregular / Random Component (R)

The irregular (random) component captures unpredictable, random variations caused by unexpected events.

Example:

A sudden drop in airline bookings due to a pandemic.

Stock price fluctuations caused by breaking news.

‚ö° Noise that cannot be predicted easily.

Question 4: Define Stationary in time series. How can you test and transform a
non-stationary series into a stationary one?

Answer: Stationarity in Time Series

A time series is said to be stationary if its statistical properties remain constant over time.
This means the series has:

Constant mean

Constant variance

Constant autocovariance (correlation structure)

Stationarity is important because many time-series models (like ARIMA) assume the data is stationary.

How to Test Stationarity
1. Visual Inspection

Plot the time series.

If the mean or variance changes over time, the series is likely non-stationary.

Example:
An upward-sloping sales plot indicates non-stationarity due to trend.

2. Augmented Dickey‚ÄìFuller (ADF) Test

Null hypothesis (H‚ÇÄ): Time series is non-stationary.

Alternative hypothesis (H‚ÇÅ): Time series is stationary.

Decision Rule:

If p-value < 0.05, reject H‚ÇÄ ‚Üí series is stationary.

If p-value ‚â• 0.05, series is non-stationary.

3. KPSS Test

Null hypothesis (H‚ÇÄ): Time series is stationary.

Alternative hypothesis (H‚ÇÅ): Time series is non-stationary.

üìå ADF and KPSS together give a more reliable conclusion.

How to Transform a Non-Stationary Series into a Stationary One
1. Differencing

Subtract the previous observation from the current one.

ùëå
ùë°
‚Ä≤
=
ùëå
ùë°
‚àí
ùëå
ùë°
‚àí
1
Y
t
‚Ä≤
	‚Äã

=Y
t
	‚Äã

‚àíY
t‚àí1
	‚Äã


Removes trend

First-order differencing is most common

2. Log / Power Transformation
Apply log, square root, or Box-Cox transformation.

Stabilizes variance

Useful when variance increases over time

Example:

ùëå
ùë°
‚Ä≤
=
log
‚Å°
(
ùëå
ùë°
)
Y
t
‚Ä≤
	‚Äã

=log(Y
t
	‚Äã

)
3. Detrending

Remove the trend component using regression or smoothing.

Useful when trend is deterministic

4. Seasonal Differencing

Subtract values from the same season.

ùëå
ùë°
‚Ä≤
=
ùëå
ùë°
‚àí
ùëå
ùë°
‚àí
ùëö
Y
t
‚Ä≤
	‚Äã

=Y
t
	‚Äã

‚àíY
t‚àím
	‚Äã


(where m is the seasonal period, e.g., 12 for monthly data)

Question 5: Differentiate between AR, MA, ARIMA, SARIMA, and SARIMAX models in
terms of structure and application.

Answer: 1. AR (AutoRegressive) Model

Structure:

ùëå
ùë°
=
ùëê
+
ùúô
1
ùëå
ùë°
‚àí
1
+
ùúô
2
ùëå
ùë°
‚àí
2
+
‚ãØ
+
ùúô
ùëù
ùëå
ùë°
‚àí
ùëù
+
ùúÄ
ùë°
Y
t
	‚Äã

=c+œï
1
	‚Äã

Y
t‚àí1
	‚Äã

+œï
2
	‚Äã

Y
t‚àí2
	‚Äã

+‚ãØ+œï
p
	‚Äã

Y
t‚àíp
	‚Äã

+Œµ
t
	‚Äã


Key Idea:

Current value depends on past values of the series.

Application:

When the time series is stationary and shows strong autocorrelation.

Example:

Daily temperature forecasting.

2. MA (Moving Average) Model

Structure:

ùëå
ùë°
=
ùëê
+
ùúÄ
ùë°
+
ùúÉ
1
ùúÄ
ùë°
‚àí
1
+
ùúÉ
2
ùúÄ
ùë°
‚àí
2
+
‚ãØ
+
ùúÉ
ùëû
ùúÄ
ùë°
‚àí
ùëû
Y
t
	‚Äã

=c+Œµ
t
	‚Äã

+Œ∏
1
	‚Äã

Œµ
t‚àí1
	‚Äã

+Œ∏
2
	‚Äã

Œµ
t‚àí2
	‚Äã

+‚ãØ+Œ∏
q
	‚Äã

Œµ
t‚àíq
	‚Äã


Key Idea:

Current value depends on past error terms.

Application:

When past shocks/noise influence the series.

Example:

Modeling random demand fluctuations.

3. ARIMA (AutoRegressive Integrated Moving Average)

Structure:

ùê¥
ùëÖ
ùêº
ùëÄ
ùê¥
(
ùëù
,
ùëë
,
ùëû
)
ARIMA(p,d,q)

p: AR order

d: Differencing order (to make series stationary)

q: MA order

Key Idea:

Combines AR + differencing + MA.

Application:

Non-stationary time series without seasonality.

Example:

Sales forecasting with a trend.

4. SARIMA (Seasonal ARIMA)

Structure:

ùëÜ
ùê¥
ùëÖ
ùêº
ùëÄ
ùê¥
(
ùëù
,
ùëë
,
ùëû
)
(
ùëÉ
,
ùê∑
,
ùëÑ
)
ùëö
SARIMA(p,d,q)(P,D,Q)
m
	‚Äã


Adds seasonal AR, differencing, and MA components

m: Seasonal period (e.g., 12 for monthly data)

Key Idea:

Captures seasonal patterns in time series.

Application:

Time series with trend and seasonality.

Example:

Monthly airline passenger data.

5. SARIMAX (Seasonal ARIMA with Exogenous Variables)

Structure:

ùëÜ
ùê¥
ùëÖ
ùêº
ùëÄ
ùê¥
ùëã
(
ùëù
,
ùëë
,
ùëû
)
(
ùëÉ
,
ùê∑
,
ùëÑ
)
ùëö
+
ùëã
ùë°
SARIMAX(p,d,q)(P,D,Q)
m
	‚Äã

+X
t
	‚Äã


Includes external (exogenous) variables

Key Idea:

Forecast depends on past values + seasonality + external factors.

Application:

When time series is influenced by other variables.

Example:

Sales forecasting using promotions and holidays.

Comparison Table
Model	Components	Seasonality	External Variables	Application
AR	Past values	‚ùå No	‚ùå No	Stationary data
MA	Past errors	‚ùå No	‚ùå No	Noise-based patterns
ARIMA	AR + I + MA	‚ùå No	‚ùå No	Trend, non-stationary
SARIMA	ARIMA + seasonal	‚úÖ Yes	‚ùå No	Seasonal time series
SARIMAX	SARIMA + exogenous	‚úÖ Yes	‚úÖ Yes	Seasonal + external factors


Dataset:
‚óè NYC Taxi Fare Data
‚óè AirPassengers Dataset
Question 6: Load a time series dataset (e.g., AirPassengers), plot the original series,
and decompose it into trend, seasonality, and residual components.

Answer:

Step 1: Import Required Libraries



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose


Step 2: Load the AirPassengers Dataset

In [None]:
# Load dataset
data = pd.read_csv(
    "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
)

# Convert Month column to datetime
data['Month'] = pd.to_datetime(data['Month'])
data.set_index('Month', inplace=True)

# Rename column for convenience
data.columns = ['Passengers']

print(data.head())


Step 3: Plot the Original Time Series

In [2]:
plt.figure()
plt.plot(data, label='Air Passengers')
plt.title('AirPassengers Time Series')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()


NameError: name 'data' is not defined

<Figure size 640x480 with 0 Axes>

Step 4: Decompose the Time Series

Since the variance increases with time, we use a multiplicative model.

In [3]:
decomposition = seasonal_decompose(
    data['Passengers'],
    model='multiplicative',
    period=12
)

decomposition.plot()
plt.show()


NameError: name 'data' is not defined

Step 5: Components Explanation
1. Trend

Shows long-term growth in air passengers.

Reflects increasing air travel demand.

2. Seasonality

Repeating yearly pattern.

Peaks during mid-year (holiday season).

3. Residual (Irregular)

Random fluctuations not explained by trend or seasonality.

Contains noise and unexpected variations.

Question 7: Apply Isolation Forest on a numerical dataset (e.g., NYC Taxi Fare) to
detect anomalies. Visualize the anomalies on a 2D scatter plot.

Answer:

Step 1: Import Required Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest


Step 2: Load and Prepare the Dataset

(Assume a simplified NYC Taxi Fare dataset with numerical features)

In [5]:
# Sample structure of NYC Taxi Fare data
# Features: trip_distance, fare_amount
data = pd.read_csv("nyc_taxi_fare.csv")

# Select numerical features
X = data[['trip_distance', 'fare_amount']]


FileNotFoundError: [Errno 2] No such file or directory: 'nyc_taxi_fare.csv'

Step 3: Apply Isolation Forest

In [6]:
iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.05,
    random_state=42
)

# Fit and predict
data['anomaly'] = iso_forest.fit_predict(X)

# -1 ‚Üí anomaly, 1 ‚Üí normal


NameError: name 'X' is not defined

Step 4: Separate Normal Points and Anomalies

In [None]:
normal = data[data['anomaly'] == 1]
anomalies = data[data['anomaly'] == -1]


Step 5: Visualize Anomalies (2D Scatter Plot)

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(normal['trip_distance'], normal['fare_amount'], label='Normal', alpha=0.5)
plt.scatter(anomalies['trip_distance'], anomalies['fare_amount'], label='Anomaly')
plt.xlabel('Trip Distance')
plt.ylabel('Fare Amount')
plt.title('Isolation Forest ‚Äì NYC Taxi Fare Anomaly Detection')
plt.legend()
plt.show()


Interpretation of Results

Normal points follow the expected fare‚Äìdistance relationship

Anomalies include:

Very high fare for short distance

Unusually low fare for long distance

These may indicate:

Meter faults

Data entry errors

Fraudulent rides

Why Isolation Forest Works Well Here

NYC Taxi Fare data is large-scale

Isolation Forest:

Does not rely on distance or density

Efficient for high-volume numerical data

Isolates rare, abnormal observations quickly

Question 8: Train a SARIMA model on the monthly airline passengers dataset.
Forecast the next 12 months and visualize the results.

Answer:

Step 1: Import Required Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX


Step 2: Load the AirPassengers Dataset

In [None]:
# Load dataset
data = pd.read_csv(
    "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
)

# Convert to datetime
data['Month'] = pd.to_datetime(data['Month'])
data.set_index('Month', inplace=True)
data.columns = ['Passengers']

print(data.head())


Step 3: Train the SARIMA Model

AirPassengers data has:

Trend

Strong yearly seasonality (period = 12)

We use:
SARIMA(1,1,1)(1,1,1,12)

In [None]:
model = SARIMAX(
    data['Passengers'],
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12)
)

sarima_model = model.fit()
print(sarima_model.summary())


Step 4: Forecast the Next 12 Months

In [None]:
forecast = sarima_model.get_forecast(steps=12)
forecast_mean = forecast.predicted_mean
forecast_ci = forecast.conf_int()


Step 5: Visualize the Forecast

In [7]:
plt.figure(figsize=(10, 6))

# Plot original data
plt.plot(data, label='Observed')

# Plot forecast
plt.plot(forecast_mean, label='Forecast', linestyle='--')

# Confidence interval
plt.fill_between(
    forecast_ci.index,
    forecast_ci.iloc[:, 0],
    forecast_ci.iloc[:, 1],
    alpha=0.3
)

plt.title('SARIMA Forecast ‚Äì AirPassengers Dataset')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()


NameError: name 'data' is not defined

<Figure size 1000x600 with 0 Axes>

Question 9: Apply Local Outlier Factor (LOF) on any numerical dataset to detect
anomalies and visualize them using matplotlib.

Answer:

Step 1: Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor


Step 2: Create / Load a Numerical Dataset


In [7]:
np.random.seed(42)

# Generate normal data
X_normal = 0.3 * np.random.randn(200, 2)

# Generate outliers
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))

# Combine data
X = np.vstack((X_normal, X_outliers))

data = pd.DataFrame(X, columns=['Feature1', 'Feature2'])


Step 3: Apply Local Outlier Factor (LOF)

In [None]:
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1
)

# Fit and predict
data['anomaly'] = lof.fit_predict(data[['Feature1', 'Feature2']])

# -1 ‚Üí anomaly, 1 ‚Üí normal


Step 4: Separate Normal Points and Anomalies

In [None]:
normal = data[data['anomaly'] == 1]
anomalies = data[data['anomaly'] == -1]


Step 5: Visualize Anomalies Using Matplotlib

In [None]:
plt.figure(figsize=(8, 6))

plt.scatter(
    normal['Feature1'], normal['Feature2'],
    label='Normal', alpha=0.6
)

plt.scatter(
    anomalies['Feature1'], anomalies['Feature2'],
    label='Anomaly'
)

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Local Outlier Factor (LOF) ‚Äì Anomaly Detection')
plt.legend()
plt.show()


Interpretation of Results

Normal points form dense clusters

Anomalies lie in low-density regions

LOF detects local density deviations, not just global outliers

Why LOF is Useful

Excellent for local anomalies

Works well when:

Data has varying densities

Outliers are subtle and context-dependent

Question 10: You are working as a data scientist for a power grid monitoring company.
Your goal is to forecast energy demand and also detect abnormal spikes or drops in
real-time consumption data collected every 15 minutes. The dataset includes features
like timestamp, region, weather conditions, and energy usage.
Explain your real-time data science workflow:
‚óè How would you detect anomalies in this streaming data (Isolation Forest / LOF /
DBSCAN)?
‚óè Which time series model would you use for short-term forecasting (ARIMA /
SARIMA / SARIMAX)?
‚óè How would you validate and monitor the performance over time?
‚óè How would this solution help business decisions or operations?

Answer: 1. Real-Time Anomaly Detection (Streaming Data)
Recommended Approach: Isolation Forest + LOF (Hybrid)
Why not only DBSCAN?

DBSCAN is not ideal for streaming data

Sensitive to eps and min_samples

Struggles with high-dimensional and evolving data

Isolation Forest (Primary Detector)

Why Isolation Forest?

Fast and scalable for high-frequency (15-minute) data

Works well with high-dimensional features

Suitable for near real-time scoring

Features used:

Energy usage

Temperature, humidity

Hour of day, day of week

Region-level aggregated demand

What it detects:

Sudden spikes or drops

Meter faults

Data ingestion errors

Local Outlier Factor (Secondary Validator)

Why LOF?

Detects local anomalies

Useful when:

One region behaves abnormally compared to nearby regions

Gradual drifts occur

üìå Workflow:

Isolation Forest flags anomalies first

LOF confirms anomalies locally (reduces false positives)

2. Short-Term Energy Demand Forecasting
Recommended Model: SARIMAX
Why SARIMAX?

Energy demand has:

Strong daily & weekly seasonality

Dependence on external factors

Model Structure:

ùëÜ
ùê¥
ùëÖ
ùêº
ùëÄ
ùê¥
ùëã
(
ùëù
,
ùëë
,
ùëû
)
(
ùëÉ
,
ùê∑
,
ùëÑ
)
ùëö
+
ùëã
SARIMAX(p,d,q)(P,D,Q)
m
	‚Äã

+X

Exogenous Variables (X):

Temperature

Humidity

Weather events

Holiday indicators

üìå Forecast Horizon:

Next 1‚Äì24 hours

Next 1‚Äì7 days (short-term grid planning)

Why not ARIMA or SARIMA alone?
Model	Limitation
ARIMA	No seasonality, no external features
SARIMA	Seasonality yes, but ignores weather
SARIMAX	‚úî Seasonality + ‚úî Weather impact
3. Validation & Performance Monitoring
Forecast Validation

Rolling / sliding window evaluation

Metrics:

MAE (Mean Absolute Error)

RMSE

MAPE

Compare:

Forecast vs actual demand per region

Anomaly Detection Monitoring

Track:

Number of anomalies per day

False positives confirmed by operators

Periodically retrain models:

Weekly or monthly

Adapt to seasonal load changes

Concept Drift Handling

Monitor:

Feature distributions

Prediction error trends

Retrain models if:

Error increases beyond threshold

Consumption patterns shift (policy changes, EV adoption)

4. Business & Operational Impact
Operational Benefits

üö® Early fault detection

Transformer failures

Sensor malfunctions

‚ö° Prevent blackouts

Proactive load balancing

üîß Predictive maintenance

Reduce downtime

Business Benefits

üí∞ Cost optimization

Efficient energy generation planning

üìâ Reduced penalties

Avoid overload fines

üìä Regulatory compliance

Accurate demand reporting

Decision Support

Real-time dashboards for grid operators

Automated alerts for anomalies

Data-driven capacity planning