# Anomaly detection and Time Series

**1. What is Anomaly Detection? Explain its types (point, contextual, and
collective anomalies) with examples?**

- It is the process of finding data points that deviate from normal patterns.

1. **Point Anomaly** – A single data point is unusual.
   *Example: A sudden ₹5,00,000 transaction on a card that usually spends ₹5,000.*

2. **Contextual Anomaly** – Data is abnormal in a specific context (time, place, situation).
   *Example: 30°C temperature in winter.*

3. **Collective Anomaly** – A group/sequence of points is abnormal together.
   *Example: Continuous drop in network traffic for 10 minutes.*





**2. Compare Isolation Forest, DBSCAN, and Local Outlier Factor in terms of
their approach and suitable use cases.?**

- **Isolation Forest (iForest):**
Isolation Forest detects anomalies by isolating data points using random splits. Points that are easier to isolate (require fewer splits) are considered anomalies. It is effective for high-dimensional and large datasets, and works well when anomalies are globally different from normal data.
*Use case:* Credit card fraud detection.

- **DBSCAN (Density-Based Spatial Clustering of Applications with Noise):**
DBSCAN identifies clusters based on point density. Points that do not belong to any dense cluster are treated as outliers. It works well for datasets with clusters of arbitrary shapes and is sensitive to parameter settings.
*Use case:* Detecting unusual GPS locations.

- **Local Outlier Factor (LOF):**
LOF detects anomalies by comparing the local density of a point to the density of its neighbors. Points with significantly lower density than their neighbors are considered outliers. It is useful for detecting local anomalies in datasets with varying densities.
*Use case:* Identifying unusual behavior within a subgroup of customers.



**3. What are the key components of a Time Series? Explain each with one
example**

1. Trend (T):

- Represents the long-term movement or direction in the data over time.

- Shows whether the values are generally increasing, decreasing, or constant.

- Example: The gradual rise in global average temperatures over decades.

2. Seasonality (S):

- Represents regular, repeating patterns or fluctuations within a fixed period (e.g., daily, monthly, yearly).

- Example: Increased ice cream sales every summer.

3. Cyclic Component (C):

- Represents long-term fluctuations caused by economic or business cycles, not fixed in length.

- Example: Economic boom and recession cycles affecting stock market prices over several years.

4. Irregular / Random Component (I):

- Represents unpredictable, random variations or noise in the data.

- Example: Sudden spike in flight cancellations due to a natural disaster.

**4. Define Stationary in time series. How can you test and transform a
non-stationary series into a stationary one?**

- A **stationary time series** is one whose **mean, variance, and autocorrelation remain constant over time**, without trends or seasonal patterns. 
- Stationarity is important because many forecasting models, like ARIMA, assume a stable underlying process. To check stationarity, we can use **visual inspection** or statistical tests like **ADF** and **KPSS**.
- If a series is non-stationary, it can be transformed using **differencing, log transformations, detrending, or seasonal adjustment** to stabilize its statistical properties, making it suitable for accurate modeling and forecasting.


**5. Differentiate between AR, MA, ARIMA, SARIMA, and SARIMAX models in
terms of structure and application.**


- **AR, MA, ARIMA, SARIMA, and SARIMAX** are time series forecasting models with different structures and applications.
- The **AR (AutoRegressive) model** predicts values using past observations and is suitable for stationary series.
- The **MA (Moving Average) model** uses past forecast errors to make predictions, also for stationary series. 
-**ARIMA** combines AR and MA with differencing to handle non-stationary data with trends.
- **SARIMA** extends ARIMA by adding seasonal components, making it ideal for series with both trend and seasonality.
- Finally, **SARIMAX** further extends SARIMA by incorporating exogenous variables, allowing the model to account for external factors affecting the time series.
- These models are selected based on the presence of trend, seasonality, and external influences in the data.



In [None]:
# 6. Load a time series dataset (e.g., AirPassengers), plot the original series, and decompose it into trend, seasonality, and residual components.

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.datasets import get_rdataset

# Load AirPassengers dataset
data = get_rdataset('AirPassengers').data
data['Month'] = pd.to_datetime(data['time'], format='%Y-%m')
data.set_index('Month', inplace=True)
ts = data['value']

# Plot original time series
plt.figure(figsize=(12, 4))
plt.plot(ts, color='blue')
plt.title('Original AirPassengers Time Series')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.show()

# Decompose the time series
decomposition = seasonal_decompose(ts, model='multiplicative', period=12)

# Plot decomposed components
plt.figure(figsize=(12, 10))

plt.subplot(4,1,1)
plt.plot(ts, label='Original')
plt.legend(loc='best')

plt.subplot(4,1,2)
plt.plot(decomposition.trend, label='Trend', color='orange')
plt.legend(loc='best')

plt.subplot(4,1,3)
plt.plot(decomposition.seasonal, label='Seasonality', color='green')
plt.legend(loc='best')

plt.subplot(4,1,4)
plt.plot(decomposition.resid, label='Residuals', color='red')
plt.legend(loc='best')

plt.tight_layout()
plt.show()


In [None]:
#7. Apply Isolation Forest on a numerical dataset (e.g., NYC Taxi Fare) to detect anomalies. Visualize the anomalies on a 2D scatter plot

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

# Load a sample NYC Taxi Fare dataset
# Here, we generate a small synthetic example for demonstration
# Replace with actual CSV file if available
data = pd.DataFrame({
    'passenger_count': [1, 2, 1, 3, 1, 2, 100, 1, 2, 3, 2, 1, 5, 1, 3, 2, 0, 2],
    'fare_amount': [10, 15, 12, 20, 8, 14, 500, 11, 13, 18, 15, 9, 25, 12, 19, 16, 2, 14]
})

# Fit Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
data['anomaly'] = iso_forest.fit_predict(data)

# -1 for anomalies, 1 for normal points
anomalies = data[data['anomaly'] == -1]
normal = data[data['anomaly'] == 1]

# Visualize anomalies on a 2D scatter plot
plt.figure(figsize=(10,6))
plt.scatter(normal['passenger_count'], normal['fare_amount'], c='blue', label='Normal')
plt.scatter(anomalies['passenger_count'], anomalies['fare_amount'], c='red', label='Anomaly')
plt.xlabel('Passenger Count')
plt.ylabel('Fare Amount')
plt.title('Anomaly Detection using Isolation Forest')
plt.legend()
plt.show()


In [None]:
# 8.Train a SARIMA model on the monthly airline passengers dataset. Forecast the next 12 months and visualize the results

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.datasets import airpassengers

# Load AirPassengers dataset
dataset = airpassengers.load_pandas()
ts = dataset.data['AirPassengers']
ts.index = pd.date_range(start='1949-01', periods=len(ts), freq='M')

# Split data (optional, here we use full dataset)
train = ts

# Define SARIMA model
# SARIMA(p,d,q)(P,D,Q,s)
# Common starting values: p=1, d=1, q=1, P=1, D=1, Q=1, s=12 (monthly data)
model = SARIMAX(train, 
                order=(1,1,1), 
                seasonal_order=(1,1,1,12), 
                enforce_stationarity=False, 
                enforce_invertibility=False)

# Fit the model
results = model.fit(disp=False)

# Forecast next 12 months
forecast = results.get_forecast(steps=12)
forecast_index = pd.date_range(start=ts.index[-1] + pd.DateOffset(months=1), periods=12, freq='M')
forecast_series = pd.Series(forecast.predicted_mean.values, index=forecast_index)
forecast_ci = forecast.conf_int()

# Plot original series and forecast
plt.figure(figsize=(12,6))
plt.plot(ts, label='Original', color='blue')
plt.plot(forecast_series, label='Forecast', color='red')
plt.fill_between(forecast_index, 
                 forecast_ci.iloc[:,0], 
                 forecast_ci.iloc[:,1], color='pink', alpha=0.3)
plt.title('SARIMA Forecast for AirPassengers')
plt.xlabel('Year')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()


In [None]:
# 9. Apply Local Outlier Factor (LOF) on any numerical dataset to detect anomalies and visualize them using matplotlib

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor

# Sample numerical dataset (can replace with real dataset)
data = pd.DataFrame({
    'Feature1': [10, 12, 11, 13, 12, 14, 100, 11, 13, 12, 15, 12, 14, 13, 12, 11],
    'Feature2': [20, 22, 21, 23, 22, 24, 200, 21, 23, 22, 25, 22, 24, 23, 22, 21]
})

# Apply LOF
lof = LocalOutlierFactor(n_neighbors=5, contamination=0.1)
data['anomaly'] = lof.fit_predict(data)  # -1 = anomaly, 1 = normal

# Separate normal points and anomalies
normal = data[data['anomaly'] == 1]
anomalies = data[data['anomaly'] == -1]

# Visualize anomalies
plt.figure(figsize=(10,6))
plt.scatter(normal['Feature1'], normal['Feature2'], c='blue', label='Normal')
plt.scatter(anomalies['Feature1'], anomalies['Feature2'], c='red', label='Anomaly')
plt.xlabel('Feature1')
plt.ylabel('Feature2')
plt.title('Anomaly Detection using LOF')
plt.legend()
plt.show()


In [None]:
# 10.You are working as a data scientist for a power grid monitoring company. Your goal is to forecast energy demand and also detect abnormal spikes or drops in real-time consumption data collected every 15 minutes. The dataset includes features like timestamp, region, weather conditions, and energy usage. Explain your real-time data science workflow:
#● How would you detect anomalies in this streaming data (Isolation Forest / LOF / DBSCAN)?
#● Which time series model would you use for short-term forecasting (ARIMA / SARIMA / SARIMAX)?
#● How would you validate and monitor the performance over time?
#● How would this solution help business decisions or operations?

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from statsmodels.tsa.statespace.sarimax import SARIMAX

# ---------------------------
# 1. Generate synthetic dataset
# ---------------------------
np.random.seed(42)
date_rng = pd.date_range(start='2025-01-01', end='2025-01-07', freq='15T')  # 15-min intervals
n = len(date_rng)
energy_usage = 50 + 10*np.sin(np.arange(n)*2*np.pi/96) + np.random.normal(0, 3, n)  # daily seasonality
weather_temp = 20 + 5*np.sin(np.arange(n)*2*np.pi/96) + np.random.normal(0,1,n)      # synthetic weather
region = np.random.choice(['North','South'], size=n)

data = pd.DataFrame({'timestamp': date_rng, 'region': region, 
                     'temperature': weather_temp, 'energy_usage': energy_usage})
data.set_index('timestamp', inplace=True)

# Introduce some anomalies
data.loc[data.sample(5).index, 'energy_usage'] *= 2

# ---------------------------
# 2. Anomaly Detection with Isolation Forest
# ---------------------------
features = ['energy_usage','temperature']
iso_forest = IsolationForest(contamination=0.01, random_state=42)
data['anomaly'] = iso_forest.fit_predict(data[features])  # -1 = anomaly, 1 = normal

# Visualize anomalies
plt.figure(figsize=(12,6))
normal = data[data['anomaly']==1]
anomaly = data[data['anomaly']==-1]
plt.plot(normal.index, normal['energy_usage'], label='Normal')
plt.scatter(anomaly.index, anomaly['energy_usage'], color='red', label='Anomaly', s=50)
plt.xlabel('Timestamp')
plt.ylabel('Energy Usage')
plt.title('Anomaly Detection with Isolation Forest')
plt.legend()
plt.show()

# ---------------------------
# 3. Short-term Forecasting with SARIMAX
# ---------------------------
# SARIMAX uses energy_usage as endogenous and temperature as exogenous
train = data['energy_usage'][:-96]  # last day for testing
exog_train = data['temperature'][:-96]
test = data['energy_usage'][-96:]   # last day
exog_test = data['temperature'][-96:]

# Define SARIMAX model (daily seasonality, 96 periods per day)
model = SARIMAX(train, exog=exog_train, order=(1,1,1), seasonal_order=(1,1,1,96),
                enforce_stationarity=False, enforce_invertibility=False)
results = model.fit(disp=False)

# Forecast next 96 periods (1 day ahead)
forecast = results.get_forecast(steps=96, exog=exog_test)
forecast_values = forecast.predicted_mean
forecast_ci = forecast.conf_int()

# Plot forecast vs actual
plt.figure(figsize=(12,6))
plt.plot(data.index, data['energy_usage'], label='Actual')
plt.plot(test.index, forecast_values, color='red', label='Forecast')
plt.fill_between(test.index, forecast_ci.iloc[:,0], forecast_ci.iloc[:,1], color='pink', alpha=0.3)
plt.xlabel('Timestamp')
plt.ylabel('Energy Usage')
plt.title('SARIMAX Short-Term Forecasting')
plt.legend()
plt.show()
