### Irregular Time Series for Predictive Modeling — Part I

> Transforming, visualizing, and decomposing irregular time series. [Hands-On](https://medium.com/data-science-collective/hands-on-irregular-time-series-pt-i-2b8730bff40b)

This section introduces irregular time series, explores the dataset, and applies initial data transformations — like the log transformation to address data skewness.

This project explores an intriguing AI application in the real estate market: **predicting property sales using irregular time series modeling**.

##### Generating the Fictitious Dataset

In [None]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

# Set seed for reproducibility
np.random.seed(42)
random.seed(42)

# Generate random dates from 2011 to 2023
def generate_random_dates(start_year, end_year, n_samples):
    start_date = datetime(start_year, 1, 1)
    end_date = datetime(end_year, 12, 31)
    date_range = (end_date - start_date).days
    return [start_date + timedelta(days=random.randint(0, date_range)) for _ in range(n_samples)]

# Define dataset parameters
n_samples = 29580
sale_dates = generate_random_dates(2011, 2023, n_samples)
prices = np.random.randint(50000, 1000000, size=n_samples)
property_types = np.random.choice(['house', 'apartment'], size=n_samples)
num_rooms = np.random.randint(1, 6, size=n_samples)

# Create fictitious dataset
# This is for demonstration purposes and to support the project execution.
dataset = pd.DataFrame({
    'sale_date': sale_dates,
    'price': prices,
    'property_type': property_types,
    'num_rooms': num_rooms
})

# Save to CSV
dataset.to_csv('dataset.csv', index=False)

print(dataset.head())

In [None]:
# 1. Install LightGBM package silently
!pip install --no-cache-dir --disable-pip-version-check -q plotly lightgbm

In [None]:
# 2. Import necessary libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 3. Import sklearn modules for preprocessing and metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error

# 4. Import statsmodels for statistical analysis
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose

# 5. Import LightGBM for machine learning models
import lightgbm
from lightgbm import LGBMRegressor

# 6. Import sklearn's linear model for regression
from sklearn.linear_model import LinearRegression

In [None]:
# 7. Load the dataset
df = pd.read_csv('dataset.csv')

In [None]:
# 8. Check dataset shape
df.shape

In [None]:
# 9. Display first rows of data
df.head()

In [None]:
# 10. Display the last rows of the dataset
df.tail()

In [None]:
# 11. Check data types of each column
df.dtypes

In [None]:
# 12. Display unique values in the property_type column
df['property_type'].unique()

In [None]:
# 13. Set sale_date as the index
df.index = pd.to_datetime(df.sale_date)

In [None]:
# 14. Drop the original sale_date column
df = df.drop(columns=['sale_date'])

In [None]:
# 15. Display the first rows of the updated dataset
df.head()

In [None]:
pd.options.display.float_format = '{:.8f}'.format

In [None]:
# 16. P-value for the bedrooms column
print(f"P-value for Bedrooms Column: {adfuller(df['num_rooms'])[1]}")

In [None]:
# 17. P-value for the price column
print(f"P-value for Price Column: {adfuller(df['price'])[1]}")

The `price` series is also stationary.

Both variables — `num_rooms` and `price` — exhibit the necessary behavior for applying statistical modeling strategies.

They pass the stationarity test, even though they are irregular series.

In [None]:
# 18. Display dataframe columns
print(df.columns)

In [None]:
# 19. Visualizing the price time series with all records
import plotly.express as px

fig = px.line(df, y='price', labels={'index': 'Time', 'price': 'Price'},
              title='Price Time Series',
              template='simple_white')
fig.update_traces(line_color='green')
fig.show()

In [None]:
# 20. Visualizing the price time series with the first 300 records
fig = px.line(df.iloc[:300], y='price', labels={'index': 'Time', 'price': 'Price'},
              title='Price Time Series (First 300 Records)',
              template='simple_white')
fig.update_traces(line_color='red')
fig.show()

In [None]:
# 21. Scatter plot between price and number of rooms
fig = px.scatter(df, x='num_rooms', y='price',
                 labels={'num_rooms': 'Number of Rooms', 'price': 'Price (in Thousands)'},
                 title='Price vs. Number of Rooms',
                 template='simple_white')
fig.update_yaxes(tickprefix="$", tickformat=",.0f")
fig.show()

By reducing the view to 300 records, the irregularity becomes more evident.

- The green graph (full data) suggests a smoother trend.
- The red graph (first 300 records) clearly shows irregularities.

There's no clear pattern, trend, or seasonality in the red graph. The line fluctuates with highs, lows, and breaks — an expected behavior in the real estate market.

Unlike regular consumer products, real estate sales are irregular. An agency might go weeks without sales, then close multiple deals in a short period.

This detailed, segmented analysis helps identify patterns and potential issues more effectively than just viewing the complete dataset.

In [None]:
# 16. P-value for the number_of_rooms column
print(f"P-value for Number of Rooms Column: {adfuller(df['num_rooms'])[1]}")

# 17. P-value for the price column
print(f"P-value for Price Column: {adfuller(df['price'])[1]}")

In [None]:
# 22. Boxplot of the price variable
fig = px.box(df, y='price',
             labels={'price': 'Price (in Thousands)'},
             title='Price Distribution',
             template='simple_white')
fig.update_yaxes(tickprefix="$", tickformat=".2s")
fig.show()

The boxplot displays the median, quartiles, maximum/minimum values, and outliers.

The distribution is skewed, with a flattened box — unlike the expected, more expanded shape.

The median price is around 550 thousand.
However, there are several outliers reaching up to 8 million.

In [None]:
# 23. Histogram of the price variable
fig = px.histogram(df, x='price',
                   labels={'price': 'Price (in Thousands)'},
                   title='Price Distribution Histogram',
                   template='simple_white')
fig.update_xaxes(tickprefix="$", tickformat=".2s")
fig.show()

In [None]:
# 24. Apply log transformation to the price variable
import numpy as np

df['log_price'] = np.log(df['price'])

# 25. Histogram of the log-transformed price variable
fig = px.histogram(df, x='log_price',
                   labels={'log_price': 'Log of Price'},
                   title='Histogram of Log-Transformed Price',
                   template='simple_white')
fig.update_xaxes(title_text='Log of Price')
fig.update_yaxes(title_text='Count')
fig.show()

In [None]:
# 26. Line plot of the log-transformed price variable
fig = px.line(df, y='log_price',
              labels={'index': 'Time', 'log_price': 'Log of Price'},
              title='Log-Transformed Price Over Time',
              template='simple_white')
fig.update_traces(line_color='blue')
fig.update_xaxes(title_text='Time')
fig.update_yaxes(title_text='Log of Price')
fig.show()

In [None]:
# 28. Apply encoding to the property type variable
from sklearn.preprocessing import LabelEncoder

df['property_type'] = LabelEncoder().fit_transform(df['property_type'])

In [None]:
# 29. Resample the series to monthly and calculate the mean
df = df.resample('ME').mean()

In [None]:
# 30. Display the first 10 rows of the dataset
print(df.head(10))

In [None]:
# 31. Remove the property type variable as it cannot be grouped adequately by month
df.drop('property_type', axis=1, inplace=True)

In [None]:
# 32. Round the values of the num_rooms variable
df['num_rooms'] = df['num_rooms'].round()

In [None]:
# 33. Remove rows with missing values
df = df.dropna()

# 34. Display the first rows of the dataset
print(df.head())

# 35. Display the last rows of the dataset
print(df.tail())

In [None]:
# 36. Decompose the price series to analyze trend, seasonality, and residuals
result = seasonal_decompose(df['price'])

# 37. Plot the decomposition results
result.plot()

### Irregular Time Series for Predictive Modeling — Part II

> Feature Engineering, Model Training, and Forecasting Strategies [Hands-On](https://medium.com/data-science-collective/hands-on-irregular-time-series-for-predictive-modeling-part-ii-e5070e721bd6)

- **Feature Engineering:** Creating time-based features (like year and month) to enrich the dataset.

- **Data Preparation:** Structuring the data for effective model training.

- **Model Development:** Building and comparing machine learning models — starting with simple benchmarks and progressing to more refined approaches.

- **Forecasting:** Using the trained models to make future price predictions.

The goal is to develop accurate and interpretable models for real estate price forecasting, ensuring that complexity is added only when necessary.

In [None]:
# 43. Extract year and month for feature engineering
df['year'] = df.index.year
df['month'] = df.index.month

# 44. Display the first records to validate the new features
print(df.head())

In [None]:
# 45. Create the index for 70/30 split
index = int(len(df) * .7)

# 46. Display the dataset length and split index
print(len(df), index)

In [None]:
# 47. Training data (maintaining sequence)
train_data = df.iloc[:index]

# 48. Display the last record of the training data
train_data.tail(1)

In [None]:
# 49. Testing data (maintaining sequence)
test_data = df.iloc[index:]

# 50. Display the first record of the testing data
test_data.head(1)

In [None]:
# 51. Display the last record of the testing data
test_data.tail(1)

In [None]:
# 54. The target variable 'price' is what we want to predict
y_train = train_data[['log_price']]

# 55. Display the first records of the target training data
y_train.head()

In [None]:
# 54. The target variable 'price' is what we want to predict
y_train = train_data[['log_price']]

# 55. Display the first records of the target training data
y_train.head()

#### Building the First Version of the Model

In [None]:
# 56. Create the model
model_v1 = LGBMRegressor()

# 57. Train the model using the log-transformed target
model_v1.fit(X_train, y_train['log_price'])

In [None]:
# 58. Prepare test data for input and output
X_test = test_data.drop(columns=['log_price'])
y_test = test_data[['log_price']]

In [None]:
# 59. Generate predictions with the test data
predictions_v1 = model_v1.predict(X_test)

In [None]:
# 60. Apply the inverse log transformation to the predictions
predictions_v1 = np.exp(predictions_v1)

In [None]:
# 61. Apply the inverse log transformation to the real test data
y_test = np.exp(y_test)

In [None]:
from sklearn.metrics import mean_absolute_error

# 62. Calculate Mean Absolute Error
mae = mean_absolute_error(y_test, predictions_v1)
print(f'Mean Absolute Error: {mae:.2f}')

In [None]:
import plotly.graph_objects as go

# 63. Advanced Plot with Plotly
fig = go.Figure()
# Add actual values trace
fig.add_trace(go.Scatter(
    y=y_test.values.flatten(),
    mode='lines',
    name='Actual',
    line=dict(color='firebrick', width=2),
))
# Add predicted values trace
fig.add_trace(go.Scatter(
    y=predictions_v1,
    mode='lines',
    name='Predicted',
    line=dict(color='royalblue', width=2, dash='dash'),
))
# Customize layout for clarity and style
fig.update_layout(
    title='Actual vs Predicted Prices',
    xaxis_title='Time Index',
    yaxis_title='Price',
    template='simple_white',
    width=900,
    height=500,
    font=dict(size=12),
    legend=dict(title='Legend', orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1),
)
fig.show()

#### Building the Second Version of the Model

In [None]:
# 64. Create the model
model_v2 = LinearRegression()

In [None]:
# 65. Train the model using the training data
model_v2.fit(X_train, y_train['log_price'])

In [None]:
# 66. Prepare input and output data for testing
X_test = test_data.drop(columns=['log_price'])
y_test = test_data[['log_price']]

In [None]:
# 67. Generate predictions using the test data
predictions_v2 = model_v2.predict(X_test)

In [None]:
# 68. Apply inverse log transformation to the predictions
predictions_v2 = np.exp(predictions_v2)

# 69. Apply inverse log transformation to the real test data
y_test = np.exp(y_test)

In [None]:
# 70. Calculate Mean Absolute Error
mae = mean_absolute_error(y_test, predictions_v2)
print(f'Mean Absolute Error: {mae:.2f}')

In [None]:
import plotly.graph_objects as go
import joblib

# 71. Advanced Plot with Plotly
fig = go.Figure()
# Add actual values trace
fig.add_trace(go.Scatter(
    y=y_test.values.flatten(),
    mode='lines',
    name='Actual',
    line=dict(color='firebrick', width=2),
))
# Add predicted values trace
fig.add_trace(go.Scatter(
    y=predictions_v2,
    mode='lines',
    name='Predicted',
    line=dict(color='royalblue', width=2, dash='dash'),
))
# Customize layout for clarity and style
fig.update_layout(
    title='Actual vs Predicted Prices',
    xaxis_title='Time Index',
    yaxis_title='Price',
    template='simple_white',
    width=900,
    height=500,
    font=dict(size=12),
    legend=dict(title='Legend', orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1),
)
fig.show()

In [None]:
# 72. Import necessary libraries
import os
import joblib

# 73. Create the directory if it doesn't exist
os.makedirs('model', exist_ok=True)

# 74. Define the filename for saving the model
filename = 'model/model_v2.sav'

# 75. Save the model to disk
joblib.dump(model_v2, filename)

#### Forecast

In [None]:
# 76. Display the last record of the test input data
X_test.tail(1)

In [None]:
# 77. Display the last record from the test input data
y_test.tail(1)

In [None]:
# 78. Prepare new data for the forecast
new_data = {
    'num_rooms': [4.0],
    'year': [2023],
    'month': [8]
}

In [None]:
# 79. Create the date index for the new record
date_index = pd.to_datetime('2023-08-31')

In [None]:
# 80. Create DataFrame for the new forecast input
input_forecast_data = pd.DataFrame({
    'price': [0],  # Dummy column to match the training data
    'num_rooms': [4.0],
    'year': [2023],
    'month': [8]
}, index=[date_index])

# 81. Display the prepared forecast
input_forecast_data

In [None]:
# 82. Load the saved model from disk
import joblib
model_v2 = joblib.load('model/model_v2.sav')

In [None]:
# 83. Generate the forecast using the model
forecast = model_v2.predict(input_forecast_data)

In [None]:
# 82. Display the forecast (in log scale)
forecast

In [None]:
# 83. Apply inverse log transformation to get the original price
forecast = np.exp(forecast)

# 84. Display the transformed forecast
forecast