### Other Algorithms for analyzing progression of the population in Time 

In [1]:
#Libraries utilized dataframes pandas, arrays numpy
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

#evaluating the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

#modeling Algorithm
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from keras.models import Sequential
from keras.layers import LSTM, Dense
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor

#scatter plot
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from pandasgui import show

In [2]:
#Loading from the previously processed dataframe
population_data_origin  = pd.read_csv('datasets/population_merged_reduced.csv',  sep=",",low_memory=False)

# Display the DataFrame information including data types
population_data_origin.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17538 entries, 0 to 17537
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       17538 non-null  int64  
 1   LocID            17538 non-null  int64  
 2   Time             17538 non-null  int64  
 3   TPopulation1Jan  17538 non-null  float64
 4   PopDensity       17538 non-null  float64
 5   PopSexRatio      17538 non-null  float64
 6   MedianAgePop     17538 non-null  float64
 7   PopGrowthRate    17538 non-null  float64
 8   DoublingTime     17538 non-null  float64
 9   MAC              17538 non-null  float64
 10  SRB              17538 non-null  float64
 11  CDR              17538 non-null  float64
 12  NetMigrations    17538 non-null  float64
 13  Location         17538 non-null  object 
dtypes: float64(10), int64(3), object(1)
memory usage: 1.9+ MB


In [3]:
#Deleting first attribute garbage generated
population_data_origin = population_data_origin.iloc[:, 1:]

#show the dataframe
population_data_origin.head()

Unnamed: 0,LocID,Time,TPopulation1Jan,PopDensity,PopSexRatio,MedianAgePop,PopGrowthRate,DoublingTime,MAC,SRB,CDR,NetMigrations,Location
0,108,1950,2229.322,86.8637,91.9472,18.3147,2.2,31.5067,30.995,102.5,23.546,-13.343,Burundi
1,108,1951,2278.903,88.7571,92.1448,18.0842,2.114,32.7884,30.996,102.5,23.879,-13.217,Burundi
2,108,1952,2327.593,90.6179,92.3191,17.8744,2.036,34.0446,31.026,102.5,23.815,-13.715,Burundi
3,108,1953,2375.478,92.4508,92.488,17.6693,1.969,35.203,31.03,102.5,23.604,-14.962,Burundi
4,108,1954,2422.721,94.2874,92.6503,17.4706,1.965,35.2747,31.036,102.5,23.347,-14.599,Burundi


In [4]:
# For modeling we do not need the location name varchar type
population_data =  population_data_origin.drop("Location", axis='columns')

## Splitting the data in train and test sets by stratifying with a  Composite Primary Index ( CountryId, Year )

In [5]:
# Separate the independent variables (attributes) and the dependent variable
X = population_data.drop('TPopulation1Jan', axis=1)  # The dependent variable dropped
y = population_data['TPopulation1Jan']  # y for the dependent variable


# Assuming 'primary_key_values' is a list of primary key values for each record
primary_key_values = list(zip(population_data['LocID'], population_data['Time']))

# Assuming 'train_primary_key_values' is a list of primary key values for the training set
train_primary_key_values = list(set(zip(population_data['LocID'], population_data['Time'])))

# Create a boolean mask to identify the indices that belong to the training set
train_indices = [pk in train_primary_key_values for pk in primary_key_values]

# Split the data into training and test sets based on the indices
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=train_indices)

### Long Short-Term Memory (LSTM) Networks: 
 
LSTM is a type of recurrent neural network (RNN) that can effectively model and predict sequences, including time series data. LSTM networks can capture long-term dependencies and are commonly used for time series forecasting tasks.



In [6]:
# Create the LSTM model
model_lstm = Sequential()
model_lstm.add(LSTM(units=50, input_shape=(X_train.shape[1], 1)))
model_lstm.add(Dense(units=1))  # Output layer

# Compile the model
model_lstm.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model_lstm.fit(X_train, y_train, epochs=10, batch_size=32)

# Make predictions on the test data
predictions = model_lstm.predict(X_test)

# Evaluate the model
#loss = model.evaluate(X_test, y_test)
#print("Test Loss:", loss)
train_r2_svr = r2_score(y_train, y_train)
test_r2_svr = r2_score(y_test, y_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### ARIMA - AutoRegressive Integrated Moving Average. 

Well known time series forecasting method used to model and predict time-dependent data. ARIMA models are designed to capture different aspects of time series data, including trends, seasonality, and noise.



In [7]:
#' Time' is the column representing the year
population_data['Time'] = pd.to_datetime(population_data['Time'], format='%Y')
population_data.set_index('Time', inplace=True)

# Resample or interpolate missing data (if needed)
#population_data = population_data.resample('Y').mean()  # Resample to yearly frequency and calculate the mean for missing years

X = population_data.drop('TPopulation1Jan', axis=1)  # Independent variables
y = population_data['TPopulation1Jan']  # Dependent variable


arima_order = (1, 1, 1)  # Example ARIMA order (p, d, q)
arima_model = ARIMA(y, order=arima_order, exog=X)
arima_model_fit = arima_model.fit()

#Generate future time points
future_time_points = pd.date_range(start='2030-01-01', end='2035-01-01', freq='Y')


# Assuming you have the corresponding independent variables for the future time points in 
# all the independent variables with the exception of Time that has been converted in index  
#'LocID','Time','PopDensity','PopSexRatio','MedianAgePop','PopGrowthRate','DoublingTime','MAC','SRB','CDR','NetMigrations'
X_future_data = {
    'feature_2030': [0,0,0,0,0,0,0,0,0,0],  
    'feature_2031': [0,0,0,0,0,0,0,0,0,0],  
    'feature_2032': [0,0,0,0,0,0,0,0,0,0],
    'feature_2033': [0,0,0,0,0,0,0,0,0,0],
    'feature_2034': [0,0,0,0,0,0,0,0,0,0],
    'feature_2035': [0,0,0,0,0,0,0,0,0,0]
}    
        

# # Make predictions for the future using the ARIMA model
y_future_pred = arima_model_fit.forecast(steps=len(future_time_points), exog=X_future_data ).values

print(y_future_pred)



  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  return get_prediction_index(


ValueError: Provided exogenous values are not of the appropriate shape. Required (5, 10), got ().

The previous program is Throwing error, because the values are not within the range of the the valid values

Resolving the issue of providing values for the indepenent variables

    Historical Averages: Calculate the historical averages for each independent variable and use these averages as constant values for the future time points. While this approach assumes that the future values will be similar to historical averages, it can serve as a simple baseline.

    Seasonal Decomposition: If your data exhibits seasonal patterns, you can use seasonal decomposition techniques to identify the seasonal component and then use this component to estimate future values. For example, you can use seasonal decomposition of time series (STL) or other decomposition methods to capture seasonal patterns and then extrapolate them into the future.

    Forecasting Models: Use forecasting models for each independent variable to predict their future values. For example, you can use autoregressive integrated moving average (ARIMA) models or other time series forecasting techniques to predict the future values of each feature. This approach takes into account the temporal dependencies in the data.

    Machine Learning: Train machine learning models to predict the independent variables based on historical data. Depending on the nature of your data, you can use regression models or time series forecasting algorithms to make predictions for the future time points.

    Domain Expertise: If you have domain knowledge or access to subject matter experts, you can consult them to make reasonable assumptions about the future values of the independent variables based on their expertise.