<a href="https://colab.research.google.com/github/usshaa/SMBDA/blob/main/C-5.9%3A%20Predictive_Model_For_Countries_GDP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Predict the GDP of Malaysia

**Objective:** Training and evaluating a machine learning model to predict the GDP of Malaysia.

- **Step 1:** Import necessary Libraries

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand, col
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression, LinearRegressionModel
from pyspark.ml.evaluation import RegressionEvaluator

- **Step 2: Initialize Spark Session**: Starts a Spark session to utilize Spark's distributed computing capabilities.

In [None]:
# Step 2: Initialize Spark Session
spark = SparkSession.builder \
    .appName("Malaysia GDP Prediction") \
    .getOrCreate()

- **Step 3: Load and Prepare Data**: Reads the synthetic dataset into a Spark DataFrame and prepares it for feature engineering.

In [None]:
# Step 3: Load and Prepare Data
data = spark.read.csv("/FileStore/tables/malaysia_gdp_dataset.csv", header=True, inferSchema=True)

- **Step 4: Feature Engineering**: Uses `VectorAssembler` to combine selected feature columns into a single vector column named "features", which is required for model training.

In [None]:
# Step 4: Feature Engineering
feature_columns = ["Year", "Population", "GDP_Per_Capita", "Employment_Rate", "Inflation_Rate", "Interest_Rate"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
data = assembler.transform(data).select("features", "GDP")

- **Step 5: Build and Train Regression Model**: Initializes a Linear Regression model (`LinearRegression`) from PySpark's machine learning library (`pyspark.ml`). The model is trained using the prepared dataset.

In [None]:
# Step 5: Build and Train Regression Model
lr = LinearRegression(labelCol="GDP")
model = lr.fit(data)

- **Step 6: Evaluate the Model**: Makes predictions on the dataset and evaluates the model's performance using the Root Mean Squared Error (RMSE). This metric assesses how well the model predicts the actual GDP values.

In [None]:
# Step 6: Evaluate the Model
predictions = model.transform(data)
evaluator = RegressionEvaluator(labelCol="GDP", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

Root Mean Squared Error (RMSE): 110531700877.20


In [None]:
predictions.select("prediction", "GDP", "features").show(5)

+--------------------+--------------------+--------------------+
|          prediction|                 GDP|            features|
+--------------------+--------------------+--------------------+
|3.002323404232118E11|2.713512052938978E11|[2020.0,3.4842850...|
| 3.01690923543166E11|2.482283504872224...|[2003.0,2.9599996...|
|3.080621710125841E11|3.023843158708312E11|[2000.0,2.8354246...|
|2.717016765072492...|2.364924699445132...|[2023.0,3.0269213...|
| 2.77660559956756E11|4.398302507998309...|[2008.0,3.2642778...|
+--------------------+--------------------+--------------------+
only showing top 5 rows



In [None]:
predictions.show()

+--------------------+--------------------+--------------------+
|            features|                 GDP|          prediction|
+--------------------+--------------------+--------------------+
|[2020.0,3.4842850...|2.713512052938978E11|3.002323404232118E11|
|[2003.0,2.9599996...|2.482283504872224...| 3.01690923543166E11|
|[2000.0,2.8354246...|3.023843158708312E11|3.080621710125841E11|
|[2023.0,3.0269213...|2.364924699445132...|2.717016765072492...|
|[2008.0,3.2642778...|4.398302507998309...| 2.77660559956756E11|
|[2007.0,3.1639998...|4.289323672360232...|3.105909635989288...|
|[2007.0,2.3435721...|1.422155482575994...|3.231072249304668E11|
|[2004.0,2.0481503...|4.843150268858315E11|3.049732129893963...|
|[2023.0,2.4731795...|3.542340424405783...|3.028531706361333E11|
|[2003.0,2.4016113...|4.314829244009833E11|3.171629143155991E11|
|[2021.0,2.3164742...|3.829234574824308E11|2.922898454102738E11|
|[2023.0,3.4143645...|2.741948580030708E11|2.656274607829859...|
|[2017.0,3.3145514...|3.9

- **Step 7: Save and Load Model (Optional)**: Optionally, saves the trained model for future use and demonstrates how to load it if needed.

In [None]:
# Step 7: Save and Load Model
model.save("/FileStore/tables/malaysia_gdp_prediction_model")

### 1. Prepare the New Sample Data

First, prepare your new sample data in a format that matches the input features used by your Spark Linear Regression model. Based on your provided example data, the features likely include Year, Population, GDP_Per_Capita, Employment_Rate, Inflation_Rate, and Interest_Rate.

Here's an example of how you might structure the new sample data:

In [None]:
# Example of new sample data
new_sample = [
    (2024, 36000000.0, 6800.0, 73.0, 6.0, 2.0),  # Adjust values accordingly
    (2025, 36500000.0, 6900.0, 74.0, 5.5, 1.8),  # Another example row
    # Add more rows as needed
]

### 2. Convert Data to DataFrame

Convert the new sample data into a DataFrame format that Spark can work with. Assuming you are using PySpark, you would create a DataFrame like this:

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType

# Example schema based on your features
schema = StructType([
    StructField("Year", IntegerType(), True),
    StructField("Population", DoubleType(), True),
    StructField("GDP_Per_Capita", DoubleType(), True),
    StructField("Employment_Rate", DoubleType(), True),
    StructField("Inflation_Rate", DoubleType(), True),
    StructField("Interest_Rate", DoubleType(), True),
])

# Create a Spark session
# spark = SparkSession.builder.appName("GDPPrediction").getOrCreate()

# Convert list of tuples to DataFrame
df = spark.createDataFrame(new_sample, schema)

In [None]:
# Assuming these are the input features you want to use
input_features = ["Year", "Population", "GDP_Per_Capita", "Employment_Rate", "Inflation_Rate", "Interest_Rate"]

# Create a VectorAssembler instance
assembler = VectorAssembler(inputCols=input_features, outputCol="features")

# Transform the new sample data
df = assembler.transform(df)

### 3. Load the Trained Model

Load your pre-trained Spark Linear Regression model. Make sure you have saved your model after training using Spark's `save()` method or any other suitable method, and now you can load it for inference.

In [None]:
# Assuming 'model_path' is the path where your trained model is saved
model = LinearRegressionModel.load("/FileStore/tables/malaysia_gdp_prediction_model")

### 4. Transform and Predict

Use the loaded model to transform the new sample data and make predictions:

In [None]:
# Transform the new data
predictions = model.transform(df)

# Show the predictions
predictions.select("Year", "Prediction").show()

+----+--------------------+
|Year|          Prediction|
+----+--------------------+
|2024|2.962294960213685E11|
|2025|2.962315011918384E11|
+----+--------------------+



- **Step 8: Stop Spark Session**: Terminates the Spark session to release resources.

# !Great Job