# Build an ML Pipeline for Airfoil Noise Prediction

In this project, we use a modified version of the NASA Airfoil Self-Noise dataset to create a model that will predict the SoundLevel based on other columns in the dataset. After training the model, we assess its performance using relevant metrics to gauge accuracy and effectiveness. The model is saved for future use, ensuring it can be retrieved and deployed in real-world applications to make predictions on new data.

This project has four parts: 

- Part 1 - Perform ETL activity
  - Load a csv dataset
  - Remove duplicates if any
  - Drop rows with null values if any
  - Make transformations
  - Store the cleaned data in parquet format
- Part 2 - Create a Machine Learning Pipeline
  - Create a machine learning pipeline for prediction
- Part 3 - Evaluate the Model
  - Evaluate the model using relevant metrics
- Part 4 - Persist the Model
  - Save the model for future production use
  - Load and verify the stored model

### Preliminaries: Installing libraries and downloading data

Install the required libraries

In [1]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

Download the required data file

In [2]:
# Download the `NASA_airfoil_noise_raw.csv` file
import wget
wget.download("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv")

'NASA_airfoil_noise_raw.csv'

### Importing Libraries

Importing the required libraries

In [3]:
import os
import findspark
import warnings

def warn(*args, **kwargs):
    pass

# Suppress generated warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

findspark.init()

# import functions/Classes for sparkml
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.regression import LinearRegression

# import functions/Classes for pipeline creation
from pyspark.ml import Pipeline
from pyspark.ml.pipeline import PipelineModel

# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator

### Create a spark session

Ignore any warnings by SparkSession command

In [4]:
spark = SparkSession \
    .builder \
    .appName("Airfoil Noise Prediction") \
    .getOrCreate()

24/04/26 02:26:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


### Tasks

#### Part 1 - Perform ETL activity

Our initial step involves data cleaning, where we eliminate duplicate rows and those with missing values. This process ensures that the data remains reliable and consistent for subsequent analysis.

Load a csv dataset

* Using the `spark.read.csv` function we load the data into a dataframe
* The `header=True` indicates that there is a header row in our csv file
* The `inferSchema=True` tells spark to automatically determine the data types of the columns

In [5]:
df = spark.read.csv("NASA_airfoil_noise_raw.csv", header=True, inferSchema=True)

                                                                                

Show top 5 rows from the dataset

In [6]:
df.show(5)

+---------+-------------+-----------+------------------+-----------------------+----------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevel|
+---------+-------------+-----------+------------------+-----------------------+----------+
|      800|          0.0|     0.3048|              71.3|             0.00266337|   126.201|
|     1000|          0.0|     0.3048|              71.3|             0.00266337|   125.201|
|     1250|          0.0|     0.3048|              71.3|             0.00266337|   125.951|
|     1600|          0.0|     0.3048|              71.3|             0.00266337|   127.591|
|     2000|          0.0|     0.3048|              71.3|             0.00266337|   127.461|
+---------+-------------+-----------+------------------+-----------------------+----------+
only showing top 5 rows



Show total number of rows in the dataset

In [7]:
rowcount1 = df.count()
print(rowcount1)

1522


                                                                                

Remove duplicates, if any

In [8]:
df = df.dropDuplicates()

Show total number of rows in the dataset

In [9]:
rowcount2 = df.count()
print(rowcount2)



1503


                                                                                

Drop rows with null values, if any

In [10]:
df = df.dropna()

Show total number of rows in the dataset

In [11]:
rowcount3 = df.count()
print(rowcount3)



1499


                                                                                

Make transformations

* Rename the column `SoundLevel` to `SoundLevelDecibels`

In [12]:
df = df.withColumnRenamed("SoundLevel", "SoundLevelDecibels")

Store the cleaned data in parquet format

* Save the dataframe as `NASA_airfoil_noise_cleaned.parquet`

In [13]:
df.write.parquet("NASA_airfoil_noise_cleaned.parquet")

                                                                                

#### Part 1 - Evaluation

In [14]:
print("Part 1 - Evaluation")

print("Total rows = ", rowcount1)
print("Total rows after dropping duplicate rows = ", rowcount2)
print("Total rows after dropping duplicate rows and rows with null values = ", rowcount3)
print("New column name = ", df.columns[-1])

print("NASA_airfoil_noise_cleaned.parquet exists :", os.path.isdir("NASA_airfoil_noise_cleaned.parquet"))

Part 1 - Evaluation
Total rows =  1522
Total rows after dropping duplicate rows =  1503
Total rows after dropping duplicate rows and rows with null values =  1499
New column name =  SoundLevelDecibels
NASA_airfoil_noise_cleaned.parquet exists : True


#### Part 2 - Create a  Machine Learning Pipeline

Following this, we’ll create a Machine Learning pipeline comprising three stages, one of which involves regression. This pipeline will facilitate the development of a model that predicts SoundLevel based on other columns in the dataset.

Create a machine learning pipeline for prediction

First, load data from "NASA_airfoil_noise_cleaned.parquet" into a dataframe

In [15]:
df = spark.read.parquet("NASA_airfoil_noise_cleaned.parquet")

Show total number of rows in the dataset

In [16]:
rowcount4 = df.count()
print(rowcount4)



1499


                                                                                

Define the VectorAssembler pipeline stage

Stage 1:
* Assemble the input columns into a single column `features`
* Use all the columns except `SoundLevelDecibels` as input features


In [17]:
assembler = VectorAssembler(
    inputCols=[
        "Frequency",
        "AngleOfAttack",
        "ChordLength",
        "FreeStreamVelocity",
        "SuctionSideDisplacement"
    ],
    outputCol="features"
)

Define the StandardScaler pipeline stage

Stage 2:
* Scale the `features` using standard scaler and store in `scaledFeatures` column


In [18]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

Define the StandardScaler pipeline stage

Stage 3:
* Create a LinearRegression stage to predict `SoundLevelDecibels` using `scaledFeatures`

In [19]:
lr = LinearRegression(featuresCol="scaledFeatures", labelCol="SoundLevelDecibels")

Build a pipeline using the above three stages


In [20]:
pipeline = Pipeline(stages=[assembler, scaler, lr])

Split the data
* Split the data into training and testing sets with 70:30 split
* Set the value of seed to 42

DO NOT set the value of seed to any other value other than 42.

In [21]:
(trainingData, testingData) = df.randomSplit([0.7, 0.3], seed=42)

Fit the pipeline using the training data

In [22]:
pipelineModel = pipeline.fit(trainingData)

24/04/26 02:41:57 WARN util.Instrumentation: [e2f62d28] regParam is zero, which might cause numerical instability and overfitting.
[Stage 19:>                                                         (0 + 8) / 8]24/04/26 02:42:07 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
24/04/26 02:42:07 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
24/04/26 02:42:09 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
24/04/26 02:42:09 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
                                                                                

#### Part 2 - Evaluation


In [23]:
print("Part 2 - Evaluation")
print("Total rows = ", rowcount4)
ps = [str(x).split("_")[0] for x in pipeline.getStages()]

print("Pipeline Stage 1 = ", ps[0])
print("Pipeline Stage 2 = ", ps[1])
print("Pipeline Stage 3 = ", ps[2])

print("Label column = ", lr.getLabelCol())

Part 2 - Evaluation
Total rows =  1499
Pipeline Stage 1 =  VectorAssembler
Pipeline Stage 2 =  StandardScaler
Pipeline Stage 3 =  LinearRegression
Label column =  SoundLevelDecibels


#### Part 3 - Evaluate the Model

After training the model, we will assess its accuracy and effectiveness using suitable metrics. Subsequently, we’ll save the model for future use, ensuring that it can be stored and retrieved later. This allows us to deploy the trained model in real-world applications and make predictions on new data.

Evaluate the model using relevant metrics

Make predictions on testing data

In [24]:
predictions = pipelineModel.transform(testingData)

Print the Mean Square Error (MSE)

* Lower the value the better the model

In [25]:
evaluator = RegressionEvaluator(
    labelCol="SoundLevelDecibels",
    predictionCol="prediction",
    metricName="mse"
)

mse = evaluator.evaluate(predictions)
print(f"MSE = {mse}")



MSE = 22.593754071348812


                                                                                

Print the Mean Absolute Error (MAE)

* Lower the value the better the model

In [26]:
evaluator = RegressionEvaluator(
    labelCol="SoundLevelDecibels",
    predictionCol="prediction",
    metricName="mae"
)

mae = evaluator.evaluate(predictions)
print(f"MAE = {mae}")



MAE = 3.7336902294631287


                                                                                

Print the R-Squared (R2)

* Higher values indicate better performance

In [27]:
evaluator = RegressionEvaluator(
    labelCol="SoundLevelDecibels",
    predictionCol="prediction",
    metricName="r2"
)

r2 = evaluator.evaluate(predictions)
print(f"R Squared = {r2}")



R Squared = 0.5426016508689058


                                                                                

#### Part 3 - Evaluation


In [28]:
print("Part 3 - Evaluation")

print("Mean Squared Error = ", round(mse,2))
print("Mean Absolute Error = ", round(mae,2))
print("R Squared = ", round(r2,2))

lrModel = pipelineModel.stages[-1]

print("Intercept = ", round(lrModel.intercept,2))

Part 3 - Evaluation
Mean Squared Error =  22.59
Mean Absolute Error =  3.73
R Squared =  0.54
Intercept =  132.6


#### Part 4 - Persist the Model

Save the model for future production use

* Save the pipeline model as "Final_Project"

In [29]:
# create folder to save model
!mkdir -p Final_Project

# Persist the model to the path "./Final_Project/"
pipelineModel.write().overwrite().save("./Final_Project/")

                                                                                

Load and verify the stored model

In [30]:
loadedPipelineModel = PipelineModel.load("./Final_Project/")

                                                                                

Use the loaded pipeline model and make predictions using testingData


In [31]:
predictions = loadedPipelineModel.transform(testingData)

Show the predictions

* Show top 5 rows from the predections dataframe
* Display only the label column and predictions

In [32]:
predictions.select("SoundLevelDecibels","prediction").show(5)

[Stage 52:>                                                         (0 + 1) / 1]

+------------------+------------------+
|SoundLevelDecibels|        prediction|
+------------------+------------------+
|           127.315|123.64344009624753|
|           119.975|123.48695788614877|
|           121.783|124.38983849684254|
|           127.224|121.44706993294302|
|           122.229|125.68312652454188|
+------------------+------------------+
only showing top 5 rows



                                                                                

#### Part 4 - Evaluation


In [33]:
print("Part 4 - Evaluation")

loadedmodel = loadedPipelineModel.stages[-1]
totalstages = len(loadedPipelineModel.stages)
inputcolumns = loadedPipelineModel.stages[0].getInputCols()

print("Number of stages in the pipeline = ", totalstages)
for i,j in zip(inputcolumns, loadedmodel.coefficients):
    print(f"Coefficient for {i} is {round(j,4)}")

Part 4 - Evaluation
Number of stages in the pipeline =  3
Coefficient for Frequency is -3.9728
Coefficient for AngleOfAttack is -2.4775
Coefficient for ChordLength is -3.3818
Coefficient for FreeStreamVelocity is 1.5789
Coefficient for SuctionSideDisplacement is -1.6465


### Stop Spark Session


In [34]:
spark.stop()

### Change Log


|  Date (YYYY-MM-DD) |  Version | Changed By  |  Change Description |
|---|---|---|---|
| 2024-04-24  | 0.2  | Pravin Regismond | Modified to fulfill project requirements |
| 2023-05-26  | 0.1  | Ramesh Sannareddy | Initial Version Created |

Copyright © 2023 IBM Corporation. All rights reserved.