## NASA Airfoil Noise Prediction using PySpark

## Objective


As a Data Engineer passionate about building scalable data solutions, I worked on a project involving the NASA Airfoil Self-Noise dataset, commonly used in aeronautics research. The objective was to support data scientists by handling all ETL and ML pipeline development tasks so they could focus on algorithm optimization.

In this project, I cleaned and preprocessed the dataset by removing duplicate records and handling missing values to ensure data quality. I then built a complete end-to-end ML pipeline using PySpark, with the goal of predicting SoundLevel based on various physical attributes of the airfoil.

The pipeline included feature engineering, model training, and evaluation using metrics like R², RMSE, and MAE. Finally, I persisted the trained model for future use in production environments or batch inference tasks.


## Datasets

In this lab I will be using dataset(s):

 - The original dataset can be found here NASA airfoil self noise dataset. https://archive.ics.uci.edu/dataset/291/airfoil+self+noise
 
 - This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.


Diagram of an airfoil. - For informational purpose


![Airfoil with flow](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/images/Airfoil_with_flow.png)


Diagram showing the Angle of attack. - For informational purpose


![Airfoil angle of attack](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/images/Airfoil_angle_of_attack.jpg)


In [1]:
!pip install pyspark==3.1.2 -q
!pip install findspark -q

### Importing Required Libraries



In [2]:
# this section is to suppress warnings generated by code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

## Perform ETL activity


### Import required libraries


In [3]:

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.pipeline import PipelineModel
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import StandardScaler


### Create a spark session


In [46]:
spark = SparkSession.builder.appName("Final Project").getOrCreate()

### Load the csv file into a dataframe


In [5]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv


--2025-07-09 05:57:38--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/NASA_airfoil_noise_raw.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60682 (59K) [text/csv]
Saving to: ‘NASA_airfoil_noise_raw.csv.2’


2025-07-09 05:57:38 (60.8 MB/s) - ‘NASA_airfoil_noise_raw.csv.2’ saved [60682/60682]



### Load the dataset into the spark dataframe


In [6]:
df = spark.read.csv("NASA_airfoil_noise_raw.csv", header=True, inferSchema=True)

### Top 5 rows of the dataset


In [7]:
df.show(5)

+---------+-------------+-----------+------------------+-----------------------+----------+
|Frequency|AngleOfAttack|ChordLength|FreeStreamVelocity|SuctionSideDisplacement|SoundLevel|
+---------+-------------+-----------+------------------+-----------------------+----------+
|      800|          0.0|     0.3048|              71.3|             0.00266337|   126.201|
|     1000|          0.0|     0.3048|              71.3|             0.00266337|   125.201|
|     1250|          0.0|     0.3048|              71.3|             0.00266337|   125.951|
|     1600|          0.0|     0.3048|              71.3|             0.00266337|   127.591|
|     2000|          0.0|     0.3048|              71.3|             0.00266337|   127.461|
+---------+-------------+-----------+------------------+-----------------------+----------+
only showing top 5 rows



### Total number of rows in the dataset


In [8]:
rowcount1 = df.count()
print(rowcount1)

1522


### Drop all the duplicate rows from the dataset


In [9]:
df = df.dropDuplicates()

### Total number of rows in the dataset


In [10]:
rowcount2 = df.count()
print(rowcount2)



1503


                                                                                

### Drop all the rows that contain null values from the dataset


In [11]:
df = df.dropna()

### Total number of rows in the dataset


In [12]:
rowcount3 = df.count()
print(rowcount3)



1499


                                                                                

### Renaming the column "SoundLevel" to "SoundLevelDecibels"


In [13]:
df = df.withColumnRenamed("SoundLevel", "SoundLevelDecibels")

### Saving the dataframe in parquet format


In [14]:
df.write.mode("overwrite").parquet("NASA_airfoil_noise_cleaned.parquet")

[Stage 12:>                                                       (0 + 8) / 200]25/07/09 05:58:25 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
25/07/09 05:58:25 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
25/07/09 05:58:26 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
                                                                                

#### Evaluation


In [15]:
print("Part 1 - Evaluation")

print("Total rows = ", rowcount1)
print("Total rows after dropping duplicate rows = ", rowcount2)
print("Total rows after dropping duplicate rows and rows with null values = ", rowcount3)
print("New column name = ", df.columns[-1])

import os

print("NASA_airfoil_noise_cleaned.parquet exists :", os.path.isdir("NASA_airfoil_noise_cleaned.parquet"))

Part 1 - Evaluation
Total rows =  1522
Total rows after dropping duplicate rows =  1503
Total rows after dropping duplicate rows and rows with null values =  1499
New column name =  SoundLevelDecibels
NASA_airfoil_noise_cleaned.parquet exists : True


## Create a  Machine Learning Pipeline


### Loading "NASA_airfoil_noise_cleaned.parquet" into a dataframe


In [16]:
df = spark.read.parquet("NASA_airfoil_noise_cleaned.parquet")

### Total number of rows in the dataset


In [17]:
rowcount4 = df.count()
print(rowcount4)

[Stage 14:>                                                         (0 + 8) / 8]

1499


                                                                                

### VectorAssembler pipeline stage


In [18]:
assembler = VectorAssembler(inputCols=["Frequency","AngleOfAttack","ChordLength", "FreeStreamVelocity", "SuctionSideDisplacement"], outputCol="features")

### Define the StandardScaler pipeline stage


In [19]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

### Define the Model creation pipeline stage


In [20]:
lr = LinearRegression(featuresCol="features", labelCol="SoundLevelDecibels")

### Build the pipeline


In [21]:
pipeline = Pipeline(stages=[assembler, scaler, lr])

### Split the data


In [22]:
(trainingData, testingData) = df.randomSplit([0.7, 0.3], seed=42)

### Fit the pipeline


In [23]:
pipelineModel = pipeline.fit(trainingData)

25/07/09 06:00:36 WARN util.Instrumentation: [370c580d] regParam is zero, which might cause numerical instability and overfitting.
[Stage 19:>                                                         (0 + 8) / 8]25/07/09 06:00:38 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
25/07/09 06:00:38 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
25/07/09 06:00:38 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
25/07/09 06:00:38 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
                                                                                

#### Evaluation


In [24]:
print("Part 2 - Evaluation")
print("Total rows = ", rowcount4)
ps = [str(x).split("_")[0] for x in pipeline.getStages()]

print("Pipeline Stage 1 = ", ps[0])
print("Pipeline Stage 2 = ", ps[1])
print("Pipeline Stage 3 = ", ps[2])

print("Label column = ", lr.getLabelCol())

Part 2 - Evaluation
Total rows =  1499
Pipeline Stage 1 =  VectorAssembler
Pipeline Stage 2 =  StandardScaler
Pipeline Stage 3 =  LinearRegression
Label column =  SoundLevelDecibels


## Evaluate the Model


### Predict using the model


In [25]:
predictions = pipelineModel.transform(testingData)

### Print the MSE


In [26]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="SoundLevelDecibels", metricName="mse")
mse = evaluator.evaluate(predictions)
print(mse)

[Stage 26:>                                                         (0 + 8) / 8]

22.5937540713487


                                                                                

### Task 3 - Print the MAE


In [27]:
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="SoundLevelDecibels", metricName="mae")
mae = evaluator.evaluate(predictions)
print(mae)

[Stage 28:>                                                         (0 + 8) / 8]

3.7336902294630927


                                                                                

### Print the R-Squared(R2)


In [28]:
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="SoundLevelDecibels", metricName="r2")
r2 = evaluator.evaluate(predictions)
print(r2)

[Stage 30:>                                                         (0 + 8) / 8]

0.542601650868908


                                                                                

#### Evaluation


In [29]:
print("Part 3 - Evaluation")

print("Mean Squared Error = ", round(mse,2))
print("Mean Absolute Error = ", round(mae,2))
print("R Squared = ", round(r2,2))

lrModel = pipelineModel.stages[-1]

print("Intercept = ", round(lrModel.intercept,2))


Part 3 - Evaluation
Mean Squared Error =  22.59
Mean Absolute Error =  3.73
R Squared =  0.54
Intercept =  132.6


## Persist the Model


### Saving the model


In [30]:
pipelineModel.write().overwrite().save("./Final_Project/")

                                                                                

### Load the model from the path "Final_Project"


In [31]:
loadedPipelineModel = pipelineModel.load("./Final_Project/")

                                                                                

### Make predictions using the loaded model on the testdata


In [32]:
predictions = loadedPipelineModel.transform(testingData)

### Show the predictions


In [33]:
predictions.select("SoundLevelDecibels", "prediction").show(5)

[Stage 52:>                                                         (0 + 1) / 1]

+------------------+------------------+
|SoundLevelDecibels|        prediction|
+------------------+------------------+
|           127.315|123.64344009624709|
|           119.975| 123.4869578861485|
|           121.783|124.38983849684239|
|           127.224|121.44706993294264|
|           122.229|125.68312652454149|
+------------------+------------------+
only showing top 5 rows



                                                                                

#### Evaluation


In [47]:
print("Part 4 - Evaluation")

loadedmodel = loadedPipelineModel.stages[-1]
totalstages = len(loadedPipelineModel.stages)
inputcolumns = loadedPipelineModel.stages[0].getInputCols()

print("Number of stages in the pipeline = ", totalstages)
for i,j in zip(inputcolumns, loadedmodel.coefficients):
    print(f"Coefficient for {i} is {round(j,4)}")

Part 4 - Evaluation
Number of stages in the pipeline =  3
Coefficient for Frequency is -0.0013
Coefficient for AngleOfAttack is -0.4111
Coefficient for ChordLength is -36.1575
Coefficient for FreeStreamVelocity is 0.1019
Coefficient for SuctionSideDisplacement is -135.2067


### Stop Spark Session


In [53]:
### By Qasim Alibadruddin With the Help of IBMSkillsNetwork Labs.

In [50]:
spark.stop()

In [52]:
### By Qasim Alibadruddin With the Help of IBMSkillsNetwork Labs.