Dataset is taken from UCI Machine Repository https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise# Donor: Dr Roberto Lopez robertolopez '@' intelnics.com Intelnics

Creators: Thomas F. Brooks, D. Stuart Pope and Michael A. Marcolini NASA

The NASA data set comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of attack. The task is to predict the Scaled cound pressure level in decibels.


In [20]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

**Reading the dataset**

In [None]:
pd.set_option('max_columns',10)

dataset=pd.read_csv("/content/airfoil_self_noise.dat",sep="\t",names=["Frequency","Angle of attack","Chord length","Free-stream velocity","Suction side displacement" ,"Scaled sound pressure level"])

#Checking for shape and head/tail and info of data
print(dataset.head())
print(dataset.tail())
print(dataset.info())
print(dataset.shape)

#Checking for presence of any null value, there are no null values present in dataset
print(dataset.isnull().any())


In [None]:
# Finding the correlation among various varaibles to check for multi-colinearity among independent variables
corelation=dataset.corr()
print(corelation)
print(sns.heatmap(corelation))

"""Since no co-relation value is highter than 7 or less than -7, 
no high multi-colinearity exists between independent variables """

In [23]:
#Splitting the dataset in train data and test data using train_test_split from sklearn
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

X_train, X_test, y_train, y_test =train_test_split(X,y,random_state=42,test_size=.2,shuffle=True)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [25]:
# Always After splitting, perform the feature scaling
from sklearn.preprocessing import StandardScaler
scalar=StandardScaler()
X_train=scalar.fit_transform(X_train)
X_test=scalar.transform(X_test)

# **Building a Multi Linear Regression model**

In [26]:
from sklearn.linear_model import LinearRegression
linear=LinearRegression()
linear.fit(X_train,y_train)
pd.set_option('precision',2)
predcitedvalue=linear.predict(X_test)


In [27]:
# print the cofficient and intercepts
print(f'Linear Cofficients are {linear.coef_}')
print(f'The line intercept is {linear.intercept_}')

# printing the root mean squared error
print(f'The Mean_squared_error is { mean_squared_error(y_test,predcitedvalue,squared=False)}')
print(f'The Mean_absolute_error is { mean_absolute_error(y_test,predcitedvalue)}')
print(f'R2 Score is {r2_score(y_test,predcitedvalue)}')

Linear Cofficients are [-4.05944704 -2.36673679 -3.21995215  1.526901   -1.81559015]
The line intercept is 124.87696006655574
The Mean_squared_error is 4.704109194974886
The Mean_absolute_error is 3.6724145641788013
R2 Score is 0.5582979754897286


In [None]:
# making a new dataframe where orginal values and predicted values are stored
newdataframe=pd.DataFrame([predcitedvalue,y_test]).T

newdataframe.columns=["predictedvalue","originalvalue"]
newdataframe["newcol"]=newdataframe.originalvalue-newdataframe.predictedvalue
print(newdataframe)

**The R2 Score in case of Linear Regression is very less, The model is not a goof fit, We will now try Decision Tree Regressor**

# **Building a Decision Tree Regressor**

In [31]:
from sklearn.tree import DecisionTreeRegressor
decisionregressor=DecisionTreeRegressor(random_state=23)
decisionregressor.fit(X_train,y_train)
predvalue_decision=decisionregressor.predict(X_test)

In [32]:
# printing the root mean squared error
print(f'The Mean_squared_error is { mean_squared_error(y_test,predvalue_decision,squared=False)}')
print(f'The Mean_absolute_error is { mean_absolute_error(y_test,predvalue_decision)}')
print(f'R2 Score is {r2_score(y_test,predvalue_decision)}')

The Mean_squared_error is 2.619758695235585
The Mean_absolute_error is 1.8362790697674412
R2 Score is 0.8630073766926125


**R2 Score has significantly improved by using Decision Tree Regressor as compared to Linear Regression Model. We can now further try to use Random Forest Regressor**

In [34]:
from sklearn.ensemble import RandomForestRegressor
randomRegressor=RandomForestRegressor(random_state=23,n_estimators=30)
randomRegressor.fit(X_train,y_train)
predvalue_Random=randomRegressor.predict(X_test)

In [35]:
# printing the root mean squared error
print(f'The Mean_squared_error is { mean_squared_error(y_test,predvalue_Random,squared=False)}')
print(f'The Mean_absolute_error is { mean_absolute_error(y_test,predvalue_Random)}')
print(f'R2 Score is {r2_score(y_test,predvalue_Random)}')

The Mean_squared_error is 1.9190851296815714
The Mean_absolute_error is 1.3578409745293467
R2 Score is 0.9264871802042807


**Accuracy of predicted results have improved with R2score equal to .92 using Random Forest Classifier . We can further hypertune using the increased number of random trees**