# AP Statistics Final Project
**Neural Network for Predicting Airfoil Self-Nosie**

The [data](https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise) from this project is a Kaggle mirror of a NASA dataset from the open source UC Irvine Machine Learning Repository.

More details about data features and the data set itself can be found at the link above, or in my presentation.

The objective of this model is the sound pressure level of the airfoil, measured in decibels (dB). 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Reading the input data.

In [None]:
input_data = pd.read_csv("/kaggle/input/nasa-airfoil-self-noise/NASA_airfoil_self_noise.csv")

The following is a statistical summary of each variable, incuding the objective (sound). Listed for each variable is the five-number summary, the sample size (n), the mean, and the standard deviation.

In [None]:
input_data.describe()

The following is a correlation table showing the value of *r*, the correlation coefficient, for linear regressions using each combination of variables. 

In [None]:
input_data.corr()

The following six graphs are histograms of the five feature variables, and the single output variable.

In [None]:
input_data["Frequency"].plot(kind="hist")

In [None]:
input_data["AngleAttack"].plot(kind="hist")

In [None]:
input_data["ChordLength"].plot(kind="hist")

In [None]:
input_data["FreeStreamVelocity"].plot(kind="hist")

In [None]:
input_data["SuctionSide"].plot(kind="hist")

In [None]:
input_data["Sound"].plot(kind="hist")

Splitting data into features and objective.

In [None]:
y = input_data["Sound"]
X = input_data.drop("Sound", axis=1)

Creating a train/test split of 80/20.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

Constructing the actual model layer by layer. The "relu" activation function is used for all Dense layers except the last one, where a linear one is used.

In [None]:
import keras
from keras.layers import *
model = keras.models.Sequential()
model.add(Dense(128, input_dim = 5, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(256, activation="relu"))
model.add(Dense(512, activation="relu"))
model.add(Dense(1, activation="linear"))

Compiling the model.

In [None]:
from keras import metrics
model.compile(optimizer="adam", loss = "mean_squared_error", metrics=[metrics.MeanSquaredError()])

Fitting the model to the training data.

In [None]:
model.fit(X_train, y_train, batch_size = 64, epochs = 5000, verbose = 0)

Summary of the model.

In [None]:
model.summary()

Calculating the loss of the model on the test set (in this case, it is mean squared error).

In [None]:
loss = model.evaluate(X_test, y_test, verbose=1)

In [None]:
print("Mean Squared Error:", loss)
print("Root Mean Squared Error:", np.sqrt(loss))

In [None]:
y_pred = model.predict(X_test)

Creating a residual plot of predicted vs actual values to see if it appropriate to calculate the correlation coefficient *r*.

In [None]:
import numpy as np
import seaborn as sns
sns.residplot(y_test, y_pred, lowess=True, color="g")


A scatterplot of our predictions vs the actual values for the test set.

In [None]:
from matplotlib import pyplot
pyplot.scatter(y_test, y_pred)

Changing the dimensionality of our predictions to match the input data so we can calculate the correlation coefficient.

In [None]:
y_pred_1 = y_pred.flatten()

Calculating the value of the correlation coefficient *r* between our predictions on the test set and the actual values.

In [None]:
import numpy as np
import scipy.stats
corr , _ = scipy.stats.pearsonr(y_test, y_pred_1)
print("Pearsons correlation:", corr)

Calculating *r^2*

In [None]:
r2 = np.power(corr,2)
print(r2)

Linear Regression of Observed vs Predicted.

In [None]:
from scipy import stats
import numpy as np
slope, intercept, r_value, p_value, std = stats.linregress(y_test,y_pred_1)
print(slope, intercept)