# Comparison Of ANN vs STASTISTICAL MODEL for Regression

In this notebook, we will have a look at the performance of a statistical model and an Artificial Neural Network for a **Linear Regression** problem.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import keras
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import r2_score,mean_squared_error
import time
import math

We will be using the Employee Salary Dataset. The dataset has only two columns: #Years of Experience  and Salary. We will have a look the dataset in details. 

In [None]:
dataset = pd.read_csv('../input/random-salary-data-of-employes-age-wise/Salary_Data.csv')

In [None]:
dataset.shape

In [None]:
dataset.isnull().sum()

In [None]:
dataset.head()

Now, lets take #Years of Experience as our X value and Salary as our y value. Remember X has to be a matrix and y has to be a vector.

In [None]:
X = dataset.iloc[:,0:1].values 
y = dataset.iloc[:,1].values
print("X Shape: ",X.shape,'\ny Shape :',y.shape)

In [None]:
plt.figure(figsize=(20,10))
plt.scatter(X,y, color='red')
plt.title('Salary vs Exp Scatterplot')
plt.xlabel('years of exp')
plt.ylabel('Salary')
plt.show()

# Train Test Split

As we have only 30 observations, we will have a 80-20 train-test spliting.

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=0)

# Statistical Model 

We will use the sklearn's LinearRegression class to build our statistical regression model. We will have a look at the time also. For that, we will use the time module from python.

In [None]:
start_1 = time.time()
regressor = LinearRegression()
regressor.fit(X_train,y_train)
end_1 = time.time()
print('Time Taken :', end_1-start_1,'seconds')

In [None]:
y_pred_reg = regressor.predict(X_test)

# Artificial Nueral Netwrok

We will now build our ANN model and fit the data. We will also have a look at the time taken.

In [None]:
start_2 = time.time()

model = Sequential()
model.add(Dense(32,activation='relu',input_dim=1))
model.add(Dense(32,activation='relu'))
model.add(Dense(32,activation='relu'))
model.add(Dense(1))
opt  = keras.optimizers.RMSprop(learning_rate = 0.0099)
model.compile(optimizer=opt,loss='mean_squared_error')
model.fit(X_train,y_train,epochs=500)

end_2 = time.time()
print('\nTime Taken :',end_2-start_2,'seconds')

In [None]:
y_pred_ann = model.predict(X_test)

# Lets visualize the result on Training dataset 

Here Blue is the statistical model prediction line, Green is the ANN prediction line and Red are the actual points.

In [None]:
plt.figure(figsize=(20,10))
plt.scatter(X_train,y_train, color='red')
plt.plot(X_train,regressor.predict(X_train), color='blue') 
plt.plot(X_train,model.predict(X_train), color='green') 
plt.title('Salary vs Exp (training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()

# Lets look at the prediction on Test Values

In [None]:
plt.figure(figsize=(20,10))
plt.scatter(X_test,y_test, color='red')
plt.plot(X_train,regressor.predict(X_train), color='blue') 
plt.plot(X_train,model.predict(X_train), color='green') 
plt.title('Salary vs Exp (test set)')
plt.xlabel('years of exp')
plt.ylabel('Salary')
plt.legend()
plt.show()

# Accuracy Measures & Performance

In [None]:
print('Criteria \t | Statistical Model \t\t| ANN ')
print("R-Sq \t\t | ", r2_score(y_test, y_pred_reg),' \t\t| ', r2_score(y_test, y_pred_ann))
print("RMSE \t\t | ", math.sqrt(mean_squared_error(y_test, y_pred_reg)),' \t\t| ', math.sqrt(mean_squared_error(y_test, y_pred_ann)))
print("Time \t\t | ", end_1 - start_1,' \t| ', end_2 - start_2)

# Conclusion

Both the models performed well with a tiny bit of diiference. But with respect to computational time and cost, the Statistical approach is definitely a winner. And **It is always better to use a simpler method like the statistical approach to solve this sort of simple problems.** 