<a href="https://colab.research.google.com/github/thekkanathashish95/Projects/blob/master/Regression%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression

I have implemented all the major different kinds of regression techniques on the same dataset. 

- Regression using Artificial Neural Networks
- Multiple Linear Regression
- Support Vector Regression
- Decision Tree Regression
- Random Forest Regression


###Dataset Information

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

Source: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

### Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

In [2]:
tf.__version__

'2.3.0'

##Data Preprocessing

### Importing the dataset

In [3]:
dataset = pd.read_excel('Folds5x2_pp.xlsx')

In [4]:
dataset.shape

(9568, 5)

In [5]:
dataset.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


**Matrix of Features:**

1. AT - Temperature - in the range 1.81°C and 37.11°C,
2. V - Exhaust Vacuum - in the range 25.36-81.56 cm Hg
3. RH - Relative Humidity - in the range 25.56% to 100.16%
4. AP - Ambient Pressure - in the range 992.89-1033.30 milibar

**Dependent Variable Vector**

1. PE - Net hourly electrical energy output - 420.26-495.76 MW

In [6]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [7]:
X[:2]

array([[  14.96,   41.76, 1024.07,   73.17],
       [  25.18,   62.96, 1020.04,   59.08]])

In [8]:
y[:2]

array([463.26, 444.37])

###Missing Data

In [9]:
# from sklearn.impute import SimpleImputer
# imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# imputer.fit(X[:, 1:3])
# X[:, 1:3] = imputer.transform(X[:, 1:3])

### Encoding the Independent Variable

In [10]:
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder
# ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
# X = np.array(ct.fit_transform(X))

### Encoding the Dependent Variable

In [11]:
# from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder()
# y = le.fit_transform(y)

### Splitting the dataset into the Training set and Test set

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [13]:
len(X_train)

7654

In [14]:
len(y_test)

1914

##Artificial Neural Networks

### Initializing the ANN

In [15]:
ann = tf.keras.models.Sequential()

Initializes as a sequence of layers. 

Object created as a sequence of class - Sequential() of Keras library.

### Adding the input layer and the first hidden layer

In [16]:
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))

Dense layer is used inorder to have a dense connection between input layer and the first hidden layer. 

Each input neuron is connected to every neuron in the first hidden layer.

Number of input neurons will automatically be recognzied by Neural Network when we feed in our matrix of features

**List of Activation function used in Keras submodule of Tensorflow**

https://www.tensorflow.org/api_docs/python/tf/keras/activations 

### Adding the second hidden layer

In [17]:
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))

### Adding the output layer

In [18]:
ann.add(tf.keras.layers.Dense(units=1))

It is recommened to add Sigmoid or Softmax activation function for classification problems to the output layer. 

Since we are solving a regression problem, we can avoid using an activation function for the final layer

### Compiling the ANN

In [19]:
ann.compile(optimizer = 'adam', loss = 'mean_squared_error')

**Optimizer** - the tool we use to perform stochastic gradient descent.

**Stochastic Gradient Descent** - technique that updates the weights of hidden layer neurons inorder to reduce the loss during the training process

**Adam** - https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam

### Training the ANN model on the Training set

In [20]:
ann.fit(X_train, y_train, batch_size = 32, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f7eb89b3cc0>

### Predicting the results of the Test set

In [21]:
y_pred = ann.predict(X_test)
np.set_printoptions(precision=2) #setting number of decimal pointsb
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[429.9  431.23]
 [460.99 460.01]
 [464.52 461.14]
 ...
 [471.73 473.26]
 [438.51 438.  ]
 [457.75 463.28]]


In [22]:
r2_score(y_test, y_pred)

0.9096879429909228

###Predicting based on custom inputs

In [23]:
new_pred= ann.predict([[15.6, 38.76, 1000.07, 80.17]])
print(new_pred)

[[456.8]]


##Multiple Linear Regression

In [24]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Predicting the Test set results

In [25]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[431.43 431.23]
 [458.56 460.01]
 [462.75 461.14]
 ...
 [469.52 473.26]
 [442.42 438.  ]
 [461.88 463.28]]


### Predicting R2

In [26]:
r2_score(y_test, y_pred)

0.9325315554761302

###Predicting based on custom inputs

In [27]:
new_pred= regressor.predict([[15.6, 38.76, 1000.07, 80.17]])
print(new_pred)

[464.1]


##Decision Tree Regressor

In [28]:
from sklearn.tree import DecisionTreeRegressor
dtregressor = DecisionTreeRegressor(random_state = 0)
dtregressor.fit(X, y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=0, splitter='best')

### Predicting a new result

In [29]:
dtregressor.predict([[15.6, 38.76, 1000.07, 80.17]])

array([467.89])

In [30]:
dtregressor.score(X,y)

1.0

## Random Forest Regression

In [31]:
from sklearn.ensemble import RandomForestRegressor
rfregressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
rfregressor.fit(X, y)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=10, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

In [32]:
rfregressor.predict([[15.6, 38.76, 1000.07, 80.17]])

array([466.58])

In [33]:
rfregressor.score(X,y)

0.9926872917918784

## Support Vector Regression

### Scaling both indendent and dependent variables

In [34]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()

In [35]:
scaled_X = sc_X.fit_transform(X)

In [36]:
scaled_y = y.reshape(len(y),1)

In [37]:
scaled_y = sc_y.fit_transform(scaled_y)

In [38]:
print(scaled_X)

[[-0.63 -0.99  1.82 -0.01]
 [ 0.74  0.68  1.14 -0.97]
 [-1.95 -1.17 -0.19  1.29]
 ...
 [ 1.57  1.58 -0.06 -2.52]
 [ 0.65  1.19  0.1  -0.75]
 [ 0.26  0.65  0.67 -0.37]]


In [39]:
print(scaled_y)

[[ 0.52]
 [-0.59]
 [ 2.  ]
 ...
 [-1.45]
 [-1.09]
 [-0.06]]


### Training the SVR model on the whole dataset

In [40]:
from sklearn.svm import SVR
svregressor = SVR(kernel = 'rbf')
svregressor.fit(X, y)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [41]:
svregressor.score(X,y)

0.45806229804505816

### Predicting a new result

In [42]:
sc_y.inverse_transform(svregressor.predict(sc_X.transform([[15.6, 38.76, 1000.07, 80.17]])))

array([7904.13])