Random Forest Algorithm with Python and Scikit-Learn
-------------------------------------------------------------------------------

__Random forest is a type of supervised machine learning algorithm based on ensemble learning. Ensemble learning is a type of learning where you join different types of algorithms or same algorithm multiple times to form a more powerful prediction model. The random forest algorithm combines multiple algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the name "Random Forest". The random forest algorithm can be used for both regression and classification tasks.__

How the Random Forest Algorithm Works
------------------------------------------------------------

The following are the basic steps involved in performing the random forest algorithm:

>1 Pick N random records from the dataset.

>2 Build a decision tree based on these N records.

>3 Choose the number of trees you want in your algorithm 
   and repeat steps 1 and 2.
   
>4 In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in forest. 

>Or, in case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote.

Advantages of using Random Forest
-----------------------------------------------------
As with any algorithm, there are advantages and disadvantages to using it. In the next two sections we'll take a look at the pros and cons of using random forest for classification and regression.

1> The random forest algorithm is not biased, since, there are multiple trees and each tree is trained on a subset of data. Basically, the random forest algorithm relies on the power of "the crowd"; therefore the overall biasedness of the algorithm is reduced.

2> This algorithm is very stable. Even if a new data point is introduced in the dataset the overall algorithm is not affected much since new data may impact one tree, but it is very hard for it to impact all the trees.

3> The random forest algorithm works well when you have both categorical and numerical features. 

4> The random forest algorithm also works well when data has missing values or it has not been scaled well.

Disadvantages of using Random Forest
---------------------------------------------------------
1> A major disadvantage of random forests lies in their complexity. They required much more computational resources, owing to the large number of decision trees joined together.

2> Due to their complexity, they require much more time to train; than other comparable algorithms.

1> Using Random Forest for Regression
----------------------------------------------------------

Problem Definition : The problem here is to predict the gas consumption (in millions of gallons) in 48 of the US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license.

In [3]:
# Import Libraries
import pandas as pd  
import numpy as np  

dataset = pd.read_csv('./datasets_n_images/datasets_module_4/petrol_consumption.csv')  
dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [4]:
dataset.describe()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
count,48.0,48.0,48.0,48.0,48.0
mean,7.668333,4241.833333,5565.416667,0.570333,576.770833
std,0.95077,573.623768,3491.507166,0.05547,111.885816
min,5.0,3063.0,431.0,0.451,344.0
25%,7.0,3739.0,3110.25,0.52975,509.5
50%,7.5,4298.0,4735.5,0.5645,568.5
75%,8.125,4578.75,7156.0,0.59525,632.75
max,10.0,5342.0,17782.0,0.724,968.0


In [24]:
# Preparing the Data
# divide the data into attributes and labels

X = dataset.drop('Petrol_Consumption', axis=1)  
y = dataset['Petrol_Consumption']  

# dividing data into training and testing set .. type your code here

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0) #default tset_size=0.25,
                                                                                            #random_state=seedvalue


# Training and Making Predictions
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=70,max_depth=2, max_features=3, random_state=9)  
regressor.fit(X_train, y_train)  
y_pred = regressor.predict(X_test)  

# The RandomForestRegressor class of the sklearn.ensemble library
# is used to solve regression problems via random forest. 
# The most important parameter of the RandomForestRegressor class 
# is the n_estimators parameter. 
# This parameter defines the number of trees in the random forest.

# Evaluating the Algorithm
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 46.419105405170214
Mean Squared Error: 3736.588868290004
Root Mean Squared Error: 61.12764405970513


With 20 trees, the root mean squared error is 62.23 which is greater than 10 percent of the average petrol consumption i.e. 576.77. This may indicate, among other things, that we have not used enough estimators (trees).

In [20]:
# If the number of estimators is changed to 50, the results are as follows:
regressor = RandomForestRegressor(n_estimators=60, random_state=0)  
regressor.fit(X_train, y_train)  
y_pred = regressor.predict(X_test)  

# Evaluating the Algorithm
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 47.69097222222222
Mean Squared Error: 3290.683854166667
Root Mean Squared Error: 57.364482514589696


With 50 trees, the root mean squared error is 58.24 which is close to 10 percent of the average petrol consumption i.e. 576.77.  (~ these values may change depending upon the train-test data set).

2: Using Random Forest for Classification
------------------------------------------------------------

Problem Definition : The task here is to predict whether a bank currency note is authentic or not based on four attributes i.e. variance of the image wavelet transformed image, skewness, entropy, and curtosis of the image.

In [22]:
# doing the minimum necessary imports
# more modules would be imported as and when needed

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
%matplotlib inline

# reading data from CSV file. 
# reading bank currency note data into pandas dataframe.
bankdata = pd.read_csv("./datasets_n_images/datasets_module_4/bill_authentication.csv")  

# Exploratory Data Analysis
print(bankdata.shape)  
print("------------")
print(bankdata.head()) 

(1372, 5)
------------
   Variance  Skewness  Curtosis  Entropy  Class
0   3.62160    8.6661   -2.8073 -0.44699      0
1   4.54590    8.1674   -2.4586 -1.46210      0
2   3.86600   -2.6383    1.9242  0.10645      0
3   3.45660    9.5228   -4.0112 -3.59440      0
4   0.32924   -4.4552    4.5718 -0.98880      0


In [23]:
bankdata['Class'].value_counts()

0    762
1    610
Name: Class, dtype: int64

In [27]:
# Data Preprocessing
# Data preprocessing involves 
# (1) Dividing the data into attributes and labels and 
# (2) dividing the data into training and testing sets.

# To divide the data into attributes and labels -- type your code
X=bankdata.drop('Class',axis=1)
y=bankdata['Class']


# the final preprocessing step is to divide data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)



# Training the Algorithm.
from sklearn.ensemble import RandomForestClassifier
regressor = RandomForestClassifier(n_estimators=30, random_state=0)  
regressor.fit(X_train, y_train)
y_pred=regressor.predict(X_test)


# Evaluating the Algorithm
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  
print(accuracy_score(y_test, y_pred))

[[156   0]
 [  1 118]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       156
           1       1.00      0.99      1.00       119

    accuracy                           1.00       275
   macro avg       1.00      1.00      1.00       275
weighted avg       1.00      1.00      1.00       275

0.9963636363636363


The accuracy achieved for by our random forest classifier with 20 trees is 100%. 

The learner / Student is expected to try n_estimators values of 10 and 50. 
Do you observe any difference ?
We find that n_estimators=20 gives best gradient descent point.
Also, pl note that we can get multiple gradient descent points at regular intervals too may be 20,40,60,...

__Note:
We will never get 100 percent. Ideal results in realtime datasets is between 85 to 97.__

In [28]:
from sklearn.ensemble import GradientBoostingRegressor

In [49]:
gbreg = GradientBoostingRegressor(n_estimators=20,learning_rate=0.01)

In [50]:
gbreg = GradientBoostingRegressor(n_estimators=20)

In [51]:
dataset = pd.read_csv('./datasets_n_images/datasets_module_4/petrol_consumption.csv')  
dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [52]:

X = dataset.drop('Petrol_Consumption', axis=1)  
y = dataset['Petrol_Consumption']  

# dividing data into training and testing set .. type your code here

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0) #default tset_size=0.25,
                                                                                            #random_state=seedvalue


In [53]:
gbreg.fit(X_train, y_train)

GradientBoostingRegressor(n_estimators=20)

In [54]:
y_pred = gbreg.predict(X_test)  

# The RandomForestRegressor class of the sklearn.ensemble library
# is used to solve regression problems via random forest. 
# The most important parameter of the RandomForestRegressor class 
# is the n_estimators parameter. 
# This parameter defines the number of trees in the random forest.

# Evaluating the Algorithm
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 52.59556628817243
Mean Squared Error: 3600.0085379296374
Root Mean Squared Error: 60.00007114937146
