 <h1 style='background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;' >Random Forest </h1>


Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression).


<img src="https://www.frontiersin.org/files/Articles/284242/fnagi-09-00329-HTML/image_m/fnagi-09-00329-g001.jpg" width="700px">

 <h1 style='background-color:LimeGreen; font-family:newtimeroman; font-size:180%; text-align:center; border-radius: 15px 50px;' > Data Description </h1>



In this competition, you will be predicting the probability [0, 1] of a binary target column.

The data contains binary features (bin_*), nominal features (nom_*), ordinal features (ord_*) as well as (potentially cyclical) day (of the week) and month features. The string ordinal features ord_{3-5} are lexically ordered according to string.ascii_letters.

Since the purpose of this competition is to explore various encoding strategies. Unlike the first Categorical Feature Encoding Challenge, the data for this challenge has missing values and feature interactions.

### Files
* train.csv - the training set
* test.csv - the test set; you must make predictions against this data
* sample_submission.csv - a sample submission file in the correct format


#### Dataset Link :
[Here](https://www.kaggle.com/c/cat-in-the-dat-ii/data)


<span style="color: #0087e4; font-family: Segoe UI; font-size: 2.1em; font-weight: 200;">Load Required Libraries </span>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

<span style="color: #0087e4; font-family: Segoe UI; font-size: 2.1em; font-weight: 200;">Read Data</span>

In [None]:
df_train = pd.read_csv('../input/cat-in-the-dat-ii/train.csv')
df_test = pd.read_csv('../input/cat-in-the-dat-ii/test.csv')

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_train.info()

In [None]:
df_train.describe()

In [None]:
df_train.isna().sum()

In [None]:
df_test.isna().sum()

<span style="color: #0087e4; font-family: Segoe UI; font-size: 2.1em; font-weight: 200;">Change boolean value to int so as to encode</span>

In [None]:
df_train['bin_3'] = df_train['bin_3'].apply(lambda x: 1 if x=='T' else 0)
df_train['bin_4'] = df_train['bin_4'].apply(lambda x:1 if x =='Y' else 0)
df_test['bin_3'] = df_test['bin_3'].apply(lambda x:1 if x=='T' else 0)
df_test['bin_4'] = df_test['bin_4'].apply(lambda x:1 if x == 'Y' else 0)

In [None]:
def replace_nan(data):
    for column in data.columns:
        if data[column].isna().sum() > 0:
            data[column] = data[column].fillna(data[column].mode()[0])


replace_nan(df_train)
replace_nan(df_test)

In [None]:
features = []

for col in df_train.columns[:-1]:
    rd = LabelEncoder()
    rd.fit_transform( df_train[col].append( df_test[col] ) )
    df_train[col] = rd.transform( df_train[col] )
    df_test [col] = rd.transform( df_test [col] )
    features.append(col)

In [None]:
df_train.head()

In [None]:
df_train.info()

In [None]:
import pandas_profiling as pp
pp.ProfileReport(df_train)

<span style="color: #0087e4; font-family: Segoe UI; font-size: 2.1em; font-weight: 200;">Split into train and test sets</span>

In [None]:
X = df_train.drop('target', axis=1)
Y = df_train['target']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 0)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

<span style="color: #0087e4; font-family: Segoe UI; font-size: 2.1em; font-weight: 200;">Feature Scaling</span>

In [None]:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train

<span style="color: #0087e4; font-family: Segoe UI; font-size: 2.1em; font-weight: 200;">Random Forest Classifier</span>

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(random_state = 0)
forest.fit(X_train, Y_train)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = forest.predict(X_test)
cm = confusion_matrix(Y_test, y_pred)
print(cm)
accuracy_score(Y_test, y_pred)

In [None]:
conf_matrix = confusion_matrix(y_pred, Y_test)

print(f'Confussion Matrix: \n{conf_matrix}\n')

sns.heatmap(conf_matrix, annot=True)

<span style="color: #0087e4; font-family: Segoe UI; font-size: 2.1em; font-weight: 200;">Performance Measures</span>

In [None]:
tn = conf_matrix[0,0]
fp = conf_matrix[0,1]
tp = conf_matrix[1,1]
fn = conf_matrix[1,0]

total = tn + fp + tp + fn
real_positive = tp + fn
real_negative = tn + fp

In [None]:
accuracy  = (tp + tn) / total # Accuracy Rate
precision = tp / (tp + fp) # Positive Predictive Value
recall    = tp / (tp + fn) # True Positive Rate
f1score  = 2 * precision * recall / (precision + recall)
specificity = tn / (tn + fp) # True Negative Rate
error_rate = (fp + fn) / total # Missclassification Rate
prevalence = real_positive / total
miss_rate = fn / real_positive # False Negative Rate
fall_out = fp / real_negative # False Positive Rate

print(f'Accuracy    : {accuracy}')
print(f'Precision   : {precision}')
print(f'Recall      : {recall}')
print(f'F1 score    : {f1score}')
print(f'Specificity : {specificity}')
print(f'Error Rate  : {error_rate}')
print(f'Prevalence  : {prevalence}')
print(f'Miss Rate   : {miss_rate}')
print(f'Fall Out    : {fall_out}')

<span style="color: #0087e4; font-family: Segoe UI; font-size: 2.1em; font-weight: 200;">Classification Report</span>

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_pred, Y_test))