In [1]:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
import random

###### at first we are importing the libraries used to generating, fitting and evaluating the decision tree model with the givven data

In [2]:
data = pd.read_csv('data.csv')
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### 1 - 1 ) Decision Tree

###### reading the dataset and make two other datasets out of it: one for evaluating the final model and one for training the model and as it makes sense we have to omit the target column cause with the target included the model will be overfitted and the accuracy will be 100 % that is not really a good thing although it sounds good.


In [3]:
data = pd.read_csv('data.csv')
Y = data.target
X = data.drop(columns=['target'])

###### train test split method of scikit learn library will help us devide the whole dataset into training dataset and test data set with a custom ratio. for our example we are taking 80 % of the whole data for training the model and the rest for the testing.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

###### fiiting the model on our training dataset

In [5]:
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

###### getting the prediction of the model about our testing dataset

In [6]:
y_pred = clf.predict(X_test)

###### for the last part we are getting the accuracy of our model

In [7]:
print("Decision Tree Accuracy:",metrics.accuracy_score(y_test, y_pred))

Decision Tree Accuracy: 0.7213114754098361


### 2 ) Random Forest

#### 2 - 1 ) Sampling

In [8]:
data = pd.read_csv('data.csv')
Y = data.target
X = data.drop(columns=['target'])

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

#### 2 - 2 ) Bagging (Bootstrap aggregating)

###### generating 5 list of samples, each with 150 samples in it. sample will help us taking samples of our dataset with no replacement; but 5 datasets can have common samples in them.

In [10]:
Xs_train = [X_train.sample(n = 150) for i in range(5)]

In [11]:
ys_train = [y_train[x_train.index] for x_train in Xs_train]

In [12]:
clfs = [DecisionTreeClassifier() for i in range(5)]
fitted_clfs = [clfs[i].fit(Xs_train[i], ys_train[i]) for i in range(5)]

In [13]:
preds = [fitted_clf.predict(X_test) for fitted_clf in fitted_clfs]

###### here after getting the predictions for the test dataset for each Decision Tree we use voting for generating the final prediction on the test dataset. in this approach we get predict the target for each dataset with each Decision Tree and at the end we look at the targets predicted for each row of our dataset and decide the final target according to which is the most predicted target for that row and this way we can predict better using multiple Decision Trees.

In [14]:
final_preds = []
for i in range(len(X_test)):
    ones = [pred[i] for pred in preds].count(1)
    zeros = [pred[i] for pred in preds].count(0)
    if zeros < ones:
        final_preds.append(1)
    else:
        final_preds.append(0)

In [15]:
print("Bagging Accuracy:",metrics.accuracy_score(y_test, final_preds))

Bagging Accuracy: 0.7213114754098361


#### 2 - 3 ) Model with feature excluded

###### generating decision tree with omiting selected column to check whether its existence in the dataset helps us making a better model or not

In [16]:
def get_decision_tree_accuracy_without_column(column):
    data = pd.read_csv('data.csv')
    Y = data.target
    X = data.drop(columns=['target', column])
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
    clf = DecisionTreeClassifier()
    clf = clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    return metrics.accuracy_score(y_test, y_pred)

In [17]:
data = pd.read_csv('data.csv')
for column in data.columns:
    if column != 'target':
        print("Decision Tree Accuracy after removing column,", column, " : ")
        print(get_decision_tree_accuracy_without_column(column))

Decision Tree Accuracy after removing column, age  : 
0.7049180327868853
Decision Tree Accuracy after removing column, sex  : 
0.7049180327868853
Decision Tree Accuracy after removing column, cp  : 
0.7213114754098361
Decision Tree Accuracy after removing column, trestbps  : 
0.7213114754098361
Decision Tree Accuracy after removing column, chol  : 
0.7213114754098361
Decision Tree Accuracy after removing column, fbs  : 
0.6885245901639344
Decision Tree Accuracy after removing column, restecg  : 
0.7049180327868853
Decision Tree Accuracy after removing column, thalach  : 
0.6721311475409836
Decision Tree Accuracy after removing column, exang  : 
0.7704918032786885
Decision Tree Accuracy after removing column, oldpeak  : 
0.7540983606557377
Decision Tree Accuracy after removing column, slope  : 
0.7377049180327869
Decision Tree Accuracy after removing column, ca  : 
0.7377049180327869
Decision Tree Accuracy after removing column, thal  : 
0.6721311475409836


we can see that by removing the exang feature from our dataset we can reach better accuracy

#### 1 - 4 ) Model with 5 Random Features

In [18]:
data = pd.read_csv('data.csv')
temp_columns = list(data.columns)
temp_columns.remove('target')
temp_columns
random.sample

<bound method Random.sample of <random.Random object at 0x1750340>>

In [19]:
data = pd.read_csv('data.csv')
temp_columns = list(data.columns)
temp_columns.remove('target')
columns = random.sample(temp_columns, k = 5)
columns.append('target')
data = data[columns]
Y = data.target
X = data.drop(columns=['target'])
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("columns inlcluded in dataset for training the tree :")
print(columns)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

columns inlcluded in dataset for training the tree :
['trestbps', 'chol', 'thal', 'slope', 'sex', 'target']
Accuracy: 0.639344262295082


#### 1 - 5 ) Random Forest with Random Features

In [None]:
def get_random_features():
    data = pd.read_csv('data.csv')
    temp_columns = list(data.columns)
    temp_columns.remove('target')
    columns = random.sample(temp_columns, k = 5)
    columns.append('target')
    return columns

In [20]:
def get_random_forest_accuracy_with(columns):
    data = pd.read_csv('data.csv')
    data = data[columns]
    Y = data.target
    X = data.drop(columns=['target'])
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
    Xs_train = [X_train.sample(n = 150) for i in range(5)]
    ys_train = [y_train[x_train.index] for x_train in Xs_train]
    clfs = [DecisionTreeClassifier() for i in range(5)]
    fitted_clfs = [clfs[i].fit(Xs_train[i], ys_train[i]) for i in range(5)]
    preds = [fitted_clf.predict(X_test) for fitted_clf in fitted_clfs]
    final_preds = []
    for i in range(len(X_test)):
        ones = [pred[i] for pred in preds].count(1)
        zeros = [pred[i] for pred in preds].count(0)
        if zeros < ones:
            final_preds.append(1)
        else:
            final_preds.append(0)
    return metrics.accuracy_score(y_test, final_preds)

In [21]:
data = pd.read_csv('data.csv')
temp_columns = list(data.columns)
temp_columns.remove('target')
columns = random.sample(temp_columns, k = 5)
columns.append('target')
print("features included in the Model: ")
print(columns)

features included in the Model: 
['restecg', 'slope', 'chol', 'sex', 'cp', 'target']


In [22]:
print("Random Forest Accuracy: ",get_random_forest_accuracy_with(columns))

Random Forest Accuracy:  0.7704918032786885


##### Bootstraping

Given a standard training set  D  of size n, bagging generates m new training sets  Di , each of size n′, by sampling from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in each  Di . If n′=n, then for large n the set  Di  is expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of D, the rest being duplicates. This kind of sample is known as a bootstrap sample. Then, m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

It reduces variance and helps to avoid overfitting.

##### Overfitting

Over-fitting is the phenomenon in which the learning system tightly fits the given training data so much that it would be inaccurate in predicting the outcomes of the untrained data. In decision trees, over-fitting occurs when the tree is designed so as to perfectly fit all samples in the training data set.

Bagging reduces variance and helps to avoid overfitting.



##### Random Forest

Random forests differ in only one way from bagging: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging".

The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated.

##### Conclusion

Although this data set and the numbers we used for creating subtrees were not perfect, and considering the fact that running the program again and again would give us different results because of it's random nature, we can see that usually, bagging accuracy is higher than simple decision tree since it stops overfitting. And we can also conclude that random forest can be better than both of the decision tree and bagging because it has not just the benefits of bagging, but also can prevent correlation between trees.