## Question: 3

Train and fine tune a decision tree using the wine dataset by following the following steps:-

  1. Use load_wine() to generate wine dataset
  2. Split the dataset into train and test  dataset
  3. Use random search CV to hyperparameter tune the Decision Tree
  4. Try to achieve an accuracy of at least 85%


Grow a random forest using the following steps:-

  1. Continuing the previous question, create 10 subsets of the training dataset. You can use the              class for it.
  2. Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  3. Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import randint
from sklearn.metrics import accuracy_score

### Step 1: Load wine dataset

In [2]:
wine = load_wine()
X, y = wine.data, wine.target

### Step 2: Split the dataset into train and test dataset

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 3: Use random search CV to hyperparameter tune the Decision Tree

In [4]:
param_dist = {
    'max_depth': randint(1, 20),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20),
    'criterion': ['gini', 'entropy']
}

In [5]:
tree = DecisionTreeClassifier()
random_search = RandomizedSearchCV(tree, param_distributions=param_dist, n_iter=100, cv=5, random_state=42)
random_search.fit(X_train, y_train)

In [6]:
# Best hyperparameters
best_params = random_search.best_params_
best_params

{'criterion': 'gini',
 'max_depth': 17,
 'min_samples_leaf': 1,
 'min_samples_split': 17}

### Step 4: Try to achieve an accuracy of at least 85%

In [7]:
best_tree = DecisionTreeClassifier(**best_params)
best_tree.fit(X_train, y_train)
y_pred = best_tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

In [8]:
print(f"Best Decision Tree Accuracy: {accuracy*100:.2f}%")

Best Decision Tree Accuracy: 94.44%


## Random Forest:

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import ShuffleSplit

### Step 1: Create 10 subsets of the training dataset using ShuffleSplit

In [10]:
n_splits = 10
shuffle_split = ShuffleSplit(n_splits=n_splits, test_size=0.2, random_state=42)

### Step 2: Train 1 decision tree on each subset using the best hyperparameter values

In [11]:
forest = RandomForestClassifier(**best_params, n_estimators=n_splits, random_state=42)

In [12]:
for train_index, _ in shuffle_split.split(X_train):
    forest.fit(X_train[train_index], y_train[train_index])

### Step 3: Evaluate all the trees on the test dataset

In [13]:
ensemble_predictions = [tree.predict(X_test) for tree in forest.estimators_]

In [14]:
# Check performance of each tree
for i, y_pred_tree in enumerate(ensemble_predictions):
    accuracy_tree = accuracy_score(y_test, y_pred_tree)
    print(f"Tree {i+1} Accuracy: {accuracy_tree*100:.2f}%")

Tree 1 Accuracy: 80.56%
Tree 2 Accuracy: 88.89%
Tree 3 Accuracy: 80.56%
Tree 4 Accuracy: 91.67%
Tree 5 Accuracy: 80.56%
Tree 6 Accuracy: 94.44%
Tree 7 Accuracy: 91.67%
Tree 8 Accuracy: 77.78%
Tree 9 Accuracy: 88.89%
Tree 10 Accuracy: 80.56%


In [15]:
# Overall performance of the Random Forest
ensemble_predictions_majority = [max(set(predictions), key=predictions.count) for predictions in zip(*ensemble_predictions)]
accuracy_forest = accuracy_score(y_test, ensemble_predictions_majority)

In [16]:
print(f"Random Forest Accuracy: {accuracy_forest*100:.2f}%")

Random Forest Accuracy: 100.00%
