### Train and fine tune a decision tree using the wine dataset by following the following steps:-

  - Use load_wine() to generate wine dataset
  - Split the dataset into train and test  dataset
  - Use random search CV to hyperparameter tune the Decision Tree
  - Try to achieve an accuracy of at least 85%


### Grow a random forest using the following steps:-

  - Continuing the previous question, create 10 subsets of the training dataset. You can use the ShuffleSplitclass for it.
  - Train 1 decision tree on each subset, using the best hyperparameter values found in the previous question.
  - Evaluate all the trees on the test dataset. Are they performing better than the tree created in the previous question?

In [6]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, RandomizedSearchCV, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
from sklearn.metrics import accuracy_score
import numpy as np

In [2]:
# Step 1: Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

In [3]:
# Step 2: Split the dataset into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Step 3: Hyperparameter tuning using RandomizedSearchCV for Decision Tree
param_dist = {
    'max_depth': randint(1, 10),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 10),
    'criterion': ['gini', 'entropy']
}

In [5]:
tree = DecisionTreeClassifier()
random_search = RandomizedSearchCV(tree, param_distributions=param_dist, n_iter=100, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)


In [7]:
print("Best parameters found for Decision Tree:")
print(random_search.best_params_)
print()

Best parameters found for Decision Tree:
{'criterion': 'entropy', 'max_depth': 9, 'min_samples_leaf': 5, 'min_samples_split': 9}



In [8]:
# Step 4: Evaluate the Decision Tree
best_tree = random_search.best_estimator_
y_pred_tree = best_tree.predict(X_test)
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print("Accuracy of Decision Tree:", accuracy_tree)

Accuracy of Decision Tree: 0.9444444444444444


In [9]:
# Step 5: Train Random Forest using 10 subsets of the training dataset
num_subsets = 10
trees = []
cv = ShuffleSplit(n_splits=num_subsets, test_size=0.2, random_state=42)

In [10]:
for train_index, _ in cv.split(X_train):
    X_subset_train, y_subset_train = X_train[train_index], y_train[train_index]
    tree = DecisionTreeClassifier(**random_search.best_params_)
    tree.fit(X_subset_train, y_subset_train)
    trees.append(tree)

In [11]:
# Step 6: Evaluate all the trees on the test dataset
predictions = np.zeros((len(X_test), num_subsets))

In [12]:
for i, tree in enumerate(trees):
    predictions[:, i] = tree.predict(X_test)

In [13]:
y_pred_rf = np.mean(predictions, axis=1)
accuracy_rf = accuracy_score(y_test, y_pred_rf.round())
print("Accuracy of Random Forest:", accuracy_rf)

Accuracy of Random Forest: 0.9166666666666666


In [15]:
len(X_test)

36