#**Decision Tree Assignment**

#**Theoretical Questions**

1. What is a Decision Tree, and how does it work ?
  - A decision tree is a supervised learning algorithm used for both classification and regression tasks.
  - It visually represents a series of decisions, starting from a root node and branching out to leaf nodes that represent the final outcome or prediction.

2. What are impurity measures in Decision Trees ?
  - Impurity measures in decision trees quantify the homogeneity or heterogeneity of a node's data.
  - In simpler terms, they assess how mixed or unmixed the classes are within a node.

3. What is the mathematical formula for Gini Impurity	?
  - The mathematical formula for Gini Impurity is 1 - Σ(pᵢ²), where pᵢ is the probability of a data point belonging to class i.
  - This formula is used in decision tree algorithms to determine the best way to split data at each node, aiming to minimize impurity and maximize homogeneity in the resulting child nodes.

4. What is the mathematical formula for Entropy	?
  - To calculate the entropy change for a phase transition, we can use the formula ΔS=ΔH/T, where ΔH is the enthalpy of fusion and T is the temperature. Given: ΔH = 6.01 kJ/mol and T = 273 K (0°C).

5.  What is Information Gain, and how is it used in Decision Trees	?
  - Information Gain is a metric used in decision tree algorithms to determine the most relevant features for splitting data.

6. What is the difference between Gini Impurity and Entropy	?
  - Gini impurity is generally faster to compute and often produces similar results to entropy, but entropy can be more sensitive to changes in class distribution, particularly in cases of imbalanced datasets.

7. What is the mathematical explanation behind Decision Trees	?
  - Decision trees are built using mathematical concepts to recursively partition data based on feature values, aiming to create increasingly pure subsets.

8. What is Pre-Pruning in Decision Trees ?
  - Pre-pruning, also known as early stopping, is a technique used in decision tree algorithms to prevent overfitting by halting the tree's growth before it reaches its full potential.

9. What is Post-Pruning in Decision Trees	?
  - Post-pruning, also known as backward pruning, is a technique used in decision tree algorithms to simplify the tree structure after it has been fully grown according to GitHub and potentially overfit the training data.
  - It involves removing branches or subtrees from the fully grown tree, usually based on performance metrics on a validation set, to improve generalization to unseen data.

10. What is the difference between Pre-Pruning and Post-Pruning	?
  - Pre-pruning and post-pruning are two strategies used to optimize decision trees by reducing their complexity and preventing overfitting.
  - Pre-pruning stops the tree's growth during construction, while post-pruning simplifies a fully grown tree.

11. What is a Decision Tree Regressor	?
  - A decision tree regressor is a machine learning model that uses a tree-like structure to predict continuous numerical values.

12.  What are the advantages and disadvantages of Decision Trees ?
  - Decision trees offer both advantages and disadvantages in machine learning.
  - Advantages include ease of understanding and interpretation, ability to handle both categorical and numerical data, and less need for data preparation.

13. How does a Decision Tree handle missing values ?
  - Decision trees can handle missing values in several ways. One common approach is to use surrogate splits, where alternative features are used to guide data points with missing values down the appropriate branch.

14. How does a Decision Tree handle categorical features ?
  - Decision trees can naturally handle categorical features. For classification problems, the tree splits are determined based on the unique values of the categorical feature, effectively partitioning the data based on those categories.

15. What are some real-world applications of Decision Trees ?
  - Decision trees are versatile tools with applications across many fields, including business, healthcare, finance, and marketing. They are used for everything from predicting customer behavior to diagnosing diseases and assessing financial risk.

#**Practical Questions**

In [None]:
#16 Write a Python program to train a Decision Tree Classifier on the Iris dataset and print the model accuracy.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Classifier Accuracy on Iris dataset: {accuracy:.2f}")


In [None]:
#17 Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Decision Tree Classifier with Gini Impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Print feature importances
print("Feature Importances:")
for name, importance in zip(feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")


In [None]:
#18 Write a Python program to train a Decision Tree Classifier using Entropy as the splitting criterion and print the model accuracy.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Decision Tree Classifier using Entropy
clf = DecisionTreeClassifier(criterion='entropy', random_state=42)
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy (using entropy): {accuracy:.4f}")


In [None]:
#19. Write a Python program to train a Decision Tree Regressor on a housing dataset and evaluate using Mean Squared Error (MSE).

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")


In [None]:
#20 Write a Python program to train a Decision Tree Classifier and visualize the tree using graphviz.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
import graphviz

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Export the tree to Graphviz DOT format
dot_data = export_graphviz(
    clf,
    out_file=None,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,
    rounded=True,
    special_characters=True
)

# Visualize the tree using graphviz
graph = graphviz.Source(dot_data)
graph.render("iris_tree")  # Saves to iris_tree.pdf
graph.view()  # Opens the rendered PDF in the default viewer


In [None]:
#21 Write a Python program to train a Decision Tree Classifier with a maximum depth of 3 and compare its accuracy with a fully grown tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with max_depth=3
dt_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_limited.fit(X_train, y_train)
y_pred_limited = dt_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

# Train fully grown Decision Tree (no max_depth)
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)
y_pred_full = dt_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

# Print results
print(f"Accuracy with max_depth=3: {acc_limited:.4f}")
print(f"Accuracy with fully grown tree: {acc_full:.4f}")


In [None]:
#22 Write a Python program to train a Decision Tree Classifier using min_samples_split=5 and compare its accuracy with a default tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Decision Tree with min_samples_split=5
dt_split5 = DecisionTreeClassifier(min_samples_split=5, random_state=42)
dt_split5.fit(X_train, y_train)
y_pred_split5 = dt_split5.predict(X_test)
acc_split5 = accuracy_score(y_test, y_pred_split5)

# Default Decision Tree
dt_default = DecisionTreeClassifier(random_state=42)
dt_default.fit(X_train, y_train)
y_pred_default = dt_default.predict(X_test)
acc_default = accuracy_score(y_test, y_pred_default)

# Print results
print(f"Accuracy with min_samples_split=5: {acc_split5:.4f}")
print(f"Accuracy with default parameters: {acc_default:.4f}")


In [None]:
#23 Write a Python program to apply feature scaling before training a Decision Tree Classifier and compare its accuracy with unscaled data.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train on unscaled data
clf_unscaled = DecisionTreeClassifier(random_state=42)
clf_unscaled.fit(X_train, y_train)
y_pred_unscaled = clf_unscaled.predict(X_test)
acc_unscaled = accuracy_score(y_test, y_pred_unscaled)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train on scaled data
clf_scaled = DecisionTreeClassifier(random_state=42)
clf_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = clf_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Print the results
print(f"Accuracy without scaling: {acc_unscaled:.4f}")
print(f"Accuracy with scaling:    {acc_scaled:.4f}")


In [None]:
#24 Write a Python program to train a Decision Tree Classifier using One-vs-Rest (OvR) strategy for multiclass classification.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Wrap DecisionTreeClassifier with OneVsRestClassifier
ovr_model = OneVsRestClassifier(DecisionTreeClassifier(random_state=42))
ovr_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = ovr_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output results
print(f"Accuracy using One-vs-Rest with Decision Tree: {accuracy:.4f}")


In [None]:
#25 Write a Python program to train a Decision Tree Classifier and display the feature importance scores.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Get feature importances
importances = clf.feature_importances_

# Create a DataFrame for better display
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance Score': importances
}).sort_values(by='Importance Score', ascending=False)

# Display the feature importance scores
print("Feature Importance Scores:")
print(importance_df)


In [None]:
#26 Write a Python program to train a Decision Tree Regressor with max_depth=5 and compare its performance with an unrestricted tree.

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor with max_depth=5
reg_limited = DecisionTreeRegressor(max_depth=5, random_state=42)
reg_limited.fit(X_train, y_train)
y_pred_limited = reg_limited.predict(X_test)
mse_limited = mean_squared_error(y_test, y_pred_limited)

# Train unrestricted Decision Tree Regressor
reg_full = DecisionTreeRegressor(random_state=42)
reg_full.fit(X_train, y_train)
y_pred_full = reg_full.predict(X_test)
mse_full = mean_squared_error(y_test, y_pred_full)

# Compare MSEs
print(f"Mean Squared Error (max_depth=5): {mse_limited:.4f}")
print(f"Mean Squared Error (unrestricted): {mse_full:.4f}")


In [None]:
#27 Write a Python program to train a Decision Tree Classifier, apply Cost Complexity Pruning (CCP), and visualize its effect on accuracy.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

# Train an initial Decision Tree to get effective alphas
clf = DecisionTreeClassifier(random_state=42)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Train trees for each value of ccp_alpha
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

# Remove the last tree (usually trivial with one node)
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

# Compute train and test accuracy for each alpha
train_scores = [accuracy_score(y_train, clf.predict(X_train)) for clf in clfs]
test_scores = [accuracy_score(y_test, clf.predict(X_test)) for clf in clfs]

# Plotting accuracy vs ccp_alpha
plt.figure(figsize=(10, 6))
plt.plot(ccp_alphas, train_scores, marker='o', label="Train Accuracy", drawstyle="steps-post")
plt.plot(ccp_alphas, test_scores, marker='o', label="Test Accuracy", drawstyle="steps-post")
plt.xlabel("ccp_alpha (Pruning Parameter)")
plt.ylabel("Accuracy")
plt.title("Effect of Cost Complexity Pruning on Accuracy")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
#28 Write a Python program to train a Decision Tree Classifier and evaluate its performance using Precision, Recall, and F1-Score.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Evaluate using precision, recall, and F1-score
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n")
print(report)


In [None]:
#29 Write a Python program to train a Decision Tree Classifier and visualize the confusion matrix using seaborn.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
class_names = iris.target_names

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix using seaborn heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Decision Tree Classifier')
plt.tight_layout()
plt.show()


In [None]:
#30 Write a Python program to train a Decision Tree Classifier and use GridSearchCV to find the optimal values for max_depth and min_samples_split.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Use GridSearchCV to search best parameters
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get best estimator
best_model = grid_search.best_estimator_

# Predict on test set
y_pred = best_model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)

# Output results
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Score:", grid_search.best_score_)
print("Test Set Accuracy:", accuracy)
