<a href="https://colab.research.google.com/github/upkarsssharma/Data_science_assignment1/blob/main/Decision_tree_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
### 1. What is a Decision Tree, and how does it work?
A Decision Tree is a supervised machine learning algorithm used for classification and regression. It works by recursively splitting the dataset into subsets based on feature values to create a tree-like structure.

In [None]:
### 2. What are impurity measures in Decision Trees?
Impurity measures quantify the homogeneity or disorder in a dataset. Common impurity measures:
- Gini Impurity
- Entropy
- Mean Squared Error (for regression)

In [None]:
### 3. What is the mathematical formula for Gini Impurity?
Gini = 1 - \sum(p_i^2)

In [None]:
### 4. What is the mathematical formula for Entropy?
Entropy = -\sum(p_i * log2(p_i))

In [None]:
### 5. What is Information Gain, and how is it used in Decision Trees?
Information Gain = Entropy(parent) - \sum(weighted entropy of children)

In [None]:
### 6. What is the difference between Gini Impurity and Entropy?
- Gini is simpler and faster.
- Entropy is more informative theoretically but computationally more expensive.

In [None]:
### 7. What is the mathematical explanation behind Decision Trees?
Greedy recursive binary partitioning that optimizes impurity reduction.

In [None]:
### 8. What is Pre-Pruning in Decision Trees?
Pre-pruning stops tree growth early using conditions like max depth, min samples, etc.

In [None]:
### 9. What is Post-Pruning in Decision Trees?
Post-pruning grows the full tree then prunes based on validation or cost complexity.

In [None]:
### 10. What is the difference between Pre-Pruning and Post-Pruning?
Pre-pruning avoids complexity early, post-pruning simplifies after full growth.

In [None]:
### 11. What is a Decision Tree Regressor?
It is a Decision Tree model used for predicting continuous outputs using MSE.

In [None]:
### 12. What are the advantages and disadvantages of Decision Trees?
+ Easy to interpret, no need for scaling, handles mixed data types.
- Overfits easily, unstable, biased to features with more splits.

In [None]:
### 13. How does a Decision Tree handle missing values?
- Surrogate splits
- Ignore missing during split
- Imputation

In [None]:
### 14. How does a Decision Tree handle categorical features?
- One-hot encoding or label encoding (in sklearn)

In [None]:
### 15. What are some real-world applications of Decision Trees?
- Diagnosis, fraud detection, credit scoring, marketing, manufacturing, etc.

In [None]:
## 1. Train a Decision Tree Classifier on the Iris dataset

In [None]:
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_wine()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9444444444444444


In [None]:
## 2. Train with Gini Impurity and print feature importances

In [None]:
clf_gini = DecisionTreeClassifier(criterion='gini')
clf_gini.fit(X_train, y_train)
print("Feature Importances:", clf_gini.feature_importances_)

Feature Importances: [0.01899507 0.         0.02094206 0.         0.         0.
 0.41105326 0.         0.         0.38493424 0.         0.
 0.16407537]


In [None]:
## 3. Train with Entropy and print accuracy

In [None]:
clf_entropy = DecisionTreeClassifier(criterion='entropy')
clf_entropy.fit(X_train, y_train)
y_pred = clf_entropy.predict(X_test)
print("Accuracy with Entropy:", accuracy_score(y_test, y_pred))

Accuracy with Entropy: 0.9166666666666666


In [None]:
## 4. Decision Tree Regressor with MSE

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.2, random_state=42)
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

MSE: 0.4947938357017442


In [None]:
## 5. Visualize tree using Graphviz

In [None]:

from sklearn.tree import export_graphviz
import graphviz

dot_data = export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
graph = graphviz.Source(dot_data)
graph.render("wine_tree")

In [None]:
## 6. Compare max_depth=3 vs full tree

In [None]:
iris = load_wine()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

clf_full = DecisionTreeClassifier()
clf_full.fit(X_train, y_train)
print("Full Tree Accuracy:", accuracy_score(y_test, clf_full.predict(X_test)))

clf_limited = DecisionTreeClassifier(max_depth=3)
clf_limited.fit(X_train, y_train)
print("Depth=3 Tree Accuracy:", accuracy_score(y_test, clf_limited.predict(X_test)))

Full Tree Accuracy: 0.9444444444444444
Depth=3 Tree Accuracy: 0.9444444444444444


In [None]:
## 7. Use min_samples_split=5

In [None]:
clf_split5 = DecisionTreeClassifier(min_samples_split=5)
clf_split5.fit(X_train, y_train)
print("Accuracy with min_samples_split=5:", accuracy_score(y_test, clf_split5.predict(X_test)))

Accuracy with min_samples_split=5: 0.9444444444444444


In [None]:
## 8. Feature scaling comparison

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, iris.target, test_size=0.2, random_state=42)
clf_scaled = DecisionTreeClassifier()
clf_scaled.fit(X_train_s, y_train_s)
print("Scaled Accuracy:", accuracy_score(y_test_s, clf_scaled.predict(X_test_s)))

Scaled Accuracy: 0.9444444444444444


In [None]:
## 9. One-vs-Rest for multiclass

In [None]:
from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(DecisionTreeClassifier())
ovr_clf.fit(X_train, y_train)
print("OvR Accuracy:", accuracy_score(y_test, ovr_clf.predict(X_test)))

OvR Accuracy: 0.9444444444444444


In [None]:
## 10. Display feature importance

In [None]:
print("Feature Importances:", clf.feature_importances_)

Feature Importances: [0.01899507 0.         0.02094206 0.         0.         0.
 0.41105326 0.         0.         0.38493424 0.         0.
 0.16407537]


In [None]:

## 11. Regressor with max_depth=5 vs unrestricted

In [None]:
reg5 = DecisionTreeRegressor(max_depth=5)
reg5.fit(X_train, y_train)
print("MSE (depth=5):", mean_squared_error(y_test, reg5.predict(X_test)))

MSE (depth=5): 0.16666666666666666


In [None]:
## 12. Cost Complexity Pruning

In [None]:
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
clfs = [DecisionTreeClassifier(ccp_alpha=alpha).fit(X_train, y_train) for alpha in ccp_alphas]
scores = [accuracy_score(y_test, clf.predict(X_test)) for clf in clfs]
print("Pruned Accuracies:", scores)

Pruned Accuracies: [0.9444444444444444, 0.9444444444444444, 0.9444444444444444, 0.9444444444444444, 0.8611111111111112, 0.3888888888888889]


In [None]:
## 13. Precision, Recall, F1

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
print("Precision:", precision_score(y_test, y_pred, average='we'))
print("Recall:", recall_score(y_test, y_pred, average='macro'))
print("F1 Score:", f1_score(y_test, y_pred, average='macro'))

In [None]:
## 14. Confusion matrix visualization

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

In [None]:
## 15. GridSearchCV for best hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best Params:", grid.best_params_)
print("Best Score:", grid.best_score_)

Best Params: {'max_depth': 3, 'min_samples_split': 2}
Best Score: 0.9224137931034484
