**1.  What is a Decision Tree, and how does it work in the context of
classification?**

A Decision Tree is a supervised learning algorithm used for both classification and regression tasks — but it’s most commonly used for classification.

**2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?**

Gini Impurity measures how often a randomly chosen sample would be misclassified, while Entropy measures the randomness or disorder in a node — both guide Decision Trees to choose splits that create purer child nodes.

**3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.**

Pre-Pruning stops the tree from growing too deep by applying constraints (like max depth, min samples per leaf) during training, preventing overfitting early.
Post-Pruning allows the tree to grow fully and then cuts back less important branches based on validation performance.

**4.What is Information Gain in Decision Trees, and why is it important for
choosing the best split?**

Information Gain measures how much uncertainty (entropy) is reduced after splitting a dataset based on a feature.

**5.What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?**

Decision Trees are used in areas like credit scoring, medical diagnosis, and fraud detection; they’re easy to interpret and handle mixed data types but can overfit and be sensitive to small data changes.

**10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.**

To build a disease prediction model, first handle missing values by imputing numerical features with the median and categorical ones with the mode (or use advanced imputers if needed). Then encode categorical data using one-hot encoding for nominal features and ordinal encoding for ordered ones. Next, train a Decision Tree classifier within a preprocessing pipeline that includes imputation and encoding. Use GridSearchCV or RandomizedSearchCV to tune key hyperparameters such as max_depth, min_samples_split, and criterion based on cross-validation results. Finally, evaluate the model using metrics like accuracy, precision, recall, F1-score, and ROC-AUC on a test set.
In practice, this model helps healthcare providers predict diseases early, prioritize high-risk patients, and optimize medical resources, improving both patient outcomes and operational efficiency.


In [1]:
'''6. Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances'''

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


In [2]:
''' 7:  Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.'''

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
y_pred_limited = tree_limited.predict(X_test)

tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
y_pred_full = tree_full.predict(X_test)

print("Accuracy (max_depth=3):", accuracy_score(y_test, y_pred_limited))
print("Accuracy (fully-grown):", accuracy_score(y_test, y_pred_full))


Accuracy (max_depth=3): 1.0
Accuracy (fully-grown): 1.0


In [3]:
''' 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances'''

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


Mean Squared Error: 0.5280096503174904
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


In [4]:
'''9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy'''

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {'max_depth': [2, 3, 4, 5, None], 'min_samples_split': [2, 3, 4, 5, 10]}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Accuracy: 1.0
