## **ASSIGNMENT - DECISION TREE :**

**Question 1**: What is a Decision Tree, and how does it work in the context of
classification?
- Decision Tree is a model that makes decision by asking a series of questions about the data.
- In classification, it splits the data step by step, until it reaches a final decision (class label)
---

**Question 2**: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
- Gini Impurity measures how often a random sample would be misclassified
- Entropy measures disorder or uncertainty in data.
- Lower impurity = better split. The tree picks splits that make child nodes purer
---

**Question 3**: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- Pre-Pruning: Stops tree growth early (limiting depth) → Faster, avoids overfitting
- Post-Pruning: Grows full tree, then removes weak branches → Keeps accuracy while simplifying
---

**Question 4**: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
- Information Gain = reduction in impurity (Entropy or Gini) after a split
- It helps the tree choose the feature that best separates the data
---

**Question 5**: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
- Uses: Medical diagnosis, credit scoring, fraud detection, marketing
- Advantages: Easy to understand, works with both numeric and categorical data
- Limitations: Can overfit, sensitive to noisy data
---

**Question 6**: Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier using the Gini criterion
- Print the model’s accuracy and feature importances

In [4]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
df = pd.DataFrame(data.data, columns = data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

model = DecisionTreeClassifier(criterion='gini')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print('Accuracy score:', accuracy_score(y_test, y_pred))
print('Feature Importance', model.feature_importances_)

Accuracy score: 0.9666666666666667
Feature Importance [0.01253395 0.01880092 0.07584566 0.89281948]


**Question 7**: Write a Python program to:
- Load the Iris Dataset
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [5]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
df = pd.DataFrame(data.data, columns = data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

model1 = DecisionTreeClassifier()
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)

model2 = DecisionTreeClassifier(max_depth=3)
model2.fit(X_train, y_train)
y_pred2 = model2.predict(X_test)

print('Normal Decision Tree Classifier:', accuracy_score(y_test, y_pred1))
print('Depth 3 Decision Tree Classifier:', accuracy_score(y_test, y_pred2))

Normal Decision Tree Classifier: 0.9666666666666667
Depth 3 Decision Tree Classifier: 0.9666666666666667


**Question 8**: Write a Python program to:
- Load the Boston Housing Dataset
- Train a Decision Tree Regressor
- Print the Mean Squared Error (MSE) and feature importances

In [9]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns = data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

model = DecisionTreeRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print('Mean squared Error', mean_squared_error(y_test, y_pred))
print('Feature Importance', model.feature_importances_)

Mean squared Error 0.49865452532836724
Feature Importance [0.50670517 0.05058811 0.0335729  0.02892374 0.03277802 0.14651809
 0.09858541 0.10232857]


**Question 9**: Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
- Print the best parameters and the resulting model accuracy

In [10]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

data = load_iris()
df = pd.DataFrame(data.data, columns = data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

params = {
    'max_depth': [2, 3, 4, 5, 10],
    'min_samples_split': [2, 3, 4, 5, 10]
}
model = DecisionTreeClassifier()
grid = GridSearchCV(model, param_grid=params, cv=5, verbose=0)
grid.fit(X_train, y_train)
y_pred = grid.best_estimator_.predict(X_test)

print('Best parameters: ', grid.best_params_)
print('Accuracy:', accuracy_score(y_test, y_pred))

Best parameters:  {'max_depth': 4, 'min_samples_split': 3}
Accuracy: 0.9666666666666667


**Question 10**: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance

And describe what business value this model could provide in the real-world
setting.

**Answer**:
- Handle the missing values -> Drop high-missing columns, impute numeric with median, categorical with mode
- Encode the categorical features -> One-Hot for nominal and Ordinal for ordered data
- Train Decision Tree -> split data then fit the DecisionTreeClassifier on the preprocessed data
- Tune Hyperparameters -> Using GridSearchCv for max_depth, min_sample_split, criterion and more
- Evaluate performance -> Checking accuracy, recall, precision, confusion matrix, and even ROC-AUC


####Business value provided by model:
- Predict disease risk early therefore supports faster diagnosis, saves cost, improves patient outcomes