**Question 1: What is a Decision Tree, and how does it work in the context of
classification?**

ans:

A **Decision** **Tree** is a supervised machine learning algorithm that predicts an output by following a series of decision rules based on the values of input features. In classification, it’s used to assign data points into predefined categories.


**How it works in classification:**

1.**Rootnode:**The algorithm starts with the entire dataset at the root node.


2.**Best split selection** : At each node, it evaluates all features and possible split points using an impurity measure such as Gini Impurity or Entropy. The split that results in the most “pure” child nodes is chosen.

3.**Stopping criteria** : Splitting stops when a node becomes pure.









**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?**


ans:

1.**Gini impurity:**

 - Measures how often a randomly chosen element from the node would be incorrectly classified if it was randomly labeled according to the distribution of labels in the node.

 2.**Entropy:**

 Comes from information theory; measures the amount of disorder or uncertainty in the node.



**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.**

ans:

**Post-pruning:**The tree is grown to its maximum size then branches that provide little predictive power are removed.

**pre-pruning:**

to find the parameter .



**Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?**

ans:

**Infromation** **gain** : it is used for best feature for the root node.

**Question 5: What are some common real-world applications of Decision Trees?**


ans:


**Common real-world applications:**

1.**Credit scoring** : Classifying loan applicants as “low risk” or “high risk” based on financial history.

2.**Fraud detection** : Identifying suspicious transactions.



**Question 6: Write a Python program to:**

- Load the Iris Dataset
- Train a decision Tree Classifier the gini criterion
- print the models accuracy and feature importances

In [None]:
from sklearn.datasets import load_iris

In [None]:
data = load_iris()

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(data.data,columns=data.feature_names)

In [None]:
X = df
y = data.target

In [None]:
#train test split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [None]:
#train Decision tree classifier with Gini criterion
from sklearn.tree import DecisionTreeClassifier

In [None]:
clf = DecisionTreeClassifier(criterion="gini",random_state=42)

In [None]:
clf.fit(X_train,y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
#calculate accuracy
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test,y_pred)

1.0

**Question 7: Write a Python program to:**

 -  Load the Iris Dataset
  -  Train a Decision Tree Classifier with **max_depth=3** and compare its accuracy to a fully-grown tree.

In [None]:
iris = load_iris()

In [None]:
X = iris.data
y = iris.data

In [None]:
# Split into train/test sets (stratified)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Tree with max_depth=3
clf_md3 = DecisionTreeClassifier(max_depth=3, random_state=42)


In [None]:
clf_md3


**Question 8: Write a Python program to:**
- Load the California Housing dataset from sklearn

-  Train a Decision Tree Regressor

-  Print the Mean Squared Error (MSE) and feature importances

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Decision Tree classifier
dt = DecisionTreeClassifier(random_state=42)

# Set up the grid of parameters to search
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(dt, param_grid, cv=5, n_jobs=-1)

# Fit GridSearch to the training data
grid_search.fit(X_train, y_train)

# Best parameters found
best_params = grid_search.best_params_

# Evaluate the best estimator on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

**Question 9: Write a Python program to:**

- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
- Print the best parameters and the resulting model accuracy

In [2]:
from sklearn.datasets import load_iris

In [3]:
#load dataset
iris = load_iris()

In [4]:
X = iris.data
y = iris.target

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
# Split into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
from sklearn.tree import DecisionTreeClassifier

In [8]:
clf = DecisionTreeClassifier()

In [9]:
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, None],
    'min_samples_split': [2, 5, 10]
}

In [10]:
from sklearn.model_selection import GridSearchCV

In [11]:
gscv = GridSearchCV(clf,param_grid=param_grid,cv = 5)

In [12]:
gscv.fit(X_train,y_train)

In [13]:
gscv.best_params_

{'max_depth': 4, 'min_samples_split': 2}

In [14]:
clf = DecisionTreeClassifier(max_depth=4,min_samples_split=2)

In [15]:
clf.fit(X_train,y_train)

In [17]:
y_pred = clf.predict(X_test)

In [18]:
from sklearn.metrics import accuracy_score

In [19]:
accuracy_score(y_test,y_pred)

1.0

**Question 10:
Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.**

- Handle the missing values
- Encode the categorical features
- Train a Decision Tree mode



1. **Handle missing values**

- Understand the missingness: First, analyze how data is missing. Is it random, or does it follow a pattern? This affects how you handle it.

2. **Encode Categorical Features**

- Identify categorical variables: This could be patient gender, blood type, or any non-numeric info.

3.**Train a Decision Tree Mode**l:
Split the data: Use an 80-20 or 70-30 split between training and test sets, or use cross-validation to ensure your results generalize.
