Question 1: What is a Decision Tree, and how does it work in the context of
classification?

* A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
* A decision tree starts with a root node, which does not have any incoming branches. The outgoing branches from the root node then feed into the internal nodes, also known as decision nodes. Based on the available features, both node types conduct evaluations to form homogenous subsets, which are denoted by leaf nodes, or terminal nodes. The leaf nodes represent all the possible outcomes within the dataset.


Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?

* entropy helps us to build an appropriate decision tree for selecting the best splitter. Entropy can be defined as a measure of the purity of the sub-split. Entropy always lies between 0 to 1. The entropy of any split can be calculated by this formula.
H(s)=−P
(+)
​
 log
2
​
 P
(+)
​
 −P
(−)
​
 log
2
​
 P
(−)
​
 Here  
P
(−)
​

P
(+)
​

​
 =% of positive class, and 1% of negative class.
 * The algorithm calculates the entropy of each feature after every split and as the splitting continues on, it selects the best feature and starts splitting according to it.
 * The graph of entropy increases up to 1 and then starts decreasing.
 * The range of entropy is from 0 to (log2C).
 * Gini Impurity of features after splitting can be calculated by using this formula. GI=1−∑
i=1
n
​
 (p
i
​
 )
2
 GI=1−[(P
(+)
​
 )
2
 +(P
(−)
​
 )
2
 ]
 * Gini Impurity only goes up to 0.5 before decreasing, thus requiring less computational power.
 *the range of Gini Impurity is from 0 to 0.5 (for binary classification).


 Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
* Pre-pruning, also called early stopping, is a technique where you halt the construction of the decision tree before it fully grows. The idea is to prevent the tree from overfitting by limiting its size during the initial building phase.
* Example: In areas like financial trading or online recommendation systems, where speed is critical, pre-pruning can be a game-changer. Let’s say you’re building a model to recommend products to users as they shop online. Here, milliseconds matter. You want a decision tree that can give quick predictions without being bogged down by too many complex branches. Pre-pruning helps you limit the depth of the tree so the model stays nimble, even if you lose a bit of accuracy.
* Post-pruning, sometimes called cost-complexity pruning, works by allowing the decision tree to grow as much as it wants, capturing all the patterns (and possibly noise). After the tree is fully grown, we then come in and systematically remove the branches or splits that don’t contribute much to the model’s performance. Essentially, we’re simplifying the tree after the fact.
* Example: In fields like medical diagnosis, the stakes are high. You want your model to be as accurate as possible, even if it means more computational work. Post-pruning allows you to first capture all potential interactions within the data, then simplify the tree to focus only on the branches that make meaningful contributions. For instance, in diagnosing diseases, the decision tree might initially grow very large, considering all sorts of patient symptoms and medical history, but post-pruning ensures that the tree only retains branches that genuinely impact the diagnosis.



Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
* Information Gain (IG) is a measure used in decision trees to quantify the effectiveness of a feature in splitting the dataset into classes. It calculates the reduction in entropy (uncertainty) of the target variable (class labels) when a particular feature is known.
* Decision trees aim to create subsets that are as pure as possible.
* A split with higher IG reduces more uncertainty, meaning the feature provides more information about the class labels.
* By selecting the feature with the highest IG at each node, the decision tree ensures efficient learning, smaller trees, and better classification performance.
* Without IG, splits could be chosen arbitrarily and may not improve the predictive power of the model.


Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
* Loan Approval in Banking: Banks use Decision Trees to assess whether a loan application should be approved. The decision is based on factors like credit score, income, employment status and loan history. This helps predict approval or rejection helps in enabling quick and reliable decisions.
* Medical Diagnosis: In healthcare they assist in diagnosing diseases. For example, they can predict whether a patient has diabetes based on clinical data like glucose levels, BMI and blood pressure. This helps classify patients into diabetic or non-diabetic categories, supporting early diagnosis and treatment.
* Fraud Detection: In finance, Decision Trees are used to detect fraudulent activities, such as credit card fraud. By analyzing past transaction data and patterns, Decision Trees can identify suspicious activities and flag them for further investigation.
* Advantages:Easy to Understand,Versatility,No Need for Feature Scaling,Handles Non-linear Relationships,Handles Missing Data.
* Limitations:Overfitting,Bias towards Features with Many Categories,Difficulty in Capturing Complex Interactions,Computationally Expensive for Large Datasets.






In [None]:
'''Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances'''
from sklearn.datasets import load_iris
data=load_iris()
import pandas as pd
df=pd.DataFrame(data.data,columns=data.feature_names)
df['target']=data.target

from sklearn.model_selection import train_test_split
x=df.iloc[:,:-1]
y=df.iloc[:,-1]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)

from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier(criterion='gini')
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

0.9666666666666667


In [None]:
'''Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.'''
from sklearn.datasets import load_iris
data=load_iris()
import pandas as pd
df=pd.DataFrame(data.data,columns=data.feature_names)
df['target']=data.target

from sklearn.model_selection import train_test_split
x=df.iloc[:,:-1]
y=df.iloc[:,-1]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(criterion='entropy', max_depth=2)
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

model=DecisionTreeClassifier()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))


0.9666666666666667
0.9666666666666667


In [10]:
'''Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances'''
import pandas as pd
df=pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
x=df.iloc[:,:-1]
y=df.iloc[:,-1]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
from sklearn.tree import DecisionTreeRegressor
model=DecisionTreeRegressor()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test,y_pred))


feature_importances = pd.DataFrame({ "Feature": x.columns, "Importance": model.feature_importances_ }).sort_values(by="Importance", ascending=False)
print(feature_importances)

17.756078431372547
    Feature  Importance
12    lstat    0.534187
5        rm    0.255755
7       dis    0.077248
0      crim    0.040197
4       nox    0.033416
10  ptratio    0.020333
6       age    0.011827
2     indus    0.008660
9       tax    0.007822
11        b    0.005689
3      chas    0.003115
1        zn    0.001228
8       rad    0.000523


In [30]:
'''Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy'''
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1
)

dt = DecisionTreeClassifier(random_state=1)


params = {
    'max_depth': [1, 2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8, 10]
}

grid = GridSearchCV(dt, param_grid=params, cv=5, scoring='accuracy', verbose=1)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Test Set Accuracy:", accuracy)


Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Test Set Accuracy: 0.9666666666666667


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting
* First i should understand the data.
* Do EDA for the dataset.
* Then apply df.isnull().sum() for checking whether any null value is present if present and replace it with mean,median,mode according the data.
* Encode the data by using one hot encoding ,label encoding or target encoding etc. as per the categorical data type.
* Then i will apply train test split.
* Initialize Decision Tree classifier or regreesor as per the requirement ,then the Decision Tree algorithm will learn decision rules based on feature splits to predict the target variables.
* Tune the models using techniques like Grid Search or Random Search with cross validation with the hyper parameters like max_depth,min_samples_split,criterion.
* Find the best parameters and by applying that find the accuracy and evaluate its performance.


This model will help in diagnosis and early detection,As per the disease the prevention and treatment etc.