#ASSIGNMENT

QUES1.What is a Decision Tree, and how does it work in the context of classification?
- A Decision Tree is a machine learning model that looks like a flowchart. It splits the data into smaller groups by asking yes/no type questions at each step. In classification, the tree checks features step by step until it reaches a final class at the leaf node. For example, to classify a flower, the tree may ask questions about petal length, petal width, etc., and finally decide which species it belongs to. It is simple to understand and easy to visualize.

QUES2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
- Both Gini Impurity and Entropy measure how "impure" or mixed the classes are in a group.  
- Gini Impurity shows the probability of wrongly classifying an item if we randomly assign a label. Lower Gini means purer groups.  
- Entropy measures disorder or randomness in the data. If all items belong to one class, entropy is 0 (pure).  

In a Decision Tree, the algorithm chooses splits that reduce impurity (low Gini or entropy). This way, each split makes the groups more homogenous and improves accuracy.

QUES3.What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- - Pre-Pruning means stopping the tree from growing too deep in the first place. For example, setting a maximum depth or minimum samples per split. Its main advantage is saving time and reducing overfitting.  
- Post-Pruning means letting the tree grow fully and then cutting unnecessary branches after training. Its advantage is that it creates a simpler model while keeping the accuracy high.  

Both methods help avoid overly complex trees and improve generalization on new data.

QUES4.What is Information Gain in Decision Trees, and why is it important for choosing the best split?
- Information Gain measures how much impurity decreases after making a split. It compares the impurity before and after splitting the data. The higher the Information Gain, the better the split.  

It is important because it guides the tree to choose the most useful feature at each step. This ensures that the tree becomes accurate, efficient, and easier to understand, instead of making random splits.

QUES5.What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
- Applications of Decision Trees include:  
- Medical diagnosis (predicting diseases)  
- Banking and finance (loan approval, fraud detection)  
- Marketing (customer churn prediction)  
- E-commerce (product recommendations)  

Advantages: They are easy to understand, can handle both numbers and categories, and require little data preparation.  
Limitations: They can overfit the data, are sensitive to small changes, and sometimes create very large trees that are hard to manage.  


In [None]:
#QUES6.Write a Python program to:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model with Gini
clf = DecisionTreeClassifier(criterion="gini")
clf.fit(X_train, y_train)

# Print accuracy and feature importance
print("Accuracy:", accuracy_score(y_test, clf.predict(X_test)))
print("Feature Importances:", clf.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.01667014 0.         0.40593501 0.57739485]


In [None]:
#QUES7.Write a Python program to:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Limited depth
clf1 = DecisionTreeClassifier(max_depth=3)
clf1.fit(X_train, y_train)

# Full tree
clf2 = DecisionTreeClassifier()
clf2.fit(X_train, y_train)

print("Max_depth=3 Accuracy:", accuracy_score(y_test, clf1.predict(X_test)))
print("Full Tree Accuracy:", accuracy_score(y_test, clf2.predict(X_test)))


Max_depth=3 Accuracy: 1.0
Full Tree Accuracy: 1.0


In [None]:
#QUES8.Write a Python program to:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load data
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train regressor
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

# Print results
print("MSE:", mean_squared_error(y_test, reg.predict(X_test)))
print("Feature Importances:", reg.feature_importances_)


MSE: 0.5003812696454941
Feature Importances: [0.52777883 0.05275811 0.0532922  0.02747635 0.03071986 0.13138811
 0.09359448 0.08299207]


In [None]:
#QUES9.Write a Python program to:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor # Changed from DecisionTreeClassifier

# Parameters to tune
params = {"max_depth":[2,3,4,None], "min_samples_split":[2,3,4]}

grid = GridSearchCV(DecisionTreeRegressor(), params, cv=3) # Changed from DecisionTreeClassifier
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)

Best Params: {'max_depth': None, 'min_samples_split': 4}
Best Accuracy: 0.6052932867473401


QUES10. Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process.
- Step 1: Handle missing values → For numerical data, fill with average/median. For categorical data, fill with most frequent value.  
- Step 2: Encode categorical features → Convert text into numbers using One-Hot Encoding or Label Encoding.  
- Step 3: Train Decision Tree → Use the cleaned data to train a decision tree model.  
- Step 4: Tune hyperparameters → Use GridSearchCV to try different max_depth, min_samples_split, etc., and select the best model.  
- Step 5: Evaluate performance → Check accuracy, precision, recall, and confusion matrix on the test set.  

Business value: This model can help doctors predict diseases quickly and accurately, reduce human error, save costs, and provide better healthcare services to patients.
