# Decision Tree Assignment
Prepared by: Sanskriti Jaiswal
Assignment Code: DA-AG-012

### Question 1: What is a Decision Tree, and how does it work in the context of classification?
**Answer:**
A Decision Tree is a flowchart-like structure used for both classification and regression. In classification, it splits the dataset into subsets using the value of input features. Each internal node represents a feature, each branch a decision rule, and each leaf node a class label. The model works by making decisions at each node based on feature values, ultimately predicting the class at the leaf node. It’s simple, interpretable, and effective for various classification problems.

### Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
**Answer:**
- **Gini Impurity** measures the probability of a wrong classification if a random sample is classified according to the class distribution in a node. Lower Gini means purer nodes.
- **Entropy** is a measure from information theory that quantifies uncertainty. Lower entropy means more information (purity).
These metrics are used to evaluate how good a feature is at splitting data. The lower the impurity (Gini or Entropy), the better the split.

### Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
**Answer:**
- **Pre-Pruning** stops tree growth early by setting constraints like `max_depth`, `min_samples_split`, etc. *Advantage:* Saves time and avoids overfitting.
- **Post-Pruning** allows the tree to grow fully and then removes branches that do not add value. *Advantage:* Improves accuracy on unseen data while still learning detailed patterns.

### Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
**Answer:**
Information Gain measures the reduction in entropy after a dataset is split on an attribute. It tells how much information a feature gives us about the class. The feature with the highest Information Gain is chosen to split the node. It’s crucial because it helps in building the most informative and efficient tree.

### Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
**Answer:**
**Applications:**
- Healthcare diagnosis
- Credit scoring
- Fraud detection
- Marketing segmentation
**Advantages:**
- Easy to understand and visualize
- Requires little data preprocessing
- Works for both classification and regression
**Limitations:**
- Prone to overfitting
- Can be biased if one class dominates
- Instability with small data changes

In [None]:
# Question 6
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", clf.feature_importances_)

In [None]:
# Question 7
full_tree = DecisionTreeClassifier()
limited_tree = DecisionTreeClassifier(max_depth=3)

full_tree.fit(X_train, y_train)
limited_tree.fit(X_train, y_train)

print("Full Tree Accuracy:", accuracy_score(y_test, full_tree.predict(X_test)))
print("Depth-3 Tree Accuracy:", accuracy_score(y_test, limited_tree.predict(X_test)))

In [None]:
# Question 8
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)

In [None]:
# Question 9
from sklearn.model_selection import GridSearchCV

params = {'max_depth': [2, 3, 4, 5], 'min_samples_split': [2, 5, 10]}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid=params, cv=5)
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best Score:", grid.best_score_)

### Question 10:
**Answer:**
Step-by-step approach:
1. **Handle Missing Values:**
   - Use imputation techniques (mean for numerical, mode for categorical).
2. **Encode Categorical Features:**
   - Use LabelEncoder or OneHotEncoder.
3. **Train a Decision Tree:**
   - Split data into training and testing, fit the tree model.
4. **Tune Hyperparameters:**
   - Use GridSearchCV to optimize parameters like `max_depth`, `min_samples_split`.
5. **Evaluate Performance:**
   - Use accuracy, precision, recall, confusion matrix.
**Business Value:**
The model can provide early disease detection, reduce healthcare costs, and improve patient outcomes by assisting doctors in making data-driven decisions.