## 1. Decision Tree - 1_4 Apr

Q1. Describe the decision tree classifier algorithm and how it works to make predictions.

Ans:
    
Decision tree classifier is a popular machine learning algorithm used for classification tasks. It works by constructing a tree-like model of decisions and their possible consequences. The tree consists of nodes and edges, where each node represents a decision or a test on one of the input features, and each edge represents the outcome of the decision or test.

The algorithm starts by analyzing the input data and selecting the feature that best separates the data into distinct classes. It then creates a node for this feature and splits the data into subsets based on its possible values. This process is repeated recursively for each subset until all data points in a subset belong to the same class, or until a certain stopping criterion is met (e.g., maximum depth, minimum number of samples per leaf).

To make a prediction for a new input instance, the algorithm traverses the decision tree from the root node down to a leaf node, following the path determined by the input features' values. The class associated with the leaf node reached by the input instance is then returned as the predicted class.

Q2. Provide a step-by-step explanation of the mathematical intuition behind decision tree classification.

Ans:
    
The decision tree algorithm uses a mathematical approach to construct the tree-like model of decisions and their consequences. The following is a step-by-step explanation of the mathematical intuition behind decision tree classification:

1. Information Gain: The first step in building a decision tree is to select the feature that best separates the data into distinct classes. This is done by calculating the information gain (IG) of each feature. IG measures the reduction in entropy or disorder of the data when a feature is used to split it into subsets. The higher the IG, the better the feature for splitting the data.

2. Entropy: Entropy (H) is a measure of the randomness or disorder of a set of data. In decision tree classification, the entropy of a dataset D is calculated as:

H(D) = - Σ p_i log2 p_i

where p_i is the proportion of data points in D that belong to class i. Entropy is maximum when the data is evenly distributed among all classes, and minimum when all data points belong to the same class.

3. Information Gain Ratio: Information Gain can be biased towards features with a large number of possible values. To address this issue, we can use Information Gain Ratio (IGR) instead. IGR takes into account the intrinsic information of a feature, which is measured as the entropy of the feature values. The formula for IGR is:

IGR(D, f) = IG(D, f) / IV(f)

where f is a feature, IG(D, f) is the information gain of feature f on dataset D, and IV(f) is the intrinsic value of feature f, which is defined as:

IV(f) = - Σ (|D_i| / |D|) log2 (|D_i| / |D|)

where D_i is the subset of data points in D that have the value i for feature f.

4. Gini Index: Another approach to measuring the quality of a split is the Gini index. The Gini index (G) measures the probability of misclassifying a randomly chosen data point from the set. The formula for G is:

G(D) = 1 - Σ p_i^2

where p_i is the proportion of data points in D that belong to class i. The Gini index is minimum when all data points belong to the same class.

5. Splitting: Once the feature with the highest IG or IGR, or the lowest Gini index, is selected, the data is split into subsets based on the possible values of the feature. This process is repeated recursively for each subset until all data points in a subset belong to the same class, or until a certain stopping criterion is met.

6. Classification: To classify a new data point, the decision tree is traversed from the root node down to a leaf node, following the path determined by the input features' values. The class associated with the leaf node reached by the input instance is then returned as the predicted class.

Q3. Explain how a decision tree classifier can be used to solve a binary classification problem.

Ans:
    
A decision tree classifier can be used to solve a binary classification problem by constructing a tree-like model that can predict the class label of new input data. The binary classification problem involves predicting one of two possible classes (e.g., positive or negative, true or false, etc.). The following is an explanation of how a decision tree classifier can be used to solve a binary classification problem:

1. Data Preparation: The first step is to prepare the data for training and testing the classifier. The data should be split into two sets: a training set and a test set. The training set is used to build the decision tree, and the test set is used to evaluate its performance.

2. Decision Tree Construction: The decision tree is constructed using the training set by selecting the feature that best separates the data into the two classes. This is done by calculating the information gain or Gini index of each feature. The decision tree is built recursively by splitting the data into subsets based on the possible values of the selected feature. This process is repeated until all data points in a subset belong to the same class or a stopping criterion is met.

3. Prediction: To predict the class label of a new input instance, the decision tree is traversed from the root node down to a leaf node, following the path determined by the input features' values. At each node, the decision tree tests the value of a feature and selects the edge that matches the input value. This process continues until a leaf node is reached, which represents the predicted class label.

4. Evaluation: The performance of the decision tree classifier is evaluated using the test set. The predicted class labels are compared to the true class labels to calculate metrics such as accuracy, precision, recall, and F1 score. These metrics indicate how well the classifier is able to predict the correct class label for new input data.

Q4. Discuss the geometric intuition behind decision tree classification and how it can be used to make
predictions.

Ans: 

The geometric intuition behind decision tree classification is based on the idea of dividing the input space into regions using hyperplanes. Each node in the decision tree represents a split in the input space along a feature's value, resulting in two or more subregions. The split points and the number of subregions depend on the algorithm used to build the decision tree.

Once the input space is partitioned into regions, the decision tree can be used to predict the class label of new input instances. To make a prediction, we start at the root node of the tree and traverse it by following the path determined by the input features' values. At each node, we test the value of a feature and select the edge that matches the input value. This process continues until we reach a leaf node, which represents the predicted class label.

The decision tree's hyperplanes can be represented as perpendicular lines in the input space, dividing it into regions. The boundary between two regions is a hyperplane that separates them. In the case of binary classification, the decision tree's hyperplanes divide the input space into two regions, corresponding to the two possible class labels. Each region is associated with a class label, and any new input instance falling into that region is assigned that class label.

The decision tree's ability to divide the input space into regions makes it a powerful tool for classification problems, especially when the classes are well separated. However, it may not perform well when the classes overlap, as the hyperplanes may not be able to capture the complex decision boundaries.

Q5. Define the confusion matrix and describe how it can be used to evaluate the performance of a
classification model.

Ans:
    
A confusion matrix is a table that summarizes the performance of a classification model by comparing the predicted labels with the actual labels. It is a matrix that consists of four terms:

True Positive (TP): The number of instances that are correctly predicted as positive (belonging to the positive class). 

False Positive (FP): The number of instances that are incorrectly predicted as positive (not belonging to the positive class). 

True Negative (TN): The number of instances that are correctly predicted as negative (not belonging to the positive class). 

False Negative (FN): The number of instances that are incorrectly predicted as negative (belonging to the positive class).

A confusion matrix helps to evaluate the performance of a classification model by providing insight into the types of errors that the model is making. By comparing the predicted labels with the actual labels, we can calculate several metrics such as:

    Accuracy : (TP+TN)/(TP+TN+FP+FN)

    Precision: TP/(TP+FP)

    Recall : TP/(TP+FN)

    FBETA SCORE: (1+betabeta) (PrecisionRecall)/Precision+Recall)


Q6. Provide an example of a confusion matrix and explain how precision, recall, and F1 score can be
calculated from it.

Ans:
    

Actual Positive	Actual Negative
Predicted Positive :	80 (True Positive)	20 (False Positive)
Predicted Negative :    10 (False Negative)	90 (True Negative)

From this confusion matrix, we can calculate the precision, recall, and F1 score as follows:

Accuracy = 0.85

Precision = 80/(80+20) = 0.8.

Recall = 80/(80+10) = 0.89.

F1 score = 2(0.8*0.89)/(0.8+0.89) = 0.84.

Q7. Discuss the importance of choosing an appropriate evaluation metric for a classification problem and
explain how this can be done.

Ans:
    
To choose an appropriate evaluation metric for a classification problem, you need to consider the following:

1. The problem's objective: What is the ultimate goal of the problem? Are we interested in minimizing false positives or false negatives, or both?

2. The nature of the data: Is the data balanced or imbalanced? Are there multiple classes, or is it a binary classification problem?

3. The model's limitations: What are the limitations of the model being used? Does it have a high bias or high variance?

Once you have considered these factors, you can choose the appropriate evaluation metric that best reflects the problem's objective and the nature of the data. You can also use multiple evaluation metrics to gain a comprehensive understanding of the model's performance. It is essential to understand the limitations of the chosen evaluation metric and interpret the results accordingly.

Q8. Provide an example of a classification problem where precision is the most important metric, and
explain why.

Ans:
    
Example : To detect SPAM in email 

if our mail is not SPAM but our model predicts it as SPAM - False Positive case. It's a blunder. 

Q9. Provide an example of a classification problem where recall is the most important metric, and explain
why.

Ans: 
    
Example : Detect Diabetes

if our patient is diabetic but model predict it as non Diabetic - False Negative case. It's blunder. 