In [1]:
%autosave 15

Autosaving every 15 seconds


# i. Decision Tree

**Brief:** Apparently to reach a decision by going through a series of binary decisions.

A **`Decision Tree`** is a popular machine learning algorithm which can be used for classification as well as regression. It is a **tree-like model** in which **internal nodes represent the features or attributes of the data**, **branches represent the decision rules**, and **the leaves represent the final decisions or outcomes**.

**Process of Building a Decision Tree:** It involves selecting the best feature (one at a time) recursively to split the data into smaller subsets (and build a tree), based on some metric such as information gain or Gini impurity. Each split creates a new node in the tree, which is connected to its parent node by a branch. The leaf nodes of the tree represent the final decision, such as a classification or a predicted value.

**Example:** Say, we have a dataset of customer information for a company, including attributes like age, income, and gender, and we want to use a decision tree to predict whether a customer will make a purchase. 

The decision tree algorithm would first look at all the attributes and **select the one that has the highest information gain, which is a measure of how well a feature can split the data into different classes**. Let's say that the age attribute has the highest information gain. The algorithm would then split the dataset into subsets based on age, with one branch representing customers under the age of 30 and another branch representing customers over the age of 30. The algorithm would then repeat this process for each branch, selecting the attribute with the highest information gain and splitting the data again, until it reaches the leaf nodes, which represent the final classification of whether the customer will make a purchase or not.

### i.i Can decision trees handle missing data?

Yes, Decision trees can handle missing data by using different methods for imputation and splitting nodes. When missing data is encountered during tree construction, the decision tree algorithm can either **impute the missing value**, **ignore the missing value**, or **create a new branch in the tree for instances with missing values**.

**Imputation:** One common approach for imputation is to use the most common value for categorical features or the mean or median for continuous features, **only when the missing data is not systematic and occurs randomly**. Another approach is to use regression imputation or other more complex methods to estimate the missing values based on the available data.

**Creating a new branch:** For splitting nodes, a decision tree algorithm can use different strategies depending on the type of feature and the nature of the data. One strategy is to create a separate branch for instances with missing values, and use the remaining data to split other nodes.

The choice of strategy depends on the nature of the data and the objectives of the analysis.

### i.ii Can decision trees overfit to the training data?

Yes, decision trees are prone to the overfitting. They are non-parametric algorithms and as such, tend to stick closely to the training data if the tree is deep and too complex.

To avoid overfitting, techniques such as **pruning**, **limiting the maximum depth of the tree**, and **using an appropriate minimum number of samples per leaf node** can be used.    

### i.iii State the advantages and disadvantages of the decision tree.

### Advantages:

- Decision Tress are **very intuitive** and can be easily understood by both technical and non-technical audiences. The decision rules are represented in a tree-like structure, making it easy to visualize the decision-making process.<br><br>

- Decision trees **can handle both numerical and categorical data without the need for data transformation**. This makes them more versatile than other machine learning algorithms that can only handle specific types of data.<br><br>

- Decision trees are a **non-parametric method**, which means that they do not make any assumptions about the distribution of the data. This makes them useful for both linear and non-linear data.<br><br>

- Decision trees **can handle missing data** by simply ignoring the missing values and splitting the data based on the available values. This makes them more robust to missing data than other machine learning algorithms.<br><br>

- **Feature Selection:** Decision trees assign an **importance score** to each feature **based on its contribution to splitting the data**. This score is calculated by measuring the decrease in impurity of the target variable when a feature is used for splitting. Features with higher importance scores are considered more important for predicting the target variable, and can be selected for further analysis.<br><br>

-  Decision trees **can handle multi-class problems**, which means that they can classify data into more than two classes. This makes them useful for a wide range of classification problems.<br><br>

- Decision trees are **easy to validate** using methods such as cross-validation, which helps to prevent overfitting and improve the generalization performance of the model.

### Disavantages:

- Probablity of overfitting is very high, especially when the tree is deep and complex. This can lead to poor generalization on unseen data.<br><br>

- **Instability:** Decision trees **are sensitive to small variations in the data**, and can produce different trees for different samples of data.<br><br>

- **Greediness:** Decision trees are greedy in nature, and choose the best split at each node based on the available features, without considering the global optimum (what's gonna happen in upcoming levels). This can lead to suboptimal trees.<br>

    It may not lead to the **global optimum**. In other words, it may miss out on certain patterns or correlations in the data that can be found by exploring other options, resulting in suboptimal trees. Therefore, the decision tree may not generalize well on new data and may not be as accurate as it could be.<br><br>    

- It takes more time to train a decision tree model than others.

### i.iv Explain the bias-variance trade-off in the decision trees.

**Generally, decision trees have `low bias` and `high variance`.** This is because decision trees are capable of fitting complex and non-linear relationships in the data, which gives them low bias. However, this flexibility can also lead to overfitting, which causes high variance.

This overfitting can be controlled by pruning the tree or using ensemble methods such as random forests or gradient boosting, which combine multiple decision trees to reduce overfitting and improve performance on unseen data.

### i.v Explain the outliers handling in the decision trees.

**Decision trees are relatively robust to outliers because they make splits based on purity of data in the node, rather than trying to fit a best-fit line or curve to the data.** Outliers tend to be isolated data points that don't fit well with the rest of the data, **which makes them easy for a decision tree to split away from the rest of the data**. In other words, outliers can lead to split points in the tree that isolate them from the majority of the data, rather than compromising the overall fit of the model. 

However, it is important to note that decision trees can still be affected by outliers if they are present in large numbers or if they are influential in the outcome variable, so it is still important to preprocess the data and handle outliers appropriately.

# ii. Random Forests

### ii.i Explain Ensemble techniques.

An **Ensemble** in machine learning is a **technique that combines the predictions of multiple models to improve the overall performance of the system**. **The idea behind ensembling is that different models will make different errors, and by combining them, we can reduce the overall error rate and create a more accurate and robust prediction.**

There are several types of ensemble techniques, but the most commonly used ones are **bagging** and **boosting**. **Bagging**, **short for bootstrap aggregating**, involves training multiple models independently on different subsets of the training data and combining their predictions by taking the average or majority vote. **This approach helps to reduce the variance in the model and prevent overfitting.** 

**Boosting**, on the other hand, involves training models sequentially, with each subsequent model learning from the mistakes of the previous ones. **This approach helps to reduce bias and improve the accuracy of the model.**

Ensemble methods are particularly useful when the data is noisy or the underlying relationships are complex, and they can help to improve the robustness and reliability of the system.

### ii.ii Explain Random Forests.

**Random forests** are versatile machine learning algorithms that can be used for both regression and classification tasks. They are a bagging ensemble of multitude of decision trees that are trained on different subsets of the training data. 

**During training, the algorithm constructs the desired number of decision trees (a hyperparameter) using different subsets of the training data, allowing it to average the predictions of each decision tree for regression problems, or output the class that gains the most number of votes among all decision trees for classification problems.**

The **feature sampling process** in Random Forests helps to reduce overfitting and create a more generalized model by using different subsets of features for each tree. This also helps to reduce the correlation between features, as each tree is built on a different set of features. Additionally, the use of decision trees in Random Forests allows for the model to capture complex non-linear relationships in the data, leading to a low bias.

**Overall, the Random Forest algorithm is a powerful and popular machine learning technique that leverages the bias-variance trade-off to create accurate and robust models.**

### ii.iii Explain the bias-variance trade-off in Random Forests.

The Random Forest algorithm **trades low bias for low variance**, which is desirable in an ideal model. The use of decision trees in Random Forests allows for the model to capture complex non-linear relationships in the data, leading to a **low bias**. By constructing an ensemble of decision trees, the algorithm is able to reduce the variance of the model and improve its performance on new, unseen data.

Each individual decision tree in the ensemble has high variance and high bias (because of the randomly selected subset of featurs, the tree won't be able to capture the complex relationships of the training data, and we are essentially removing some of the potentially important predictors from consideration, which can lead to an oversimplified model with high bias), while **the ensemble as a whole has low variance and low bias**.

To achieve this, Random Forests randomly sample both the rows and columns (features) of the dataset to create different subsets of the training data for each decision tree. This helps to reduce overfitting and increase the diversity of the ensemble.

During prediction, each decision tree in the ensemble outputs a prediction and the final output of the Random Forest is the average (for regression) or the majority vote (for classification) of all the individual tree predictions. **This process of averaging or voting helps to further reduce the variance of the model and make it more robust to noise and outliers in the data.**

### ii.iv State the advantages and disdvantages of the Random Forests.

### Advantages:

- **Accuracy:** Random Forests are known for their high accuracy and have proven to perform well on a wide range of tasks, including classification and regression.<br><br>

- **Robustness:** Random Forests are robust to overfitting, thanks to the use of an ensemble of decision trees. By aggregating the predictions of many trees, the noise in the predictions of individual trees is reduced.<br><br>

- **Non-parametric:** Random Forests are non-parametric, which means that they don't make assumptions about the distribution of the data. This makes them suitable for a wide range of applications and data types.<br><br>

- **Feature Importance/Selection:** Random Forests provide a measure of feature importance, which can help in understanding the most important variables in a dataset.<br><br>

- **Scalability:** Random Forests can handle large datasets with high dimensionality and many features.<br><br>

- **Outlier Detection:** Random Forests can be used for outlier detection, which is the process of identifying data points that are significantly different from the majority of the data.<br>**Random Forests can detect outliers by evaluating the effect of each individual sample on the overall accuracy of the model.** This is done by measuring the decrease in accuracy of the model when a particular sample is removed. Samples that have a significant impact on the model's accuracy are considered outliers.<br>In a Random Forest model, each decision tree is constructed on a different subset of the data. Therefore, **the impact of an outlier on the overall model accuracy can be evaluated across multiple trees**, making the outlier detection more robust.

### Disadvantages:

- **Non-Interpretability:** As the number of decision trees in the ensemble increases, it becomes more difficult to interpret the results and understand the relationship between the input features and the target variable. This is because the Random Forest model is a black box model, meaning it doesn't provide clear insight into how the model makes its predictions.<br><br>

- **Computationally expensive:** Random Forests can be computationally expensive to train, especially when dealing with large datasets or a high number of trees in the ensemble. This can make it difficult to use in real-time applications where speed is critical.<br><br>

- **Can overfit:** While Random Forests are generally robust to overfitting, it is still possible for them to overfit the training data, **especially if the number of trees in the ensemble is too high or the hyperparameters are not well-tuned**.<br><br>

- **Limited extrapolation ability:** Extrapolation is the process of using a statistical model to make predictions beyond the range of the data used to build the model. In other words, it involves estimating values outside the range of the data by extending a curve or line beyond the observed data points.

    Random Forests may not perform well when extrapolating beyond the range of the training data, as they are not designed to model complex interactions between variables outside of their training data.<br><br>

- **Difficulty with imbalanced data:** Random Forests may struggle to correctly classify minority classes in imbalanced datasets, as they tend to favor the majority class. This can be mitigated by resampling techniques or adjusting class weights, but it still may be a challenge.