<a href="https://colab.research.google.com/github/rhodes-byu/RF-Proximities-Workshop/blob/main/demo.ipynb" target="_blank">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>



# Random Forest Proximities and Their Applications in Data Science

## Intro

In this demo, we will cover the basics of random forests (RF), focusing primarily on random forest proximities. Proximities form a supervised similarity measure that serve as the basis for a variety of applications. Specifically, we will be covering the use of RF proximities for visualization for data exploration, missing data imputation, outlier detecion, and ...

### Installation and Imports
Although viewed by Leo Breiman (random forest's primary author) as one of the most important aspects of random forests, Scikit-Learn's implementation of random forests does not implement them!  We will thus be relying on the RF-GAP-Python package to generate proximities for our applications. The installation is done by running the below cell:

In [None]:
!pip install git+https://github.com/jakerhodes/RF-GAP-Python

In [None]:
from rfgap import RFGAP, impute
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score
from sklearn.manifold import MDS

#### Loading the Data
First we will read in the `Titanic` dataset using `fetch_openml`. The dataset contains information about passengers on the Titanic, including items such as their name, sex, class, and whether they survived or not.

In [None]:
# Read in Titanic
titanic = fetch_openml('titanic', version=1, as_frame=True)

In [None]:
titanic.keys()

In [None]:
titanic.data.head()

In [None]:
titanic.data.info()

In [None]:
X = titanic.data.copy()
X.drop(['name', 'ticket', 'cabin', 'boat', 'body', 'home.dest'], axis=1, inplace=True)

# One-hot encoding of categorical variables
X_one_hot = pd.get_dummies(X, drop_first=False)

In [None]:
X.info()

Note that age, fare, and embarked have missing values. We can impute them using the RF-GAP proximities.

### Missing Value Imputation

Leo Breiman described two methods for random forest imputation. The first method does not actually use the random forest at all, but simply imputes using the mean, median, or most frequent category for an initial guess. The second method uses the random forest proximities to refine the imputation. Here is the original description:

---

**Random forests has two ways of replacing missing values.**

1. **Fast Method (Initial Guess)**

   - If the *m*th variable is **not categorical**, compute the **median** of all values of this variable in class *j*, then use this value to replace all missing values of the *m*th variable in class *j*.
   - If the *m*th variable is **categorical**, the replacement is the **most frequent non-missing** value in class *j*.
   - These replacement values are called **fills**.

2. **Proximity-Based Method (Refined Imputation)**

   - This method is **computationally more expensive** but has given better performance, even with large amounts of missing data.
   - It replaces missing values **only in the training set**.

   **Steps:**
   1. Perform a rough and inaccurate filling in of the missing values.
   2. Run a random forest and compute **proximities**.
   3. For a missing value:
      - If `x(m,n)` is a **missing continuous** value, estimate its fill as the **average** over the non-missing values of the *m*th variable, **weighted by the proximities** between the *n*th case and the cases with non-missing values.
      - If it is a **missing categorical** variable, replace it with the **most frequent non-missing** value, where **frequency is weighted by proximity**.
   4. **Iterate**:
      - Construct a new forest using the newly filled-in values.
      - Find new fills.
      - Repeat the process.

   - **Our experience is that 4–6 iterations are enough.**


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_one_hot, titanic.target, test_size=0.2, random_state=42)

In [None]:
X_train_imputed, X_test_imputed = impute.rfgap_impute(x = X_train, y = y_train, initialization = 'knn', x_test = X_test)

In [None]:
X_missing = X_train.isna().sum(axis = 1)

In [None]:
X_train_imputed_plot = X_train_imputed.copy()
X_train_imputed_plot['Sex'] = X_train_imputed_plot['sex_female'].map({1: 'female', 0: 'male'})
sns.swarmplot(data=X_train_imputed_plot, x='age', y='Sex', hue = X_missing)

In [None]:
# TODO: Review above see what to keep

## Training the RandomForest

The RF-GAP class is a wrapper around the `RandomForestClassifier` or `RandomForestRegressor` from `sklearn`. It takes most of the same parameters, but has the added benefit of the proximity construction and subsequent applications. As with other models in `sklearn`, we train using the `model.fit(x, y)` method, and make predictions using the `model.predict(x)` method. Aftwerward, we can evalute the random forest model using our metric of choice.

To include the computation of the out-of-bag accuracy, we need to include the `oob_score = True` argument. We can access the score using `model.oob_score_`.

In [None]:
rf = RFGAP(prox_method = 'rfgap', n_estimators = 500, oob_score = True, random_state = 42)
# Hint: RFGAP defaults to classification, include `y` as an argument or specifiy `prediction_type = 'regression'` for regression.

rf.fit(X_train_imputed, y_train)
print('OOB Score: ', rf.oob_score_)

### Predictions on the Test Set
As with other `sklearn` models, we can use the `predict` method to make predictions on the test set. The `predict_proba` method will return the predicted probabilities for each class.

In [None]:
yhat = rf.predict(X_test_imputed)
print(yhat)


In [None]:
print('Test Score: ', rf.score(X_test_imputed, y_test))

Note the similarity between the test accuracy and OOB accuracy.

## Generating the proximities
The `RFGAP` class uses the built-in method `get_proximities` to generate the random forest proximities. By default, the RF-GAP proximities are generated across the full test set. Other options include the original and OOB versions of the proximities. 

As the RF-GAP proximities serves as weights (recall: $\hat{y_i}^{RF} = \sum_{j = 1}^{n}y_{j}p(i, j)$), each row of the proximity matrix sums to 1. 

In [None]:
prox_rfgap = rf.get_proximities()
prox_rfgap.sum(axis = 1)

In [None]:
y_train_hot = pd.get_dummies(y_train, drop_first=False)

In [None]:
Proximity-based predicted probabilities

In [None]:
weighted_sum = prox_rfgap @ y_train_hot
print(weighted_sum)

In [None]:
prox_predictions = np.argmax(weighted_sum, axis = 1)
oob_predictions = np.argmax(rf.oob_decision_function_, axis = 1)

In [None]:
np.sum(oob_predictions == prox_predictions)

All of the proximity predictions match the OOB predictions!

In [None]:
### Visualization of the proximities

In [None]:
help(RFGAP)

In [None]:
dir(rf)

In [None]:
rf.force_symmetric = True
rf.non_zero_diagonal = True

In [None]:
# rf.set_params(non_zero_diagonal = True, force_symmetric = True)
rfgap_symmetric = rf.get_proximities().toarray()
rf_distances = 1 - rfgap_symmetric

In [None]:
mds = MDS(n_components=2, dissimilarity='precomputed', random_state=42)
mds_fit = mds.fit_transform(rf_distances)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(22, 10))
markers = ['X', '.']
s = 80
sns.scatterplot(x=mds_fit[:, 0], y=mds_fit[:, 1], 
                hue=X_train_imputed.sex_male, style=y_train,
                palette='Set1', ax=axes[0], s=s,
                markers=markers)

sns.scatterplot(x=mds_fit[:, 0], y=mds_fit[:, 1], 
                hue=X_train_imputed.pclass, style=y_train, 
                palette='Dark2', ax=axes[1], s=s,
                markers=markers)


for ax in axes:
    ax.legend(loc='upper left')
fig.suptitle("Proximity Visualization via MDS", fontsize=16)

We can do better!  The Potential of Heat-diffusion for Affinity-based Trajectory Embedding or PHATE is a tool for visualizing high dimensional data. We can apply PHATE to the proximities for a better low-dimensional representation or embedding.

In [None]:
!pip install git+https://github.com/jakerhodes/RF-PHATE

In [None]:
from rfphate import RFPHATE

In [None]:
## Bonus Material (MA, RF-PHATE, Time Series Classification)