# Usage Examples for `QuadratiK` in Python
Authors : Giovanni Saraceno, Marianthi Markatou, Raktim Mukhopadhyay, Mojgan Golzy

Date Modified: 20 February 2025

### Note on Matplotlib Usage

Matplotlib behaves differently in interactive environments (like Jupyter Notebook) versus non-interactive environments (like Python terminal):

1. In Jupyter Notebook:
   - Plots display automatically after cell execution
   - No explicit plt.show() is needed

2. In Python Terminal:
   - Need to explicitly call plt.show() to display plots, OR
   - May use plt.ion() for interactive mode

3. **Throughout this notebook, we provide two versions of plotting code**:
    - *One that works directly in Jupyter (default).*
    - *One in comments that works in Python terminal (needs uncommenting).*

For a detailed example, please see:
https://github.com/statsmodels/statsmodels/issues/1265

**The code in this notebook has been tested in Jupyter Notebook.**

## Introduction

This document contains various Python examples illustrating the use of `QuadratiK`

### Installation

The python package QuadratiK and other necessary packages must be installed. 

## Normality Test

Click here to download the `Python` script for this example: [normality_test.py](py_scripts/normality_test.py)

We illustrate the usage of the introduced KernelTest for the Normality
test. We generate one sample from a multivariate standard Normal distribution, that is $x = (x_1, . . . , x_n)$ ∼
$N_d(0, I_d)$ with dimension $d = 4$, and sample size $n = 500$.

In [None]:
import numpy as np

np.random.seed(78990)
from QuadratiK.kernel_test import KernelTest

# data generation
data_norm = np.random.multivariate_normal(mean=np.zeros(4), cov=np.eye(4), size=500)

# performing the normality test
normality_test = KernelTest(
    h=0.4, num_iter=150, method="subsampling", random_state=42
).test(data_norm)

# printing the summary for normality test
print(normality_test.summary())

## K-Sample Test

Click here to download the `Python` script for this example: [k_sample_test.py](py_scripts/k_sample_test.py)

We generate three samples, with $n=200$ observations each, from a 2-dimensional Gaussian distributions with mean vectors $\mu_1 = (0, \sqrt(3)/3)$, $\mu_2 = (-1/2, -\sqrt(3)/6)$ and  $\mu_3 = (1/2, \sqrt(3)/6)$, and the Identity matrix as Covariance matrix.  

In [None]:
import numpy as np

np.random.seed(0)
from QuadratiK.kernel_test import KernelTest

size = 200
eps = 1
x1 = np.random.multivariate_normal(
    mean=[0, np.sqrt(3) * eps / 3], cov=np.eye(2), size=size
)
x2 = np.random.multivariate_normal(
    mean=[-eps / 2, -np.sqrt(3) * eps / 6], cov=np.eye(2), size=size
)
x3 = np.random.multivariate_normal(
    mean=[eps / 2, -np.sqrt(3) * eps / 6], cov=np.eye(2), size=size
)
# Merge the three samples into a single dataset
X_k = np.concatenate([x1, x2, x3])
# The memberships are needed for k-sample test
y_k = np.repeat(np.array([1, 2, 3]), size).reshape(-1, 1)

# performing the k-sample test
k_sample_test = KernelTest(h=1.5, method="subsampling", random_state=42).test(X_k, y_k)

# printing the summary for the k-sample test
print(k_sample_test.summary())

## Two-Sample Test


Click here to download the `Python` script for this example: [two_sample_test.py](py_scripts/two_sample_test.py)

This example shows the application of the two-sample test. 
Instead of providing the vector of membership to the two groups as for the 
k-sample test, the two-sample test can be additionally performed by providing 
the two samples to be compared. We generate the sample $y_1, ..., y_n$ from a 
skew-normal distribution $SN_d(0,I_d, \lambda)$, where $d=4$, $n=200$ and 
$\lambda = (0.5, ..., 0.5)$. 

**Note:** If a value of `h` is not provided, the `select_h` function can be used to determine the optimal `h`. Please see examples below where `select_h` function has been illustrated. 

In [None]:
import numpy as np

np.random.seed(0)
from scipy.stats import skewnorm

from QuadratiK.kernel_test import KernelTest

# data generation
X_2 = np.random.multivariate_normal(mean=np.zeros(4), cov=np.eye(4), size=200)
Y_2 = skewnorm.rvs(
    size=(200, 4),
    loc=np.zeros(4),
    scale=np.ones(4),
    a=np.repeat(0.5, 4),
    random_state=20,
)
# performing the two sample test
two_sample_test = KernelTest(h=2, num_iter=150, random_state=42).test(X_2, Y_2)

# printing the summary for the two sample test
print(two_sample_test.summary())

The `qq_plot` function can be used to generate the qq-plots between the given samples. 

In [None]:
from QuadratiK.tools import qq_plot

two_sample_qq_plot = qq_plot(X_2, Y_2)

# To save the qq plot: run the following line
# two_sample_qq_plot.savefig('two_sample_qq_plot.png')

"""
If you want to run the following line in python terminal or in a .py file, please uncomment the code below and run.
--------------------------------
from QuadratiK.tools import qq_plot
import matplotlib.pyplot as plt
two_sample_qq_plot = qq_plot(X_2, Y_2)
plt.show()
two_sample_qq_plot.savefig('two_sample_qq_plot.png')
--------------------------------
""";

## Uniformity Test

Click here to download the `Python` script for this example: [uniformity_test.py](py_scripts/uniformity_test.py)

We generate $n=200$ observations from the uniform distribution 
on $S^{d-1}$, with $d=3$.  

In [None]:
import numpy as np

np.random.seed(0)
from QuadratiK.poisson_kernel_test import PoissonKernelTest

# data generation
z = np.random.normal(size=(200, 3))
data_unif = z / np.sqrt(np.sum(z**2, axis=1, keepdims=True))

# performing the uniformity test
unif_test = PoissonKernelTest(rho=0.7, random_state=42).test(data_unif)

# printing the summary for uniformity test
print(unif_test.summary())

The `qq_plot` function can be used to generate the qq-plots between the given samples and the uniform distribution.

## Tuning Parameter $h$ selection

The algorithm is implemented through the function `select_h`. 
The function select_h takes as arguments the data matrix x, the vector of 
labels y, and the type of alternatives (one of "location", "scale" or 
"skewness");  select_h returns not only the selected value of h, but also the 
power plot versus the considered list of h values for each tested value of 
$\delta$. 

### For Two-Sample Test

Click here to download the `Python` script for this example: [h_selection_two_sample_test.py](py_scripts/h_selection_two_sample_test.py)

We present the algorithm for selecting the optimal value of the tuning 
parameter in the two-sample problem. 
The algorithm for the selection of h for 
the two-sample test can be also performed providing the two samples $x$ 
and $y$.

In [None]:
from QuadratiK.kernel_test import select_h

# Perform the algorithm for selecting h
h_selected, all_powers, plot = select_h(
    x=X_2, y=Y_2, alternative="location", power_plot=True
)
print(f"Selected h is: {h_selected}")

# To save the power plot: run the following line
# plot.savefig('two_sample_power_plot.png')

"""
If you want to run the following line in python terminal or in a .py file, please uncomment the code below and run.
--------------------------------
from QuadratiK.kernel_test import select_h
import matplotlib.pyplot as plt

# Perform the algorithm for selecting h
h_selected, all_powers, plot = select_h(
    x=X_2, y=Y_2, alternative="location", power_plot=True
)
plt.show()
plot.savefig('two_sample_power_plot.png')
print(f"Selected h is: {h_selected}")
--------------------------------
""";

### For K-Sample Test

Click here to download the `Python` script for this example: [h_selection_k_sample_test.py](py_scripts/h_selection_k_sample_test.py)

We present the algorithm for selecting the optimal value of the tuning parameter in the k-sample problem. 

In [None]:
from QuadratiK.kernel_test import select_h

# Perform the algorithm for selecting h
h_selected, all_powers = select_h(
    x=X_k, y=y_k, alternative="skewness", power_plot=False, method="subsampling", b=0.2
)
print(f"Selected h is: {h_selected}")

## Real World Examples

### Two-Sample Test

Click here to download the `Python` script for this example: [real_world_two_sample_test.py](py_scripts/real_world_two_sample_test.py)

We utilize the Wisconsin Breast Cancer (Diagnostic) Dataset from the UCI repository to demonstrate the application of the Two-Sample Test in a real-world context.

In [None]:
from QuadratiK.datasets import load_wisconsin_breast_cancer_data
from QuadratiK.kernel_test import KernelTest, select_h

X, y = load_wisconsin_breast_cancer_data(return_X_y=True, scaled=True)

# Create masks for Malignant (M) and Benign (B) tumors
malignant_mask = y == 1
benign_mask = y == 0

# Create X1 and X2 using the masks
X1 = X[malignant_mask.all(axis=1)]
X2 = X[benign_mask.all(axis=1)]

# Perform the algorithm for selecting h
h_selected, all_powers = select_h(
    x=X1, y=X2, alternative="skewness", method="subsampling", b=0.5, n_jobs=-1
)
print(f"Selected h is: {h_selected}")

# performing the two sample test
two_sample_test = KernelTest(h=h_selected, num_iter=150, random_state=42).test(X1, X2)

# printing two sample test object
print(two_sample_test)

### K-Sample Test

Click here to download the `Python` script for this example: [real_world_k_sample_test.py](py_scripts/real_world_k_sample_test.py)

To illustrate the application of the K-Sample Test, we use the wine dataset from the UCI repository.

In [None]:
from QuadratiK.datasets import load_wine_data
from QuadratiK.kernel_test import KernelTest, select_h

X, y = load_wine_data(return_X_y=True, scaled=True)

# Perform the algorithm for selecting h
h_selected, all_powers = select_h(
    x=X, y=y, alternative="skewness", n_jobs=-1, b=0.5, method="subsampling"
)
print(f"Selected h is: {h_selected}")

# performing the two sample test
k_sample_test = KernelTest(h=h_selected, num_iter=150, random_state=42).test(X, y)

# printing the summary for the two sample test
print(k_sample_test.summary())

### Poisson Kernel Based Clustering

Click here to download the `Python` script for this example: [pkbc.py](py_scripts/pkbc.py)

We consider the Wireless Indoor Localization Data Set, publicly available in the UCI Machine Learning Repository’s website. This data set is used to study the performance of different indoor localization algorithms. 

The Wireless Indoor Localization data set contains the measurements of the Wi-Fi signal strength in different indoor rooms. It consists of a data frame with 2000 rows and 8 columns. The first 7 variables report the values of the Wi-Fi signal strength received from 7 different Wi-Fi routers in an office location in Pittsburgh (USA). The last column indicates the class labels, from 1 to 4, indicating the different rooms. Notice that, the Wi-Fi signal strength is measured in dBm, decibel milliwatts, which is expressed as a negative value ranging from -100 to 0. In total, we have 500 observations for each room.

Given that the Wi-Fi signal strength takes values in a limited range, it is appropriate to consider the spherically transformed observations, by $L_2$ normalization, and consequently perform the clustering algorithm on the 7-dimensional sphere.



In [10]:
import warnings

from QuadratiK.datasets import load_wireless_data
from QuadratiK.spherical_clustering import PKBC

warnings.filterwarnings("ignore")

X, y = load_wireless_data(return_X_y=True)
# number of clusters tried are from 2 to 10
pkbc = PKBC(num_clust=range(2, 11), random_state=42).fit(X)

In [None]:
validation_metrics, elbow_plots = pkbc.validation(y_true=y)

# To save the power plot: run the following line
# elbow_plots.savefig('elbow_plots.png')

"""
If you want to run the following line in python terminal or in a .py file, please uncomment the code below and run.
--------------------------------
import matplotlib.pyplot as plt

validation_metrics, elbow_plots = pkbc.validation(y_true=y)
plt.show()
elbow_plots.savefig('elbow_plots.png')
--------------------------------
""";

To guide the choice of the number of clusters, the function validation provides cluster validation measures and graphical tools. Specifically, it displays the Elbow plot from the computed within-cluster sum of squares values and returns an a table of computed evaluation measures as shown below. 

In [None]:
print(validation_metrics.round(2))

In [None]:
print(pkbc.summary())

In [None]:
"""
This plot is created using Plotly. For detailed instructions on saving the plot, please refer to the Plotly documentation at:
https://plotly.com/python/static-image-export/. Additionally, the current renderer is set to "png", but the plot can be saved in various formats; please consult the Plotly documentation for more information.

Please be aware that generating static images requires `Kaleido` and `nbformat`.
"""

# please feel free to change the default renderer, for options see: https://plotly.com/python/renderers/
import plotly.io as pio

pio.renderers.default = "png"

pkbc_clusters = pkbc.plot(num_clust=4, y_true=y)
pkbc_clusters.show()

# To save the plot: run the following line
# pkbc_clusters.write_image("pkbc_clusters.png")

"""
If you want to run the following line in python terminal, please uncomment the code below and run.
--------------------------------
import plotly.io as pio

# For viewing the plot please set:
pio.renderers.default = "browser"

pkbc_clusters = pkbc.plot(num_clust=4, y_true=y)
pkbc_clusters.show()

# Once the plot opens in the browser, you can save the plot by clicking on the "Download" button in the plot on top right corner.
--------------------------------
""";

The clusters identified with $k=4$ achieve high performance in terms of ARI, Macro Precision and Macro Recall.

## Initializing the Dashboard

We show the initialization of the dashboard application. The corresponding code snippet is given below.

In [None]:
# uncomment the below code to instantiate the dashboard on a local machine
"""
from QuadratiK.ui import UI
UI().run()
"""

![Dashboard](images/dash-landing.png)

The above image shows the landing page of the user interface in the `QuadratiK` package