/
geometry_parameter_space.Rmd
147 lines (119 loc) · 3.79 KB
/
geometry_parameter_space.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: "Using liminal to understand high dimensional parameter space"
output: rmarkdown::html_vignette
link-citations: yes
bibliography: liminal.bib
vignette: >
%\VignetteIndexEntry{geometry_parameter_space}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(ggplot2)
theme_set(theme_bw())
```
This example is modified from the examples tours described in @Cook2018-jm.
Here we use a tour to explore principal components space and
any non-linear structure and clusters via t-SNE.
## Setting up the data
Data were obtained from CT14HERA2 parton distribution function
fits as used in @Cook2018-jm. There are 28 directions in the parameter
space of parton distribution function fit, each point in the variables
labelled X1-X56 indicate moving +- 1 standard deviation from the 'best'
(maximum likelihood estimate) fit of the function. Each observation has
all predictions of the corresponding measurement from an experiment.
(see table 3 in that paper for more explicit details).
The remaining columns are:
* InFit: A flag indicating whether an observation entered the fit of
CT14HERA2 parton distribution function
* Type: First number of ID
* ID: contains the identifier of experiment, 1XX/2XX/5XX corresponds
to Deep Inelastic Scattering (DIS) / Vector Boson Production (VBP) /
Strong Interaction (JET). Every ID points to an experimental paper.
* pt: the per experiment observational id
* x,mu: the kinematics of a parton. x is the parton momentum fraction, and
mu is the factorisation scale.
First, we take the load the data as a data.frame:
```{r pdfsense-prepare}
library(liminal)
data(pdfsense)
```
## Linear embeddings and the tour
First we can estimate all `nrow(pdfsense)` principal components
using on the parton distribution fits:
```{r pdfsense}
pcs <- prcomp(pdfsense[, 7:ncol(pdfsense)])
```
Using this data structure, we can produce a screeplot:
```{r, echo = TRUE}
res <- data.frame(
component = 1:56,
variance_explained = cumsum(pcs$sdev / sum(pcs$sdev))
)
ggplot(res, aes(x = component, y = variance_explained)) +
geom_point() +
scale_x_continuous(
breaks = seq(0, 60, by = 5)
) +
scale_y_continuous(
labels = function(x) paste0(100*x, "%")
)
```
Approximately 70% of the variance in the pdf fits are explained by the first 15 principal components.
Next we augment our original data with the principal components:
```{r}
pdfsense <- dplyr::bind_cols(
pdfsense,
as.data.frame(pcs$x)
)
pdfsense$Type <- factor(pdfsense$Type)
```
We can view a simple tour via`limn_tour()` and color points
by their experimental group
```{r, eval = FALSE}
limn_tour(pdfsense, PC1:PC6, Type)
```
## Non-Linear embeddings
Now we can set up a non-linear embedding via t-SNE, here
we embed all 56 principal components.
```{r}
set.seed(3099)
start <- clamp_sd(as.matrix(dplyr::select(pdfsense, PC1, PC2)), sd = 1e-4)
tsne <- Rtsne::Rtsne(
dplyr::select(pdfsense, PC1:PC56),
pca = FALSE,
normalize = TRUE,
perplexity = 50,
exaggeration_factor = nrow(pdfsense) / 100,
Y_init = start
)
```
Once we have run t-SNE we tidy it into a `data.frame`, to perform a linked
tour.
```{r tsne}
tsne_embedding <- as.data.frame(tsne$Y)
tsne_embedding <- dplyr::rename(tsne_embedding, tsneX = V1, tsneY = V2)
tsne_embedding$Type <- pdfsense$Type
```
We can view the clusters using a static scatter plot:
```{r}
ggplot(tsne_embedding,
aes(x = tsneX, y = tsneY, color = Type)) +
geom_point() +
scale_color_manual(values = limn_pal_tableau10())
```
We can link a tour view next to the embedding to give us
a clear picture of the clustering:
```{r, eval = FALSE}
limn_tour_link(
tour_data = pdfsense,
embed_data = tsne_embedding,
cols = PC1:PC6,
color = Type
)
```
# References {-}