System information (for reproducibility):

In [13]:
versioninfo()

Julia Version 1.11.4
Commit 8561cc3d68d (2025-03-10 11:36 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 12 × Apple M2 Max
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_EDITOR = code


Load packages:

In [14]:
using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

[32m[1m  Activating[22m[39m project at `~/Documents/github.com/ucla-biostat-257/2025spring/slides/01-intro`


[32m[1mStatus[22m[39m `~/Documents/github.com/ucla-biostat-257/2025spring/slides/01-intro/Project.toml`
  [90m[6e4b80f9] [39mBenchmarkTools v1.6.0
  [90m[6f49c342] [39mRCall v0.14.6
  [90m[37e2e46d] [39mLinearAlgebra v1.11.0
  [90m[9a3f8284] [39mRandom v1.11.0


## Basic information

* Tue/Thu 1pm-2:50pm @ CHS 41-268.   

* Instructor: Dr. Hua Zhou.  

## What is statistics?

* Statistics, the science of *data analysis*, is the applied mathematics in the 21st century. 

* People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians's job is to help them extract knowledge and insights from data. 

* If existing software tools readily solve the problem, all the better. 

* Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming). 

* This entails at least two essential skills: **programming** and fundamental knowledge of **algorithms**. 

## What is this course about?

* **Not** a course on statistical packages. It does not answer questions such as _How to fit a linear mixed model in R,  Julia, SAS, SPSS, or Stata?_

* **Not** a pure programming course, although programming is important and we do homework in Julia.  

* **Not** a course on data science. [BIOSTAT 203B (Introduction to Data Science)](https://ucla-biostat-203b.github.io/2025winter/schedule/schedule-lec1.html) in winter quarter focuses on some R tools for data scientists.

* This course focuses on **algorithms**, mostly those in **numerical linear algebra** and **numerical optimization**. 

## Learning objectives

1. Be highly appreciative of this quote by [James Gentle](https://www.google.com/books/edition/Computational_Statistics/mQ5KAAAAQBAJ?hl=en&gbpv=1&dq=inauthor:%22James+E.+Gentle%22)

    > The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

    Examples: $\mathbf{X}^T \mathbf{W} \mathbf{X}$, $\operatorname{tr} (\mathbf{A} \mathbf{B})$, $\operatorname{diag} (\mathbf{A} \mathbf{B})$, multivariate normal density, ...  

2. Become **memory-conscious**. You care about looping order. You do benchmarking on hot functions fanatically to make sure it's not allocating.    

3. **No inversion mentality**. Whenever you see a matrix inverse in mathematical expression, your brain reacts with *matrix decomposition*, *iterative solvers*, etc. For R users, that means you almost never use the `solve(M)` function to obtain inverse of a matrix $\boldsymbol{M}$.   

    Examples: $(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$, $\mathbf{y}^T \boldsymbol{\Sigma}^{-1} \mathbf{y}$, Newton-Raphson algorithm, ...   

4. Master some basic strategies to solve **big data** problems. 

    Examples: how Google solve the PageRank problem with $10^{9}$ webpages, linear regression with $10^7$ observations, etc.  

5. No afraid of **optimizations** and treat it as a technology. Be able to recognize some major optimization problem classes and choose the best solver(s) correspondingly.

6. Be immune to the language fight. 

## Course logistics

* Course webpage: <https://ucla-biostat-257.github.io/2025spring>.

* [Syllabus](https://ucla-biostat-257.github.io/2025spring/syllabus/syllabus.html).

* Check the [Schedule](https://ucla-biostat-257.github.io/2025spring/schedule/schedule.html) page frequently. 

* Jupyter notebooks will be posted/updated before each lecture.

## How to get started

All course materials are in GitHub repo <https://github.com/ucla-biostat-257/2025spring>. Lecture notes are Jupyter Notebooks (`.ipynb` files) in the `slides` folder. It is a good idea to learn by running through the code examples. You can do this in several ways. 

### Run Jupyter Notebook in Binder

A quick and easy way to run the Jupyter Notebooks is Binder, a free service that allows users to run Jupyter Notebooks in cloud. Simply follow the Binder link at the [schedule](https://ucla-biostat-257.github.io/2025spring/schedule/schedule.html) page. 

If you want the JupyterLab interface, replace the `tree` by `lab` in the URL.  

### Run Jupyter Notebook locally on your own computer

1. Install Julia v1.11.x following instructions at <https://julialang.org/downloads/>.

2. Install `IJulia` package. Open Julia REPL, type `]` to enter the package mode, then type
```julia
add IJulia
build IJulia
```

3. Git clone the course material.   
```bash
git clone https://github.com/ucla-biostat-257/2025spring.git biostat-257-2025spring
```
You can change `biostat-257-2025spring` to any other directory name you prefer.

4. On terminal, enter the folder for the ipynb file you want to run, e.g. `biostat-257-2024spring/slides/01-intro/`. 

5. Open Julia REPL, type  
```julia  
using IJulia
jupyterlab(dir = pwd())
```
to open the JupyterLab in browser or
```julia  
using IJulia
notebook(dir = pwd())
```
to open a Jupyter notebook.

6. Course material is updated frequently. Remember to `git pull` to obtain the most recent material.

### Run Jupyter Notebook in VS Code

1. Install [Julia](https://julialang.org/downloads/), [VS Code](https://code.visualstudio.com/), and [Quarto](https://quarto.org/docs/get-started/).

2. Open VS Code and install extensions: Julia, Jupyter, Quarto, GitHub Copilot.

3. Git clone the course material.   
```bash
git clone https://github.com/ucla-biostat-257/2025spring.git biostat-257-2025spring
```
You can change `biostat-257-2025spring` to any other directory name you prefer.

4. Open the folder in VS Code.

## In class dicussion

The logistic regression is typically estimated by the Fisher scoring algorithm, or iteratively reweighted least squares (IWLS), which iterates according to
$$
\boldsymbol{\beta}^{(t)} = (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)},
$$
where $\mathbf{z}^{(t)}$ are pseudo-responses and $\mathbf{W}^{(t)} = \text{diag}(\mathbf{w}^{(t)})$ is a diagonal matrix with nonnegative weights on the diagonal. Superscript $t$ is the iterate number.

### Poll: Numeric

How much speedup can we achieve, by careful consideration of flops and memory usage, over the following naive implementation?
```julia
inv(X' * diagm(w) * X) * X' * diagm(w) * z
```

### Experiment

First generate some data.

In [15]:
using LinearAlgebra, Random

# Random seed for reproducibility
Random.seed!(257)
# samples, features
n, p = 5000, 100
# design matrix
X = [ones(n) randn(n, p - 1)]
# pseudo-responses
z = randn(n)
# weights
w = 0.25 * rand(n);

### Method 1

The following code literally translates the mathematical expression into code.

In [16]:
# method 1 
res1 = inv(X' * diagm(w) * X) * X' * diagm(w) * z

100-element Vector{Float64}:
 -0.004731352650088043
  0.009183070405469696
 -0.01627522147347795
 -0.013279497350630196
  0.020014830435187928
  0.020674778392632612
  0.0007810692137151187
 -0.012360822702514544
  0.00112392670988122
  0.011690288350451017
 -0.019599718827196574
  0.01775819774235745
 -0.002506239462765153
  ⋮
 -0.018115321884488347
 -0.011950081272644483
 -0.0054037502392284865
  0.001766631586071268
  0.01889729150257136
 -0.02628676655057106
  0.034928418336936384
  0.0080085874357102
  0.00824432461294388
  0.013637070959968484
  0.01360393323312991
 -0.005396382879830027

In [17]:
using BenchmarkTools

bm1 = @benchmark ((inv($X' * diagm($w) * $X) * $X') * diagm($w)) * $z
bm1

BenchmarkTools.Trial: 67 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m71.416 ms[22m[39m … [35m96.663 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m5.43% … 8.75%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m74.180 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m9.19%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m75.225 ms[22m[39m ± [32m 4.119 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m9.09% ± 0.62%

  [39m [39m [39m [39m▃[39m▃[39m▅[39m█[34m▃[39m[39m▃[39m▂[32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▄[39m▁[39m█[39m█[39m█

Several unwise choices of algorithms waste lots of flops. The memeory allocations, caused by intermediate results, also slow down the program because of the need for garbage collection. This is a common mistake a beginner programmer in a high-level language makes. For example, the following R code (same algorithm on the same data) shows similar allocation. R code is much slower than Julia possibly because of the outdated BLAS/LAPACK library being used. 

In [18]:
using RCall

R"""
library(bench)

# Interpolate Julia variables into R workspace
X <- $X
w <- $w
z <- $z

rbm1 <- bench::mark(
  solve(t(X) %*% diag(w) %*% X) %*% t(X) %*% diag(w) %*% z,
  iterations = 10
  ) |> 
  print(width = Inf)
""";

# A tibble: 1 x 13
  expression                                                    min   median
  <bch:expr>                                               <bch:tm> <bch:tm>
1 solve(t(X) %*% diag(w) %*% X) %*% t(X) %*% diag(w) %*% z    1.83s    1.85s
  `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result         
      <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>         
1     0.539     401MB     1.08    10    20      18.6s <dbl [100 x 1]>
  memory              time            gc               
  <list>              <list>          <list>           
1 <Rprofmem [14 x 3]> <bench_tm [10]> <tibble [10 x 3]>


[33m[1m└ [22m[39m[90m@ RCall ~/.julia/packages/RCall/0ggIQ/src/io.jl:172[39m


### Method 2

In the following code, we make smarter choice of algorithms (rearranging order of evaluation; utilizing data structures such as diagonal matrix, triangular matrix, and positive definite matrices) and get rid of memeory allocation by pre-allocating intermediate arrays. 

In [19]:
# preallocation
XtWt = Matrix{Float64}(undef, p, n)
XtX = Matrix{Float64}(undef, p, p)
Xtz = Vector{Float64}(undef, p)

function myfun(X, z, w, XtWt, XtX, Xtz)
    # XtWt = X' * W
    mul!(XtWt, transpose(X), Diagonal(w))
    # XtX = X' * W * X
    mul!(XtX, XtWt, X)
    # Xtz = X' * W * z
    mul!(Xtz, XtWt, z)
    # Cholesky on XtX
    LAPACK.potrf!('U', XtX)
    # Two triangular solves to solve (XtX) \ (Xtz)
    BLAS.trsv!('U', 'T', 'N', XtX, Xtz)
    BLAS.trsv!('U', 'N', 'N', XtX, Xtz)
end

# First check correctness vs Method 1
res2 = myfun(X, z, w, XtWt, XtX, Xtz)

100-element Vector{Float64}:
 -0.004731352650088043
  0.009183070405469736
 -0.016275221473477902
 -0.013279497350630177
  0.020014830435187862
  0.020674778392632622
  0.0007810692137151314
 -0.012360822702514545
  0.001123926709881188
  0.011690288350451031
 -0.01959971882719655
  0.017758197742357474
 -0.0025062394627651525
  ⋮
 -0.01811532188448826
 -0.011950081272644486
 -0.005403750239228435
  0.0017666315860712875
  0.018897291502571387
 -0.026286766550571057
  0.034928418336936204
  0.008008587435710167
  0.008244324612943875
  0.013637070959968462
  0.013603933233129917
 -0.005396382879830023

In [20]:
bm2 = @benchmark myfun($X, $z, $w, $XtWt, $XtX, $Xtz)
bm2

BenchmarkTools.Trial: 5653 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m793.500 μs[22m[39m … [35m  3.680 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m867.875 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m883.638 μs[22m[39m ± [32m107.249 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▄[39m▅[39m▁[39m [39m [39m [39m [39m▁[39m▄[39m▇[39m█[39m▇[34m▆[39m[39m▇[32m▆[39m[39m▅[39m▄[39m▂[39m▂[39m [39m▁[39m▂[39m▁[39m▂[39m▂[39m▂[39m▂[39m▂[39m▂[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m█[39m

In R, a better implementation is
```r
solve(t(X * w) %*% X, t(X) %*% (z * w))
```
It's much faster than the naive implementation. To achieve zero memory allocation, some low-level coding using C++ and RcppEigen is necessary.

In [21]:
R"""
rbm2 <- bench::mark(
  solve(t(X * w) %*% X, t(X) %*% (z * w)),
  ) |> 
  print(width = Inf)
""";

# A tibble: 1 x 13
  expression                                   min   median `itr/sec` mem_alloc
  <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 solve(t(X * w) %*% X, t(X) %*% (z * w))   20.8ms   21.3ms      46.4    11.6MB
  `gc/sec` n_itr  n_gc total_time result          memory             
     <dbl> <int> <dbl>   <bch:tm> <list>          <list>             
1     4.42    21     2      452ms <dbl [100 x 1]> <Rprofmem [10 x 3]>
  time            gc               
  <list>          <list>           
1 <bench_tm [23]> <tibble [23 x 3]>


### Conclusion

By careful consideration of the computational algorithms, we achieve a whooping speedup (in Julia) of

In [22]:
# speed-up
median(bm1.times) / median(bm2.times)

85.47347486677229

For PhD students, that means, instead of waiting more than two months for your simulations to finish, you only need one day!