---
title: "Symbolic Regression in Julia"
date: 2024-03-14

jupyter: julia-1.10
format: 
    pdf: default
    html: default
    
engine: julia
---








![](cover.png)


## What is it?

A linear regression finds the line that is "closest" to a dataset. In a similar maner, a symbolic regression is an algorithm that find a combination of symbols that minimizes the mean square error of a given dataset. These symbols are unary and binary operators like the + symbol or a function like $cos$ and $1/x$.

## Example 1

Let's try to approximate the function $f(x) = - x^2 + 1$ using the symbols and $+, -, *$ combined with the variable $x$. 


In [None]:
using SymbolicRegression, MLJ, SymbolicUtils
using Plots

x = [-3:0.1:3;]
y = @. - x^2 + 1;

scatter(x, y)

First we define a model


In [None]:
model = SRRegressor(
    binary_operators=[+, -, *],    
    niterations=50,
    seed = 1
);

(Note: the argument `seed = 1` is needed to ensure that the result is the same when this Quarto document compiles; you don't need it.)

And then fit it to our dataset


In [None]:
#| output: false
X = reshape(x, (length(x), 1))

mach = machine(model, X, y)
fit!(mach)

We can see a report about the results:


In [None]:
r = report(mach);

r

This report contains the losses


In [None]:
r.losses

the equations


In [None]:
r.equations

and the best one of the functions found (ie. the one with the least loss):


In [None]:
node_to_symbolic(r.equations[r.best_idx], model)

Here, we can read $x_1$ as $x$, because we only have one variable.

Notice that this expression simplifies to our original $f$.

## Example 2

Now let's get a more interesting example. Take $f(x) = x^2 + 2cos(x)^2$:


In [None]:
y = @. x^2 + 2cos(x)^2 

scatter(x, y)

We again create a model and fit it, but now we allow more operations: besides the earlier binary functions, we also have the unary `cos` function:


In [None]:
#| output: false

model = SRRegressor(
    binary_operators = [+, -, *],    
    unary_operators = [cos],
    niterations=50,
    seed = 1
);

mach = machine(model, X, y)
fit!(mach)

and see the best equation:


In [None]:
r = report(mach)
node_to_symbolic(r.equations[r.best_idx], model)

So, we got

$$
x * x + cos(x + x) - (-1) = x^2 + cos(2x) + 1
$$

Since $cos(2x) + 1 = 2cos^2(x)$, we retrieve the original function.

## Example 3

Even after adding some noise to the original dataset, the symbolic regression still can find a very good approximation:

Take $f(x) = 0.3 * x^3 - x^2 + 2cos(x) + \epsilon(x)$ where $\epsilon(x)$ is a random uniform error (varying in $[0, 1]$) like this:


In [None]:
x = [-5:0.1:5;]
X = reshape(x, (length(x), 1))
errors = rand(length(x))
y = @. 0.3*x^3 - x^2 + 2cos(x) + errors

scatter(x, y)

In [None]:
#| output: false
model = SRRegressor(
    binary_operators = [+, -, *],    
    unary_operators = [cos],
    niterations=60,
    seed = 1
);

mach = machine(model, X, y)
fit!(mach)

and see the best equation:


In [None]:
r = report(mach)
node_to_symbolic(r.equations[r.best_idx], model)

We can plot the prediction and the original dataset to compare them:


In [None]:
y_pred = predict(mach, X)
 
scatter(x, y);
scatter!(x, y_pred, color = "red")

Not bad at all!

You can see more about this package in [this link](https://astroautomata.com/SymbolicRegression.jl/dev/examples/). If you have enough courage, read the [original paper](https://arxiv.org/abs/2305.01582) on arxiv!