# How to use
Explains how it is used in classification questions.   

## install
Run the following code in the environment where you want to INSTALL nb_sisso

In [14]:
#!pip install git+https://github.com/souno1218/nb_sisso.git

## Data preparation

First, create (random) data.   
At this time, the initial features are `a~f` with 6 types (`n_features=6`), the number of samples is 220 (`n_samples=220`) and the target variable is `target`.   
The target variable is also binary and `[True,False]`.   

In [15]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
x=rng.random((6,220))
y=rng.choice((True,False),220)
columns=[f"{chr(i + 97)}" for i in range(x.shape[0])]
index=[f"sample_{i}" for i in range(x.shape[1])]
df=pd.DataFrame(x.T,columns=columns,index=index)
df["target"]=y
df.head()

Unnamed: 0,a,b,c,d,e,f,target
sample_0,0.921883,0.497861,0.948289,0.985947,0.808652,0.760885,False
sample_1,0.678842,0.154782,0.26095,0.419566,0.531651,0.746601,False
sample_2,0.651391,0.767856,0.295795,0.6106,0.676014,0.810512,True
sample_3,0.792191,0.519502,0.713539,0.402951,0.414618,0.164235,True
sample_4,0.317687,0.688351,0.715291,0.443022,0.795014,0.496883,False


Let the system of units be `{a: ‘m/s’, b: ‘m^2’, c: ‘m/s^2’, d: ‘a.u.’, e: ‘m’, f: ‘a.u.’}` respectively.   
Since there are two unit systems at this time, create an ndarray whose shape is `(n_features,2)` as follows.   
At this time, dtype must be int64.   

In [16]:
units=np.zeros((x.shape[0],2),dtype="int64")
units[0,0] = 1  # a:    "m"
units[0,1] = -1 # a:   "/s"
units[1,0] = 2  # b:  "m^2"
units[1,1] = 0  # b: "no s"
units[2,0] = 1  # c:    "m"
units[2,1] = -2 # c: "/s^2"
units[3,0] = 0  # d: "no m"
units[3,1] = 0  # d: "no s"
units[4,0] = 1  # e:    "m"
units[4,1] = 0  # e: "no s"
units[5,0] = 0  # f: "no m"
units[5,1] = 0  # f: "no s"
units_df=pd.DataFrame(units.T,columns=columns,index=["m","s"])
units_df

Unnamed: 0,a,b,c,d,e,f
m,1,2,1,0,1,0
s,-1,0,-2,0,0,0


The SHAPE of each array should be looked at.

In [17]:
print(f"x.shape={x.shape}, y.shape={y.shape}, units.shape={units.shape}")

x.shape=(6, 220), y.shape=(220,), units.shape=(6, 2)


## SIS

Determine which operator to use. The following can be used.   
`["+","-","*","/","*-1","**-1","**2","sqrt","| |","**3","cbrt","**6","exp","exp-","log","sin","cos"]`

In [18]:
operators_to_use=["+","-","*","/","**2","| |","**3"]

Determine the model_score: we have made various ones for binary classification, but no multi-level classification or regression at the moment.   
If you know python, you can create an original one with some ease, as long as it takes only `x,y` as an argument and returns two float64 values like above.   

In [19]:
from nb_sisso.model_score_1d import Hull_1d
model_score=Hull_1d

Run SIS. You can then set `is_use_1` to include `np.ones(n_samples)` as an initial feature.   
Even if `is_use_1 = True`, there is no need to add `np.ones(n_samples)` to `x`.   
In this case, the number of saves is 5000 (`how_many_to_save=5000`), the maximum number of operators is 4 (`max_n_op=4`) and `is_use_1=True`.   
The first time it is run, it is compiled, so it is slow to start working.   
There are several optional arguments, see nb_SO.py.   

Translated with www.DeepL.com/Translator (free version)

In [20]:
from nb_sisso import SIS
score,eq=SIS(x,y,model_score=model_score,units=units,how_many_to_save=5000,is_use_1=True,max_n_op=4,operators_to_use=operators_to_use)

2024-10-23 15:17:33,074 SIS [INFO] : numba=0.60.0, numpy=2.0.2
2024-10-23 15:17:33,075 SIS [INFO] : OPT=_OptLevel(3), THREADING_LAYER=default
2024-10-23 15:17:33,075 SIS [INFO] : USING_SVML=False, ENABLE_AVX=True, DISABLE_JIT=0
2024-10-23 15:17:33,075 SIS [INFO] : SIS
2024-10-23 15:17:33,076 SIS [INFO] : num_threads=8, how_many_to_save=5000, 
2024-10-23 15:17:33,077 SIS [INFO] : how_many_to_save_per_1_core=5000, 
2024-10-23 15:17:33,077 SIS [INFO] : max_n_op=4, model_score=Hull_1d, 
2024-10-23 15:17:33,077 SIS [INFO] : x.shape=(6, 220), is_use_1=True
2024-10-23 15:17:33,077 SIS [INFO] : use_binary_op=[-1, -2, -3, -4], 
2024-10-23 15:17:33,078 SIS [INFO] : use_unary_op=[-7, -9, -10]
2024-10-23 15:17:33,078 SIS [INFO] : units=[ 1 -1] , [2 0] , [ 1 -2] , [0 0] , [1 0] , [0 0]
2024-10-23 15:17:33,078 SIS [INFO] : compiling
2024-10-23 15:17:33,082 SIS [INFO] : END, compile
2024-10-23 15:17:33,083 SIS [INFO] :   n_op=1
2024-10-23 15:17:33,083 SIS [INFO] :     binary_op n_op1:n_op2 = 0:0,  lo

There are two return values, `score,eq`. The shape of `score` is `(how_many_to_save,2)`, corresponding to `eq`.   
The dim2 is 2 because there are two `score`s.   
The shape of `eq` is `(how_many_to_save,2*max_n_op+1)`, which is an array of formulas and cannot be read as such.   
It can be read by putting it into the `decryption` function.

In [21]:
from nb_sisso.utils import decryption
decryption(eq[0])

'((((1-f)/a)/b)*c)'

In [22]:
df_SIS_ans=pd.DataFrame()
str_eq=[decryption(eq[i]) for i in range(eq.shape[0])]
df_SIS_ans["score1"]=score[:,0]
df_SIS_ans["score2"]=score[:,1]
df_SIS_ans["str_eq"]=str_eq
df_SIS_ans.head()

Unnamed: 0,score1,score2,str_eq
0,0.090909,-9.205452,((((1-f)/a)/b)*c)
1,0.090909,-101.158127,(((a/(1-f))*b)/c)
2,0.086364,-2.659392,((((b*f)-b)*e)/a)
3,0.086364,-41.124011,((((b/f)-b)*c)*e)
4,0.086364,-116.164626,(((a/(1-f))/b)/e)


## SO

All combinations of two of the `how_many_to_save` SIS results are done and the one with the highest score is chosen.   

Create features for SO from the SIS results: put the `x` used in SIS and the SIS return value `eq` into `eq_list_to_num`, which will return the features calculated according to `eq`.   
In SIS, the features need to be a list, so change them to a list.   

In [23]:
from nb_sisso.utils import eq_list_to_num
X=eq_list_to_num(x,eq)
list_x=[X]

Determine the model_score: we have made various ones for binary classification, but no multi-level classification or regression at the moment.   
If you know python, you can create an original one with some ease, as long as it takes only `x,y` as an argument and returns two float64 values like above.   

In [24]:
from nb_sisso.model_score_2d import Hull_2d
model_score=Hull_2d

Run SO. See nb_SO.py for the detailed role of `which_arr_to_choose_from`.   
The `combination_dim` is how many combinations to choose from the `how_many_to_save` pieces.   
There are several optional arguments, see nb_SO.py.   

In [25]:
from nb_sisso import SO
score_list,index_list=SO(list_x,y,model_score=model_score,which_arr_to_choose_from={1:0,2:0},combination_dim=2)

2024-10-23 15:17:33,757 SO [INFO] : numba=0.60.0, numpy=2.0.2
2024-10-23 15:17:33,758 SO [INFO] : OPT=_OptLevel(3), THREADING_LAYER=default
2024-10-23 15:17:33,758 SO [INFO] : USING_SVML=False, ENABLE_AVX=True, DISABLE_JIT=0
2024-10-23 15:17:33,758 SO [INFO] : SO
2024-10-23 15:17:33,758 SO [INFO] : num_threads=8, how_many_to_save=50, 
2024-10-23 15:17:33,759 SO [INFO] : combination_dim=2, model_score=Hull_2d, 
2024-10-23 15:17:33,759 SO [INFO] : which_arr_to_choose_from={1: 0, 2: 0}
2024-10-23 15:17:33,759 SO [INFO] : loop=12497500
2024-10-23 15:17:33,759 SO [INFO] : compiling
2024-10-23 15:17:33,760 SO [INFO] : END, compile
2024-10-23 15:17:43,764 SO [INFO] :     513089/12497500  0:00:10.003461 : 0:03:53.654567
2024-10-23 15:17:53,775 SO [INFO] :    1018706/12497500  0:00:20.014311 : 0:03:45.521547
2024-10-23 15:18:03,783 SO [INFO] :    1552003/12497500  0:00:30.023081 : 0:03:31.737698
2024-10-23 15:18:13,794 SO [INFO] :    1993113/12497500  0:00:40.034067 : 0:03:30.993222
2024-10-23 

The return values are score and index, where index is the index of the incoming list_x.

In [27]:
arr_columns=np.array([decryption(eq[i]) for i in range(eq.shape[0])])
df_ans=pd.DataFrame(columns=["score1","score2","data_x","data_y"])
df_ans["score1"]=score_list[:,0]
df_ans["score2"]=score_list[:,1]
df_ans["data_x"]=arr_columns[index_list[:,0]]
df_ans["data_y"]=arr_columns[index_list[:,1]]
df_ans

Unnamed: 0,score1,score2,data_x,data_y
0,0.5,0.0,(((c/f)/(b**3))**3),(d/(((b*b)**2)**3))
1,0.5,0.0,(((f/(b**3))**3)/e),(((c/(b**3))**3)/d)
2,0.5,0.0,(d/((a*(b**3))**3)),((((c/b)/b)**2)**3)
3,0.5,0.0,(((f/(b**3))**3)/e),(e*((c/(b**3))**3))
4,0.5,0.0,(((f/(b**3))**3)/e),(e/(((b**3)/c)**3))
5,0.5,0.0,(((f/(b**3))**3)/e),(a/(((b**3)/c)**3))
6,0.5,0.0,(((c/f)/(b**3))**3),(d/(((b*b)**3)**2))
7,0.5,0.0,(((c/f)/(b**3))**3),(d/((b*(b**3))**3))
8,0.5,0.0,(((c/f)/(b**3))**3),(d/(((b**3)**2)**2))
9,0.5,0.0,((d/a)/((b**3)**3)),(1/(((b**3)/c)**3))
