# How to use
Explains how it is used in classification questions.   

## install
Run the following code in the environment where you want to INSTALL nb_sisso

In [2]:
#!pip install git+https://github.com/souno1218/nb_sisso.git

## Data preparation

First, create (random) data.   
At this time, the initial features are `a~f` with 6 types (`n_features=6`), the number of samples is 220 (`n_samples=220`) and the target variable is `target`.   
The target variable is also binary and `[True,False]`.   

In [3]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
x=rng.random((6,220))
y=rng.choice((True,False),220)
columns=[f"{chr(i + 97)}" for i in range(x.shape[0])]
index=[f"sample_{i}" for i in range(x.shape[1])]
df=pd.DataFrame(x.T,columns=columns,index=index)
df["target"]=y
df.head()

Unnamed: 0,a,b,c,d,e,f,target
sample_0,0.339269,0.115281,0.548645,0.889588,0.350939,0.401582,False
sample_1,0.77277,0.738557,0.570823,0.674099,0.512062,0.449124,False
sample_2,0.956897,0.982367,0.301842,0.778214,0.35994,0.64794,True
sample_3,0.378307,0.061617,0.308011,0.032697,0.63193,0.223087,True
sample_4,0.710007,0.445686,0.392497,0.933997,0.513576,0.398541,True


Let the system of units be `{a: ‘m/s’, b: ‘m^2’, c: ‘m/s^2’, d: ‘a.u.’, e: ‘m’, f: ‘a.u.’}` respectively.   
Since there are two unit systems at this time, create an ndarray whose shape is `(n_features,2)` as follows.   
At this time, dtype must be int64.   

In [4]:
units=np.zeros((x.shape[0],2),dtype="int64")
units[0,0] = 1  # a:    "m"
units[0,1] = -1 # a:   "/s"
units[1,0] = 2  # b:  "m^2"
units[1,1] = 0  # b: "no s"
units[2,0] = 1  # c:    "m"
units[2,1] = -2 # c: "/s^2"
units[3,0] = 0  # d: "no m"
units[3,1] = 0  # d: "no s"
units[4,0] = 1  # e:    "m"
units[4,1] = 0  # e: "no s"
units[5,0] = 0  # f: "no m"
units[5,1] = 0  # f: "no s"
units_df=pd.DataFrame(units.T,columns=columns,index=["m","s"])
units_df

Unnamed: 0,a,b,c,d,e,f
m,1,2,1,0,1,0
s,-1,0,-2,0,0,0


The SHAPE of each array should be looked at.

In [5]:
print(f"x.shape={x.shape}, y.shape={y.shape}, units.shape={units.shape}")

x.shape=(6, 220), y.shape=(220,), units.shape=(6, 2)


## SIS

Determine which operator to use. The following can be used.   
`["+","-","*","/","*-1","**-1","**2","sqrt","| |","**3","cbrt","**6","exp","exp-","log","sin","cos"]`

In [6]:
operators_to_use=["+","-","*","/","**2","| |","**3"]

Determine the model_score: we have made various ones for binary classification, but no multi-level classification or regression at the moment.   
If you know python, you can create an original one with some ease, as long as it takes only `x,y` as an argument and returns two float64 values like above.   

In [7]:
from nb_sisso.model_score_1d import Hull_1d
model_score=Hull_1d

Run SIS. You can then set `is_use_1` to include `np.ones(n_samples)` as an initial feature.   
Even if `is_use_1 = True`, there is no need to add `np.ones(n_samples)` to `x`.   
In this case, the number of saves is 5000 (`how_many_to_save=5000`), the maximum number of operators is 4 (`max_n_op=4`) and `is_use_1=True`.   
The first time it is run, it is compiled, so it is slow to start working.   
There are several optional arguments, see nb_SO.py.   

Translated with www.DeepL.com/Translator (free version)

In [8]:
from nb_sisso import SIS
score,eq=SIS(x,y,model_score=model_score,units=units,how_many_to_save=5000,is_use_1=True,max_n_op=4,operators_to_use=operators_to_use)

2024-11-06 17:28:31,201 SIS [INFO] : numba=0.60.0, numpy=2.0.2
2024-11-06 17:28:31,202 SIS [INFO] : OPT=_OptLevel(3), THREADING_LAYER=default
2024-11-06 17:28:31,202 SIS [INFO] : USING_SVML=False, ENABLE_AVX=True, DISABLE_JIT=0
2024-11-06 17:28:31,202 SIS [INFO] : SIS
2024-11-06 17:28:31,203 SIS [INFO] : num_threads=8, how_many_to_save=5000, 
2024-11-06 17:28:31,203 SIS [INFO] : how_many_to_save_per_1_core=5000, 
2024-11-06 17:28:31,203 SIS [INFO] : max_n_op=4, model_score=Hull_1d, 
2024-11-06 17:28:31,204 SIS [INFO] : x.shape=(6, 220), is_use_1=True
2024-11-06 17:28:31,204 SIS [INFO] : use_binary_op=[-1, -2, -3, -4], 
2024-11-06 17:28:31,204 SIS [INFO] : use_unary_op=[-7, -9, -10]
2024-11-06 17:28:31,205 SIS [INFO] : units=[ 1 -1] , [2 0] , [ 1 -2] , [0 0] , [1 0] , [0 0]
2024-11-06 17:28:31,205 SIS [INFO] : compiling
2024-11-06 17:28:58,388 SIS [INFO] : END, compile
2024-11-06 17:28:58,388 SIS [INFO] :   n_op=1
2024-11-06 17:28:58,588 SIS [INFO] :     binary_op n_op1:n_op2 = 0:0,  lo

There are two return values, `score,eq`. The shape of `score` is `(how_many_to_save,2)`, corresponding to `eq`.   
The dim2 is 2 because there are two `score`s.   
The shape of `eq` is `(how_many_to_save,2*max_n_op+1)`, which is an array of formulas and cannot be read as such.   
It can be read by putting it into the `decryption` function.

In [9]:
from nb_sisso.utils import decryption
decryption(eq[0])

'((f-((f+f)**2))*e)'

In [10]:
df_SIS_ans=pd.DataFrame()
str_eq=[decryption(eq[i]) for i in range(eq.shape[0])]
df_SIS_ans["score1"]=score[:,0]
df_SIS_ans["score2"]=score[:,1]
df_SIS_ans["str_eq"]=str_eq
df_SIS_ans.head()

Unnamed: 0,score1,score2,str_eq
0,0.077273,-0.995885,((f-((f+f)**2))*e)
1,0.077273,-0.999921,((f+((f+f)**2))*e)
2,0.077273,-1.0,(e/(((1+f)/f)**3))
3,0.077273,-1.0,(e*((f/(1+f))**3))
4,0.077273,-1.0,((((1+f)/f)**3)/e)


## SO

All combinations of two of the `how_many_to_save` SIS results are done and the one with the highest score is chosen.   

Create features for SO from the SIS results: put the `x` used in SIS and the SIS return value `eq` into `eq_list_to_num`, which will return the features calculated according to `eq`.   
In SIS, the features need to be a list, so change them to a list.   

In [11]:
from nb_sisso.utils import eq_list_to_num
X=eq_list_to_num(x,eq)
list_x=[X]

Determine the model_score: we have made various ones for binary classification, but no multi-level classification or regression at the moment.   

In [12]:
from nb_sisso.model_score_2d import Hull_2d
model_score=Hull_2d

If you know python, you can create an original one with some ease, as long as it takes only `x,y` as an argument and returns two float64 values like above.   
The two return values are called score1 and score2 respectively, and are sorted by score1 (in increasing order), with the one with the higher score2 being superior to the one with the same score1.   

In [13]:
one_X=X[[0,1]].T
print(one_X.shape)
model_score(one_X,y)

(220, 2)


(0.0, -inf)

Run SO. See nb_SO.py for the detailed role of `which_arr_to_choose_from`.   
The `combination_dim` is how many combinations to choose from the `how_many_to_save` pieces.   
There are several optional arguments, see nb_SO.py.   

In [14]:
from nb_sisso import SO
score_list,index_list=SO(list_x,y,model_score=model_score,which_arr_to_choose_from={1:0,2:0},combination_dim=2)

2024-11-06 17:29:04,008 SO [INFO] : numba=0.60.0, numpy=2.0.2
2024-11-06 17:29:04,009 SO [INFO] : OPT=_OptLevel(3), THREADING_LAYER=default
2024-11-06 17:29:04,009 SO [INFO] : USING_SVML=False, ENABLE_AVX=True, DISABLE_JIT=0
2024-11-06 17:29:04,009 SO [INFO] : SO
2024-11-06 17:29:04,009 SO [INFO] : num_threads=8, how_many_to_save=50, 
2024-11-06 17:29:04,009 SO [INFO] : combination_dim=2, model_score=Hull_2d, 
2024-11-06 17:29:04,009 SO [INFO] : which_arr_to_choose_from={1: 0, 2: 0}
2024-11-06 17:29:04,010 SO [INFO] : loop=12497500
2024-11-06 17:29:04,010 SO [INFO] : compiling
2024-11-06 17:29:10,716 SO [INFO] : END, compile
2024-11-06 17:29:20,726 SO [INFO] :     488024/12497500  0:00:10.009447 : 0:04:06.316192
2024-11-06 17:29:30,757 SO [INFO] :     951900/12497500  0:00:20.039879 : 0:04:03.063796
2024-11-06 17:29:40,775 SO [INFO] :    1389403/12497500  0:00:30.058563 : 0:04:00.314317
2024-11-06 17:29:50,792 SO [INFO] :    1812447/12497500  0:00:40.075138 : 0:03:56.257928
2024-11-06 

The return values are score and index, where index is the index of the incoming list_x.

In [15]:
arr_columns=np.array([decryption(eq[i]) for i in range(eq.shape[0])])
df_ans=pd.DataFrame(columns=["score1","score2","data_x","data_y"])
df_ans["score1"]=score_list[:,0]
df_ans["score2"]=score_list[:,1]
df_ans["data_x"]=arr_columns[index_list[:,0]]
df_ans["data_y"]=arr_columns[index_list[:,1]]
df_ans

Unnamed: 0,score1,score2,data_x,data_y
0,1.0,-2.722169e-10,((f/((d/f)-f))**2),((d/((f*f)-d))**2)
1,1.0,-1.257615e-07,(d/(((d/f)-f)**2)),((d/((f*f)-d))**2)
2,0.736364,-4.344731e-16,((f/((d/f)-f))**2),(d/np.abs(((f*f)-d)))
3,0.736364,-4.344731e-16,((f/((d/f)-f))**2),np.abs((d/((f*f)-d)))
4,0.736364,-4.449938e-10,(d/(((d/f)-f)**2)),(d/np.abs(((f*f)-d)))
5,0.736364,-4.449938e-10,(d/(((d/f)-f)**2)),np.abs((d/((f*f)-d)))
6,0.722727,-200983500.0,np.abs(((d/(1-f))-1)),(((d/(1-f))**3)**3)
7,0.718182,-0.9998029,(((d/(1-f))-1)**2),(((d/(1-f))**3)**3)
8,0.713636,-1.147325,np.abs((((1-f)-f)*f)),((1/((1+f)+f))+f)
9,0.713636,-1.147325,(f*np.abs(((1-f)-f))),((1/((1+f)+f))+f)
