Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn-compatible interface #147

Open
34j opened this issue Oct 24, 2023 · 15 comments
Open

sklearn-compatible interface #147

34j opened this issue Oct 24, 2023 · 15 comments

Comments

@34j
Copy link

34j commented Oct 24, 2023

I think it would be great to have this feature, as I think sklearn is often used for tabular data. I tried to use skorch, but skorch does not allow TensorFrames and did not work well.

(examples/tutorial.py)

from skorch import NeuralNetClassifier

net = NeuralNetClassifier(module=model, max_epochs=args.epochs, lr=args.lr, 
                            device=device, batch_size=args.batch_size, 
                            classes=dataset.num_classes, iterator_train=DataLoader,
                            iterator_valid=DataLoader, train_split=None)
net.fit(train_dataset, y=None)
Traceback (most recent call last):
  File "\examples\tutorial.py", line 346, in <module>
    net.fit(train_dataset, y=None)
  File "\site-packages\skorch\classifier.py", line 165, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1319, in fit
    self.partial_fit(X, y, **fit_params)
  File "\site-packages\skorch\net.py", line 1278, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "\site-packages\skorch\net.py", line 1190, in fit_loop
    self.run_single_epoch(iterator_train, training=True, prefix="train",
  File "\site-packages\skorch\net.py", line 1226, in run_single_epoch
    step = step_fn(batch, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1105, in train_step
    self._step_optimizer(step_fn)
  File "\site-packages\skorch\net.py", line 1060, in _step_optimizer
    optimizer.step(step_fn)
  File "\site-packages\torch\optim\optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\torch\optim\optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\torch\optim\sgd.py", line 66, in step
    loss = closure()
           ^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1094, in step_fn
    step = self.train_step_single(batch, **fit_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 993, in train_step_single
    y_pred = self.infer(Xi, **fit_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\net.py", line 1517, in infer
    x = to_tensor(x, device=self.device)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 104, in to_tensor
    return [to_tensor_(x) for x in X]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 104, in <listcomp>
    return [to_tensor_(x) for x in X]
            ^^^^^^^^^^^^^
  File "\site-packages\skorch\utils.py", line 118, in to_tensor
    raise TypeError("Cannot convert this data type to a torch tensor.")
TypeError: Cannot convert this data type to a torch tensor.

I think the following changes are needed:

  • Add an ability to convert from DataFrame to TensorFrame without much prior information.
  • Create a wrapper that passes Tensor to skorch or create a scikit-learn compatible estimator specifically for this package.

I am sorry, but I cannot take much time to assist in creating this feature, so if it is not possible, please close this.

@yiweny
Copy link
Contributor

yiweny commented Oct 25, 2023

You can convert a DataFrame to TensorFrame easily with

dataset = Dataset(df, col_to_stype=col_to_stype, target_col="y")
dataset.tensor_frame

See tutorial.

@weihua916
Copy link
Contributor

Thanks for your suggestion! I think this is great to add. Setting this as P2 feature, as we first want to prioritize more stype support #88.

@MacOS
Copy link

MacOS commented Dec 18, 2023

Is someone already working on that?

@weihua916
Copy link
Contributor

No, as far as I know. Let us know if you are interested!

@MacOS
Copy link

MacOS commented Dec 22, 2023

Yes, I'm interested. Hence, you can assign this to me. How fast should this task be completed?

@weihua916
Copy link
Contributor

@MacOS Great, thank you! It'd be good to complete this feature by the end of January. Would that be possible?

@MacOS
Copy link

MacOS commented Jan 1, 2024

@weihua916 As of now, yes.

@34j
Copy link
Author

34j commented Mar 11, 2024

I have tried this and it seems to be very difficult.
As a quick fix that isn't pretty, the following seems necessary:

  • Patch skorch.utils.to_tensor_ to bypass TensorFrame.
  • Add index = torch.sensor(index) to torch_frame.DataLoader.collapse_fn to make it return TensorFrame instead of list[TensorFrame].

However, I don't know how to pass the validation dataset.

Next, we want to pass a validation dataset as well, but if we pass them using a tuple like skorch.NeuralNet.fit((train_dataset.tensorframe, val_dataset.tensor_frame), None), skorch would raise a lot of errors. Therefore, I tried to split them inside skorch.

  • Pass col_to_stype as y, as in skorch.NeuralNet.fit(dataset.df, dataset.col_to_stype), utilizing the internal structure.
  • Remove self.check_data(X, y) in skorch.NeuralNet.fit_loop().
  • Modify TensorFrame to call self.materialize() in the constructor.
  • To avoid an error in torch_frame.Dataset.split(), set split_col like skorch.NeuralNet(... , dataset=lambda d, c: Dataset(d, c, split_col='split_col')).

@MacOS
Copy link

MacOS commented Mar 11, 2024

🤔

Thank you for looking into this, @34j! I was about to start working on it.

Add an ability to convert from DataFrame to TensorFrame without much prior information.

I would have simply converted the DataFrame to TensorFrame internally, work with it, and if requested, return the DataFrame again. This means, of course, that one has to track what was given. Or am I missing something?

Create a wrapper that passes Tensor to skorch or create a scikit-learn compatible estimator specifically for this package.

This seems to be very big and unrealistic because we would have to make all estimators compatible with scikit-learn, which is a lot to ask for. At the moment, scikit-learn is an optional dependency.

May I ask you, @34j, to post a self-contained example (or examples) that what qualify pytorch-frame as being sklearn-compatible?

PS: I would submit one PR today, but maybe only as a draft.

@34j
Copy link
Author

34j commented Mar 11, 2024

Add an ability to convert from DataFrame to TensorFrame without much prior information.

This is an implicit request for the recently implemented infer_df_stype, which has thankfully already been resolved.

Create a wrapper that passes Tensor to skorch

I feel like this could probably be done, I'll send a draft PR in an hour and I want to ask @MacOS to take it over and do the documentation, testing and tutorial work.

dirty prototype code

example/tutorial.py:

from skorch import NeuralNetClassifier
from skorch.dataset import Dataset as SkorchDataset
import torch.nn as nn
from torch_frame.utils import infer_df_stype
from torch_frame.data.dataset import DataFrameToTensorFrameConverter, Dataset


def create_dataset(df, _) -> Dataset:
    dataset_ = Dataset(
        df, dataset.col_to_stype, split_col="split_col", target_col="target_col"
    )
    dataset_.materialize()
    return dataset_


def split_dataset(dataset: Dataset) -> tuple[SkorchDataset, SkorchDataset]:
    datasets = dataset.split()[:2]
    return datasets[0].tensor_frame, datasets[1].tensor_frame


def get_iterator(dataset: SkorchDataset, **kwargs) -> DataLoader:
    return DataLoader2(dataset, **kwargs)


class DataLoader2(DataLoader):
    def collate_fn(
        self, index: int | List[int] | range | slice | Tensor
    ) -> tuple[TensorFrame, Tensor | None]:
        index = torch.tensor(index)
        res = super().collate_fn(index).to(device)
        return res, res.y

net = NeuralNetClassifier(
    module=model,
    max_epochs=args.epochs,
    lr=args.lr,
    device=device,
    batch_size=6,
    iterator_train=get_iterator,
    dataset=create_dataset,
    iterator_valid=get_iterator,
    train_split=split_dataset,
    classes=dataset.df["target_col"].unique(),
    verbose=1,
    criterion=nn.CrossEntropyLoss,
)
net.fit(dataset.df, None)

@MacOS
Copy link

MacOS commented Mar 11, 2024

@34j Is fine with me!

So we drop the second part of your request then, correct?

@MacOS
Copy link

MacOS commented Mar 13, 2024

Heads up everyone, I started working on it. I already merge the PR draft of @34j into my fork.

Would be nice if you guys would be available in case I have questions. :)

@34j
Copy link
Author

34j commented Mar 14, 2024

Heads up everyone, I started working on it. I already merge the PR draft of @34j into my fork.

Would be nice if you guys would be available in case I have questions. :)

May I ask you what is your question nvm plz, sorry for my terrible English comprehension

@MacOS
Copy link

MacOS commented Mar 19, 2024

So far none. I meant just in case.

Sorry for the delay at all, but I had personal matters to deal with. I'm confident that I can submit a PR this month.

@MacOS
Copy link

MacOS commented Apr 6, 2024

Hi all,

short update, unfortunately, I got sick, hence again a delay. Should I still work on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants