Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loading issues while train #16

Closed
Practcdi opened this issue Oct 4, 2021 · 4 comments
Closed

Data loading issues while train #16

Practcdi opened this issue Oct 4, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@Practcdi
Copy link

Practcdi commented Oct 4, 2021

Hey ,

[Note] : I have pandas dataframe contain 2 columns as ,

  1. Text
  2. Label

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                    y_data ,
                                                    test_size = 0.2, 
                                                    shuffle=False)

train () and fit() methods are not working

here is a reference code

image

How to fix it?

Thanks

@angrymeir
Copy link
Collaborator

angrymeir commented Oct 4, 2021

Hey @Practcdi,

TLDR

this is due to the case, that fit/train requires a list of strings instead of a DataFrame. (See function documentation here)

Fix: pass x_train.values.tolist(), y_train to clf.train()

Bit more insights on why it does not work:

Following the respective code lines (here):

x_train, y_train = list(x_train), list(y_train)

if len(x_train) != len(y_train):
    raise ValueError("`x_train` and `y_train` must have the same length")

If you pass a dataframe to the variable x_train of shape = (535544, 1) casting this to a list will only return the column names.
Thus the check will compare the following:

if 1 != 535544:
    raise ValueError("`x_train` and `y_train` must have the same length")

@Practcdi Practcdi closed this as completed Oct 4, 2021
@Practcdi
Copy link
Author

Practcdi commented Oct 4, 2021

Hey @Practcdi,

TLDR

this is due to the case, that fit/train requires a list of strings instead of a DataFrame. (See function documentation here)

Fix: pass x_train.values.tolist(), y_train to clf.train()

Bit more insights on why it does not work:

Following the respective code lines (here):

x_train, y_train = list(x_train), list(y_train)

if len(x_train) != len(y_train):
    raise ValueError("`x_train` and `y_train` must have the same length")

If you pass a dataframe to the variable x_train of shape = (535544, 1) casting this to a list will only return the column names. Thus the check will compare the following:

if 1 != 535544:
    raise ValueError("`x_train` and `y_train` must have the same length")

Thanks lot 😊

@sergioburdisso sergioburdisso added the enhancement New feature or request label Oct 5, 2021
@sergioburdisso
Copy link
Owner

@Practcdi Thanks for sharing this issue with us!

@angrymeir Thanks for taking care of it 💪, btw, what do you think of adding an extra check at the beginning of fit/train throwing an ValueError exception saying something like "the x_train argument is expected to be a list of strings" when the provided x_train isn't a list of string. 🤔

@angrymeir
Copy link
Collaborator

@sergioburdisso Hm unsure about that one because...

  1. Where to start and where to end? Is it only fit/train that needs this kind of validation or also other methods (potentially all methods with user input because of consistency)?
  2. I think it's difficult to detect if a x_train can be casted to a list of strings without information loss. E.g. while pandas.DataFrame can't be casted, pandas.Series can be casted without issues, so it should stay a valid option?
  3. Its well documented, stating exactly what the function expects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants