NSections Issue with Train Dataset #87

AlexanderLavelle · 2021-10-10T18:44:26Z

When I inserted my training df into .fit_predict(), I receive the initiation:

- time: 72000 seconds
- cpus: 4 cores
- memory: 16 gb

Train data shape: (17290, 17455)

but then the process fails with error:

C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\shape_base.py in array_split(ary, indices_or_sections, axis)
    771         # handle array case.
--> 772         Nsections = len(indices_or_sections) + 1
    773         div_points = [0] + list(indices_or_sections) + [Ntotal]

TypeError: object of type 'int' has no len()

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-134-bcf87a9972bc> in <module>
      5                        #general_params={'use_algos': [['lgb', 'cb', 'LinearLBFGS', 'linear_l1', 'xgb']]}
      6                       )
----> 7 oof_pred = automl.fit_predict(newTrainDum, roles=roles)

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\automl\presets\tabular_presets.py in fit_predict(self, train_data, roles, train_features, cv_iter, valid_data, valid_features)
    411             data, _ = read_data(valid_data, valid_features, self.cpu_limit, self.read_csv_params)
    412 
--> 413         oof_pred = super().fit_predict(train, roles=roles, cv_iter=cv_iter, valid_data=valid_data)
    414 
    415         return cast(NumpyDataset, oof_pred)

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\automl\presets\base.py in fit_predict(self, train_data, roles, train_features, cv_iter, valid_data, valid_features)
    171         logger.info('- memory: {} gb\n'.format(self.memory_limit))
    172         self.timer.start()
--> 173         result = super().fit_predict(train_data, roles, train_features, cv_iter, valid_data, valid_features)
    174         logger.info('\nAutoml preset training completed in {:.2f} seconds.'.format(self.timer.time_spent))
    175 

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\automl\base.py in fit_predict(self, train_data, roles, train_features, cv_iter, valid_data, valid_features)
    155         """
    156         self.timer.start()
--> 157         train_dataset = self.reader.fit_read(train_data, train_features, roles)
    158 
    159         assert len(self._levels) <= 1 or train_dataset.folds is not None, \

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\reader\base.py in fit_read(self, train_data, features_names, roles, **kwargs)
    323         dataset = PandasDataset(train_data[self.used_features], self.roles, task=self.task, **kwargs)
    324         if self.advanced_roles:
--> 325             new_roles = self.advanced_roles_guess(dataset, manual_roles=parsed_roles)
    326             droplist = [x for x in new_roles if new_roles[x].name == 'Drop' and not self._roles[x].force_input]
    327             self.upd_used_features(remove=droplist)

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\reader\base.py in advanced_roles_guess(self, dataset, manual_roles)
    492         # guess roles nor numerics
    493 
--> 494         stat = get_numeric_roles_stat(dataset, manual_roles=manual_roles,
    495                                       random_state=self.random_state,
    496                                       subsample=self.samples,

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\reader\guess_roles.py in get_numeric_roles_stat(train, subsample, random_state, manual_roles, n_jobs)
    263 
    264     # check scores as is
--> 265     res['raw_scores'] = get_score_from_pipe(train, target, empty_slice=empty_slice, n_jobs=n_jobs)
    266 
    267     # check unique values

C:\ProgramData\Anaconda3\lib\site-packages\lightautoml\reader\guess_roles.py in get_score_from_pipe(train, target, pipe, empty_slice, n_jobs)
    192         return _get_score_from_pipe(train, target, pipe, empty_slice)
    193 
--> 194     idx = np.array_split(np.arange(shape[1]), n_jobs)
    195     idx = [x for x in idx if len(x) > 0]
    196     n_jobs = len(idx)

<__array_function__ internals> in array_split(*args, **kwargs)

C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\shape_base.py in array_split(ary, indices_or_sections, axis)
    776         Nsections = int(indices_or_sections)
    777         if Nsections <= 0:
--> 778             raise ValueError('number sections must be larger than 0.')
    779         Neach_section, extras = divmod(Ntotal, Nsections)
    780         section_sizes = ([0] +

ValueError: number sections must be larger than 0.

Any help to adjust my df would be greatly appreciated. I did try to convert to array, but then the target param can't be found (which makes sense).

Thanks in advance!

The text was updated successfully, but these errors were encountered:

alexmryzhkov · 2021-10-10T20:18:24Z

Hi @AlexanderLavelle,

Could you please share the code you have used to receive the error?

As I can see you have 17k+ features - are they real or you one-hot-encoded some variables?

Alex

AlexanderLavelle · 2021-10-10T20:28:22Z

automl = TabularAutoML(task=task, 
                       timeout=TIMEOUT,
                       #cpu_limit=N_THREADS,
                       reader_params={'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
                       #general_params={'use_algos': [['lgb', 'cb', 'LinearLBFGS', 'linear_l1', 'xgb']]}
                      )
oof_pred = automl.fit_predict(newTrainDum, roles=roles)

Is the line just above. The 17k features are 20 features and OHE feats are: ['id', 'yr_built', 'yr_renovated', 'zipcode']

The dataset is the toy King County housing dataset

alexmryzhkov · 2021-10-10T21:19:47Z

@AlexanderLavelle,

Now all is clear. I'll state some moments I can see from your code below:

if you do not set the cpu_limit param, we will use the default one which is equal to 4 vCPU cores (as you can see from the beginning of the log)
as for the use_algos - for now we have only 5 variants here: 'linear_l2', 'lgb', 'lgb_tuned', 'cb', 'cb_tuned', but you still can combine them on the different stacking levels if necessary
and the main part - to use our TabularAutoML preset you have no need to do the preprocessing: we can work with the categorical features in their raw edition, we work with unfilled Nans in the dataset etc. You can just use the raw dataset version to train the model and receive the result.

Hope this helps.

Alex

AlexanderLavelle · 2021-10-10T22:11:59Z

@alexmryzhkov - I uncommented cpu_limit to properly utilize 12 threads. I backtracked and used the "wholeDf" with no OHE -- but this has resulted in the same issue. I have set

roles = {
    #'drop': 'id2', #done when I thought it needed a drop column
    #'group': 'breath_id',  #from the kaggle root of formatting
    #'category': autoMLcat,  #just commented out to test, not working either way 
    'target': 'logPrice',
}

but I still end up with the same error.

For use_algos, that line is commented out -- I will say that it's not quite clear from the documentation how to implement the various algorithms -- for instance, I used "LinearLBFGS" as a result of the documentation rather than the example on Kaggle.

In terms of processing / category: I fed categorical feats with the no dummy df (wholeDf) and I fed [dummy cats + orig cats] to roles for "newTrainDum", but no matter what I am receiving the same error.

Perhaps I am just giving it too fine a tune for a beginner? Should I just try to run it in a naive style?

alexmryzhkov · 2021-10-11T08:46:17Z

Hi @AlexanderLavelle,

Please check my notebook on the King County dataset - if it works with you, cool. If you have any questions about that - please feel free to ask.

Alex

AlexanderLavelle · 2021-10-12T01:04:12Z

@alexmryzhkov I rewrote my notebook to better follow the flow of initializing the CV. It worked, which is great! I think the root of the problem may have been trying to set torch.device to 'cuda'?

Either way, thank you for your notebook and confirmation on the dataset!

alexmryzhkov · 2021-10-12T12:18:14Z

@AlexanderLavelle if you set torch.device to cuda, do you want to train models on GPU? If yes, you have no need to do that - if your environment has properly installed GPU and torch, our LightAutoML will automatically train CatBoost models on GPU (for other models there will be almost no improvement, especially in Kaggle Kernels).

Alex

AlexanderLavelle · 2021-10-17T21:08:08Z

@alexmryzhkov I would like to train on GPU, top to bottom. I have sklearn-intelex and GPU versions of lightgbm on my local machine -- so in theory, any dataset within my 4GB nvidia card (planning to upgrade soon), I would like to have the pipeline do every calculation on GPU for speed. As far as gpu/enhanced sklearn (intelex), I have received notices that lightautoml will use the intelex augmented 'auc'.

alexmryzhkov · 2021-10-17T22:18:38Z

Hi @AlexanderLavelle,

Currently we do not have the full GPU pipeline but we are working on it ☺️
The only parts, which can work on the GPU for now are the models.

Alex

github-actions · 2022-03-26T01:56:26Z

Stale issue message

github-actions bot added the no-issue-activity label Mar 26, 2022

github-actions bot closed this as completed Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NSections Issue with Train Dataset #87

NSections Issue with Train Dataset #87

AlexanderLavelle commented Oct 10, 2021 •

edited

alexmryzhkov commented Oct 10, 2021

AlexanderLavelle commented Oct 10, 2021

alexmryzhkov commented Oct 10, 2021

AlexanderLavelle commented Oct 10, 2021

alexmryzhkov commented Oct 11, 2021

AlexanderLavelle commented Oct 12, 2021

alexmryzhkov commented Oct 12, 2021

AlexanderLavelle commented Oct 17, 2021

alexmryzhkov commented Oct 17, 2021

github-actions bot commented Mar 26, 2022

NSections Issue with Train Dataset #87

NSections Issue with Train Dataset #87

Comments

AlexanderLavelle commented Oct 10, 2021 • edited

alexmryzhkov commented Oct 10, 2021

AlexanderLavelle commented Oct 10, 2021

alexmryzhkov commented Oct 10, 2021

AlexanderLavelle commented Oct 10, 2021

alexmryzhkov commented Oct 11, 2021

AlexanderLavelle commented Oct 12, 2021

alexmryzhkov commented Oct 12, 2021

AlexanderLavelle commented Oct 17, 2021

alexmryzhkov commented Oct 17, 2021

github-actions bot commented Mar 26, 2022

AlexanderLavelle commented Oct 10, 2021 •

edited