Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train titanic_model #52

Open
szz01 opened this issue Jun 26, 2019 · 6 comments
Open

How to train titanic_model #52

szz01 opened this issue Jun 26, 2019 · 6 comments

Comments

@szz01
Copy link

szz01 commented Jun 26, 2019

When i run with own data set,I get the following error:
AttributeError Traceback (most recent call last)
in
4 feature='sex',
5 feature_name='Gender',
----> 6 predict_kwds={}
7 )

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/pdpbox/info_plots.py in actual_plot(model, X, feature, feature_name, num_grid_points, grid_type, percentile_range, grid_range, cust_grid_points, show_percentile, show_outliers, endpoint, which_classes, predict_kwds, ncols, figsize, plot_params)
289 # make predictions
290 # info_df only contains feature value and actual predictions
--> 291 prediction = predict(X, **predict_kwds)
292 info_df = X[_make_list(feature)]
293 actual_prediction_columns = ['actual_prediction']

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features)
1282
1283 if validate_features:
-> 1284 self._validate_features(data)
1285
1286 length = c_bst_ulong()

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/xgboost/core.py in _validate_features(self, data)
1669 """
1670 if self.feature_names is None:
-> 1671 self.feature_names = data.feature_names
1672 self.feature_types = data.feature_types
1673 else:

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/pandas/core/generic.py in getattr(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.getattribute(self, name)
5068
5069 def setattr(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'feature_names'

so i want to know how to train the titanic_model in the example.
Thank for you advice.

@dyerrington
Copy link

Looks like you're referencing an attribute that doesn't exist in your dataframe @szz01. Why don't you post your full code example?

@ivan-marroquin
Copy link

Hi @dyerrington

I have the same issue with PDPpox version 0.2.0. I am using Python 3.6.5 on a windows machine.

The classifier was generated using xgboost 0.90 with command XGBClassifier and to fit the classifier, I used Python arrays (the same data set is part of the attached zip file).

The attached a zip file contains a Python script and its input data necessary to duplicate the incident.

Many thanks,
Ivan

testing_pdpbox.zip

@ivan-marroquin
Copy link

Hi there,

I was wondering if someone had the opportunity to look into this issue.

Many thanks,

Ivan

@SauceCat
Copy link
Owner

@ivan-marroquin can you put your error messages here?

@ivan-marroquin
Copy link

Hi @SauceCat

As per your request:

pdpbox_interaction= pdp.pdp_interact(model= best_trained_model, dataset= pd_test_inputs, model_features= feature_names, features= features_to_plot)

File "c:\temp\python\python3.6.5\lib\site-packages\pdpbox\pdp.py", line 558, in pdp_interact
n_jobs=n_jobs, predict_kwds=predict_kwds, data_transformer=data_transformer)

File "c:\temp\python\python3.6.5\lib\site-packages\pdpbox\pdp.py", line 159, in pdp_isolate
for feature_grid in feature_grids)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 921, in call
if self.dispatch_one_batch(iterator):

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib_parallel_backends.py", line 549, in init
self.results = batch()

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 225, in call
for func, args, kwargs in self.items]

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 225, in
for func, args, kwargs in self.items]

File "c:\temp\python\python3.6.5\lib\site-packages\pdpbox\pdp_calc_utils.py", line 44, in _calc_ice_lines
preds = predict(_data[model_features], **predict_kwds)

File "c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py", line 1284, in predict
self._validate_features(data)

File "c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py", line 1675, in _validate_features
if self.feature_names != data.feature_names:

File "c:\temp\python\python3.6.5\lib\site-packages\pandas\core\generic.py", line 5180, in getattr
return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'feature_names'

Many thanks,
Ivan

@dyerrington
Copy link

To me, @ivan-marroquin , the error is descriptive:

File "c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py", line 1675, in _validate_features
if self.feature_names != data.feature_names:

File "c:\temp\python\python3.6.5\lib\site-packages\pandas\core\generic.py", line 5180, in getattr
return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'feature_names'

The part of the code from xgboost that throws this error is this:

Line ~1675 of xgboost/core.py

    def _validate_features(self, data):
        """
        Validate Booster and data's feature_names are identical.
        Set feature_names and feature_types from DMatrix
        """
        if self.feature_names is None:
            self.feature_names = data.feature_names
            self.feature_types = data.feature_types
        else:
            # Booster can't accept data with different feature names
            if self.feature_names != data.feature_names:
                dat_missing = set(self.feature_names) - set(data.feature_names)
                my_missing = set(data.feature_names) - set(self.feature_names)

                msg = 'feature_names mismatch: {0} {1}'

                if dat_missing:
                    msg += ('\nexpected ' + ', '.join(str(s) for s in dat_missing) +
                            ' in input data')

                if my_missing:
                    msg += ('\ntraining data did not have the following fields: ' +
                            ', '.join(str(s) for s in my_missing))

                raise ValueError(msg.format(self.feature_names,
                                            data.feature_names))
    

xgboost is trying to make sure the data that the model is derived from matches the data frame in reference -- as far as I can tell. When the original object (data in this case) doesn't have an attribute, .feature_names, the original DataFrame type object throws the final error.

The first thing I would check is that the model you've trained matches the data you are trying to plot. I would double-check everything including the encoding of feature names. Assert that they match 100% before doing anything with PDP then fix any problems. If it fails, absolutely reduce the problem and re-revaluate. Try building a model with fewer features and a very small number of observations so that it trains in seconds or milliseconds, then try to get it to work in the same file or in a notebook environment without doing any encoding or decoding / serialization of models.

@SauceCat SauceCat reopened this Mar 13, 2021
@SauceCat SauceCat added to-do and removed to-do labels Mar 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants