Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In the ordinal encoder go ahead and update the existing column instea… #126

Merged

Conversation

JohnnyC08
Copy link
Contributor

@JohnnyC08 JohnnyC08 commented Sep 23, 2018

To fix #100

The issue was arising because _tmp columns were being appended to the end of the data frame as part of the transform process.

First, we noticed that the transform process was to append a temporary column, drop the existing column, and rename the temporary column to the existing column name.

So, we went ahead and reduced that step to one step where we update the existing column using our mapping which preserves the order. I wasn't sure why the above mentioned transform method had that many steps and a single update seems to ensure the tests are passing.

@janmotl I also noticed in travis the python3 step seems to be running python 2.7 instead of python3. From the install.sh I see mentions of a conda create and in the CI logs I see the mention of a virtualenv being set which I don't see mentioned in the project. Perhaps the travis cache needs to be cleared?

…d of adding a new column, deleting the old one, and renaming the new column to the new column
@janmotl
Copy link
Collaborator

janmotl commented Sep 23, 2018

The _tmp suffix is also used by LOO, TargetEncoder and WOE.

The TravisCI issue seems to be there from the beginning:
https://travis-ci.org/scikit-learn-contrib/categorical-encoding/jobs/148114560. I do not know how to approach it.

@janmotl
Copy link
Collaborator

janmotl commented Sep 23, 2018

Regarding TravisCI: It starts with Python 2.7 but it then follows with update to Python 3.5:

Successfully installed chardet-3.0.4 coverage-4.5.1 coveralls-1.5.0 docopt-0.6.2 idna-2.7 requests-2.19.1 urllib3-1.23
Python 3.5.6 :: Anaconda, Inc.

on line 675.

But I do not know why python --version gets executed before the updates.

@JohnnyC08
Copy link
Contributor Author

Gotcha, I would have to add multiple debug echo statements in the ci scripts to even start to get to the bottom of it.

I verified the problem with the TargetEncoder. I'll add tests and take of it and the WOE as well.

@janmotl
Copy link
Collaborator

janmotl commented Sep 23, 2018

Btw., the test can be parametrized and executed on all encoders.

…mn order by having the final step be update the existing column and dropping the temporary column instead of the rename and drop strategy
@JohnnyC08
Copy link
Contributor Author

@janmotl Went ahead and took care of the woe and TargetEncoder. Tell me what you think.

@janmotl
Copy link
Collaborator

janmotl commented Sep 24, 2018

@JohnnyC08 Nice - I had a suspicion that it is not going to be possible to avoid the need to generate temporary columns.

Once LeaveOneOut is going to be fixed, it will be possible to use a simplified test:

    def test_preserve_column_order(self):
        binary_cat_example = pd.DataFrame(
            {'Trend': ['UP', 'UP', 'DOWN', 'FLAT', 'DOWN', 'UP', 'DOWN', 'FLAT', 'FLAT', 'FLAT'],
             'target': [1, 1, 0, 0, 1, 0, 0, 0, 1, 1]}, columns=['Trend', 'target'])

        for encoder_name in encoders.__all__:
            with self.subTest(encoder_name=encoder_name):
                encoder = getattr(encoders, encoder_name)()
                result = encoder.fit_transform(binary_cat_example, binary_cat_example['target'])
                columns = result.columns.values

                self.assertTrue('target' in columns[-1], "Target must be the last column as in the input")

@JohnnyC08
Copy link
Contributor Author

@janmotl Taken care of. I opted to use your test. For some of the encoder, I could eliminate the temporary columns, but I didn't want to risk anything until the data driven testing gets in so we can refactor with some more impunity.

In the leave one encoder I extracted some of the column names to variables to make the code easier to read.

Tell me what you think.

@janmotl
Copy link
Collaborator

janmotl commented Sep 25, 2018

I am ok with that. Just note that if the dataset contains columns named Trend and Trend_tmp, it will result into errors in LOO, WOE and TargetEncoder:

    def test_tmp_column_name(self):
        binary_cat_example = pd.DataFrame(
            {'Trend': ['UP', 'UP', 'DOWN', 'FLAT'],
             'Trend_tmp': ['UP', 'UP', 'DOWN', 'FLAT'],
             'target': [1, 1, 0, 0]}, columns=['Trend', 'Trend_tmp', 'target'])

        for encoder_name in ['LeaveOneOutEncoder', 'TargetEncoder', 'WOEEncoder']:
            with self.subTest(encoder_name=encoder_name):
                encoder = getattr(encoders, encoder_name)()
                _ = encoder.fit_transform(binary_cat_example, binary_cat_example['target'])

But it is not anything new. A possible workaround is to store the temporary data into a temporary Series instead of the input DataFrame. As a bonus, it could simplify the code by removing the string concatenation in the temporary column name creation.

@JohnnyC08
Copy link
Contributor Author

You read my mind because I was thinking of using a series itself without regarding the actual data frame. I'll make the corresponding changes.

@JohnnyC08
Copy link
Contributor Author

@janmotl I went ahead and modified the LOO, WOE, TargetEncoder, and OrdinalEncoder to use a series. In the process I found the fit components of the OrginalEncoder and LOO were doing more than they had to so I removed some processing there. The tests are passing which makes me think I didn't break anything, but I want you to be sure to review the latest commits since my last message and tell me what you think. I've broken the commits into one for each encoder and two for the ordinal encoder.

Overall, using a series has helped the readability of the code which is great. As always, tell me what you think. Thanks buddy.

@JohnnyC08
Copy link
Contributor Author

After mentioning removing the dataframe modification code from the fit methods, I went ahead and double checked the target encoder and saw I could remove some from there too

@janmotl
Copy link
Collaborator

janmotl commented Sep 26, 2018

@JohnnyC08 Yes! That necessary transformation of X during fitting and unnecessary passing of arguments in LOO was irritating me as well.

Just note that we now unnecessarily calculate smoothing in TargetEncoder when tmp[val]['count'] == 1. But I don't know whether the runtime will be better after returning back the condition.

Otherwise, it looks good!

@janmotl
Copy link
Collaborator

janmotl commented Sep 26, 2018

On second thought, let's keep the calculation of smoothing there for all situations in order to make the trained model easier to understand and possibly modify the code.

@janmotl janmotl merged commit c209ae1 into scikit-learn-contrib:master Sep 26, 2018
@JohnnyC08 JohnnyC08 deleted the issue-106-ordinal-column-order branch September 26, 2018 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OrdinalEncoder changes column order
2 participants