In the ordinal encoder go ahead and update the existing column instea… #126

JohnnyC08 · 2018-09-23T04:01:58Z

To fix #100

The issue was arising because _tmp columns were being appended to the end of the data frame as part of the transform process.

First, we noticed that the transform process was to append a temporary column, drop the existing column, and rename the temporary column to the existing column name.

So, we went ahead and reduced that step to one step where we update the existing column using our mapping which preserves the order. I wasn't sure why the above mentioned transform method had that many steps and a single update seems to ensure the tests are passing.

@janmotl I also noticed in travis the python3 step seems to be running python 2.7 instead of python3. From the install.sh I see mentions of a conda create and in the CI logs I see the mention of a virtualenv being set which I don't see mentioned in the project. Perhaps the travis cache needs to be cleared?

…d of adding a new column, deleting the old one, and renaming the new column to the new column

…me for the test

janmotl · 2018-09-23T06:18:12Z

The _tmp suffix is also used by LOO, TargetEncoder and WOE.

The TravisCI issue seems to be there from the beginning:
https://travis-ci.org/scikit-learn-contrib/categorical-encoding/jobs/148114560. I do not know how to approach it.

janmotl · 2018-09-23T07:03:32Z

Regarding TravisCI: It starts with Python 2.7 but it then follows with update to Python 3.5:

Successfully installed chardet-3.0.4 coverage-4.5.1 coveralls-1.5.0 docopt-0.6.2 idna-2.7 requests-2.19.1 urllib3-1.23
Python 3.5.6 :: Anaconda, Inc.

on line 675.

But I do not know why python --version gets executed before the updates.

JohnnyC08 · 2018-09-23T07:25:36Z

Gotcha, I would have to add multiple debug echo statements in the ci scripts to even start to get to the bottom of it.

I verified the problem with the TargetEncoder. I'll add tests and take of it and the WOE as well.

janmotl · 2018-09-23T11:54:22Z

Btw., the test can be parametrized and executed on all encoders.

…mn order by having the final step be update the existing column and dropping the temporary column instead of the rename and drop strategy

JohnnyC08 · 2018-09-23T20:17:13Z

@janmotl Went ahead and took care of the woe and TargetEncoder. Tell me what you think.

janmotl · 2018-09-24T11:36:50Z

@JohnnyC08 Nice - I had a suspicion that it is not going to be possible to avoid the need to generate temporary columns.

Once LeaveOneOut is going to be fixed, it will be possible to use a simplified test:

    def test_preserve_column_order(self):
        binary_cat_example = pd.DataFrame(
            {'Trend': ['UP', 'UP', 'DOWN', 'FLAT', 'DOWN', 'UP', 'DOWN', 'FLAT', 'FLAT', 'FLAT'],
             'target': [1, 1, 0, 0, 1, 0, 0, 0, 1, 1]}, columns=['Trend', 'target'])

        for encoder_name in encoders.__all__:
            with self.subTest(encoder_name=encoder_name):
                encoder = getattr(encoders, encoder_name)()
                result = encoder.fit_transform(binary_cat_example, binary_cat_example['target'])
                columns = result.columns.values

                self.assertTrue('target' in columns[-1], "Target must be the last column as in the input")

JohnnyC08 · 2018-09-25T02:10:59Z

@janmotl Taken care of. I opted to use your test. For some of the encoder, I could eliminate the temporary columns, but I didn't want to risk anything until the data driven testing gets in so we can refactor with some more impunity.

In the leave one encoder I extracted some of the column names to variables to make the code easier to read.

Tell me what you think.

janmotl · 2018-09-25T09:07:31Z

I am ok with that. Just note that if the dataset contains columns named Trend and Trend_tmp, it will result into errors in LOO, WOE and TargetEncoder:

    def test_tmp_column_name(self):
        binary_cat_example = pd.DataFrame(
            {'Trend': ['UP', 'UP', 'DOWN', 'FLAT'],
             'Trend_tmp': ['UP', 'UP', 'DOWN', 'FLAT'],
             'target': [1, 1, 0, 0]}, columns=['Trend', 'Trend_tmp', 'target'])

        for encoder_name in ['LeaveOneOutEncoder', 'TargetEncoder', 'WOEEncoder']:
            with self.subTest(encoder_name=encoder_name):
                encoder = getattr(encoders, encoder_name)()
                _ = encoder.fit_transform(binary_cat_example, binary_cat_example['target'])

But it is not anything new. A possible workaround is to store the temporary data into a temporary Series instead of the input DataFrame. As a bonus, it could simplify the code by removing the string concatenation in the temporary column name creation.

JohnnyC08 · 2018-09-25T14:32:32Z

You read my mind because I was thinking of using a series itself without regarding the actual data frame. I'll make the corresponding changes.

…. To handle the leave one encoder we opted to use a series and removed unnecessary code from the fit method

… modifying the existing frame

… pattern

…sn't need to

JohnnyC08 · 2018-09-26T03:29:02Z

@janmotl I went ahead and modified the LOO, WOE, TargetEncoder, and OrdinalEncoder to use a series. In the process I found the fit components of the OrginalEncoder and LOO were doing more than they had to so I removed some processing there. The tests are passing which makes me think I didn't break anything, but I want you to be sure to review the latest commits since my last message and tell me what you think. I've broken the commits into one for each encoder and two for the ordinal encoder.

Overall, using a series has helped the readability of the code which is great. As always, tell me what you think. Thanks buddy.

JohnnyC08 · 2018-09-26T03:41:51Z

After mentioning removing the dataframe modification code from the fit methods, I went ahead and double checked the target encoder and saw I could remove some from there too

janmotl · 2018-09-26T09:17:15Z

@JohnnyC08 Yes! That necessary transformation of X during fitting and unnecessary passing of arguments in LOO was irritating me as well.

Just note that we now unnecessarily calculate smoothing in TargetEncoder when tmp[val]['count'] == 1. But I don't know whether the runtime will be better after returning back the condition.

Otherwise, it looks good!

janmotl · 2018-09-26T09:23:38Z

On second thought, let's keep the calculation of smoothing there for all situations in order to make the trained model easier to understand and possibly modify the code.

JohnnyC08 added 2 commits September 22, 2018 20:57

In the ordinal encoder go ahead and update the existing column instea…

1027a60

…d of adding a new column, deleting the old one, and renaming the new column to the new column

For python2 support ensure we pass the column order in create datafra…

11ff41e

…me for the test

JohnnyC08 added 2 commits September 23, 2018 13:07

Make target encoder and weight of evidence encoder also preserve colu…

c386e15

…mn order by having the final step be update the existing column and dropping the temporary column instead of the rename and drop strategy

Remove commented out code

44bd398

Also fix the leave one encoder column ordering

705f641

JohnnyC08 added 5 commits September 25, 2018 19:09

Add test to see how the same column with temp appended runs correctly…

5518edc

…. To handle the leave one encoder we opted to use a series and removed unnecessary code from the fit method

Convert the weight of evidence encoder to use a new series instead of…

6c62a1b

… modifying the existing frame

Remove use of temp column in dataframe for target encoder

a2a0a4c

Have ordinal encoder transform method also use the transformed column…

94ab93f

… pattern

Have ordinal encoder train method not modify the data frame as it doe…

6951a01

…sn't need to

Removed dataframe modification code from target fit

031f498

janmotl merged commit c209ae1 into scikit-learn-contrib:master Sep 26, 2018

JohnnyC08 deleted the issue-106-ordinal-column-order branch September 26, 2018 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In the ordinal encoder go ahead and update the existing column instea… #126

In the ordinal encoder go ahead and update the existing column instea… #126

JohnnyC08 commented Sep 23, 2018 •

edited

janmotl commented Sep 23, 2018

janmotl commented Sep 23, 2018

JohnnyC08 commented Sep 23, 2018

janmotl commented Sep 23, 2018

JohnnyC08 commented Sep 23, 2018

janmotl commented Sep 24, 2018 •

edited

JohnnyC08 commented Sep 25, 2018

janmotl commented Sep 25, 2018

JohnnyC08 commented Sep 25, 2018

JohnnyC08 commented Sep 26, 2018

JohnnyC08 commented Sep 26, 2018

janmotl commented Sep 26, 2018

janmotl commented Sep 26, 2018

In the ordinal encoder go ahead and update the existing column instea… #126

In the ordinal encoder go ahead and update the existing column instea… #126

Conversation

JohnnyC08 commented Sep 23, 2018 • edited

janmotl commented Sep 23, 2018

janmotl commented Sep 23, 2018

JohnnyC08 commented Sep 23, 2018

janmotl commented Sep 23, 2018

JohnnyC08 commented Sep 23, 2018

janmotl commented Sep 24, 2018 • edited

JohnnyC08 commented Sep 25, 2018

janmotl commented Sep 25, 2018

JohnnyC08 commented Sep 25, 2018

JohnnyC08 commented Sep 26, 2018

JohnnyC08 commented Sep 26, 2018

janmotl commented Sep 26, 2018

janmotl commented Sep 26, 2018

JohnnyC08 commented Sep 23, 2018 •

edited

janmotl commented Sep 24, 2018 •

edited