New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In the ordinal encoder go ahead and update the existing column instea… #126
In the ordinal encoder go ahead and update the existing column instea… #126
Conversation
…d of adding a new column, deleting the old one, and renaming the new column to the new column
The The TravisCI issue seems to be there from the beginning: |
Regarding TravisCI: It starts with Python 2.7 but it then follows with update to Python 3.5:
on line 675. But I do not know why |
Gotcha, I would have to add multiple debug echo statements in the ci scripts to even start to get to the bottom of it. I verified the problem with the TargetEncoder. I'll add tests and take of it and the WOE as well. |
Btw., the test can be parametrized and executed on all encoders. |
…mn order by having the final step be update the existing column and dropping the temporary column instead of the rename and drop strategy
@janmotl Went ahead and took care of the woe and TargetEncoder. Tell me what you think. |
@JohnnyC08 Nice - I had a suspicion that it is not going to be possible to avoid the need to generate temporary columns. Once LeaveOneOut is going to be fixed, it will be possible to use a simplified test: def test_preserve_column_order(self):
binary_cat_example = pd.DataFrame(
{'Trend': ['UP', 'UP', 'DOWN', 'FLAT', 'DOWN', 'UP', 'DOWN', 'FLAT', 'FLAT', 'FLAT'],
'target': [1, 1, 0, 0, 1, 0, 0, 0, 1, 1]}, columns=['Trend', 'target'])
for encoder_name in encoders.__all__:
with self.subTest(encoder_name=encoder_name):
encoder = getattr(encoders, encoder_name)()
result = encoder.fit_transform(binary_cat_example, binary_cat_example['target'])
columns = result.columns.values
self.assertTrue('target' in columns[-1], "Target must be the last column as in the input") |
@janmotl Taken care of. I opted to use your test. For some of the encoder, I could eliminate the temporary columns, but I didn't want to risk anything until the data driven testing gets in so we can refactor with some more impunity. In the leave one encoder I extracted some of the column names to variables to make the code easier to read. Tell me what you think. |
I am ok with that. Just note that if the dataset contains columns named def test_tmp_column_name(self):
binary_cat_example = pd.DataFrame(
{'Trend': ['UP', 'UP', 'DOWN', 'FLAT'],
'Trend_tmp': ['UP', 'UP', 'DOWN', 'FLAT'],
'target': [1, 1, 0, 0]}, columns=['Trend', 'Trend_tmp', 'target'])
for encoder_name in ['LeaveOneOutEncoder', 'TargetEncoder', 'WOEEncoder']:
with self.subTest(encoder_name=encoder_name):
encoder = getattr(encoders, encoder_name)()
_ = encoder.fit_transform(binary_cat_example, binary_cat_example['target']) But it is not anything new. A possible workaround is to store the temporary data into a temporary Series instead of the input DataFrame. As a bonus, it could simplify the code by removing the string concatenation in the temporary column name creation. |
You read my mind because I was thinking of using a series itself without regarding the actual data frame. I'll make the corresponding changes. |
…. To handle the leave one encoder we opted to use a series and removed unnecessary code from the fit method
… modifying the existing frame
@janmotl I went ahead and modified the LOO, WOE, TargetEncoder, and OrdinalEncoder to use a series. In the process I found the fit components of the OrginalEncoder and LOO were doing more than they had to so I removed some processing there. The tests are passing which makes me think I didn't break anything, but I want you to be sure to review the latest commits since my last message and tell me what you think. I've broken the commits into one for each encoder and two for the ordinal encoder. Overall, using a series has helped the readability of the code which is great. As always, tell me what you think. Thanks buddy. |
After mentioning removing the dataframe modification code from the fit methods, I went ahead and double checked the target encoder and saw I could remove some from there too |
@JohnnyC08 Yes! That necessary transformation of X during fitting and unnecessary passing of arguments in LOO was irritating me as well. Just note that we now unnecessarily calculate Otherwise, it looks good! |
On second thought, let's keep the calculation of |
To fix #100
The issue was arising because
_tmp
columns were being appended to the end of the data frame as part of the transform process.First, we noticed that the transform process was to append a temporary column, drop the existing column, and rename the temporary column to the existing column name.
So, we went ahead and reduced that step to one step where we update the existing column using our mapping which preserves the order. I wasn't sure why the above mentioned transform method had that many steps and a single update seems to ensure the tests are passing.
@janmotl I also noticed in travis the python3 step seems to be running python 2.7 instead of python3. From the
install.sh
I see mentions of aconda create
and in the CI logs I see the mention of avirtualenv
being set which I don't see mentioned in the project. Perhaps the travis cache needs to be cleared?