Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple columns alias name is not effect! #128

Closed
feng-1985 opened this issue Oct 19, 2017 · 13 comments
Closed

multiple columns alias name is not effect! #128

feng-1985 opened this issue Oct 19, 2017 · 13 comments

Comments

@feng-1985
Copy link

feng-1985 commented Oct 19, 2017

import pandas as pd
import numpy as np
import sklearn.preprocessing, sklearn.decomposition, \
    sklearn.linear_model, sklearn.pipeline, sklearn.metrics
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import CategoricalImputer

data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', None, 'dog', 'cat', 'fish'],
                      'children': [4., 6, 3, 3, 2, 3, 5, 4],
                      'salary':   [90, 24, 44, 27, 32, 59, 36, 27],
                     'age':[12,24,21,17,18,25,19,15]})

mapper = DataFrameMapper([
     ('pet', [CategoricalImputer(), sklearn.preprocessing.LabelBinarizer()]),
     (['children', 'salary'], sklearn.preprocessing.StandardScaler(), {'alias': 'children_scaled',
                                                                       'alias1':'salary_scaled'})

 ], df_out=True, default=None)
mapper.fit_transform(data.copy())
print(mapper.transformed_names_)

['pet_cat', 'pet_dog', 'pet_fish', 'children_scaled_0', 'children_scaled_1', 'age']

@feng-1985
Copy link
Author

`import pandas as pd
import numpy as np
import sklearn.preprocessing, sklearn.decomposition,
sklearn.linear_model, sklearn.pipeline, sklearn.metrics
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import CategoricalImputer

data = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', None, 'dog', 'cat', 'fish'],
'children': [4., 6, 3, 3, 2, 3, 5, 4],
'salary': [90, 24, 44, 27, 32, 59, 36, 27],
'age':[12,24,21,17,18,25,19,15]})

mapper = DataFrameMapper([
(['children', 'salary'], sklearn.preprocessing.StandardScaler())
], df_out=True, default=None)
mapper.fit_transform(data.copy()) # !!!
print(mapper.transformed_names_)
`

['children_salary_0', 'children_salary_1', 'age_pet_0', 'age_pet_1']

@feng-1985
Copy link
Author

feng-1985 commented Oct 20, 2017

I will try to solve this problem!

@dukebody
Copy link
Collaborator

Can you describe what do you expect vs. what is the current behavior? I don't quite understand where is the issue.

The input alias1 in othe of the options dict for one of the features in your code doesn't do anything. A parameter with such a name is not expected nor used.

@feng-1985
Copy link
Author

feng-1985 commented Oct 23, 2017

@dukebody
The first result I expect it should be:

['pet_cat', 'pet_dog', 'pet_fish', 'children_scaled', 'salary_scaled', 'age']

Yes, the input alias1 is unuse, so I using the list:

mapper = DataFrameMapper([
     ('pet', [CategoricalImputer(), sklearn.preprocessing.LabelBinarizer()]),
     (['children', 'salary'], sklearn.preprocessing.StandardScaler(), {'alias': ['children_scaled','salary_scaled']})
 ], df_out=True, default=None)

But this code will raise error, so I update dataframe_mapper.py code, add some functions, hope it helps.

The second result I expect it should be:

['children', 'salary', 'age', 'pet']

#129 checks failed, i don't know why.

@devforfu
Copy link
Collaborator

@bifeng See the CircleCI report (click Details link). Basically, your pull request has a few flake8 errors/warnings.

@feng-1985
Copy link
Author

@devforfu finally remove these errors.

@feng-1985 feng-1985 reopened this Nov 7, 2017
@dukebody
Copy link
Collaborator

@bifeng I believe you can get the columns you want using the right syntax with the DataFrameMapper, without modifying any code. Try the following:

import pandas as pd
import numpy as np
import sklearn.preprocessing, sklearn.decomposition, \
    sklearn.linear_model, sklearn.pipeline, sklearn.metrics
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import CategoricalImputer

data = pd.DataFrame({'pet':      ['cat', 'dog', 'dog', 'fish', None, 'dog', 'cat', 'fish'],
                      'children': [4., 6, 3, 3, 2, 3, 5, 4],
                      'salary':   [90, 24, 44, 27, 32, 59, 36, 27],
                     'age':[12,24,21,17,18,25,19,15]})

mapper = DataFrameMapper([
     ('pet', [CategoricalImputer(), sklearn.preprocessing.LabelBinarizer()]),
     (['children'],
      sklearn.preprocessing.StandardScaler(), {'alias': 'children_scaled'}),
     (['salary'],
      sklearn.preprocessing.StandardScaler(), {'alias': 'salary_scaled'}),

 ], df_out=True, default=None)
mapper.fit_transform(data.copy())
print(mapper.transformed_names_)

It outputs:

['pet_cat', 'pet_dog', 'pet_fish', 'children_scaled', 'salary_scaled', 'age']

what is what you expect. Right?

The problem in your original code is that if you use the feature definition:

['children', 'salary'], sklearn.preprocessing.StandardScaler()

then it applies the StandardScaler to a numpy array consisting on both columns, producing a single output.

Let me know if the solution I outlined works for you.

@datajanko
Copy link

I think he wants to avoid redundancy. Actually, it might be best to allow an aliases variable in gen_features

@dukebody
Copy link
Collaborator

dukebody commented Feb 4, 2018

I honestly don't think it's too much writing... However to avoid redundancy one can write something like:

features = [('pet', [CategoricalImputer(), sklearn.preprocessing.LabelBinarizer()])]
for column, alias in zip(['children', 'salary'], ['children_scaled', 'salary_scaled']):
    features.append = ([column], sklearn.preprocessing.StandardScaler(), {'alias': alias})

mapper = DataFrameMapper(features)

@datajanko
Copy link

In that case, I have the feeling that we should modify gen_features (because it seems like your are essentially providing functionality of that function.

@devforfu
Copy link
Collaborator

devforfu commented Sep 5, 2018

@dukebody Do you think we should processed with the PR #129, or close it for now? As it was noted, the more straightforward solution would be to modify gen_features so it can accept aliases list.

@dukebody
Copy link
Collaborator

@devforfu I don't think #129 is necessary. We can modify gen_features to accept an aliases list perhaps, yes.

@Rendiere
Copy link

@dukebody any update on adding list of aliases to gen_features ? +1 from me for this functionality ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants