Planned functionality #1

rhiever · 2016-02-29T19:57:30Z

In the immediate future, datacleaner will:

Encode all non-numerical variables as numerical variables
Replace all NaNs with the median of the column or drop all NaN rows (configurable)

If anyone has more ideas, please add them here.

Currently only for autoclean()

We should use the mode (i.e., most common value) of the column for categorical variables, and the median for continuous variables. Since there’s no easy way to detect continuous vs. categorical variables in pandas, we use a heuristic: If >20% of the values in a column are unique, then it is probably a continuous variable. Otherwise, it is probably a categorical variable. (Related to #1)

Related to #1 and imputing

jaumebp · 2016-03-06T14:25:24Z

In my experience it is worth identifying ordinal variables (e.g. numerical grades) and handle then separately. In many cases these can be treated as continuous variables, but sometimes it is necessary to treat them as discrete ones. One example of this is missing value imputation. If treating them as continuous you may end up injecting fake values that then can mislead the downstream analysis.

Thanks for the project! I tested it on some of my biomedical datasets and compared the PCA before/after the cleaning. The only case where there were differences is a dataset with discrete variables (Exome sequencing) and specifically in the columns where some of the values were '0'. There was the following error message:
sys:1: DtypeWarning: Columns (6,19,131,225,404,416,515,651,833,945,975,986,1265,1327,1387,1494,1541,1558,1715,1737,1854,1875,1947,1980,2015,2024,2111,2132,2140,2165,2426,2652,2667,2668,2871,2943,2978,2997,3165,3335,3634,3807,3945,4010,4018,4177,4191,4196,4243,4245,4389,4463,4553,4772,4814,4841,4962) have mixed types. Specify dtype option on import or set low_memory=False.

rhiever · 2016-03-06T14:29:56Z

Indeed, which is why I'm trying to discover how to identify ordinal vs. continuous variables. I posted this question on StackOverflow to brainstorm.

jaumebp · 2016-03-06T14:36:11Z

In our software we went with a much simpler approach. Letting the user specify a list of attributes to be treated as ordinal. Of course, an automatic solution is far more elegant :)

westurner · 2016-12-14T09:01:01Z

"Convenience function: Detect if there are non-numerical features and encode them as numerical features" EpistasisLab/tpot#61

westurner · 2016-12-14T09:02:38Z

Do I have to do get_dummies() all by myself?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

... get_dummies() accepts a number of kwargs

westurner · 2016-12-14T09:06:59Z

Do I have to do get_dummies() all by myself?

I think it illogical to e.g. average Exterior1st in the Kaggle House Prices Dataset: the average of ImStucc and Wd Sdng seems nonsensical?

westurner · 2016-12-14T09:08:27Z

CSVW as JSONLD may be a good way to specify a dataset header with the relevant metadata for such operations? pandas-dev/pandas#3402

rhiever · 2016-12-14T16:28:33Z

You should be able to use the sklearn OneHotEncoder to get the equivalent of the pandas get_dummies().

westurner · 2016-12-14T16:49:07Z

You should be able to use the sklearn OneHotEncoder to get the equivalent of the pandas get_dummies().

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Is there a way to specify that I only need certain columns to be expanded into multiple columns w/ OneHotEncoder?

rhiever · 2016-12-14T16:51:03Z

See the docs you linked and the categorical_features parameter.

westurner · 2016-12-14T17:14:46Z

Do I need to write a FunctionTransformer to stack multiple preprocessing modules?

westurner · 2016-12-14T17:16:01Z

Do I need to write a FunctionTransformer to stack multiple preprocessing modules?

i.e for different columns. Or just run autoclean multiple times?

rhiever · 2016-12-15T16:32:36Z

Running autoclean multiple times might be the easier solution. Might be a useful extension to autocleaner to allow the user to pass multiple preprocessors in a list.

westurner · 2016-12-23T08:33:55Z

Might be a useful extension to autocleaner to allow the user to pass multiple preprocessors in a list.

https://github.com/paulgb/sklearn-pandas DataFrameMapper supports various combinations of columns and transformations.

westurner · 2016-12-23T08:36:41Z

It may be worth noting that pandas Categoricals have an ordered=True parameter. http://pandas.pydata.org/pandas-docs/stable/categorical.html#sorting-and-order

Does specifying the Categoricals have a different effect than inferring the ordinals from the happenstance sequence of strings in a given dataset?

adrose · 2017-02-13T01:12:26Z

any plans to impute NA's rather then replace continuous variables with the median value?

rhiever · 2017-02-13T19:05:59Z

@adrose, do you mean via model-based imputation?

adrose · 2017-02-13T19:16:09Z

@rhiever sorry should have been A LOT more specific, but yes something similar to what the Amelia command is doing in this R package - i.e. (bootstrapped linear regression).

Happy to expand on it more, or would be excited to see if you have any thoughts on this function if you think it may be applicable.

westurner · 2017-02-14T04:50:57Z

https://en.wikipedia.org/wiki/Imputation_(statistics) :

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.[1] Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data.[2] Imputation theory is constantly developing and thus requires consistent attention to new information regarding the subject. There have been many theories embraced by scientists to account for missing data but the majority of them introduce large amounts of bias. A few of the well known attempts to deal with missing data include: hot deck and cold deck imputation; listwise and pairwise deletion; mean imputation; regression imputation; last observation carried forward; stochastic imputation; and multiple imputation. [emphasis added]
http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
- class sklearn.preprocessing.Imputer(, strategy=('mean', 'median', 'most_frequent'), )
http://scikit-learn.org/stable/auto_examples/missing_values.html :

Imputing missing values before building an estimator¶
This example shows that imputing the missing values can give better results than discarding the samples containing any missing value. Imputing does not always improve the predictions, so please check via cross-validation. Sometimes dropping rows or using marker values is more effective.
Missing values can be replaced by the mean, the median or the most frequent value using the strategy hyper-parameter. The median is a more robust estimator for data with high magnitude variables which could dominate results (otherwise known as a ‘long tail’)."

rhiever added the enhancement label Feb 29, 2016

rhiever added a commit that referenced this issue Mar 2, 2016

First pass at #1

113f9b6

Currently only for autoclean()

rhiever added a commit that referenced this issue Mar 2, 2016

First pass at #1 for autoclean_cv()

ea83721

rhiever added a commit that referenced this issue Mar 2, 2016

Initial pass at command-line interface for #1

f7645cb

rhiever added a commit that referenced this issue Mar 2, 2016

Add drop NaN functionality

9a4f309

Related to #1 and imputing

ghk829 mentioned this issue May 22, 2019

add fill_func to fillna #17

Open

ghk829 referenced this issue in ghk829/datacleaner May 22, 2019

add fill_func to fillna #1

5d332b4

ghk829 mentioned this issue May 22, 2019

add fill_func to fillna #1 #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planned functionality #1

Planned functionality #1

rhiever commented Feb 29, 2016

jaumebp commented Mar 6, 2016

rhiever commented Mar 6, 2016

jaumebp commented Mar 6, 2016

westurner commented Dec 14, 2016

westurner commented Dec 14, 2016

westurner commented Dec 14, 2016

westurner commented Dec 14, 2016

rhiever commented Dec 14, 2016

westurner commented Dec 14, 2016

rhiever commented Dec 14, 2016

westurner commented Dec 14, 2016

westurner commented Dec 14, 2016 •

edited

rhiever commented Dec 15, 2016

westurner commented Dec 23, 2016

westurner commented Dec 23, 2016

adrose commented Feb 13, 2017

rhiever commented Feb 13, 2017

adrose commented Feb 13, 2017

westurner commented Feb 14, 2017 •

edited

Planned functionality #1

Planned functionality #1

Comments

rhiever commented Feb 29, 2016

jaumebp commented Mar 6, 2016

rhiever commented Mar 6, 2016

jaumebp commented Mar 6, 2016

westurner commented Dec 14, 2016

westurner commented Dec 14, 2016

westurner commented Dec 14, 2016

westurner commented Dec 14, 2016

rhiever commented Dec 14, 2016

westurner commented Dec 14, 2016

rhiever commented Dec 14, 2016

westurner commented Dec 14, 2016

westurner commented Dec 14, 2016 • edited

rhiever commented Dec 15, 2016

westurner commented Dec 23, 2016

westurner commented Dec 23, 2016

adrose commented Feb 13, 2017

rhiever commented Feb 13, 2017

adrose commented Feb 13, 2017

westurner commented Feb 14, 2017 • edited

westurner commented Dec 14, 2016 •

edited

westurner commented Feb 14, 2017 •

edited