ValueError: could not convert string to float: 'aaa' #193

simonm3 · 2016-11-23T15:41:08Z

I have imbalanced classes with 10,000 1s and 10m 0s. I want to undersample before I convert category columns to dummies to save memory. I expected it would ignore the content of x and randomly select based on y. However I get the above error. What am I not understanding and how do I do this without converting category features to dummies first?

clf_sample = RandomUnderSampler(ratio=.025)
x = pd.DataFrame(np.random.random((100,5)), columns=list("abcde"))
x.loc[:, "b"] = "aaa"
clf_sample.fit(x, y.head(100))

The text was updated successfully, but these errors were encountered:

glemaitre · 2016-11-23T16:10:12Z

This could be due to Pandas, I will check that. You should also know that we do not support Pandas and x and y resampled will not be DataFrame type.

I just tried the following example in numpy which seems to work fine

import numpy as np
from imblearn.under_sampling import RandomUnderSampler

x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
y = np.array([0] * 10 + [1] * 90)

rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x, y)

glemaitre · 2016-11-23T16:10:33Z

@simonm3 What is y in your example?

simonm3 · 2016-11-23T16:40:13Z

y is just a list of integers that are 1 or 0. It is fine though. You are correct that it is because of pandas. Also if I convert pandas to values it does not work either! The two arrays are equal. However the numpy one is dtype "<U3" and the pandas one is "o".

import numpy as np
from imblearn.under_sampling import RandomUnderSampler

x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
x1 = pd.DataFrame(x, columns=list("ab")).values
print(np.array_equal(x, x1))
print(x1.dtype, x.dtype)

y = np.array([0] * 10 + [1] * 90)
rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x, y)
rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x1, y)

simonm3 · 2016-11-23T17:03:56Z

Here is a solution:

import numpy as np
from imblearn.under_sampling import RandomUnderSampler

# create test data
x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
x1 = pd.DataFrame(x, columns=list("ab"), index=range(1000, 1100))
y = np.array([0] * 10 + [1] * 90)

###### numpy test #####
rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x, y)

###### pandas test #####
# convert pandas to numpy
x1 = x1.reset_index()
savedcols = x1.columns
x1 = x1.values.astype("U")

rus = RandomUnderSampler()
x1_res, y1_res = rus.fit_sample(x1, y)

# convert back to pandas
x1_res = pd.DataFrame(x1_res, columns=savedcols).set_index("index")
x1_res

simonm3 · 2016-11-23T17:16:46Z

Though now all the number columns are converted to strings! An alternative is to just pass the index of my dataframe to the sampler; then select the rows from the result. That should work......unless you can think of a better solution.

dvro · 2016-11-23T20:29:19Z

@simonm3 just save the column names and also the column types.

I changed your script bellow, let me know how it works for you:

import pandas as pd
import numpy as np

from imblearn.under_sampling import RandomUnderSampler

# create test data
x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
x1 = pd.DataFrame(x, columns=list("ab"), index=range(1000, 1100))
# adding numeric column
x1['n'] = np.arange(x1.shape[0])

y = np.array([0] * 10 + [1] * 90)

print x1

###### numpy test #####
rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x, y)

###### pandas test #####
# convert pandas to numpy
x1 = x1.reset_index()
col_names = x1.columns
col_types = x1.dtypes

x1 = x1.values.astype("U")

rus = RandomUnderSampler()
x1_res, y1_res = rus.fit_sample(x1, y)

# convert back to pandas
x1_res = pd.DataFrame(x1_res)

x1_res.columns = col_names
for col_name, col_type in zip(col_names, col_types):
    x1_res[col_name] = x1_res[col_name].astype(col_type)

print x1_res
print x1_res.dtypes

final resut keeps the types:

[100 rows x 3 columns]
    index    a    b   n
0    1000  aaa  bbb   0
1    1001  aaa  bbb   1
2    1002  aaa  bbb   2
3    1003  aaa  bbb   3
4    1004  aaa  bbb   4
5    1005  aaa  bbb   5
6    1006  aaa  bbb   6
7    1007  aaa  bbb   7
8    1008  aaa  bbb   8
9    1009  aaa  bbb   9
10   1058  aaa  bbb  58
11   1082  aaa  bbb  82
12   1082  aaa  bbb  82
13   1066  aaa  bbb  66
14   1031  aaa  bbb  31
15   1044  aaa  bbb  44
16   1074  aaa  bbb  74
17   1053  aaa  bbb  53
18   1034  aaa  bbb  34
19   1045  aaa  bbb  45
index     int64
a        object
b        object
n         int64
dtype: object

chkoar · 2016-11-24T12:28:07Z

@simonm3 you could pass the index as you said

import numpy as np
import pandas as pd

from imblearn.under_sampling import RandomUnderSampler

# create test data
X = np.array([['aaa'] * 100, ['bbb'] * 100]).T
X_df = pd.DataFrame(X, columns=list("ab"), index=range(1000, 1100))
y = np.array([0] * 10 + [1] * 90)

# numpy test
X_res1, y_res1 = RandomUnderSampler().fit_sample(X, y)

#  pandas test
X_i = X_df.index.values.reshape(-1, 1)
_, _, i = RandomUnderSampler(return_indices=True).fit_sample(X_i, y)
X_res2, y_res2 = X_df.iloc[i, :], y[i]

@dvro @glemaitre we could implicitly support pandas

simonm3 · 2016-11-24T18:02:04Z

Not sure about that. According to the docs return indicees only returns "samples randomly selected from the majority class." Also there is no return_indicees for RandomOverSampler. In the end I did this: xindex, y = clf.fit_sample(x.index.values.reshape(-1,1), y) x = x.loc[xindex.ravel()]

…

On 24 November 2016 at 12:28, chkoar ***@***.***> wrote: @simonm3 <https://github.com/simonm3> pass the index as you said import numpy as npimport pandas as pd from imblearn.under_sampling import RandomUnderSampler # create test dataX = np.array([['aaa'] * 100, ['bbb'] * 100]).TX_df = pd.DataFrame(X, columns=list("ab"), index=range(1000, 1100)) y = np.array([0] * 10 + [1] * 90) # numpy testX_res1, y_res1 = RandomUnderSampler().fit_sample(X, y) # pandas testX_i = X_df.index.values.reshape(-1, 1) _, _, i = RandomUnderSampler(return_indices=True).fit_sample(X_i, y)X_res2, y_res2 = X_df.iloc[i, :], y[i] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#193 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJN6fepIpnD3MemgeD5mEdFePsLQPMqks5rBYLYgaJpZM4K6vB9> .

simonm3 · 2016-11-24T18:07:10Z

Oh of course there won't be for the oversampler as there are new samples. I guess best solution is just to create the dummies and pass the whole file in. If I run out of memory I will just take a sample.

…

On 24 November 2016 at 18:02, simon mackenzie ***@***.***> wrote: Not sure about that. According to the docs return indicees only returns "samples randomly selected from the majority class." Also there is no return_indicees for RandomOverSampler. In the end I did this: xindex, y = clf.fit_sample(x.index.values.reshape(-1,1), y) x = x.loc[xindex.ravel()] On 24 November 2016 at 12:28, chkoar ***@***.***> wrote: > @simonm3 <https://github.com/simonm3> pass the index as you said > > import numpy as npimport pandas as pd > from imblearn.under_sampling import RandomUnderSampler > # create test dataX = np.array([['aaa'] * 100, ['bbb'] * 100]).TX_df = pd.DataFrame(X, columns=list("ab"), index=range(1000, 1100)) > y = np.array([0] * 10 + [1] * 90) > # numpy testX_res1, y_res1 = RandomUnderSampler().fit_sample(X, y) > # pandas testX_i = X_df.index.values.reshape(-1, 1) > _, _, i = RandomUnderSampler(return_indices=True).fit_sample(X_i, y)X_res2, y_res2 = X_df.iloc[i, :], y[i] > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#193 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABJN6fepIpnD3MemgeD5mEdFePsLQPMqks5rBYLYgaJpZM4K6vB9> > . >

chkoar · 2016-11-24T20:15:26Z

@simonm3 obviously I proposed a solution according to your example and the usage of RUS

glemaitre · 2016-12-03T11:23:08Z

I am closing this issue. Feel free to re-open if needed

TenzinJhopee · 2017-03-20T23:06:49Z

from math import sqrt
import warnings
from collections import Counter
import pandas as pd
import random

def k_nearest_neighbors(data, predict, k=3):
    if len(data) >= k:
        warnings.warn('K is set to value less than total voting group STUPID!')
    distances = []
    for group in data:
        for features in data[group]:
            euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
            distances.append([euclidean_distance, group])
            
    votes = [i[1] for i in sorted(distances)[:k]]
    vote_result = Counter(votes).most_common(1)[0][0]
    print(Counter(votes).most_common(1))
          
    return vote_result

df = pd.read_csv('breast-cancer-wisconsin.data.txt', error_bad_lines=False)    
df.replace('?', -99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).value.tolist()
print(full_data[:10])

ValueError: could not convert string to float: 'Class'

Please help me to find a solution guys

glemaitre · 2017-03-22T12:17:25Z

I will be very straightforward:

this part of GitHub is an issue tracker. Post only issue regarding software related;
this GitHub page is related to imbalanced-learn and nothing related to your issue;
always, read contributing and issue guideline while raising an issue. You will learnt that you should use triple quotes for readibility.

shaypal5 · 2018-04-15T12:52:00Z

Hello!

I'm getting this error with imblearn v0.3.3 when trying to use RandomUnderSampler.fit_sample() when X includes a column with string values.

The problem is caused due to sklearn.utils.check_X_y being called in the following form:
check_X_y(X, y, accept_sparse=['csr', 'csc'])
Since the dtype parameter is not specified explicitly, it is set to "numeric" by default, as detailed in the function's documentation here:
https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/utils/validation.py#L479

As such, the defined behavior of check_X_y in this case is: "If "numeric", dtype is preserved unless array.dtype is object.

I've cloned your repo and had to add dtype=None to the call to check_X_y in both SamplerMixin.sample() and BaseSampler.fit() to get RandomUnderSampler to work with string data.

Since prototype selection methods, unlike prototype generation methods, can support any kind of data, I think this check should not be forced for such methods.

A possible design is to add a _check_X_y method to SamplerMixin or BaseSampler which will call sklearn.utils.check_X_y(X, y, accept_sparse=['csr', 'csc']), and have prototype selection methods override this method with a version which will instead call sklearn.utils.check_X_y(X, y, accept_sparse=['csr', 'csc'], dtype=None)

Whatever the design, if one can be agreed on / you can advise me on one, I don't mind writing it myself and opening a pull request. That is, assuming you agree with me that non-numeric data should be allowed for prototype selection methods.

Cheers, (and what a great package!)
Shay

glemaitre · 2018-04-15T13:48:06Z

That could be nice. We could a PR and check that the check estimator from scikit learn pass. If yes we need to add ‎new common tests. But it seems a good idea.

glemaitre · 2018-04-15T15:09:41Z

Thinking about it a bit more, whatever is computing distance using kNN cannot use it. So mainly this is true for the random over sampler and under sampler on the top of the head

shaypal5 · 2018-04-16T08:41:24Z

Cool, so for starters just these two can override the default way; but for that we need to have an override-able property that determines it - I think a function is the most proper way to do that.

(Also, would you mind editing your previous comment in this post and remove the quote of my entire post + some code that looks like it comes from email metadata? It makes it, and the entire thread, a bit unreadable)

glemaitre · 2018-04-16T09:59:05Z

Probably the way to go would be to improve the _check_X_y: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/base.py#L32

Now that I see it, I don't like that _check_X_y is only checking the hash. Supposedly the check_X_y of scikit-learn should go there. At a first glance, I would think that we could have something like:

def _check_X_y(self, X, y):
  return check_X_y(X, y, accept_sparse=['csr', 'csc'], dtype='numeric')

def check_inputs(self, X, y):
  if hasattr(self, ratio_):
    y, binarize_y = check_target_type(y, indicate_one_vs_all=True)
    X, y = self._check_X_y(X, y)
    X_hash, y_hash = hash_X_y(X, y)
    if self._X_hash != X_hash or self._y_hash != y_hash:
      raise RuntimeError("X and y need to be same array earlier fitted.")
    return X, y, y_binarize
  else:
    y = check_target_type(y)
    X, y = self._check_X_y(X, y)
    self._X_hash, self._y_hash = hash_X_y(X, y)
    return X, y

shaypal5 · 2018-04-16T11:15:24Z

Oh, right! I forgot it already exists!

Yes, I wholeheartedly agree that the call to scikit's check_X_y should happen there!

shaypal5 · 2018-04-16T11:17:43Z

I'll get to working on it later this week - I'll start with the version you've outlined above + overriding it in the random samplers, then I'll see if tests are passing and think about additional tests.

glemaitre · 2018-08-23T22:33:42Z

Solved in master for RandomUnderSampling and RandomOverSampling.
Just have to wait scikit-learn 0.20 such that we can release as well 0.4.

Yevgnen · 2019-08-09T11:58:33Z

How can I set back the DataFrame type in a pipeline?

make_pipeline(SMOTENC(), LGBMClassifier())

I've tried to write a dummy transformer to transform it back to DataFrame in the middle of the pipeline, but it did't work.

make_pipeline(SMOTENC(), FixType(), LGBMClassifier())

simonm3 closed this as completed Nov 23, 2016

simonm3 reopened this Nov 23, 2016

glemaitre closed this as completed Dec 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: could not convert string to float: 'aaa' #193

ValueError: could not convert string to float: 'aaa' #193

simonm3 commented Nov 23, 2016 •

edited

glemaitre commented Nov 23, 2016

glemaitre commented Nov 23, 2016

simonm3 commented Nov 23, 2016

simonm3 commented Nov 23, 2016

simonm3 commented Nov 23, 2016

dvro commented Nov 23, 2016 •

edited

chkoar commented Nov 24, 2016 •

edited

simonm3 commented Nov 24, 2016 via email

simonm3 commented Nov 24, 2016 via email

chkoar commented Nov 24, 2016

glemaitre commented Dec 3, 2016

TenzinJhopee commented Mar 20, 2017 •

edited by glemaitre

glemaitre commented Mar 22, 2017

shaypal5 commented Apr 15, 2018

glemaitre commented Apr 15, 2018 via email •

edited

glemaitre commented Apr 15, 2018 via email

shaypal5 commented Apr 16, 2018

glemaitre commented Apr 16, 2018

shaypal5 commented Apr 16, 2018

shaypal5 commented Apr 16, 2018

glemaitre commented Aug 23, 2018

Yevgnen commented Aug 9, 2019

ValueError: could not convert string to float: 'aaa' #193

ValueError: could not convert string to float: 'aaa' #193

Comments

simonm3 commented Nov 23, 2016 • edited

glemaitre commented Nov 23, 2016

glemaitre commented Nov 23, 2016

simonm3 commented Nov 23, 2016

simonm3 commented Nov 23, 2016

simonm3 commented Nov 23, 2016

dvro commented Nov 23, 2016 • edited

chkoar commented Nov 24, 2016 • edited

simonm3 commented Nov 24, 2016 via email

simonm3 commented Nov 24, 2016 via email

chkoar commented Nov 24, 2016

glemaitre commented Dec 3, 2016

TenzinJhopee commented Mar 20, 2017 • edited by glemaitre

glemaitre commented Mar 22, 2017

shaypal5 commented Apr 15, 2018

glemaitre commented Apr 15, 2018 via email • edited

glemaitre commented Apr 15, 2018 via email

shaypal5 commented Apr 16, 2018

glemaitre commented Apr 16, 2018

shaypal5 commented Apr 16, 2018

shaypal5 commented Apr 16, 2018

glemaitre commented Aug 23, 2018

Yevgnen commented Aug 9, 2019

simonm3 commented Nov 23, 2016 •

edited

dvro commented Nov 23, 2016 •

edited

chkoar commented Nov 24, 2016 •

edited

TenzinJhopee commented Mar 20, 2017 •

edited by glemaitre

glemaitre commented Apr 15, 2018 via email •

edited