Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: could not convert string to float: 'aaa' #193

Closed
simonm3 opened this issue Nov 23, 2016 · 22 comments
Closed

ValueError: could not convert string to float: 'aaa' #193

simonm3 opened this issue Nov 23, 2016 · 22 comments

Comments

@simonm3
Copy link

simonm3 commented Nov 23, 2016

I have imbalanced classes with 10,000 1s and 10m 0s. I want to undersample before I convert category columns to dummies to save memory. I expected it would ignore the content of x and randomly select based on y. However I get the above error. What am I not understanding and how do I do this without converting category features to dummies first?

clf_sample = RandomUnderSampler(ratio=.025)
x = pd.DataFrame(np.random.random((100,5)), columns=list("abcde"))
x.loc[:, "b"] = "aaa"
clf_sample.fit(x, y.head(100))
@glemaitre
Copy link
Member

This could be due to Pandas, I will check that. You should also know that we do not support Pandas and x and y resampled will not be DataFrame type.

I just tried the following example in numpy which seems to work fine

import numpy as np
from imblearn.under_sampling import RandomUnderSampler

x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
y = np.array([0] * 10 + [1] * 90)

rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x, y)

@glemaitre
Copy link
Member

@simonm3 What is y in your example?

@simonm3
Copy link
Author

simonm3 commented Nov 23, 2016

y is just a list of integers that are 1 or 0. It is fine though. You are correct that it is because of pandas. Also if I convert pandas to values it does not work either! The two arrays are equal. However the numpy one is dtype "<U3" and the pandas one is "o".

import numpy as np
from imblearn.under_sampling import RandomUnderSampler

x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
x1 = pd.DataFrame(x, columns=list("ab")).values
print(np.array_equal(x, x1))
print(x1.dtype, x.dtype)

y = np.array([0] * 10 + [1] * 90)
rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x, y)
rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x1, y)

@simonm3
Copy link
Author

simonm3 commented Nov 23, 2016

Here is a solution:

import numpy as np
from imblearn.under_sampling import RandomUnderSampler

# create test data
x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
x1 = pd.DataFrame(x, columns=list("ab"), index=range(1000, 1100))
y = np.array([0] * 10 + [1] * 90)

###### numpy test #####
rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x, y)

###### pandas test #####
# convert pandas to numpy
x1 = x1.reset_index()
savedcols = x1.columns
x1 = x1.values.astype("U")

rus = RandomUnderSampler()
x1_res, y1_res = rus.fit_sample(x1, y)

# convert back to pandas
x1_res = pd.DataFrame(x1_res, columns=savedcols).set_index("index")
x1_res

@simonm3 simonm3 closed this as completed Nov 23, 2016
@simonm3
Copy link
Author

simonm3 commented Nov 23, 2016

Though now all the number columns are converted to strings! An alternative is to just pass the index of my dataframe to the sampler; then select the rows from the result. That should work......unless you can think of a better solution.

@simonm3 simonm3 reopened this Nov 23, 2016
@dvro
Copy link
Member

dvro commented Nov 23, 2016

@simonm3 just save the column names and also the column types.

I changed your script bellow, let me know how it works for you:

import pandas as pd
import numpy as np

from imblearn.under_sampling import RandomUnderSampler

# create test data
x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
x1 = pd.DataFrame(x, columns=list("ab"), index=range(1000, 1100))
# adding numeric column
x1['n'] = np.arange(x1.shape[0])

y = np.array([0] * 10 + [1] * 90)

print x1

###### numpy test #####
rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x, y)

###### pandas test #####
# convert pandas to numpy
x1 = x1.reset_index()
col_names = x1.columns
col_types = x1.dtypes

x1 = x1.values.astype("U")

rus = RandomUnderSampler()
x1_res, y1_res = rus.fit_sample(x1, y)

# convert back to pandas
x1_res = pd.DataFrame(x1_res)

x1_res.columns = col_names
for col_name, col_type in zip(col_names, col_types):
    x1_res[col_name] = x1_res[col_name].astype(col_type)

print x1_res
print x1_res.dtypes

final resut keeps the types:

[100 rows x 3 columns]
    index    a    b   n
0    1000  aaa  bbb   0
1    1001  aaa  bbb   1
2    1002  aaa  bbb   2
3    1003  aaa  bbb   3
4    1004  aaa  bbb   4
5    1005  aaa  bbb   5
6    1006  aaa  bbb   6
7    1007  aaa  bbb   7
8    1008  aaa  bbb   8
9    1009  aaa  bbb   9
10   1058  aaa  bbb  58
11   1082  aaa  bbb  82
12   1082  aaa  bbb  82
13   1066  aaa  bbb  66
14   1031  aaa  bbb  31
15   1044  aaa  bbb  44
16   1074  aaa  bbb  74
17   1053  aaa  bbb  53
18   1034  aaa  bbb  34
19   1045  aaa  bbb  45
index     int64
a        object
b        object
n         int64
dtype: object

@chkoar
Copy link
Member

chkoar commented Nov 24, 2016

@simonm3 you could pass the index as you said

import numpy as np
import pandas as pd

from imblearn.under_sampling import RandomUnderSampler

# create test data
X = np.array([['aaa'] * 100, ['bbb'] * 100]).T
X_df = pd.DataFrame(X, columns=list("ab"), index=range(1000, 1100))
y = np.array([0] * 10 + [1] * 90)

# numpy test
X_res1, y_res1 = RandomUnderSampler().fit_sample(X, y)

#  pandas test
X_i = X_df.index.values.reshape(-1, 1)
_, _, i = RandomUnderSampler(return_indices=True).fit_sample(X_i, y)
X_res2, y_res2 = X_df.iloc[i, :], y[i]

@dvro @glemaitre we could implicitly support pandas

@simonm3
Copy link
Author

simonm3 commented Nov 24, 2016 via email

@simonm3
Copy link
Author

simonm3 commented Nov 24, 2016 via email

@chkoar
Copy link
Member

chkoar commented Nov 24, 2016

@simonm3 obviously I proposed a solution according to your example and the usage of RUS

@glemaitre
Copy link
Member

I am closing this issue. Feel free to re-open if needed

@TenzinJhopee
Copy link

TenzinJhopee commented Mar 20, 2017

from math import sqrt
import warnings
from collections import Counter
import pandas as pd
import random

def k_nearest_neighbors(data, predict, k=3):
    if len(data) >= k:
        warnings.warn('K is set to value less than total voting group STUPID!')
    distances = []
    for group in data:
        for features in data[group]:
            euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
            distances.append([euclidean_distance, group])
            
    votes = [i[1] for i in sorted(distances)[:k]]
    vote_result = Counter(votes).most_common(1)[0][0]
    print(Counter(votes).most_common(1))
          
    return vote_result

df = pd.read_csv('breast-cancer-wisconsin.data.txt', error_bad_lines=False)    
df.replace('?', -99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).value.tolist()
print(full_data[:10])
ValueError: could not convert string to float: 'Class'

Please help me to find a solution guys

@glemaitre
Copy link
Member

I will be very straightforward:

  • this part of GitHub is an issue tracker. Post only issue regarding software related;
  • this GitHub page is related to imbalanced-learn and nothing related to your issue;
  • always, read contributing and issue guideline while raising an issue. You will learnt that you should use triple quotes for readibility.

@shaypal5
Copy link

Hello!

I'm getting this error with imblearn v0.3.3 when trying to use RandomUnderSampler.fit_sample() when X includes a column with string values.

The problem is caused due to sklearn.utils.check_X_y being called in the following form:
check_X_y(X, y, accept_sparse=['csr', 'csc'])
Since the dtype parameter is not specified explicitly, it is set to "numeric" by default, as detailed in the function's documentation here:
https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/utils/validation.py#L479

As such, the defined behavior of check_X_y in this case is: "If "numeric", dtype is preserved unless array.dtype is object.

I've cloned your repo and had to add dtype=None to the call to check_X_y in both SamplerMixin.sample() and BaseSampler.fit() to get RandomUnderSampler to work with string data.

Since prototype selection methods, unlike prototype generation methods, can support any kind of data, I think this check should not be forced for such methods.

A possible design is to add a _check_X_y method to SamplerMixin or BaseSampler which will call sklearn.utils.check_X_y(X, y, accept_sparse=['csr', 'csc']), and have prototype selection methods override this method with a version which will instead call sklearn.utils.check_X_y(X, y, accept_sparse=['csr', 'csc'], dtype=None)

Whatever the design, if one can be agreed on / you can advise me on one, I don't mind writing it myself and opening a pull request. That is, assuming you agree with me that non-numeric data should be allowed for prototype selection methods.

Cheers, (and what a great package!)
Shay

@glemaitre
Copy link
Member

glemaitre commented Apr 15, 2018 via email

@glemaitre
Copy link
Member

glemaitre commented Apr 15, 2018 via email

@shaypal5
Copy link

Cool, so for starters just these two can override the default way; but for that we need to have an override-able property that determines it - I think a function is the most proper way to do that.

(Also, would you mind editing your previous comment in this post and remove the quote of my entire post + some code that looks like it comes from email metadata? It makes it, and the entire thread, a bit unreadable)

@glemaitre
Copy link
Member

Probably the way to go would be to improve the _check_X_y: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/base.py#L32

Now that I see it, I don't like that _check_X_y is only checking the hash. Supposedly the check_X_y of scikit-learn should go there. At a first glance, I would think that we could have something like:

def _check_X_y(self, X, y):
  return check_X_y(X, y, accept_sparse=['csr', 'csc'], dtype='numeric')

def check_inputs(self, X, y):
  if hasattr(self, ratio_):
    y, binarize_y = check_target_type(y, indicate_one_vs_all=True)
    X, y = self._check_X_y(X, y)
    X_hash, y_hash = hash_X_y(X, y)
    if self._X_hash != X_hash or self._y_hash != y_hash:
      raise RuntimeError("X and y need to be same array earlier fitted.")
    return X, y, y_binarize
  else:
    y = check_target_type(y)
    X, y = self._check_X_y(X, y)
    self._X_hash, self._y_hash = hash_X_y(X, y)
    return X, y

@shaypal5
Copy link

Oh, right! I forgot it already exists!

Yes, I wholeheartedly agree that the call to scikit's check_X_y should happen there!

@shaypal5
Copy link

I'll get to working on it later this week - I'll start with the version you've outlined above + overriding it in the random samplers, then I'll see if tests are passing and think about additional tests.

@glemaitre
Copy link
Member

Solved in master for RandomUnderSampling and RandomOverSampling.
Just have to wait scikit-learn 0.20 such that we can release as well 0.4.

@Yevgnen
Copy link

Yevgnen commented Aug 9, 2019

How can I set back the DataFrame type in a pipeline?

make_pipeline(SMOTENC(), LGBMClassifier())

I've tried to write a dummy transformer to transform it back to DataFrame in the middle of the pipeline, but it did't work.

make_pipeline(SMOTENC(), FixType(), LGBMClassifier())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants