New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: could not convert string to float: 'aaa' #193
Comments
This could be due to Pandas, I will check that. You should also know that we do not support Pandas and I just tried the following example in import numpy as np
from imblearn.under_sampling import RandomUnderSampler
x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
y = np.array([0] * 10 + [1] * 90)
rus = RandomUnderSampler()
x_res, y_res = rus.fit_sample(x, y) |
@simonm3 What is |
y is just a list of integers that are 1 or 0. It is fine though. You are correct that it is because of pandas. Also if I convert pandas to values it does not work either! The two arrays are equal. However the numpy one is dtype "<U3" and the pandas one is "o".
|
Here is a solution:
|
Though now all the number columns are converted to strings! An alternative is to just pass the index of my dataframe to the sampler; then select the rows from the result. That should work......unless you can think of a better solution. |
@simonm3 just save the column names and also the column types. I changed your script bellow, let me know how it works for you:
final resut keeps the types:
|
@simonm3 you could pass the index as you said import numpy as np
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
# create test data
X = np.array([['aaa'] * 100, ['bbb'] * 100]).T
X_df = pd.DataFrame(X, columns=list("ab"), index=range(1000, 1100))
y = np.array([0] * 10 + [1] * 90)
# numpy test
X_res1, y_res1 = RandomUnderSampler().fit_sample(X, y)
# pandas test
X_i = X_df.index.values.reshape(-1, 1)
_, _, i = RandomUnderSampler(return_indices=True).fit_sample(X_i, y)
X_res2, y_res2 = X_df.iloc[i, :], y[i] @dvro @glemaitre we could implicitly support pandas |
Not sure about that. According to the docs return indicees only
returns "samples
randomly selected from the majority class." Also there is no
return_indicees for RandomOverSampler.
In the end I did this:
xindex, y = clf.fit_sample(x.index.values.reshape(-1,1), y)
x = x.loc[xindex.ravel()]
…On 24 November 2016 at 12:28, chkoar ***@***.***> wrote:
@simonm3 <https://github.com/simonm3> pass the index as you said
import numpy as npimport pandas as pd
from imblearn.under_sampling import RandomUnderSampler
# create test dataX = np.array([['aaa'] * 100, ['bbb'] * 100]).TX_df = pd.DataFrame(X, columns=list("ab"), index=range(1000, 1100))
y = np.array([0] * 10 + [1] * 90)
# numpy testX_res1, y_res1 = RandomUnderSampler().fit_sample(X, y)
# pandas testX_i = X_df.index.values.reshape(-1, 1)
_, _, i = RandomUnderSampler(return_indices=True).fit_sample(X_i, y)X_res2, y_res2 = X_df.iloc[i, :], y[i]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#193 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABJN6fepIpnD3MemgeD5mEdFePsLQPMqks5rBYLYgaJpZM4K6vB9>
.
|
Oh of course there won't be for the oversampler as there are new samples. I
guess best solution is just to create the dummies and pass the whole file
in. If I run out of memory I will just take a sample.
…On 24 November 2016 at 18:02, simon mackenzie ***@***.***> wrote:
Not sure about that. According to the docs return indicees only returns "samples
randomly selected from the majority class." Also there is no
return_indicees for RandomOverSampler.
In the end I did this:
xindex, y = clf.fit_sample(x.index.values.reshape(-1,1), y)
x = x.loc[xindex.ravel()]
On 24 November 2016 at 12:28, chkoar ***@***.***> wrote:
> @simonm3 <https://github.com/simonm3> pass the index as you said
>
> import numpy as npimport pandas as pd
> from imblearn.under_sampling import RandomUnderSampler
> # create test dataX = np.array([['aaa'] * 100, ['bbb'] * 100]).TX_df = pd.DataFrame(X, columns=list("ab"), index=range(1000, 1100))
> y = np.array([0] * 10 + [1] * 90)
> # numpy testX_res1, y_res1 = RandomUnderSampler().fit_sample(X, y)
> # pandas testX_i = X_df.index.values.reshape(-1, 1)
> _, _, i = RandomUnderSampler(return_indices=True).fit_sample(X_i, y)X_res2, y_res2 = X_df.iloc[i, :], y[i]
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#193 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABJN6fepIpnD3MemgeD5mEdFePsLQPMqks5rBYLYgaJpZM4K6vB9>
> .
>
|
@simonm3 obviously I proposed a solution according to your example and the usage of RUS |
I am closing this issue. Feel free to re-open if needed |
from math import sqrt
import warnings
from collections import Counter
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to value less than total voting group STUPID!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
print(Counter(votes).most_common(1))
return vote_result
df = pd.read_csv('breast-cancer-wisconsin.data.txt', error_bad_lines=False)
df.replace('?', -99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).value.tolist()
print(full_data[:10]) ValueError: could not convert string to float: 'Class' Please help me to find a solution guys |
I will be very straightforward:
|
Hello! I'm getting this error with The problem is caused due to As such, the defined behavior of I've cloned your repo and had to add Since prototype selection methods, unlike prototype generation methods, can support any kind of data, I think this check should not be forced for such methods. A possible design is to add a Whatever the design, if one can be agreed on / you can advise me on one, I don't mind writing it myself and opening a pull request. That is, assuming you agree with me that non-numeric data should be allowed for prototype selection methods. Cheers, (and what a great package!) |
That could be nice. We could a PR and check that the check estimator from scikit learn pass. If yes we need to add new common tests. But it seems a good idea.
|
Thinking about it a bit more, whatever is computing distance using kNN cannot use it. So mainly this is true for the random over sampler and under sampler on the top of the head
|
Cool, so for starters just these two can override the default way; but for that we need to have an override-able property that determines it - I think a function is the most proper way to do that. (Also, would you mind editing your previous comment in this post and remove the quote of my entire post + some code that looks like it comes from email metadata? It makes it, and the entire thread, a bit unreadable) |
Probably the way to go would be to improve the Now that I see it, I don't like that def _check_X_y(self, X, y):
return check_X_y(X, y, accept_sparse=['csr', 'csc'], dtype='numeric')
def check_inputs(self, X, y):
if hasattr(self, ratio_):
y, binarize_y = check_target_type(y, indicate_one_vs_all=True)
X, y = self._check_X_y(X, y)
X_hash, y_hash = hash_X_y(X, y)
if self._X_hash != X_hash or self._y_hash != y_hash:
raise RuntimeError("X and y need to be same array earlier fitted.")
return X, y, y_binarize
else:
y = check_target_type(y)
X, y = self._check_X_y(X, y)
self._X_hash, self._y_hash = hash_X_y(X, y)
return X, y |
Oh, right! I forgot it already exists! Yes, I wholeheartedly agree that the call to scikit's |
I'll get to working on it later this week - I'll start with the version you've outlined above + overriding it in the random samplers, then I'll see if tests are passing and think about additional tests. |
Solved in master for RandomUnderSampling and RandomOverSampling. |
How can I set back the
I've tried to write a dummy transformer to transform it back to
|
I have imbalanced classes with 10,000 1s and 10m 0s. I want to undersample before I convert category columns to dummies to save memory. I expected it would ignore the content of x and randomly select based on y. However I get the above error. What am I not understanding and how do I do this without converting category features to dummies first?
The text was updated successfully, but these errors were encountered: