Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while performing Binary Relevance or Label Powerset #89

Closed
ansin218 opened this issue Dec 11, 2017 · 7 comments
Closed

Error while performing Binary Relevance or Label Powerset #89

ansin218 opened this issue Dec 11, 2017 · 7 comments
Assignees

Comments

@ansin218
Copy link

I am trying to perform a simple classification using Binary Relevance or Label Powerset. I consistently encounter the error despite also trying to convert it into a sparse matrix. How do I overcome this?

Here is my code:

import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from skmultilearn.problem_transform import BinaryRelevance, LabelPowerset
from sklearn.metrics import f1_score

data = pd.read_csv("a_lucene_results.csv")
y = data[['isA','isB','isC']]
to_drop = ['id','isA','isB','isC']
X = data.drop(to_drop,axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
#X_train = sparse.csr_matrix(X_train)   # Here's the initialization of the sparse matrix.
#X_test = sparse.csr_matrix(X_test)
#y_train = sparse.csr_matrix(y_train)   # Here's the initialization of the sparse matrix.
#y_test = sparse.csr_matrix(y_test)
clf = BinaryRelevance(GaussianNB())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))

However, I always get this:

Traceback (most recent call last):
  File "exp2.py", line 26, in <module>
    clf.fit(X_train, y_train)
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/skmultilearn/problem_transform/br.py", line 60, in fit
    X, sparse_format='csr', enforce_sparse=True)
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/skmultilearn/base/base.py", line 97, in ensure_input_format
    return matrix_creation_function_for_format(sparse_format)(X)
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/compressed.py", line 79, in __init__
    self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/compressed.py", line 32, in __init__
    arg1 = arg1.asformat(self.format)
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/base.py", line 287, in asformat
    return getattr(self, 'to' + format)()
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/coo.py", line 342, in tocsr
    data = np.empty_like(self.data, dtype=upcast(self.dtype))
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/sputils.py", line 51, in upcast
    raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('O'),)
@ChristianSch
Copy link
Member

As in #88 I cannot reproduce your problem, it seems to be caused by your data.

X, y = make_multilabel_classification(sparse=True, n_labels=5, return_indicator='sparse', allow_unlabeled=False)
_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = BinaryRelevance(GaussianNB())
y_pred = clf.predict(X_test)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))
# The macro averaged F1-score is: 0.785

Without knowing the data I assume your dataframe is erroneous such that you have an unusable column in it (which causes the exception: dtype('O')). If pandas can't determine the type of the variable (because it contains mixed data, or structured data such as JSON) it falls back to Object.
As you give train_test_split I don't know what it returns, but it seems to return dataframes, which are subsequently converted to sparse matrices. As it doesn't know what shall be done for an object, it crashes.

I strongly recommend to set the proper types of pandas after loading the data set, to prevent doing something wrong when further down the rabbit hole (such as assuming a Gaussian distribution on the features, as you do with a GaussianNB!).

Without the data I'm afraid I can't help you--I'm pretty sure though it's an error on your side. Is it publicly available?

PS: @souravsingh's issue #88 seems to be the same code as yours.If you two work together I'd propose to close one issue to keep the open issues at a minimum.

@ChristianSch ChristianSch self-assigned this Dec 11, 2017
@ansin218
Copy link
Author

ansin218 commented Dec 11, 2017

@ChristianSch : Yes, it's similar to his. However, I did try to change the type after loading the dataset. I also tried to convert them into list and then into ndarray or sparsematrix. However, it just throws me error about the dtype('O') which is an object. Also, the documentation states that using dtype=str will assign it as dtype=object. The column that I am using to train (a bunch of sentences), they are alll of the form of object.

@souravsingh Could you please post a snippet of how you got it working? Because I still cannot.

I can may be upload my raw dataset if you guys need me to.

Edit: I have also tried it with DecisionTree and RandomForest!

@souravsingh
Copy link

My code is something like this-

import pandas as pd
import numpy as np
from sklearn.svm import SVC
#from sklearn.multioutput import ClassifierChain
from sklearn.naive_bayes import GaussianNB
from modlamp.sequences import MixedLibrary
from sklearn.model_selection import train_test_split
from skmultilearn.problem_transform import BinaryRelevance, LabelPowerset, ClassifierChain
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score
import xgboost as xgb
from mlxtend.classifier import StackingClassifier
data = pd.read_csv("full_dataset.csv")
y = data[['antiviral','antifungal','antibacterial']]
to_drop = ['# ID','Sequence','antiviral', 'antibacterial', 'antifungal']
X = data.drop(to_drop,axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
clf = LabelPowerset(xgb.XGBClassifier(n_estimators=500))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(y_pred)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))

While the code doesn't work with XGBoost as base classifier for some reason, it works with all the scikit-learn classifiers.

@ansin218
Copy link
Author

That is exactly the same. Except that I have been trying to do it with scikit-learn classifiers.

I am assuming your columns antiviral, antifungal and antibacterial are columns with binary values. And X contains the list of columns with your features (in my case it is only one column with sentences). Could you please tell me where I am possibly going wrong? If not, we could introspect my dataset if you wish to.

@ChristianSch
Copy link
Member

If you'd provide some data and your relevant code to the preprocessing I'd take a spin!

@ansin218
Copy link
Author

ansin218 commented Dec 11, 2017

Here you go, this is the code:

import pandas as pd
import numpy as np
import scipy
import scipy.sparse as sp
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from skmultilearn.problem_transform import BinaryRelevance, LabelPowerset, ClassifierChain
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
data = pd.read_csv("text.csv", dtype={'sentence':np.str_ })
print(data.info())
y = data[['isIssue','isAlternative','isPro','isCon','isDecision']]
to_drop = ['id','isIssue','isAlternative','isPro','isCon','isDecision']
X = data.drop(to_drop, axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = BinaryRelevance(RandomForestClassifier())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))

Attached the CSV File along.
text.csv.zip

@ChristianSch
Copy link
Member

ChristianSch commented Dec 11, 2017

Well, you can't just throw your model at textual data. The sentence variable seems to be solely made up of singular values, so no categorical data (which could be transformed via data.get_dummies or something. You need to transform the text into vector representations in order to use it with any model.

See here for reference: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Also not related to scikit-multilearn, so closing afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants