Error while performing Binary Relevance or Label Powerset #89

ansin218 · 2017-12-11T16:56:01Z

I am trying to perform a simple classification using Binary Relevance or Label Powerset. I consistently encounter the error despite also trying to convert it into a sparse matrix. How do I overcome this?

Here is my code:

import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from skmultilearn.problem_transform import BinaryRelevance, LabelPowerset
from sklearn.metrics import f1_score

data = pd.read_csv("a_lucene_results.csv")
y = data[['isA','isB','isC']]
to_drop = ['id','isA','isB','isC']
X = data.drop(to_drop,axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
#X_train = sparse.csr_matrix(X_train)   # Here's the initialization of the sparse matrix.
#X_test = sparse.csr_matrix(X_test)
#y_train = sparse.csr_matrix(y_train)   # Here's the initialization of the sparse matrix.
#y_test = sparse.csr_matrix(y_test)
clf = BinaryRelevance(GaussianNB())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))

However, I always get this:

Traceback (most recent call last):
  File "exp2.py", line 26, in <module>
    clf.fit(X_train, y_train)
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/skmultilearn/problem_transform/br.py", line 60, in fit
    X, sparse_format='csr', enforce_sparse=True)
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/skmultilearn/base/base.py", line 97, in ensure_input_format
    return matrix_creation_function_for_format(sparse_format)(X)
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/compressed.py", line 79, in __init__
    self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/compressed.py", line 32, in __init__
    arg1 = arg1.asformat(self.format)
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/base.py", line 287, in asformat
    return getattr(self, 'to' + format)()
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/coo.py", line 342, in tocsr
    data = np.empty_like(self.data, dtype=upcast(self.dtype))
  File "/home/ankur218/anaconda3/lib/python3.5/site-packages/scipy/sparse/sputils.py", line 51, in upcast
    raise TypeError('no supported conversion for types: %r' % (args,))
TypeError: no supported conversion for types: (dtype('O'),)

The text was updated successfully, but these errors were encountered:

ChristianSch · 2017-12-11T20:35:44Z

As in #88 I cannot reproduce your problem, it seems to be caused by your data.

X, y = make_multilabel_classification(sparse=True, n_labels=5, return_indicator='sparse', allow_unlabeled=False)
_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = BinaryRelevance(GaussianNB())
y_pred = clf.predict(X_test)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))
# The macro averaged F1-score is: 0.785

Without knowing the data I assume your dataframe is erroneous such that you have an unusable column in it (which causes the exception: dtype('O')). If pandas can't determine the type of the variable (because it contains mixed data, or structured data such as JSON) it falls back to Object.
As you give train_test_split I don't know what it returns, but it seems to return dataframes, which are subsequently converted to sparse matrices. As it doesn't know what shall be done for an object, it crashes.

I strongly recommend to set the proper types of pandas after loading the data set, to prevent doing something wrong when further down the rabbit hole (such as assuming a Gaussian distribution on the features, as you do with a GaussianNB!).

Without the data I'm afraid I can't help you--I'm pretty sure though it's an error on your side. Is it publicly available?

PS: @souravsingh's issue #88 seems to be the same code as yours.If you two work together I'd propose to close one issue to keep the open issues at a minimum.

ansin218 · 2017-12-11T20:49:29Z

@ChristianSch : Yes, it's similar to his. However, I did try to change the type after loading the dataset. I also tried to convert them into list and then into ndarray or sparsematrix. However, it just throws me error about the dtype('O') which is an object. Also, the documentation states that using dtype=str will assign it as dtype=object. The column that I am using to train (a bunch of sentences), they are alll of the form of object.

@souravsingh Could you please post a snippet of how you got it working? Because I still cannot.

I can may be upload my raw dataset if you guys need me to.

Edit: I have also tried it with DecisionTree and RandomForest!

souravsingh · 2017-12-11T20:53:15Z

My code is something like this-

import pandas as pd
import numpy as np
from sklearn.svm import SVC
#from sklearn.multioutput import ClassifierChain
from sklearn.naive_bayes import GaussianNB
from modlamp.sequences import MixedLibrary
from sklearn.model_selection import train_test_split
from skmultilearn.problem_transform import BinaryRelevance, LabelPowerset, ClassifierChain
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score
import xgboost as xgb
from mlxtend.classifier import StackingClassifier
data = pd.read_csv("full_dataset.csv")
y = data[['antiviral','antifungal','antibacterial']]
to_drop = ['# ID','Sequence','antiviral', 'antibacterial', 'antifungal']
X = data.drop(to_drop,axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
clf = LabelPowerset(xgb.XGBClassifier(n_estimators=500))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(y_pred)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))

While the code doesn't work with XGBoost as base classifier for some reason, it works with all the scikit-learn classifiers.

ansin218 · 2017-12-11T20:58:03Z

That is exactly the same. Except that I have been trying to do it with scikit-learn classifiers.

I am assuming your columns antiviral, antifungal and antibacterial are columns with binary values. And X contains the list of columns with your features (in my case it is only one column with sentences). Could you please tell me where I am possibly going wrong? If not, we could introspect my dataset if you wish to.

ChristianSch · 2017-12-11T21:06:37Z

If you'd provide some data and your relevant code to the preprocessing I'd take a spin!

ansin218 · 2017-12-11T21:21:54Z

Here you go, this is the code:

import pandas as pd
import numpy as np
import scipy
import scipy.sparse as sp
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from skmultilearn.problem_transform import BinaryRelevance, LabelPowerset, ClassifierChain
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
data = pd.read_csv("text.csv", dtype={'sentence':np.str_ })
print(data.info())
y = data[['isIssue','isAlternative','isPro','isCon','isDecision']]
to_drop = ['id','isIssue','isAlternative','isPro','isCon','isDecision']
X = data.drop(to_drop, axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = BinaryRelevance(RandomForestClassifier())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("The macro averaged F1-score is: %.3f" %(f1_score(y_pred, y_test, average='macro')))

Attached the CSV File along.
text.csv.zip

ChristianSch · 2017-12-11T21:42:25Z

Well, you can't just throw your model at textual data. The sentence variable seems to be solely made up of singular values, so no categorical data (which could be transformed via data.get_dummies or something. You need to transform the text into vector representations in order to use it with any model.

See here for reference: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Also not related to scikit-multilearn, so closing afterwards.

ChristianSch self-assigned this Dec 11, 2017

ChristianSch closed this as completed Dec 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while performing Binary Relevance or Label Powerset #89

Error while performing Binary Relevance or Label Powerset #89

ansin218 commented Dec 11, 2017

ChristianSch commented Dec 11, 2017

ansin218 commented Dec 11, 2017 •

edited

souravsingh commented Dec 11, 2017

ansin218 commented Dec 11, 2017

ChristianSch commented Dec 11, 2017

ansin218 commented Dec 11, 2017 •

edited by ChristianSch

ChristianSch commented Dec 11, 2017 •

edited

Error while performing Binary Relevance or Label Powerset #89

Error while performing Binary Relevance or Label Powerset #89

Comments

ansin218 commented Dec 11, 2017

ChristianSch commented Dec 11, 2017

ansin218 commented Dec 11, 2017 • edited

souravsingh commented Dec 11, 2017

ansin218 commented Dec 11, 2017

ChristianSch commented Dec 11, 2017

ansin218 commented Dec 11, 2017 • edited by ChristianSch

ChristianSch commented Dec 11, 2017 • edited

ansin218 commented Dec 11, 2017 •

edited

ansin218 commented Dec 11, 2017 •

edited by ChristianSch

ChristianSch commented Dec 11, 2017 •

edited