Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solver Needs Samples of at least two classes #1

Closed
ghost opened this issue Mar 5, 2018 · 2 comments
Closed

Solver Needs Samples of at least two classes #1

ghost opened this issue Mar 5, 2018 · 2 comments

Comments

@ghost
Copy link

ghost commented Mar 5, 2018

I came across your jupyter notebook and was pleased to find solutions to a problem that had been giving me headaches, that is, classification of data from a dataframe with columns that have numeric attributes. I have data that is similar to yours and I modified your code for my dataset but its not working. Your data has a column labelled "Type", which is just an array of ones.

Whenever I run your code on my dataset, I get the following error:
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: '1'

Do you know why this error is coming up in my case when it wouldn't in your case ? I also tried out the code from your webpage which differs from the one here on github on the following line:
website code: mask = mask = np.random.rand(len(df)) < ratio (error comes up because lt is not defined anywhere in the code)
github code :mask = np.random.rand(len(df)) < ratio

When I run the code thats given on your website, and make the above change(removing &lt, ratio and adding <, the error changes to KeyError: "Type"

Do you know how I can solve this ? Thanks for the help in advance

Here is my code for the dataframe preprocessing
diffreport.txt

import warnings; warnings.simplefilter("ignore")
#importing important libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import csv

df = pd.read_csv("diffreport.csv", sep= ",")

d1 = df.drop("name", axis = 1)
d2 = d1.drop("isotopes", axis = 1)
d3 = d2.drop("adduct", axis = 1)
d4 = d3.drop("tstat", axis = 1)
d5 = d4.drop("pvalue", axis = 1)
d6 = d5.drop("fold", axis = 1)
d7 = d6.drop(d6.columns[0], axis = 1)
d8 = d7.drop("npeaks", axis = 1)
d9 = d8.drop("Eta6", axis = 1)
d10 = d9.drop("Eta8", axis = 1)
columns = ['Eta6_0', 'Eta6_2', 'Eta6_3', 'Eta8.1', 'Eta82', 'Eta83']
df1 = pd.DataFrame(d10, columns = columns)
df1['Type'] = "1"

The rest of my code is similar to yours but I have pasted it below for clarity
import time
import pandas as pd
import numpy as np

import pickle

Some modules for plotting and visualizing

import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

And some Machine Learning modules from scikit-learn

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

dict_classifiers = {
"Logistic Regression": LogisticRegression(),
"Nearest Neighbors": KNeighborsClassifier(),
"Linear SVM": SVC(),
"Gradient Boosting Classifier": GradientBoostingClassifier(n_estimators=1000),
"Decision Tree": tree.DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(n_estimators=1000),
"Neural Net": MLPClassifier(alpha = 1),
"Naive Bayes": GaussianNB(),
#"AdaBoost": AdaBoostClassifier(),
#"QDA": QuadraticDiscriminantAnalysis(),
#"Gaussian Process": GaussianProcessClassifier()
}

def batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 5, verbose = True):

dict_models = {}
for classifier_name, classifier in list(dict_classifiers.items())[:no_classifiers]:
    t_start = time.clock()
    classifier.fit(X_train, Y_train)
    t_end = time.clock()
    
    t_diff = t_end - t_start
    train_score = classifier.score(X_train, Y_train)
    test_score = classifier.score(X_test, Y_test)
    
    dict_models[classifier_name] = {'model': classifier, 'train_score': train_score, 'test_score': test_score, 'train_time': t_diff}
    if verbose:
        print("trained {c} in {f:.2f} s".format(c=classifier_name, f=t_diff))
return dict_models

def label_encode(df, list_columns):
"""
This method one-hot encodes all column, specified in list_columns

"""
for col in list_columns:
    le = LabelEncoder()
    col_values_unique = list(df[col].unique())
    le_fitted = le.fit(col_values_unique)

    col_values = list(df[col].values)
    le.classes_
    col_values_transformed = le.transform(col_values)
    df[col] = col_values_transformed      

def expand_columns(df, list_columns):
for col in list_columns:
colvalues = df[col].unique()
for colvalue in colvalues:
newcol_name = "{}is{}".format(col, colvalue)
df.loc[df[col] == colvalue, newcol_name] = 1
df.loc[df[col] != colvalue, newcol_name] = 0
df.drop(list_columns, inplace=True, axis=1)

def get_train_test(df, y_col, x_cols, ratio):
"""
This method transforms a dataframe into a train and test set, for this you need to specify:
1. the ratio train : test (usually 0.7)
2. the column with the Y_values
"""
mask = np.random.rand(len(df)) < ratio
df_train = df[mask]
df_test = df[~mask]

Y_train = df_train[y_col].values
Y_test = df_test[y_col].values
X_train = df_train[x_cols].values
X_test = df_test[x_cols].values
return df_train, df_test, X_train, Y_train, X_test, Y_test

def display_dict_models(dict_models, sort_by='test_score'):
cls = [key for key in dict_models.keys()]
test_s = [dict_models[key]['test_score'] for key in cls]
training_s = [dict_models[key]['train_score'] for key in cls]
training_t = [dict_models[key]['train_time'] for key in cls]

df_ = pd.DataFrame(data=np.zeros(shape=(len(cls),4)), columns = ['classifier', 'train_score', 'test_score', 'train_time'])
for ii in range(0,len(cls)):
    df_.loc[ii, 'classifier'] = cls[ii]
    df_.loc[ii, 'train_score'] = training_s[ii]
    df_.loc[ii, 'test_score'] = test_s[ii]
    df_.loc[ii, 'train_time'] = training_t[ii]

display(df_.sort_values(by=sort_by, ascending=False))

def display_corr_with_col(df, col):
correlation_matrix = df.corr()
correlation_type = correlation_matrix[col].copy()
abs_correlation_type = correlation_type.apply(lambda x: abs(x))
desc_corr_values = abs_correlation_type.sort_values(ascending=False)
y_values = list(desc_corr_values.values)[1:]
x_values = range(0,len(y_values))
xlabels = list(desc_corr_values.keys())[1:]
fig, ax = plt.subplots(figsize=(8,8))
ax.bar(x_values, y_values)
ax.set_title('The correlation of all features with {}'.format(col), fontsize=20)
ax.set_ylabel('Pearson correlatie coefficient [abs waarde]', fontsize=16)
plt.xticks(x_values, xlabels, rotation='vertical')
plt.show()
#Classification

y_col_glass = 'Type'
x_cols_glass = list(df1.columns.values)
x_cols_glass.remove(y_col_glass)

train_test_ratio = 0.7
df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df1, y_col_glass, x_cols_glass, train_test_ratio)

dict_models = batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 8)
display_dict_models(dict_models)

@taspinar
Copy link
Owner

taspinar commented Mar 7, 2018

Hi,
The 'Type' column does not only contain 1, but the values 1, 2, 3, 5, 6, 7.
It is the column containing the label each entry belongs to.

So in the code you'll have to replace Type in y_col = 'Type' with the column name in your dataset containing the class-labels.

@ghost
Copy link
Author

ghost commented Mar 9, 2018

Hi, thanks for the reply. I just realised where my problem was, it has to do with the way in which my class label column was generated. Thanks for looking out.

@ghost ghost closed this as completed Mar 9, 2018
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant