# Classification Algorithm Comparison
When beginneging in the field of machine learning, sometimes it becmes tough to chose a appropriate algorithm for prediction.

This kernel will help the beginners to compare various algorithms used for classification.

## What is classification.?
Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).
For example, spam detection in email service providers can be identified as a classification problem. This is s binary classification since there are only 2 classes as spam and not spam. A classifier utilizes some training data to understand how given input variables relate to the class. In this case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to detect an unknown email.
Classification belongs to the category of supervised learning where the targets also provided with the input data. There are many applications in classification in many domains such as in credit approval, medical diagnosis, target marketing etc.

There are two types of learners in classification as lazy learners and eager learners.
* Lazy learners :- Lazy learners simply store the training data and wait until a testing data appear. When it does, classification is conducted based on the most related data in the stored training data. Compared to eager learners, lazy learners have less training time but more time in predicting.
Ex. k-nearest neighbor, Case-based reasoning
* Eager learners :-Eager learners construct a classification model based on the given training data before receiving data for classification. It must be able to commit to a single hypothesis that covers the entire instance space. Due to the model construction, eager learners take a long time for train and less time to predict.
Ex. Decision Tree, Naive Bayes, Artificial Neural Networks

(src- https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623)

**Although the performance of the algorithms will only give a ballpark estimation of how a particular algo performs. Rest it depends on the person on what kind of tweaking do they do to get higher accuracy.**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv("../input/winequality-red.csv")
df.head()

In [None]:
# Getting the consolidated information about the dataset
df.info()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
sns.set_color_codes("pastel")
%matplotlib inline

sns.barplot(df.quality,df.alcohol)

In [None]:
#relation of features with other features

plt.style.use('ggplot')
fig=plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),annot=True)

* The heatmap shows some interesting relations like the column 'ciric acid' is highly positively correlated to the column fixed acidity.
* The target column 'quality' is strongly correlated to the alcohol content of the wine(which was to be expexted)
* The bar plot shows an increase in alcohol content of the wine as the quality increases.

In [None]:
#import various ML algorithms to be used from the library

from sklearn.svm import SVC,NuSVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
classification_algos_name = ["SVC", "NuSVC", "KNeighborsClassifier", "GaussianNB", "MultinomialNB", "SGDClassifier", "LogisticRegression", "DecisionTreeClassifier",
                            "ExtraTreeClassifier", "QuadraticDiscriminantAnalysis", "LinearDiscriminantAnalysis", "RandomForestClassifier", "AdaBoostClassifier",
                            "GradientBoostingClassifier", "XGBClassifier"]
classification_algos=[SVC(),
                      NuSVC(nu=0.285),
                      KNeighborsClassifier(),
                      GaussianNB(),
                      MultinomialNB(),
                      SGDClassifier(),
                      LogisticRegression(),
                      DecisionTreeClassifier(),
                      ExtraTreeClassifier(),
                      QuadraticDiscriminantAnalysis(),
                      LinearDiscriminantAnalysis(),
                      RandomForestClassifier(),
                      AdaBoostClassifier(),
                      GradientBoostingClassifier(),
                      XGBClassifier()]

# Preprcessing and cleaning of data set


In [None]:
df.isnull().sum()

Since the data set contains no empty value cleaning not required.

In [None]:
#Converting discreate values of the quality column into categorial values

bins=(2,6.5,8)
category=["bad","good"]
df["quality"]=pd.cut(df["quality"],bins=bins, labels=category)

In [None]:
le=LabelEncoder()
df["quality"]=le.fit_transform(df["quality"])
df["quality"].unique()

In [None]:
df["quality"].value_counts()

In [None]:
x_train, y_test, x_train_target, y_test_target = train_test_split(df.drop("quality", axis=1), df["quality"], test_size = 0.25, random_state = 1)
print(x_train.shape, " ",y_test_target.shape)

In [None]:
accuracy_score_list = []
for mod in classification_algos:
    model = mod
    model.fit(x_train, x_train_target)
    pred = model.predict(y_test)
    accuracy_score_list.append(accuracy_score(y_test_target,pred))
for idx,i in enumerate(accuracy_score_list):
    print(classification_algos_name[idx]," ",i)

In [None]:
from bokeh.io import output_notebook
data = pd.DataFrame({"algorithms": classification_algos_name, "accuracy_score": accuracy_score_list})
data['color'] = ['#440154', '#404387', '#29788E', '#22A784', '#79D151', '#FDE724','#30678D','#084594', '#2171b5', '#4292c6', '#6baed6', '#9ecae1', '#c6dbef', '#deebf7', '#f7fbff']
output_notebook()

In [None]:
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.plotting import figure
from bokeh.palettes import Spectral6
from bokeh.transform import factor_cmap

source = ColumnDataSource(data=data)

p = figure(x_range=data['algorithms'],
           y_range=(0,1),
           plot_width = 800,
           plot_height = 600,
           title = "Comparison",
           tools="hover",
           tooltips="@algorithms: @accuracy_score")
p.vbar(x='algorithms', top='accuracy_score',color= 'color',
       width=0.95, source=source)

p.xgrid.grid_line_color = None
p.xaxis.major_label_orientation = 120
output_file('comparison.html')
show(p)