# Classification Of Wine
This is the simple classification task performed on dataset from UCI repository. In this I have showed that Logostic regression performs really well on the dataset. You can find the dataset here http://archive.ics.uci.edu/ml/datasets/Wine .
The description of data can be found on UCI repository.

In [60]:
import numpy as np
import pandas as pd

In [61]:
df=pd.read_csv("datasets/Winedata.txt")
print(df.head())

   class  Alcohol  Malic acid   Ash  Alcalinity of ash  Magnesium  \
0      1    14.23        1.71  2.43               15.6        127   
1      1    13.20        1.78  2.14               11.2        100   
2      1    13.16        2.36  2.67               18.6        101   
3      1    14.37        1.95  2.50               16.8        113   
4      1    13.24        2.59  2.87               21.0        118   

   Total phenols  Flavanoids  Nonflavanoid phenols  Proanthocyanins  \
0           2.80        3.06                  0.28             2.29   
1           2.65        2.76                  0.26             1.28   
2           2.80        3.24                  0.30             2.81   
3           3.85        3.49                  0.24             2.18   
4           2.80        2.69                  0.39             1.82   

   Color intensity   Hue  OD280/OD315  Proline     
0             5.64  1.04         3.92        1065  
1             4.38  1.05         3.40        1050  
2 

In [62]:
y=df['class']
df.drop(['class'], 1, inplace=True)
X=np.array(df)
print(X.shape)

(178, 13)


In [63]:
#preprocessing the data
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler().fit(X)
newX=scaler.transform(X)
#print(newX)

In [64]:
#spliting data
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(newX,y,test_size=0.30,random_state=33)

In [65]:
#chwching simple models on dataset
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

clf1=SVC()
clf2=LogisticRegression()
clf3=KNeighborsClassifier()
clfs=[clf1,clf2,clf3]

In [66]:
#checking the classifier which perform the best
for clf in clfs:
    clf.fit(X_train,y_train)
    accuracy=clf.score(X_test,y_test)
    print(accuracy)
    

0.981481481481
0.981481481481
0.962962962963


Now we will build simple function to evaluate classifiers. In this function we will use KFold cross validation and cross validation scores to get how the classifier is performing on the data.

In [67]:
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print (scores)
    print (("Mean score: {0:.3f} (+/-{1:.3f})").format(
        np.mean(scores), sem(scores)))
for clf in clfs:
    evaluate_cross_validation(clf, X_train, y_train, 5)

[ 1.          1.          0.96        1.          0.91666667]
Mean score: 0.975 (+/-0.017)
[ 1.          0.96        0.96        1.          0.95833333]
Mean score: 0.976 (+/-0.010)
[ 0.96        0.96        0.92        1.          0.91666667]
Mean score: 0.951 (+/-0.015)


Now we will build another function to use Sklearn.metrics to find out how the classifiers are performing on data by getting the confusion matrix and classification report. These two techniques are really powerfull and can be used to find out the performance of classifier.

In [68]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print ("Accuracy on training set:")
    print (clf.score(X_train, y_train))
    print ("Accuracy on testing set:")
    print (clf.score(X_test, y_test))
    
    y_pred = clf.predict(X_test)
    
    print ("Classification Report:")
    print (metrics.classification_report(y_test, y_pred))
    print ("Confusion Matrix:")
    print (metrics.confusion_matrix(y_test, y_pred))
    
for clf in clfs:
    train_and_evaluate(clf,X_train,X_test,y_train,y_test)

Accuracy on training set:
0.991935483871
Accuracy on testing set:
0.981481481481
Classification Report:
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        15
          2       0.95      1.00      0.98        21
          3       1.00      0.94      0.97        18

avg / total       0.98      0.98      0.98        54

Confusion Matrix:
[[15  0  0]
 [ 0 21  0]
 [ 0  1 17]]
Accuracy on training set:
1.0
Accuracy on testing set:
0.981481481481
Classification Report:
             precision    recall  f1-score   support

          1       0.94      1.00      0.97        15
          2       1.00      0.95      0.98        21
          3       1.00      1.00      1.00        18

avg / total       0.98      0.98      0.98        54

Confusion Matrix:
[[15  0  0]
 [ 1 20  0]
 [ 0  0 18]]
Accuracy on training set:
0.991935483871
Accuracy on testing set:
0.962962962963
Classification Report:
             precision    recall  f1-score   support


Logistic regression is dominating other classifier as we can see. So best choice for this classifiaction task is logistic regression. I hope it helps you in learning.
