## Data Set Information:

* The dataset was downloaded from the UCI Machine Learning Repository.

* These datasets can be viewed as classification tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

* Two datasets were combined and few values were randomly removed.

### Attribute Information:


    - Input variables (based on physicochemical tests): 
       - 1 - fixed acidity 
       - 2 - volatile acidity 
       - 3 - citric acid 
       - 4 - residual sugar 
       - 5 - chlorides 
       - 6 - free sulfur dioxide 
       - 7 - total sulfur dioxide 
       - 8 - density 
       - 9 - pH 
       - 10 - sulphates 
       - 11 - alcohol Output variable (based on sensory data): 
       - 12 - quality (score between 0 and 10)
        
### It's a classification problem i will try to 'Predict Wine Type'        

In [None]:
# Import libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)

In [None]:
# Load the dataset
df = pd.read_csv('../input/winequalityN.csv')
df.head()

In [None]:
df.info()

In [None]:
df.shape

# Drop missing values


In [None]:
df.dropna(inplace=True)

In [None]:
df['quality'].unique()

In [None]:
corr=df.corr()

In [None]:
plt.figure(figsize=(14,6))
sns.heatmap(corr,annot=True)

# LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['type'] = le.fit_transform(df['type'])

In [None]:
le.classes_

In [None]:
le.transform(le.classes_)

In [None]:
dict(zip(le.classes_, le.transform(le.classes_)))

In [None]:
df.head()

In [None]:
df['type'].value_counts()

In [None]:
# {'red': 0, 'white': 1}



plt.figure(figsize=(15,7))
 
# Data to plot
labels = 'white', 'red'
sizes = [4870,1593]
colors = ['white', 'red']
explode = (0.1, 0 )  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('The percentage of type of wine',fontsize=20)
plt.legend(('white', 'red'),fontsize=15)
plt.axis('equal')
plt.show()

* white wine more than red wine

In [None]:
corr=df.corr()
plt.figure(figsize=(14,6))
sns.heatmap(corr,annot=True)

# split data

In [None]:
# i choose 'total sulfur dioxide' because it has 0.7 corrolation with type 
# and 'free sulfur dioxide' because it has 0.47 corrolation with type
X = df[['free sulfur dioxide', 'total sulfur dioxide']]
y = df['type']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [None]:
X_train = sc.fit_transform(X_train)
X_test  = sc.transform(X_test)

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log_reg = LogisticRegression()

In [None]:
log_reg.fit(X_train,y_train)

In [None]:
y_pred = log_reg.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

In [None]:
print(' Logistic Regression Accuracy = ',round(accuracy_score(y_test,y_pred),4) *100, '%')

In [None]:
# Making the Confusion Matrix will contain the correct and incorrect prediction on the dataset.
from sklearn.metrics import confusion_matrix

cm_log_reg = confusion_matrix(y_test, y_pred)
print(cm_log_reg)

In [None]:
X_train

# KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)

In [None]:
knn.fit(X_train,y_train)

In [None]:
knn_pred=knn.predict(X_test)


In [None]:
print('KNN Accuracy = ',round(accuracy_score(y_test,knn_pred),4) *100, '%')

In [None]:
# Making the Confusion Matrix will contain the correct and incorrect prediction on the dataset.
from sklearn.metrics import confusion_matrix

cm_knn = confusion_matrix(y_test, knn_pred)
print(cm_knn)

# SVM

In [None]:
from sklearn.svm import SVC
svm_linear=SVC(kernel='linear').fit(X_train,y_train)
svm_pred=svm_linear.predict(X_test)
print('SVM Accuracy = ',round(accuracy_score(y_test,svm_pred),4) *100, '%')

In [None]:
# Making the Confusion Matrix will contain the correct and incorrect prediction on the dataset.
cm_svm_lin = confusion_matrix(y_test, svm_pred)
print(cm_svm_lin)

# Naive bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB().fit(X_train,y_train)
nb_pred=nb.predict(X_test)
print('Naive bayes Accuracy = ',round(accuracy_score(y_test,nb_pred),4) *100, '%')


# The best accuracy 93.62% when i used SVM