###### In this notebook, First I have done some exploration on the data using matplotlib and seaborn. Then, I use different classifier models to predict the quality of the wine.

1. Logistic regression

2. Random Forest Classifier
 
3. Support Vector Classifier(SVC)

###### then i did some resampling in the given data as the data is imbalanced which gave me some different results.

###### importing required packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.svm import SVC

In [None]:
data=pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

###### some basic observation of the dataset.

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.isnull().sum()

###### no null values great!!

In [None]:
data.describe()

In [None]:
data['quality'].unique()

In [None]:
data.info()

In [None]:
sns.countplot(data['quality'])

as we can see the data is imbalanced .

the quality 5 and 6 of the wine are in large number compared to other qualities.



In [None]:
data.corr()

###### some visualisation between the various features and the quality of the wine.

In [None]:
data_columns=data.columns

In [None]:
for ele in data_columns:
    fig = plt.figure(figsize = (10,6))
    sns.barplot(x = 'quality', y = ele, data = data)

###### as per the problem statement we have to classify the wine type in two varities good or bad.

###### 1 for good and 0 for bad.

###### the criteria to decide whether it is good or bad is that if the quality score is less than 6.5 it is a bad wine and if it is greater than 6.5 it is considered as good.

###### converting the quality to 0 and 1 depending on their score.

In [None]:
for i in range(1599):
    if(data['quality'][i]<=6.5):
        data['quality'][i]=0
    else:
        data['quality'][i]=1

###### now we have two varities in our quality column either 0 or 1 and the count of each variety is shown in the countplot .

In [None]:
data['quality'].unique()

In [None]:
sns.countplot(data['quality'])

###### this data is also imbalanced but lets continue with this data and see what we can do.

###### using logistic regression model .

In [None]:
model=LogisticRegression(solver='lbfgs',multi_class='auto',max_iter=1000)

In [None]:
y=data['quality']
X=data.drop(['quality'],axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [None]:
model.fit(X_train,y_train)

In [None]:
model.score(X_test,y_test)

so thats a very good accuracy with a logistic regression model.

let's try some other methods in the data.

###### using random forest

random forest documentation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)

In [None]:
print(classification_report(y_test, pred_rfc))

In [None]:
print(confusion_matrix(y_test, pred_rfc))

###### using support vector machine
svc documentation https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [None]:
clf = SVC() 

In [None]:
clf.fit(X_train, y_train) 

In [None]:
pred_svc=clf.predict(X_test)

In [None]:
print(classification_report(y_test, pred_svc))

In [None]:
print(confusion_matrix(y_test, pred_svc))

###### Accuracy from logistic regression is 85.9
###### Accuracy from Random Forest classifier is 91
###### Accuracy from support vector machine is 85

### Resampling
Data imbalance can be treated with resampling the data. data resampling can be of two types.
undersampling and oversampling.

here i am using oversampling.


Over-Sampling increases the number of instances in the minority class by randomly replicating them in order to present a higher representation of the minority class in the sample.

###### Collecting all the rows of quality 1 and all the rows of quality 0.

In [None]:
data_1=data[data.quality == 1]

In [None]:
data_0=data[data.quality == 0]

In [None]:
data_1.info()

In [None]:
data_0.info()

In [None]:
data_1_new = pd.concat([data_1, data_1],ignore_index=True, sort =False)

###### running the above cell will replicate the data_1 dataframe and thus it will create a dataframe with more rows with the same data.

In [None]:
data_1_new.info()

###### Concatenating the two dataframes data_1_new and data_0 to create a training dataset for further use

In [None]:
data_new = pd.concat([data_1_new,data_0],ignore_index=True, sort =False)

In [None]:
data_new.info()

###### using all the above models one by one again on this resampled data.

In [None]:
y_new=data_new['quality']
X_new=data_new.drop(['quality'],axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.20)

In [None]:
model=LogisticRegression(solver='lbfgs',multi_class='auto',max_iter=1000)

In [None]:
model.fit(X_train,y_train)

In [None]:
model.score(X_test,y_test)

In [None]:
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)

In [None]:
print(classification_report(y_test, pred_rfc))

In [None]:
clf = SVC() 

In [None]:
clf.fit(X_train, y_train) 

In [None]:
pred_svc_new=clf.predict(X_test)

In [None]:
print(classification_report(y_test, pred_svc_new))

###### Accuracy from logistic regression is 85.9
###### Accuracy from Random Forest classifier is 95
###### Accuracy from support vector machine is 89

## Looks like resampling the data worked.

# If you like my work give it a thumps UP.