# RED WINE QUALITY -DA & ML

### ABOUT DATASET

The Wine Quality dataset contains information about various physicochemical properties of wines.
e are going to download and load the dataset into Python and perform an initial analysis to disclose what is inside it.
And applying some Machine learning algorithms.

### FEATURES DESCRIPTION
* Fixed acidity: It indicates the amount of tartaric acid in wine and is measured in g/dm3
* Volatile acidity: It indicates the amount of acetic acid in the wine. It is measured in g/dm3.
* Citric acid: It indicates the amount of citric acid in the wine. It is also measured in g/dm3
* Residual sugar: It indicates the amount of sugar left in the wine after the fermentation process is done. It is also measured in g/dm3
* Free sulfur dioxide: It measures the amount of sulfur dioxide (SO2) in free form. It is also measured in g/dm3
* Total sulfur dioxide: It measures the total amount of SO2 in the wine. This chemical works as an antioxidant and antimicrobial agent.
* Density: It indicates the density of the wine and is measured in g/dm3.
* pH: It indicates the pH value of the wine. The range of value is between 0 to 14.0, which indicates very high acidity, and 14 indicates basic acidity.
* Sulphates: It indicates the amount of potassium sulphate in the wine. It is also measured in g/dm3.
* Alcohol: It indicates the alcohol content in the wine.
* Quality: It indicates the quality of the wine, which is ranged from 1 to 10. Here, the higher the value is, the better the wine.


- - -

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load data file
red_wine=pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
red_wine.head()

In [None]:
red_wine.tail()

In [None]:
red_wine.shape

In [None]:
red_wine.columns

In [None]:
red_wine.isnull().sum()

### Observation:-
In red wine data set no missing values.

In [None]:
red_wine.info()

In [None]:
red_wine.describe()

In [None]:
plt.figure(figsize = (10,7,))
corrMatrix=red_wine.corr()
mask = np.triu(np.ones_like(corrMatrix, dtype = bool))
sns.heatmap(corrMatrix,annot=True, fmt = '.2f', linewidths = 2)
plt.show()

### Observation:-
* Alcohol is positively correlated with the quality of the red wine.
* Alcohol has a weak positive correlation with the pH value.
* Alcohol is negatively correlated with fixed acidity, volatile acidity,  chlorides, free sulfur dioxide ,total sulfur dioxide and density.
* Citric acid and density have a strong positive correlation with fixed acidity
* pH has a negative correlation with density, fixed acidity, citric acid, and sulfates.

- - - 

### * Red wine quality-wise analysis

In [None]:
#red wine quality value count
quality_value_count=red_wine["quality"].value_counts()
quality_value_count

In [None]:
sns.countplot(red_wine['quality'])

In [None]:
quality_value_count.plot.pie(autopct="%.1f%%")

In [None]:
red_wine.corr()['quality'].nlargest()

### * Red wine  alcohol-wise analysis

In [None]:
# Lets see how alcohol concentration is distributed with respect to the quality of the red wine.
sns.distplot(red_wine['alcohol'])

In [None]:
from scipy.stats import skew
skew(red_wine['alcohol'])

The output verifies that alcohol is positively skewed.

In [None]:
#Dist plot of all features:
# create dist plot
fig, ax = plt.subplots(ncols=6, nrows=2, figsize=(20,10))
index = 0
ax = ax.flatten()

for col, value in red_wine.items():
    sns.distplot(value, color='r', ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

The above figures show the distribution of the features. Few of them are normally distributed where other are rightly skewed. The range of each feature is also not huge.

### Alcohol Vs Quality

In [None]:
sns.boxplot(x='quality', y='alcohol', data = red_wine)

Above graph showing some dots outside of the graph. Those are outliers. outliers are around wine with quality 5 and 6. We can remove the outliers by passing an argument, showoutliers=False

In [None]:
sns.boxplot(x='quality', y='alcohol', data = red_wine, showfliers=False)

### Observation:-
The higher the alcohol concentration is, the higher the quality of the wine.

- - -

In [None]:
# red wine pH value wise count
red_wine["pH"].value_counts()

In [None]:
sns.boxplot(x='quality', y='pH', data = red_wine)

Above graph showing some dots outside of the graph. Those are outliers. outliers are around wine with quality 4,5,6,7 AND 8. We can remove the outliers by passing an argument, showoutliers=False

In [None]:
sns.boxplot(x='quality', y='pH', data = red_wine, showfliers=False)

### Observation:- 
    For higher quality of red wine pH value lies between 3.0 to 3.6 range.

- - -

# Machine Learning Algorithm

# Linear Regression

Dependant variable:- quality
    
Independant Variable:- fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
#droping data that we do not want to find corr
data=red_wine.drop(['quality'],axis=1)
data.head()

In [None]:
x_train,x_test, y_train, y_test=train_test_split(data,red_wine['quality'],test_size=0.20,random_state=8)

In [None]:
model=LinearRegression()
model.fit(x_train,y_train)

In [None]:
accuracy=model.score(x_test,y_test)
print(accuracy*100,'%')

In [None]:
#Predicting the Test set result;  
y_pred= model.predict(x_test)
y_pred

In [None]:
print('Train Score: ', model.score(x_train, y_train))  
print('Test Score: ', model.score(x_test, y_test))  

In [None]:
#to see what coefficients our regression model has chose
coeff_df = pd.DataFrame(model.coef_,data.columns, columns=['Coefficient'])
coeff_df

In [None]:
red_wine_prediction = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
red_wine_prediction

* Evaluating the Algorithm The final step is to evaluate the performance of algorithm. We'll do this by finding the values for MAE, MSE and RMSE. Execute the following script:

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

### Observation:-
* Accuracy of our red wine quality model by using linear regression is 35%.so here we observed that using linear regression algorithum is not good idea.

* There is poor relation between Dependant variable(quality) and
Independant Variable(fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol)  

- - -

# Logistic Regression in Machine Learning

* Logistic regression predicts the output of a categorical dependent variable. 

* Dependant variable:- quality (as it is categorize in 5,6,7,4,8,3)
    
* Independant Variable:- fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol

In [None]:
#Logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

In [None]:
logistic_data=red_wine.drop(['quality'],axis=1)
logistic_data.head()

Now we will split the dataset into a training set and test set. Below is the code for it:

In [None]:
x_train,x_test, y_train, y_test=train_test_split(logistic_data,red_wine['quality'],test_size=0.20,random_state=8)

In logistic regression, we will do feature scaling because we want accurate result of predictions. Here we will only scale the independent variable

In [None]:
#feature Scaling  
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()    
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test)  

### * Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the training set. For providing training or fitting the model to the training set, we will import the LogisticRegression class of the sklearn library.

After importing the class, we will create a logmodel object and use it to fit the model to the logistic regression. Below is the code for it:

In [None]:
logmodel=LogisticRegression()
logmodel.fit(x_train,y_train)


### * Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by using test set data. Below is the code for it:

In [None]:
predictions=logmodel.predict(x_test)
x_test

In [None]:
#Accuracy of test data
accuracy = logmodel.score(x_test, y_test)
print("Accuracy of Test set",accuracy*100,'%')

In [None]:
predictions

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)

In [None]:
# accuracy score

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

lr_acc = accuracy_score(y_test, logmodel.predict(x_test))
print(f"Accuracy Score of Training Data is {accuracy_score(y_train, logmodel.predict(x_train))}")
print(f"Accuracy Score of Test Data is {lr_acc}\n")

- - -

# K-Nearest Neighbor (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

Here we are using same data i.e dependant variable is quality .And same training set and test set which we are created above (in linear regression)

we are fitting training set and test set to KNN  

In [None]:
#feature Scaling  
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()    
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test) 

In [None]:
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)

In [None]:
#Accuracy of test data
#accuracy = knn.score(x_test, y_test)
#print("Accuracy score of Test set",accuracy*100,'%')
print(f"Accuracy score of Test Data is {(accuracy_score(y_test, knn.predict(x_test)))*100}",'%')
print(f"Accuracy score of Training Data is {(accuracy_score(y_train, knn.predict(x_train)))*100}",'%')

In [None]:
pred_knn = knn.predict(x_test)
pred_knn

In [None]:
print(classification_report(y_test, pred_knn))

In [None]:
confusion_matrix(y_test, pred_knn)

### Observation:- 
    Using KNN algorithm on red wine quality data where quality is dependant variable, accuracy of test data is 60%.

- - -

# SVC

In [None]:
from sklearn.svm import SVC

Here we are using same data i.e dependant variable is quality .And same training set and test set which we are created above (in linear regression)

After scaling data we are fitting training set and test set to SVC 


In [None]:
#SVM
model_svc=SVC()
model_svc.fit(x_train,y_train)
y_train_pred=model_svc.predict(x_train)
y_test_pred=model_svc.predict(x_test)

print("Train set Accuracy :"+str(accuracy_score(y_train_pred,y_train)*100))
print("Test set Accuracy : "+str(accuracy_score(y_test_pred,y_test)*100))

### Observation:- 
    Using SVC  algorithm on red wine quality data where quality is dependant variable, accuracy of test data is 61%.

- - -

# Decision Tree

Here we are using same data i.e dependant variable is quality .And same training set and test set which we are created above (in linear regression)

After scaling data we are fitting training set and test set to SVC 

In [None]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(x_train, y_train)

In [None]:
 #accuracy score
print(f"Accuracy score of Test Data is {(accuracy_score(y_test, decision_tree.predict(x_test)))*100}",'%')
print(f"Accuracy score of Training Data is {(accuracy_score(y_train, decision_tree.predict(x_train)))*100}",'%')

- - -