### Loading Libraries
All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use.


In data science, numpy and pandas are most commonly used libraries. Numpy is required for calculations like means, medians, square roots, etc. Pandas is used for data processin and data frames. We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd).

In [None]:
# Importing the required libraries.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


        
# Ignoring Unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

### Loading Data
Pandas module is used for reading files. We have our data in '.csv' format. We will use 'read_csv()' function for loading the data.

Now what are the Input and Target variables in the wine detection problem? Let’s have a look:

**Input variables:**

- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol

**Output variable:**

- quality (score between 0 and 10)

In [None]:
#data reading
data=pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")


### Head

To take a closer look at the data, we take the help of “ .head()” function of pandas library which returns the first five observations of the data set. Similarly, “.tail()” returns last five observations of the data set.

You’ll see that this is a great way to get an initial feeling of your data and maybe understand it a bit better already!

In [None]:
data.head()

Pandas Profiling library is very useful and easy for the detailed overview and EDA ofthe data.

In [None]:
# detailed overview ofthe data
import pandas_profiling as pp
df=data
report=pp.ProfileReport(df,title = "Pandas Profile Report")
report

In [None]:
print("Columns names in the data :",data.columns)

### Input Variables:¶
One or more variables that are used to determine (or predict) the 'Target Variable' are known as Input Variables. They are sometimes called Predictor Variable as well.

In our example, the input variables are: 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density','pH', 'sulphates', and 'alcohol'.

All of these will help us predict the quality of the wine.

#### Variables/Features:¶
Variables and features both are the same, they are often used interchangeably. All the column names in a dataset are variable

Find out the total number of rows and columns in the dataset using `.shape`.

In [None]:
# shape of data.
data.shape

## Observations:
- Dataset comprises 1599 observations(rows) and 12 features(columns).
- Out of the 12, one is the target variable and rest 11 are input variables.

In [None]:
#Data type for all of the columns in the data
data.dtypes

### Checking for Missing Values
Handling missing values is an essential part of the data cleaning and preparation process because almost all data in real life comes with some missing values.

Pandas provides isnull(), isna() functions to detect missing values. Both of them do the same thing.

- df.isna() returns the dataframe with boolean values indicating missing values.
You can also choose to use notna() which is just the opposite of isna().
- df.isna().any() returns a boolean value for each column. If there is at least one missing value in that column, the result is True.
- df.isna().sum() returns the number of missing values in each column.

In [None]:
#to check any missing value in the data.
data.isnull().values.any()

There is no missing value in any column

In [None]:
data.isna().sum()


### Checking for Duplicates
Duplicates might or might not affect the quality of data. Before deciding if they should be removed, it is essential to understand why they might have occurred in the first place.

Duplicates can be checked using the duplicated() method.

In [None]:
duplicate_entries = data[data.duplicated()]
duplicate_entries.shape

In [None]:
print("Number of duplicated rows :",data.duplicated().sum())

### Observations:
There are 240 duplicates. The quality ratings for the same/similar wine were given by different wine tasters so there is a possibility of similar reviews. We can thus keep these duplicates.

In [None]:
data[data.duplicated(keep = 'first')].shape

###  info()
`df.info` returns information about the data frame including the data types of each column, number of null values in each column and memory usage of the entire data.



In [None]:
#Information about the dataset
data.info()

### Observations:
The data has only float and integer values.

There are no missing values

### Get a Statistical Overview using Describe
The describe() function in pandas is very handy in getting various summary statistics. This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.

Let's explore different statistical measures that we have got from describe().
- count: total count of non-null values in the column
- mean: the average of all the values in that column
- min: the minimum value in the column
- max: the maximum value in the column
- 25%: first quartile in the column after we arrange those values in ascending order
- 50%: this is the median or the second quartile
- 75%: the third quartile
- std: this is the standard deviation (i.e. measure of depreciation, you must have read in the basics of statistics study material)
Note: 25%, 50%, and 75% are nothing but corresponding percentile values

In [None]:
data.describe()

### Observations:
Here as you can notice the mean value is less than the median value of each column. Median is represented by 50%(50th percentile) in the index column.This signifies the presence of Outliers. For example, a data set includes values: 30, 31, 32, and 2. The mean value (23.75), which is lower than the median of the data (30.5), is greatly affected by the extreme data point(2). 

There is notably a large difference between 75th %tile and max values of predictors “residual sugar”, ” free sulfur dioxide”, ” total sulfur dioxide”. This indicates that some values of these 3 variables lie much farther from the general range of values( up to 75th %tile) 

Thus, the observations 1 and 2 suggest that there are extreme values i.e Outliers in our dataset. 

### Unique Values of Quality(Target Variable)

In [None]:
data["quality"].unique()

### Observations:
Few key insights just by looking at the target variable are as follows:

Target variable/Dependent variable is discrete and categorical in nature.

“quality” score scale ranges from 1 to 10; 1 being poor and 10 being the best.

1,2,9 & 10 Quality ratings are not given by any observation. Only scores obtained are between 3 to 8.

### Frequency Counts of each Quality Value

In [None]:
data["quality"].value_counts()

### Observations:
This tells us the vote count of each quality score in descending order.

“quality” has most values concentrated in the categories 5, 6 and 7.

Only a few observations made for the categories 3 & 8

In [None]:
bins=[0,5,7,10]
labels=[0,1,2]
data['wine_quality']=pd.cut(data['quality'],bins=bins,labels=labels)

In [None]:

Counter(data["wine_quality"])

In [None]:
Counter(data["quality"])

### Renaming Columns

Let's rename the columns which contain spaces in their names and replace the spaces with underscores.

In [None]:
data.rename({'fixed acidity':'fixed_acidity', 'volatile acidity':'volatile_acidity', 'citric acid':'citric_acid', 'residual sugar':'residual_sugar',
       'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},inplace=True)
data.columns


# Correlation Matrix with Heatmap
Correlation:
Correlation is a statistical measure. Data correlation is a way to understand the relationship between multiple values or features in your dataset.

Every single successful data science project revolves around finding accurate correlations between the input and target variables. However more than often, we oversee how crucial correlation analysis is. 

It is recommended to perform correlation analysis before and after data gathering and transformation phases of a data science project.

 There are three different types of correlations:

1. Positive Correlation: Two features (variables) can be positively correlated with each other. It means that when the value of one variable increases then the value of the other variable(s) also increases (also decreases when the other decreases).
Eg. The more time you spend running on a treadmill, the more calories you will burn.

2. Negative Correlation: Two features (variables) can be negatively correlated with each other. This occurs when the value of one variable increases and the value of another variable(s) decreases (inversely proportional).
Eg. As the weather gets colder, air conditioning costs decrease.

3. No Correlation: Two features might not have any relationship with each other. This happens when the value of a variable is changed then the value of the other variable is not impacted.

Eg. There is no relationship between the amount of tea drunk and level of intelligence.
- Each of these correlation types exists in a spectrum represented by values from -1 to +1 where slight or high positive correlation features can be like 0.5 or 0.7.

- A very strong and perfect positive correlation is represented by a correlation score of 0.9 or 1.
- If there is a strong negative correlation, it will be represented by a value of -0.9 or -1. Values close to zero indicates no correlation.

We can check how each feature is related to others using corr() function.

 

In [None]:
data.corr()

From the above correlation matrix, we can observe that there is a relatively high positive correlation between **fixed_acidity** and **citric_acid**, **fixed_acidity** and **density**. 

Similarly we can observe there is a relatively high negative correlation between **fixed_acidity** and **pH**. There is relatively high positive correlation between alcohol presence and quality of the wines.



Creating a pictorial visualisation of the above correlation matrix using a heatmap helps in better understanding. We can do that using Seaborn's Heatmap function.

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(data=data.corr(),annot=True,cmap="bwr")
plt.show()

### Observations:
Alcohol has the highest positive correlation with wine quality, followed by the various other variables such as acidity, sulphates, density & chlorides.

There is a relatively high positive correlation between fixed_acidity and citric_acid, fixed_acidity and density.
There is a relatively high negative correlation between fixed_acidity and pH.

Density has a strong positive correlation with fixed_acidity, whereas it has a strong negative correlation with alcohol.
citric acid & volatile acidity have negative correlation.

free sulphur dioxide & total sulphur dioxide have positive correlation.

### Graphical Relation between the data variables.

We can visualize scatterplot matrix for the better understanding relationship between a pair of variables. It plots every numerical attribute against every other. 'pairplot' of seaborn helps to achieve this

### Pair Plot
The pair plot builds on two basic figures, the histogram and the scatter plot. The histogram on the diagonal allows us to see the distribution of a single variable while the scatter plots on the upper and lower triangles show the relationship (or lack thereof) between two variables.

It plots every numerical attribute against every other.

pairplot function of seaborn helps to achieve this

In [None]:
sns.pairplot(data,hue='wine_quality')
plt.show()


The correlation between **fixed_acidity** and **citric_acid** is 0.67 (you could find this value in the correlation matrix of red wines). Looking for scatterplot for this pair of variables, we can see the positive linear correlation between these two variables. We can observe the upward trend, and also the points are not too dispersed

## Histogram
Histograms use bars to visualize data as well. Many people may not even realize there is a difference between a histogram and a bar chart. They practically look the same from a distance.

The key is that a histogram looks solely at quantitative variables while a bar chart looks at categorical variables. That’s why the bars in a histogram are typically grouped together without spacing in between the bars.

### Count Plot
The variable quality is categorical in nature and we can visualize these types of variables using barplot or countplot.

Count plot is a graphical display to show the number of occurrences or frequency for each categorical data using bars.


Let's plot a histogram now! On calling the hist( ) method on a DataFrame, you'll get histograms for all the Series.

In [None]:
data.hist(bins=10,figsize=(12,10))
plt.show()

## Observations:
The distribution of the attribute “alcohol” seems to be positively skewed i.e the curve is shifted towards the left.
The attributes 'density' and 'pH' are quite normally distributed.

Now looking at the attribute quality, we can observe that the wines with average quality (i.e. quality rating 5 to 7) are more than wines with bad(1-4) or good(8-10) quality.

In [None]:
sns.countplot(x="quality",data=data)
plt.show()

#### Observation:
The average(5-7) quality of wines are more than good(1-4) and bad(8-10) quality of wines.

In [None]:
sns.countplot(x="wine_quality",data=data)

### Box Plot
A box plot is a great way to get a visual sense of an entire range of data. It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

Box plots divides data into its quartiles. The “box” shows a user the data set between the first and third quartiles.

The median gets drawn somewhere inside the box and then you see the most extreme non-outliers to finish the plot. Those lines are known as the “whiskers”. If there are any outliers then those can be plotted as well.

With box plots you can answer how diverse or uniform your data might be. You can identify what is normal and what is extreme. Box plots help give a shape to your data that is broad without sacrificing the ability to look at any piece and ask more questions.

It displays the five-number summary of a set of data. The five-number summary is:

1. minimum
2. first quartile (Q1)
3. median
4. third quartile (Q3)
5. maximum

In [None]:
sns.boxplot(data['quality'],data['alcohol'],palette ='GnBu_d')
plt.title("Boxplot of quality and alcohol")
plt.show()

### Observation:
The above plot shows the increase in the quality of wine with an increase in alcohol. The quality of the wine is directly related to the amount of alcohol in the wine. More the alcohol in the wine, the better will be the quality.
Also, the points lying outside the whiskers(the lines extending from the rectangular box) are the outliers.

In [None]:
data.plot(kind='box',subplots=True,layout=(4,3),grid=True,figsize=(10,10))
plt.tight_layout()
plt.show()

In [None]:
for i in data.columns:
    if i =="quality":
        break
    sns.boxplot("quality",i,data=data)
    plt.show()

In [None]:
for i in data.columns:
    if i == "quality":
        break
    sns.barplot("quality",i,data=data)
    plt.show()

# barplot with matplotlib
    
"""
for i in data.columns:
    if i == "quality":
        break
    plt.bar(x='quality',height=i,data=data)
    plt.xlabel("quality")
    plt.ylabel(i)
    plt.show()
"""

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler,Normalizer, RobustScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [None]:
X=data.iloc[:,:-2]
Y=data.iloc[:,-1:]
X.head()

**Scaling of the data**

In [None]:
sc=RobustScaler()  
data=sc.fit_transform(X)
data[:5]

In [None]:
train_x,test_x,train_y,test_y=train_test_split(data,Y,test_size=0.15,random_state=2)
print("shape of train input data:",train_x.shape,"\n shape of train output data",train_y.shape,
      "\nshape of test input data ",test_x.shape,"\nshape of test output data",test_y.shape)

### LogisticRegression Model

In [None]:
model=LogisticRegression()
model.fit(train_x,train_y)

In [None]:
pred=model.predict(test_x)
pred[:5]

In [None]:
print("Accuracy Score:",accuracy_score(pred,test_y))
print("classification Report:\n",classification_report(pred,test_y))
print("confusion Matrix:\n",confusion_matrix(pred,test_y))

### Support Vector Machine 

In [None]:
svc=SVC()
svc.fit(train_x,train_y)

In [None]:
pred=svc.predict(test_x)
pred[:5]

In [None]:
print("Accuracy Score:",accuracy_score(pred,test_y))
print("classification Report:\n",classification_report(pred,test_y))
print("confusion Matrix:\n",confusion_matrix(pred,test_y))


**Grid Search CV**

In [None]:
#parameters
param= {'C':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.1,1.2,1.3,1.4,1.5],
       'kernel':["linear","rbf"],
       'gamma':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.1,1.2,1.3,1.4,1.5]}


In [None]:
grid_svc= GridSearchCV(svc, param_grid=param, scoring='accuracy', cv=5)
grid_svc.fit(train_x,train_y)

In [None]:
grid_svc.best_params_

In [None]:
svc2=SVC(C= 1, gamma= 0.9, kernel='rbf')

svc2.fit(train_x,train_y)

In [None]:
pred=svc2.predict(test_x)
pred[:5]

In [None]:
print("Accuracy Score:",accuracy_score(pred,test_y))
print("classification Report:\n",classification_report(pred,test_y))
print("confusion Matrix:\n",confusion_matrix(pred,test_y))

### Cross validation

In [None]:
svc_eval = cross_val_score(estimator = svc2, X = train_x, y = train_y, cv =5,verbose =1)
svc_eval.mean()

## Decision Tree Classifier

In [None]:
rf = RandomForestClassifier()
rf.fit(train_x,train_y)

In [None]:
pred_rf=rf.predict(test_x)
pred_rf[:5]

In [None]:
print("Accuracy Score:",accuracy_score(pred_rf,test_y))
print("classification Report:\n",classification_report(pred_rf,test_y))
print("confusion Matrix:\n",confusion_matrix(pred_rf,test_y))

In [None]:
#parameters
param= {'max_depth':range(3,10),
       'criterion':["gini","entropy"],
       'max_features':["auto", "sqrt", "log2"]}

In [None]:
grid_rf= GridSearchCV(RandomForestClassifier(), param_grid=param, scoring='accuracy', cv=5)
grid_rf.fit(train_x,train_y)

In [None]:
grid_rf.best_score_

In [None]:
rf = grid_rf.best_estimator_
pred_rf=rf.predict(test_x)
pred_rf[:5]

In [None]:
print("Accuracy Score:",accuracy_score(pred_rf,test_y))
print("classification Report:\n",classification_report(pred_rf,test_y))
print("confusion Matrix:\n",confusion_matrix(pred_rf,test_y))

In [None]:
rf_eval = cross_val_score(estimator = rf, X = train_x, y = train_y, cv =5,verbose =1)
rf_eval.mean()

**Please upvote the notebook if you find it useful. Your comments are also requested for improvement of the notebook. **