# Overview
Many factors affect the quality of wine, including the kind of grape or fruit used, the fermentation process, storage and ageing methods, as well as the environment and region where it was cultivated. Wines may be categorised depending on elements including taste, scent, and colour. The quality of these categories can range from inexpensive table wines to high-end fine wines.

The goal of wine quality prediction is to develop models utilising data that can reliably anticipate a wine's quality, helping wine producers, wine growers, and merchants make judgements about their goods.

In order to find underlying patterns, we will first analyse the train and original datasets. Afterward, many models will be trained on the dataset.

# Importing The Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

In [None]:
df_train=pd.read_csv('/kaggle/input/playground-series-season-3-episode-5/train.csv')
df_test=pd.read_csv('/kaggle/input/playground-series-season-3-episode-5/test.csv')
submission=pd.read_csv("/kaggle/input/playground-series-season-3-episode-5/sample_submission.csv")

In [None]:
df_train


Unnamed: 0,Id,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,0,8.0,0.50,0.39,2.20,0.073,30.0,39.0,0.99572,3.33,0.77,12.1,6
1,1,9.3,0.30,0.73,2.30,0.092,30.0,67.0,0.99854,3.32,0.67,12.8,6
2,2,7.1,0.51,0.03,2.10,0.059,3.0,12.0,0.99660,3.52,0.73,11.3,7
3,3,8.1,0.87,0.22,2.60,0.084,11.0,65.0,0.99730,3.20,0.53,9.8,5
4,4,8.5,0.36,0.30,2.30,0.079,10.0,45.0,0.99444,3.20,1.36,9.5,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2051,2051,6.6,0.31,0.13,2.00,0.056,29.0,42.0,0.99388,3.52,0.87,12.0,7
2052,2052,9.7,0.59,0.21,1.80,0.079,27.0,65.0,0.99745,3.14,0.58,9.4,5
2053,2053,7.7,0.43,0.42,1.70,0.071,19.0,37.0,0.99258,3.32,0.77,12.5,8
2054,2054,9.1,0.50,0.00,1.75,0.058,5.0,13.0,0.99670,3.22,0.42,9.5,5


In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2056 entries, 0 to 2055
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         2056 non-null   float64
 1   volatile acidity      2056 non-null   float64
 2   citric acid           2056 non-null   float64
 3   residual sugar        2056 non-null   float64
 4   chlorides             2056 non-null   float64
 5   free sulfur dioxide   2056 non-null   float64
 6   total sulfur dioxide  2056 non-null   float64
 7   density               2056 non-null   float64
 8   pH                    2056 non-null   float64
 9   sulphates             2056 non-null   float64
 10  alcohol               2056 non-null   float64
 11  quality               2056 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 192.9 KB


**Insights:**

* There are no missing values in any of the 1143 samples in the original dataset.

* There are no missing values in any of the 2056 samples in the train dataset.

* There are 1372 non-null observations in the test dataset.

* he features found in these datasets include: fixed acidity (g/L), volatile acidity (g/L), citric acid (g/L), residual sugar (g/L), chlorides (g/L), free sulfur dioxide (mg/L), total sulfur dioxide (mg/L), density (g/mL), pH, sulphates (g/L), and alcohol content (% vol.)

In [None]:
df_train.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,2056.0,2056.0,2056.0,2056.0,2056.0,2056.0,2056.0,2056.0,2056.0,2056.0,2056.0,2056.0
mean,8.365175,0.527601,0.265058,2.398881,0.081856,16.955982,49.236868,0.996748,3.310569,0.641308,10.414972,5.720817
std,1.70539,0.173164,0.188267,0.858824,0.023729,10.00971,32.961141,0.001827,0.142321,0.137942,1.028825,0.853146
min,5.0,0.18,0.0,1.2,0.012,1.0,7.0,0.99007,2.74,0.39,8.7,3.0
25%,7.2,0.39,0.09,1.9,0.071,8.0,22.0,0.9956,3.2,0.55,9.5,5.0
50%,7.95,0.52,0.25,2.2,0.079,16.0,44.0,0.9967,3.31,0.61,10.1,6.0
75%,9.2,0.64,0.42,2.6,0.09,24.0,65.0,0.9978,3.39,0.72,11.0,6.0
max,15.9,1.58,0.76,14.0,0.414,68.0,289.0,1.00369,3.78,1.95,14.0,8.0


**Dropping the  columns which are not useful:**

In [None]:
df_train['volatile_new_acidity']=df_train['fixed acidity']/df_train['volatile acidity']
df_test['volatile_new_acidity']=df_test['fixed acidity']/df_test['volatile acidity']
df_train
df_train=df_train.drop(['Id'],axis=1)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,volatile_new_acidity
0,8.0,0.50,0.39,2.20,0.073,30.0,39.0,0.99572,3.33,0.77,12.1,6,16.000000
1,9.3,0.30,0.73,2.30,0.092,30.0,67.0,0.99854,3.32,0.67,12.8,6,31.000000
2,7.1,0.51,0.03,2.10,0.059,3.0,12.0,0.99660,3.52,0.73,11.3,7,13.921569
3,8.1,0.87,0.22,2.60,0.084,11.0,65.0,0.99730,3.20,0.53,9.8,5,9.310345
4,8.5,0.36,0.30,2.30,0.079,10.0,45.0,0.99444,3.20,1.36,9.5,6,23.611111
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2051,6.6,0.31,0.13,2.00,0.056,29.0,42.0,0.99388,3.52,0.87,12.0,7,21.290323
2052,9.7,0.59,0.21,1.80,0.079,27.0,65.0,0.99745,3.14,0.58,9.4,5,16.440678
2053,7.7,0.43,0.42,1.70,0.071,19.0,37.0,0.99258,3.32,0.77,12.5,8,17.906977
2054,9.1,0.50,0.00,1.75,0.058,5.0,13.0,0.99670,3.22,0.42,9.5,5,18.200000


In [None]:
X=df_train.drop(['quality'],axis=1)
y=df_train['quality']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [None]:
test_df=df_test

# MinMaxScaler
* MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.

* MinMaxScaler preserves the shape of the original distribution. It doesn’t meaningfully change the information embedded in the original data.

* Note that MinMaxScaler doesn’t reduce the importance of outliers.

In [None]:
transf=MinMaxScaler()
new_scaled=pd.DataFrame(transf.fit_transform(X_train),columns=X_train.columns)
new_scaled

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,volatile_new_acidity
0,0.174312,0.271429,0.013158,0.117188,0.115789,0.373134,0.085106,0.606130,0.766667,0.096154,0.377358,0.129651
1,0.532110,0.071429,0.631579,0.070312,0.155263,0.074627,0.042553,0.263602,0.488889,0.121795,0.773585,0.567973
2,0.458716,0.000000,0.644737,0.046875,0.097368,0.074627,0.039007,0.671264,0.377778,0.217949,0.094340,0.851574
3,0.449541,0.150000,0.407895,0.218750,0.118421,0.044776,0.028369,0.609962,0.577778,0.141026,0.415094,0.347780
4,0.348624,0.228571,0.750000,0.117188,0.450000,0.164179,0.262411,0.449042,0.455556,0.346154,0.566038,0.217792
...,...,...,...,...,...,...,...,...,...,...,...,...
1537,0.155963,0.364286,0.105263,0.070312,0.231579,0.194030,0.049645,0.373946,0.588889,0.211538,0.150943,0.086048
1538,0.385321,0.471429,0.342105,0.078125,0.184211,0.761194,0.322695,0.563985,0.377778,0.147436,0.132075,0.106791
1539,0.293578,0.178571,0.289474,0.085938,0.126316,0.522388,0.294326,0.579310,0.566667,0.083333,0.245283,0.242335
1540,0.128440,0.178571,0.315789,0.039062,0.094737,0.074627,0.028369,0.380077,0.566667,0.160256,0.415094,0.172436


In [None]:
new_scaled_xtest=pd.DataFrame(transf.fit_transform(X_test),columns=X_test.columns)
new_scaled_xtest

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,volatile_new_acidity
0,0.16,0.391304,0.136986,0.086957,0.292079,0.875000,0.568345,0.394516,0.692308,0.371795,0.24,0.092413
1,0.30,0.191304,0.369863,0.202899,0.227723,0.145833,0.093525,0.373953,0.461538,0.371795,0.64,0.254241
2,0.29,0.113043,0.369863,0.086957,0.227723,0.270833,0.201439,0.296268,0.413462,0.012821,0.48,0.347423
3,0.30,0.252174,0.410959,0.043478,0.272277,0.354167,0.733813,0.680122,0.442308,1.000000,0.08,0.203626
4,0.22,0.456522,0.178082,0.782609,0.316832,0.166667,0.122302,0.404417,0.740385,0.692308,0.48,0.087936
...,...,...,...,...,...,...,...,...,...,...,...,...
509,0.41,0.269565,0.383562,0.304348,0.346535,0.270833,0.690647,0.573496,0.519231,0.269231,0.06,0.229967
510,0.22,0.391304,0.369863,0.057971,0.341584,0.791667,0.431655,0.451637,0.759615,0.230769,0.18,0.108595
511,0.34,0.486957,0.095890,0.043478,0.227723,0.229167,0.280576,0.366337,0.413462,0.076923,0.40,0.107283
512,0.18,0.365217,0.109589,0.072464,0.257426,0.500000,0.381295,0.360244,0.682692,0.179487,0.10,0.106977


# Logistic Regression
The logistic function, which is at the core of the technique, gave rise to the name logistic regression. It was first used by statisticians to characterise the characteristics of population expansion. Some variants of the logistic function include the sigmoid function and the logit function. The inverse of the typical logistic function is the logistic function.

In [None]:
lg=LogisticRegression()
lg.fit(new_scaled,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [None]:
y_pred=lg.predict(new_scaled_xtest)

In [None]:
lg.score(new_scaled, y_train)

0.5830090791180286

In [None]:
lg.score(new_scaled_xtest, y_test)

0.5214007782101168

In [None]:
metrics.accuracy_score(y_test,y_pred)

0.5214007782101168

# GridSearchCV
GridSearchCV is a technique to search through the best parameter values from the given set of the grid of parameters. It is basically a cross-validation method. the model and the parameters are required to be fed in. Best parameter values are extracted and then the predictions are made.

In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('classifier' , LogisticRegression())])
param_grid = [
    {'classifier' : [LogisticRegression()],
     'classifier__penalty' : ['l1', 'l2'],
    'classifier__C' : np.logspace(-4, 4, 20),
    'classifier__solver' : ['liblinear']}
]

In [None]:
clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

# Fit on data

best_clf = clf.fit(new_scaled, y_train)
best_clf.best_estimator_

Fitting 5 folds for each of 40 candidates, totalling 200 fits


Pipeline(steps=[('classifier',
                 LogisticRegression(C=78.47599703514607, solver='liblinear'))])

**Accuracy**

In [None]:
lg1=LogisticRegression(C=78.47599703514607, solver='liblinear')
lg1.fit(new_scaled,y_train)
y_pred_new=lg1.predict(new_scaled_xtest)
metrics.accuracy_score(y_test,y_pred_new)

0.48249027237354086

In [None]:
trans=MinMaxScaler()
new_scaled_test=pd.DataFrame(trans.fit_transform(test_df),columns=test_df.columns)
new_scaled_test=new_scaled_test.drop(['Id'],axis=1)

y_pred1=lg.predict(new_scaled_test)

In [None]:
submission['quality']=y_pred1
submission

Unnamed: 0,Id,quality
0,2056,5
1,2057,5
2,2058,5
3,2059,6
4,2060,6
...,...,...
1367,3423,5
1368,3424,6
1369,3425,5
1370,3426,5


Conclusion

Logistic is much better than GridSearchCV

In [None]:
submission.to_csv("submission.csv",index=False)