An FMCG firm would like to inspect on product quality. They have gathered information on various features that impact product quality and it is classified into multiple levels. You are tasked to predict the quality based on the faetures provided. Fit a logistic regression model and provide score for your fitted model. Do you see similarities/difference between two solvers based on the score? (you can use any two solvers).

About the dataset:<br>
* product_quality: The outcome variable
* V1 to V11: Independent variables/features for which information is captured.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load data
product = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Classification-Models-main/data/product_quality.csv')
product.sample(10)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,product_quality
836,6.7,0.28,0.28,2.4,0.012,36.0,100.0,0.99064,3.26,0.39,11.7,4
15,8.9,0.62,0.19,3.9,0.17,51.0,148.0,0.9986,3.17,0.93,9.2,2
184,6.7,0.62,0.21,1.9,0.079,8.0,62.0,0.997,3.52,0.58,9.3,3
1171,7.1,0.59,0.0,2.2,0.078,26.0,44.0,0.99522,3.42,0.68,10.8,3
964,8.5,0.47,0.27,1.9,0.058,18.0,38.0,0.99518,3.16,0.85,11.1,3
624,6.8,0.69,0.0,5.6,0.124,21.0,58.0,0.9997,3.46,0.72,10.2,2
49,5.6,0.31,0.37,1.4,0.074,12.0,96.0,0.9954,3.32,0.58,9.2,2
1435,10.2,0.54,0.37,15.4,0.214,55.0,95.0,1.00369,3.18,0.77,9.0,3
1004,8.2,0.43,0.29,1.6,0.081,27.0,45.0,0.99603,3.25,0.54,10.3,2
424,7.7,0.96,0.2,2.0,0.047,15.0,60.0,0.9955,3.36,0.44,10.9,2


In [3]:
# Lets check shape
product.shape

(1599, 12)

In [4]:
# Lets check info
product.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   V1               1599 non-null   float64
 1   V2               1599 non-null   float64
 2   V3               1599 non-null   float64
 3   V4               1599 non-null   float64
 4   V5               1599 non-null   float64
 5   V6               1599 non-null   float64
 6   V7               1599 non-null   float64
 7   V8               1599 non-null   float64
 8   V9               1599 non-null   float64
 9   V10              1599 non-null   float64
 10  V11              1599 non-null   float64
 11  product_quality  1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [5]:
# Missing values
product.isnull().sum()

V1                 0
V2                 0
V3                 0
V4                 0
V5                 0
V6                 0
V7                 0
V8                 0
V9                 0
V10                0
V11                0
product_quality    0
dtype: int64

There are no missing values

In [6]:
# Lets check product quality
product['product_quality'].unique()

array([2, 3, 4, 1, 5, 0], dtype=int64)

There are 6 different levels with each level indicating the product quality

In [7]:
# Lets check correlation in X
X = product.drop('product_quality', axis=1)
X.corr()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
V1,1.0,-0.256131,0.671703,0.114777,0.093705,-0.153794,-0.113181,0.668047,-0.682978,0.183006,-0.061668
V2,-0.256131,1.0,-0.552496,0.001918,0.061298,-0.010504,0.07647,0.022026,0.234937,-0.260987,-0.202288
V3,0.671703,-0.552496,1.0,0.143577,0.203823,-0.060978,0.035533,0.364947,-0.541904,0.31277,0.109903
V4,0.114777,0.001918,0.143577,1.0,0.05561,0.187049,0.203028,0.355283,-0.085652,0.005527,0.042075
V5,0.093705,0.061298,0.203823,0.05561,1.0,0.005562,0.0474,0.200632,-0.265026,0.37126,-0.221141
V6,-0.153794,-0.010504,-0.060978,0.187049,0.005562,1.0,0.667666,-0.021946,0.070377,0.051658,-0.069408
V7,-0.113181,0.07647,0.035533,0.203028,0.0474,0.667666,1.0,0.071269,-0.066495,0.042947,-0.205654
V8,0.668047,0.022026,0.364947,0.355283,0.200632,-0.021946,0.071269,1.0,-0.341699,0.148506,-0.49618
V9,-0.682978,0.234937,-0.541904,-0.085652,-0.265026,0.070377,-0.066495,-0.341699,1.0,-0.196648,0.205633
V10,0.183006,-0.260987,0.31277,0.005527,0.37126,0.051658,0.042947,0.148506,-0.196648,1.0,0.093595


In [8]:
# Check correlation with y
y = product['product_quality']
X.corrwith(y)

V1     0.124052
V2    -0.390558
V3     0.226373
V4     0.013732
V5    -0.128907
V6    -0.050656
V7    -0.185100
V8    -0.174919
V9    -0.057731
V10    0.251397
V11    0.476166
dtype: float64

**V11 seem to have better correlation with y than others**

In [9]:
# Lets check multi-collinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

V1       74.452265
V2       17.060026
V3        9.183495
V4        4.662992
V5        6.554877
V6        6.442682
V7        6.519699
V8     1479.287209
V9     1070.967685
V10      21.590621
V11     124.394866
dtype: float64

In [10]:
# Lets drop V8
X.drop('V8', axis=1, inplace=True)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

V1      40.216574
V2      17.058940
V3       9.149028
V4       4.662789
V5       6.017799
V6       6.390157
V7       6.096300
V9     158.025734
V10     21.552410
V11    121.980842
dtype: float64

In [11]:
# Lets drop V9
X.drop('V9', axis=1, inplace=True)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

V1     37.557809
V2     15.651760
V3      8.636667
V4      4.660704
V5      5.935002
V6      6.357695
V7      5.987189
V10    21.218142
V11    37.137148
dtype: float64

In [12]:
# Lets drop V1 and retain V11 (V11 has better correlation)
X.drop('V1', axis=1, inplace=True)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

V2     12.880017
V3      4.913939
V4      4.637831
V5      5.903420
V6      6.349669
V7      5.900754
V10    20.648712
V11    31.915246
dtype: float64

In [13]:
# Lets drop V2 due to poor correlation (Retain V11 and V10)
X.drop('V2', axis=1, inplace=True)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

V3      3.429373
V4      4.589696
V5      5.220931
V6      6.175750
V7      5.503560
V10    20.446909
V11    18.092200
dtype: float64

In [14]:
# Lets drop V6
X.drop('V6', axis=1, inplace=True)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

V3      3.357480
V4      4.557597
V5      5.217760
V7      3.034367
V10    20.362169
V11    17.698264
dtype: float64

In [15]:
# Lets drop V5
X.drop('V5', axis=1, inplace=True)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

V3      3.329190
V4      4.527246
V7      3.028535
V10    16.706696
V11    17.650067
dtype: float64

In [16]:
# Now we will drop V10 (Still retain V11 due to better correlation)
X.drop('V10', axis=1, inplace=True)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

V3     3.034249
V4     4.521110
V7     2.963910
V11    6.292263
dtype: float64

**VIFs have come down drastically. Do not want to drop V11 since V11 has better correlation with y than others. Lets train the model and look at p-values**

In [17]:
# Train the model
mlogit_mod = sm.MNLogit(y, X)
mlogit_res = mlogit_mod.fit()
print(mlogit_res.summary())

Optimization terminated successfully.
         Current function value: 1.123618
         Iterations 9
                          MNLogit Regression Results                          
Dep. Variable:        product_quality   No. Observations:                 1599
Model:                        MNLogit   Df Residuals:                     1579
Method:                           MLE   Df Model:                           15
Date:                Thu, 16 Jun 2022   Pseudo R-squ.:                 0.05150
Time:                        22:34:31   Log-Likelihood:                -1796.7
converged:                       True   LL-Null:                       -1894.2
Covariance Type:            nonrobust   LLR p-value:                 2.080e-33
product_quality=1       coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
V3                   -0.3256      2.003     -0.163      0.871      -4.252       3.600
V4      

In [18]:
mlogit_res.llnull

-1894.2253774681112

### Alternate Solvers

In [19]:
bfgs_mod = mlogit_mod.fit(method='bfgs', maxiter=250)
print(bfgs_mod.summary())

Optimization terminated successfully.
         Current function value: 1.123618
         Iterations: 106
         Function evaluations: 112
         Gradient evaluations: 112
                          MNLogit Regression Results                          
Dep. Variable:        product_quality   No. Observations:                 1599
Model:                        MNLogit   Df Residuals:                     1579
Method:                           MLE   Df Model:                           15
Date:                Thu, 16 Jun 2022   Pseudo R-squ.:                 0.05150
Time:                        22:34:31   Log-Likelihood:                -1796.7
converged:                       True   LL-Null:                       -1894.2
Covariance Type:            nonrobust   LLR p-value:                 2.080e-33
product_quality=1       coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
V3                   

In [20]:
bfgs_mod.llnull

-1894.2253774681112

In [21]:
powell_mod = mlogit_mod.fit(method='powell', maxiter=250)
print(powell_mod.summary())

Optimization terminated successfully.
         Current function value: 1.130565
         Iterations: 13
         Function evaluations: 3406
                          MNLogit Regression Results                          
Dep. Variable:        product_quality   No. Observations:                 1599
Model:                        MNLogit   Df Residuals:                     1579
Method:                           MLE   Df Model:                           15
Date:                Thu, 16 Jun 2022   Pseudo R-squ.:                 0.04564
Time:                        22:34:33   Log-Likelihood:                -1807.8
converged:                       True   LL-Null:                       -1894.2
Covariance Type:            nonrobust   LLR p-value:                 6.379e-29
product_quality=1       coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
V3                    6.0534      3.168      1.911      

In [22]:
powell_mod.llnull

-1894.2253774681112

**1. We do not see see big difference in llnull values when different solvers are used**<br>
**2. There are few differences in p-values when powell solver is used**<br>
**3. All the solvers used above use MLE method as parameter estiamtion**