## Занятие 7. Линейные модели для классификации

https://scikit-learn.ru/1-2-linear-and-quadratic-discriminant-analysis/

https://scikit-learn.ru/3-3-metrics-and-scoring-quantifying-the-quality-of-predictions/

In [21]:
#!pip install pingouin

In [31]:
import os
import numpy as np
import pandas as pd
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
import pingouin as pg

### Logistic Regression

In [2]:
os.chdir("C:/Users/HP/Documents/analysis/Marketing/data/")

In [3]:
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)

In [4]:
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [5]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)

In [6]:
# Logistic Regression Classification
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
print(results.std())

0.7708646616541353
0.05090500786917546


In [7]:
model.fit(X,Y)

LogisticRegression(solver='liblinear')

In [8]:
print("Форма коэффициентов: ", model.coef_.shape)
print("Форма константы: ", model.intercept_.shape)

Форма коэффициентов:  (1, 8)
Форма константы:  (1,)


In [9]:
model.coef_

array([[ 1.16994476e-01,  2.83733435e-02, -1.68981359e-02,
         7.55145090e-04, -6.41407258e-04,  5.97201268e-02,
         6.76128123e-01,  7.23498971e-03]])

In [10]:
model.intercept_

array([-5.88679617])

In [11]:
x_new=np.array([[0,136,42,34,136,43,2,32],[0,137,40,35,168,43.1,2.288,33]])

In [12]:
model.predict(x_new)

array([1., 1.])

In [14]:
model.predict_proba(x_new)

array([[0.20547146, 0.79452854],
       [0.16759435, 0.83240565]])

Для каждого наблюдения приведены вероятности 0-го класса (1-й столбец) и 1-го класса (2-й столбец).

### Support Vector Machines

https://scikit-learn.org/stable/modules/svm.html#svm-kernels

In [38]:
model = SVC(kernel='linear')
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
print(results.std())

0.7708133971291866
0.03832392703883875


In [39]:
model.fit(X,Y)

SVC(kernel='linear')

In [40]:
print("Форма коэффициентов: ", model.coef_.shape)
print("Форма константы: ", model.intercept_.shape)

Форма коэффициентов:  (1, 8)
Форма константы:  (1,)


In [41]:
model.coef_

array([[ 9.14692398e-02,  3.00467168e-02, -1.09051239e-02,
        -4.81652166e-03, -4.21253324e-04,  7.33928948e-02,
         7.15054906e-01,  7.26041287e-03]])

In [42]:
model.intercept_

array([-6.74051604])

In [30]:
model.predict(x_new)

array([1., 1.])

In [34]:
model = SVC(kernel='rbf')
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.760457963089542


In [35]:
model.fit(X,Y)

SVC()

In [37]:
model.predict(x_new)

array([0., 0.])

### Linear Discriminant Analysis

Henze-Zirkler Multivariate Normality Test in Python
https://www.statology.org/multivariate-normality-test-python/

In [27]:
#perform the Henze-Zirkler Multivariate Normality Test
pg.multivariate_normality(X, alpha=.05)

HZResults(hz=4.1804860273838855, pval=0.0, normal=False)

Вывод: отвергаем нулевую гипотезу о многомерной нормальности дискриминантных переменных.

In [37]:
pg.normality(dataframe)

Unnamed: 0,W,pval,normal
preg,0.904278,1.608089e-21,False
plas,0.970104,1.986761e-11,False
pres,0.818921,1.5840070000000001e-28,False
skin,0.904627,1.751576e-21,False
test,0.722021,7.915248e-34,False
mass,0.949989,1.840562e-15,False
pedi,0.836519,2.4776970000000003e-27,False
age,0.874766,2.401947e-24,False
class,0.60251,1.292262e-38,False


Вывод: в соответствии с критерием Шапиро-Уилка ни один из признаков не имеет нормального распределения.

Box’s M test https://pingouin-stats.org/generated/pingouin.box_m.html

In [34]:
pg.box_m(dataframe, dvs=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age'], group='class')

AttributeError: module 'pingouin' has no attribute 'box_m'

В colab отработал. Ковариационные матрицы для разных классов не равны.

In [35]:
#Проверка линейной независимости дискриминантных переменных
pg.pairwise_corr(dataframe, columns=['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age'], method='pearson')

Unnamed: 0,X,Y,method,tail,n,r,CI95%,p-unc,BF10,power
0,preg,plas,pearson,two-sided,768,0.129459,"[0.06, 0.2]",0.0003219491,28.791,0.949814
1,preg,pres,pearson,two-sided,768,0.141282,"[0.07, 0.21]",8.541846e-05,100.059,0.975945
2,preg,skin,pearson,two-sided,768,-0.081672,"[-0.15, -0.01]",0.02360795,0.583,0.619944
3,preg,test,pearson,two-sided,768,-0.073535,"[-0.14, -0.0]",0.04162094,0.359,0.531407
4,preg,mass,pearson,two-sided,768,0.017683,"[-0.05, 0.09]",0.6246376,0.051,0.077839
5,preg,pedi,pearson,two-sided,768,-0.033523,"[-0.1, 0.04]",0.3535346,0.069,0.152973
6,preg,age,pearson,two-sided,768,0.544341,"[0.49, 0.59]",1.862813e-60,9.03e+56,1.0
7,plas,pres,pearson,two-sided,768,0.15259,"[0.08, 0.22]",2.169507e-05,366.214,0.989169
8,plas,skin,pearson,two-sided,768,0.057328,"[-0.01, 0.13]",0.1124141,0.159,0.35523
9,plas,test,pearson,two-sided,768,0.331357,"[0.27, 0.39]",3.882624e-21,8.929e+17,1.0


Вывод: сила линейной связи между дискриминантными переменными невысокая.

Проводить дискриминантный анализ нельзя, т.к. не выполнены предпосылки, а именно нет равенства ковариационных матриц для каждого класса и нет многомерной нормальности распределения дискриминантных переменных.

In [67]:
# LDA Classification
model = LinearDiscriminantAnalysis()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
print(results.std())

0.7669685577580315
0.04796563054016723


In [68]:
model.fit(X,Y)

LinearDiscriminantAnalysis()

In [45]:
model.predict(x_new)

array([1., 1.])

In [46]:
model.predict_proba(x_new)

array([[0.13996295, 0.86003705],
       [0.10537183, 0.89462817]])

https://towardsdatascience.com/linear-discriminant-analysis-in-python-76b8b17817c2

https://machinelearningmastery.com/linear-discriminant-analysis-with-python/

In [69]:
model.transform(X)[:5]

array([[ 1.38024702],
       [-1.56452399],
       [ 1.76618515],
       [-1.69058752],
       [ 2.20726804]])

In [70]:
#Значения дискриминантной функции. Она одна (количество классов минус 1): 2-1
X_lda=model.fit_transform(X,Y)
X_lda[:5]

array([[ 1.38024702],
       [-1.56452399],
       [ 1.76618515],
       [-1.69058752],
       [ 2.20726804]])

In [71]:
#We can access the following property to obtain the variance explained by each component.
model.explained_variance_ratio_

array([1.])