Добавление нового признака:
- Составить выборку bag-of-words
- Составить классификаторы для каждого признака
- Получить векторы признаков заданных изначально классификаторов
- Привести матрицу к треугольному виду
- Выделить часть, которая не объясняется в базисе признаков
- Применить однокомпонентный PCA
- Получить классификатор для него
- Сравнить с тем, который реально есть для отсутствующего признака

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA, SparsePCA
from sklearn.model_selection import GridSearchCV
import numpy as np
from tqdm import *
from sklearn.externals import joblib
import scipy.sparse as sps

# Формирование выборки

In [2]:
reviews = pd.read_csv("reviews_with_wine_features.csv")

In [3]:
reviews["filtered_description"] = reviews["filtered_description"].apply(lambda x: " ".join(eval(x)))

In [4]:
def get_bag_of_words(texts, vectorizer=None):
    if not vectorizer:
        vectorizer = CountVectorizer()
        vectorizer.fit(texts)
    transformed_texts = vectorizer.transform(texts)
    return transformed_texts, vectorizer

In [5]:
features = [c for c in reviews.columns if "_feature" in c]
reviews = pd.get_dummies(reviews, columns=features)

In [6]:
features = [c for c in reviews.columns if "_feature" in c]
features = [c for c in features if "madeFromGrape" not in c]

In [7]:
big_enough_features = np.array(features)[np.nonzero(reviews[features].mean() > 0.01)[0]]

In [8]:
big_enough_features

array(['locatedIn_feature_Alsace', 'locatedIn_feature_Bordeaux',
       'locatedIn_feature_Bourgogne', 'locatedIn_feature_California',
       'locatedIn_feature_Italy', 'locatedIn_feature_Portugal',
       'locatedIn_feature_US', 'hasSugar_feature_Dry',
       'hasSugar_feature_OffDry', 'hasSugar_feature_Sweet',
       'hasBody_feature_Full', 'hasBody_feature_Light',
       'hasBody_feature_Medium', 'hasFlavor_feature_Delicate',
       'hasFlavor_feature_Moderate', 'hasFlavor_feature_Strong',
       'hasColor_feature_Red', 'hasColor_feature_White'],
      dtype='<U32')

In [9]:
selected_features = ['hasSugar_feature_Dry',
       'hasSugar_feature_OffDry', 'hasSugar_feature_Sweet',
       'hasBody_feature_Full', 'hasBody_feature_Light',
       'hasBody_feature_Medium', 'hasFlavor_feature_Delicate',
       'hasFlavor_feature_Moderate', 'hasFlavor_feature_Strong',
       'hasColor_feature_Red', 'hasColor_feature_White']

In [10]:
valid_classes = []
for feature in selected_features:
    valid_classes.append(reviews[feature])

# Подготовка классификаторов

In [11]:
def train_classifier(X, y):
    grid = {
        'C': np.linspace(0.01, 1, 5)
    }
    classifier = LogisticRegression()
    search = GridSearchCV(estimator=classifier, param_grid=grid, cv=3, scoring='roc_auc', verbose=True)
    search.fit(X, y)
    return search

In [18]:
classifiers = []
X, vectorizer = get_bag_of_words(reviews["filtered_description"])

In [13]:
for y in tqdm(valid_classes):
    grid_search = train_classifier(X, y)
    print(grid_search.best_score_)
    classifiers.append(grid_search.best_estimator_)

  0%|          | 0/11 [00:00<?, ?it/s]

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   39.9s finished
  9%|▉         | 1/11 [00:43<07:15, 43.60s/it]

0.849040513917
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   24.8s finished
 18%|█▊        | 2/11 [01:10<05:17, 35.33s/it]

0.875700409435
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   16.6s finished
 27%|██▋       | 3/11 [01:29<03:58, 29.79s/it]

0.971809883513
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   34.2s finished
 36%|███▋      | 4/11 [02:06<03:41, 31.67s/it]

0.828378039282
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   18.0s finished
 45%|████▌     | 5/11 [02:26<02:55, 29.31s/it]

0.852603006123
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   34.3s finished
 55%|█████▍    | 6/11 [03:04<02:33, 30.75s/it]

0.79070285652
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   19.2s finished
 64%|██████▎   | 7/11 [03:25<01:57, 29.38s/it]

0.86636420522
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   31.8s finished
 73%|███████▎  | 8/11 [04:00<01:30, 30.06s/it]

0.7997471509
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   31.2s finished
 82%|████████▏ | 9/11 [04:34<01:01, 30.54s/it]

0.837659502629
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   41.1s finished
 91%|█████████ | 10/11 [05:20<00:32, 32.05s/it]

0.905891657549
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   25.8s finished
100%|██████████| 11/11 [05:49<00:00, 31.75s/it]

0.966083063716





In [14]:
joblib.dump(classifiers, 'classifiers.pkl')

['classifiers.pkl']

In [19]:
classifiers = joblib.load('classifiers.pkl')

# Составление базиса

Выбор признаков красный/белый

In [20]:
selected_classifiers = classifiers[:-2]

In [21]:
features_vectors = np.array([c.coef_[0] for c in selected_classifiers])

In [22]:
features_basis = np.triu(features_vectors)

In [23]:
inverted_features_basis = np.linalg.pinv(features_basis)

In [24]:
unexplained_X = X - (X * sps.csr_matrix(features_basis).T) * sps.csr_matrix(inverted_features_basis).T

# Составление нового признака и сравнение с настоящими значениями

In [25]:
unexplained_X.shape

(129971, 19113)

In [None]:
covariance = unexplained_X.T * unexplained_X

In [None]:
eigenvalues, eigenvectors = sps.linalg.eigs(covariance, k=1)

In [None]:
eigenvalues / np.trace(covariance)

In [None]:
new_feature_values = eigenvectors * unexplained_X.T
new_feature_coefficients = eigenvectors

In [None]:
new_feature_bias_index = (new_feature_values - new_feature_values.shift()).argmax()
new_feature_bias = new_feature_values[new_feature_bias_index] + (new_feature_values[new_feature_bias_index + 1] - new_feature_values[new_feature_bias_index])

Сравнение осуществляется через классификацию и просмотр результатов

In [None]:
new_classifer = LogisticRegression()
new_classifier.coef_ = new_feature_coefficients
new_classifier.intercept_ = new_feature_bias
new_feature_probabilities = new_classifier.predict(X)
real_feature_probabilities = classifiers[-1].predict(X)

In [None]:
new_feature_probabilities - real_feature_probabilities