Имеются данные adult.csv (см. в материалах к занятию)

Целевой переменной является уровень дохода income (крайний правый столбец).

Описание признаков можно найти по ссылке http://www.cs.toronto.edu/~delve/data/adult/adultDetail.html

Вам необходимо построить модель логистической регрессии, которая предсказывает уровень дохода человека. При возможности попробуйте улучшить точность предсказаний (метод score) с помощью перебора признаков.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

In [2]:
data = pd.read_csv('adult.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age                48842 non-null int64
workclass          48842 non-null object
fnlwgt             48842 non-null int64
education          48842 non-null object
educational-num    48842 non-null int64
marital-status     48842 non-null object
occupation         48842 non-null object
relationship       48842 non-null object
race               48842 non-null object
gender             48842 non-null object
capital-gain       48842 non-null int64
capital-loss       48842 non-null int64
hours-per-week     48842 non-null int64
native-country     48842 non-null object
income             48842 non-null object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


пустых полей нет

In [4]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


Преобразуем целевую фичу

In [5]:
target_encoder = LabelEncoder()
target_encoder.fit(data.income)
target_encoder.classes_

array(['<=50K', '>50K'], dtype=object)

In [6]:
data['income'] = target_encoder.transform(data.income)

In [7]:
num_features_0 = ['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']

Посмотрим какие признаки можно преобразовать с помощью one hot encoding

In [8]:
ohe_features_candidats = ['workclass', 'education', 'marital-status', 'occupation', 'relationship',
                         'race', 'gender', 'capital-loss', 'native-country']

In [9]:
pd.DataFrame(data={'Unique values' : [len(data[column_name].unique()) for column_name in ohe_features_candidats]},
             index=ohe_features_candidats)

Unnamed: 0,Unique values
workclass,9
education,16
marital-status,7
occupation,15
relationship,6
race,5
gender,2
capital-loss,99
native-country,42


gender преобразуем в булево значение, остальные фичи кроме capital-loss и native-country преобразуем через One hot encoding в 
новые столбцы

In [10]:
ohe_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']

In [11]:
d2 = data.copy()
gender_encoder = LabelEncoder()
d2.gender = gender_encoder.fit_transform(d2.gender)
d2 = d2[num_features_0].join(pd.get_dummies(d2[ohe_features])).join(d2.income)

Бьём на train и test и обучаем модель

In [12]:
d2.columns

Index(['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss',
       'hours-per-week', 'workclass_?', 'workclass_Federal-gov',
       'workclass_Local-gov', 'workclass_Never-worked', 'workclass_Private',
       'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc',
       'workclass_State-gov', 'workclass_Without-pay', 'education_10th',
       'education_11th', 'education_12th', 'education_1st-4th',
       'education_5th-6th', 'education_7th-8th', 'education_9th',
       'education_Assoc-acdm', 'education_Assoc-voc', 'education_Bachelors',
       'education_Doctorate', 'education_HS-grad', 'education_Masters',
       'education_Preschool', 'education_Prof-school',
       'education_Some-college', 'marital-status_Divorced',
       'marital-status_Married-AF-spouse', 'marital-status_Married-civ-spouse',
       'marital-status_Married-spouse-absent', 'marital-status_Never-married',
       'marital-status_Separated', 'marital-status_Widowed', 'occupation_?',
       'occupatio

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    d2.loc[:, ~d2.columns.isin(['income'])],
    d2['income'],
    test_size=0.2,
    random_state=42)

Первый вариант регресии

In [14]:
%%time
model0 = LogisticRegression(solver='liblinear')
model0.fit(X_train, y_train)
y_predict = model0.predict(X_test)

Wall time: 226 ms


Считаем оценку модели

In [15]:
r0 = pd.DataFrame(classification_report(y_test, y_predict, output_dict=True)).T
r0

Unnamed: 0,f1-score,precision,recall,support
0,0.883042,0.812423,0.967108,7479.0
1,0.392902,0.715935,0.270742,2290.0
accuracy,0.803869,0.803869,0.803869,0.803869
macro avg,0.637972,0.764179,0.618925,9769.0
weighted avg,0.768146,0.789805,0.803869,9769.0


Попробуем улучшить модель с помощью подбора гиперпараметра 'с' логистической регрессии

In [16]:
cvals = 10 ** np.linspace(-3,2,10)
cvals

array([1.00000000e-03, 3.59381366e-03, 1.29154967e-02, 4.64158883e-02,
       1.66810054e-01, 5.99484250e-01, 2.15443469e+00, 7.74263683e+00,
       2.78255940e+01, 1.00000000e+02])

In [17]:
%%time

model1 = GridSearchCV(LogisticRegression(solver='liblinear'), {'C': cvals}, scoring='accuracy', cv=5)
model1.fit(X_train, y_train)

Wall time: 13.6 s


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'C': array([1.00000000e-03, 3.59381366e-03, 1.29154967e-02, 4.64158883e-02,
       1.66810054e-01, 5.99484250e-01, 2.15443469e+00, 7.74263683e+00,
       2.78255940e+01, 1.00000000e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

In [18]:
pd.DataFrame(model1.cv_results_)[['mean_test_score','std_test_score','params']].sort_values(
    by='mean_test_score', ascending=False).head()

Unnamed: 0,mean_test_score,std_test_score,params
9,0.796739,0.002958,{'C': 100.0}
0,0.796688,0.002864,{'C': 0.001}
1,0.796688,0.002864,{'C': 0.003593813663804626}
2,0.796688,0.002864,{'C': 0.01291549665014884}
3,0.796688,0.002864,{'C': 0.046415888336127795}


In [19]:
best_C = model1.best_params_['C']

In [20]:
y_predict = model1.predict(X_test)

Считаем оценку

In [21]:
r1 = pd.DataFrame((classification_report(y_test, y_predict, output_dict=True))).T
r1

Unnamed: 0,f1-score,precision,recall,support
0,0.883042,0.812423,0.967108,7479.0
1,0.392902,0.715935,0.270742,2290.0
accuracy,0.803869,0.803869,0.803869,0.803869
macro avg,0.637972,0.764179,0.618925,9769.0
weighted avg,0.768146,0.789805,0.803869,9769.0


Попробуем ещё улучшить модель путем отбора признаков

In [22]:
%%time

model2 = LogisticRegression(C=best_C, solver='liblinear')
skf = StratifiedKFold(n_splits=4)
sfs_forward = SequentialFeatureSelector(model2, 
                  k_features="best", 
                  forward=True, 
                  floating=False, 
                  verbose=1,
                  scoring='roc_auc',
                  cv=skf,
                  n_jobs=-1)
    
sfs_forward = sfs_forward.fit(X_train.values, y_train.values, custom_feature_names=X_train.columns)

X_train_filtered = sfs_forward.transform(X_train)
model2.fit(X_train_filtered, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    5.2s
[Parallel(n_jobs=-1)]: Done  64 out of  64 | elapsed:    6.2s finished
Features: 1/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    2.7s
[Parallel(n_jobs=-1)]: Done  63 out of  63 | elapsed:    4.2s finished
Features: 2/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done  62 out of  62 | elapsed:    8.7s finished
Features: 3/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   11.6s
[Parallel(n_jobs=-1)]: Done  61 out of  61 | elapsed:   17.2s finished
Features: 4/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 

Features: 45/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  19 out of  19 | elapsed:   11.2s finished
Features: 46/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:   10.9s finished
Features: 47/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 out of  17 | elapsed:   10.1s finished
Features: 48/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 out of  16 | elapsed:    9.4s finished
Features: 49/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:    9.4s finished
Features: 50/64[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 out of  14 | elapsed:    8.8s finished
Features: 51/64[Parallel(n_j

Wall time: 19min 27s


LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [23]:
# список отобранных фичей
list(sfs_forward.k_feature_names_)

['age',
 'educational-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'workclass_?',
 'workclass_Federal-gov',
 'workclass_Local-gov',
 'workclass_Never-worked',
 'workclass_Private',
 'workclass_Self-emp-inc',
 'workclass_Self-emp-not-inc',
 'workclass_State-gov',
 'workclass_Without-pay',
 'education_12th',
 'education_1st-4th',
 'education_5th-6th',
 'education_7th-8th',
 'education_9th',
 'education_Assoc-acdm',
 'education_Doctorate',
 'education_Preschool',
 'education_Prof-school',
 'education_Some-college',
 'marital-status_Married-AF-spouse',
 'marital-status_Married-civ-spouse',
 'marital-status_Married-spouse-absent',
 'marital-status_Never-married',
 'marital-status_Widowed',
 'occupation_Armed-Forces',
 'occupation_Exec-managerial',
 'occupation_Farming-fishing',
 'occupation_Handlers-cleaners',
 'occupation_Machine-op-inspct',
 'occupation_Other-service',
 'occupation_Priv-house-serv',
 'occupation_Prof-specialty',
 'occupation_Protective-serv',
 'occupation_S

In [24]:
y_predict = model2.predict(X_test[list(sfs_forward.k_feature_names_)])

Считаем оценку

In [25]:
r2 = pd.DataFrame((classification_report(y_test, y_predict, output_dict=True))).T
r2

Unnamed: 0,f1-score,precision,recall,support
0,0.90732,0.882167,0.933948,7479.0
1,0.655397,0.733117,0.592576,2290.0
accuracy,0.853926,0.853926,0.853926,0.853926
macro avg,0.781358,0.807642,0.763262,9769.0
weighted avg,0.848265,0.847228,0.853926,9769.0


#### Сводная таблица

In [32]:
regressors = ['default LogisticRegression', 'GridSearch+LogisticRegression', 'SequentialFeatureSelector+LogisticRegression']
metrics = list(r0.index)
idx = pd.MultiIndex.from_product([regressors, metrics], names=['regressors', 'metrics'])

In [33]:
mt = r0.append(r1).append(r2)
mt.index = idx
mt

Unnamed: 0_level_0,Unnamed: 1_level_0,f1-score,precision,recall,support
regressors,metrics,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
default LogisticRegression,0,0.883042,0.812423,0.967108,7479.0
default LogisticRegression,1,0.392902,0.715935,0.270742,2290.0
default LogisticRegression,accuracy,0.803869,0.803869,0.803869,0.803869
default LogisticRegression,macro avg,0.637972,0.764179,0.618925,9769.0
default LogisticRegression,weighted avg,0.768146,0.789805,0.803869,9769.0
GridSearch+LogisticRegression,0,0.883042,0.812423,0.967108,7479.0
GridSearch+LogisticRegression,1,0.392902,0.715935,0.270742,2290.0
GridSearch+LogisticRegression,accuracy,0.803869,0.803869,0.803869,0.803869
GridSearch+LogisticRegression,macro avg,0.637972,0.764179,0.618925,9769.0
GridSearch+LogisticRegression,weighted avg,0.768146,0.789805,0.803869,9769.0


<b>Итого:</b> первая и вторая модель выдали один и тот же результат, третья дала улучшение