Данное задание является соревнованием на kaggle

https://www.kaggle.com/c/simplesentiment/data

In [1]:
import numpy as np
import pandas as pd 

Загружаем и просматриваем данные

In [188]:
train_df = pd.read_csv('products_sentiment_train.tsv', sep='\t', header=None)

test_df = pd.read_csv('products_sentiment_test.tsv', sep='\t', index_col='Id')

sample_subm = pd.read_csv('products_sentiment_sample_submission.csv', sep=',', index_col='Id')

In [189]:
train_df.columns = ['text', 'target']

train_df.head()

Unnamed: 0,text,target
0,"2 . take around 10,000 640x480 pictures .",1
1,i downloaded a trial version of computer assoc...,1
2,the wrt54g plus the hga7t is a perfect solutio...,1
3,i dont especially like how music files are uns...,0
4,i was using the cheapie pail ... and it worked...,1


Оценим соотношение классов в выборке

In [35]:
train_df[train_df.target==0].shape, train_df[train_df.target == 1].shape

((726, 2), (1274, 2))

Разобьем выборку на признаки (text) и непосредственно целевую переменную (target)

In [111]:
from sklearn.model_selection import cross_val_score, GridSearchCV

X_train, y_train = train_df.text, train_df.target

Задача соревнования подразумевает метрику accuracy.
Выберем какой классификатор и векторизатор лучше. Заодно, подберем количество n-грам

In [115]:
from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

In [116]:
%%time
vects = [CountVectorizer(), TfidfVectorizer()]
models = [LogisticRegression(), LinearSVC()]

params = {
            'count_vect__ngram_range' : [(1, 1), (1,2), (1,3), (1, 4), (1, 5), (1, 6)], 
         }

for i in vects:
    for j in models:
        pipe = Pipeline([
            ('count_vect', i),
            ('log_reg', j)
        ])
        
        cls = GridSearchCV(pipe, params, cv=5)
        cls.fit(X_train, y_train)
        print(i, '\n', j,'\n', cls.best_params_, '\n', cls.best_score_)
        print('-' * 50)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None) 
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) 
 {'count_vect__ngram_range': (1, 2)} 
 0.7705
--------------------------------------------------
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_w

Выберем две модели:

    TfidfVectorizer(ngram_range=(1,3)), LinearSVC
    CountVectorizer(ngram_range=(1,2)), LogisticRegression

In [117]:
import nltk

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Цифра-2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Подберем к ним гиперпараметры:

In [121]:
%%time
params = {
            'count_vect__stop_words' : [None, stop_words],
            'log_reg__C' : np.linspace(0.1, 10, 30),
        }

pipe_1 = Pipeline([
    ('count_vect', TfidfVectorizer(ngram_range=(1,4))),
    ('log_reg', LinearSVC())
    ])
        
cls_1 = GridSearchCV(pipe_1, params, cv=5)
cls_1.fit(X_train, y_train)
print(cls_1.best_params_, '\n', cls_1.best_score_)

pipe_2 = Pipeline([
    ('count_vect', CountVectorizer(ngram_range=(1,2))),
    ('log_reg', LogisticRegression())
    ])

cls_2 = GridSearchCV(pipe_2, params, cv=5)
cls_2.fit(X_train, y_train)
print(cls_2.best_params_, '\n', cls_2.best_score_)

{'count_vect__stop_words': None, 'log_reg__C': 4.196551724137931} 
 0.7945
{'count_vect__stop_words': None, 'log_reg__C': 2.831034482758621} 
 0.7765
Wall time: 2min 30s


Будем использовать наилучшие параметры С, в зависимости от модели. 
При этом, исключение стоп-слов не дает прироста качества

Подберем параметр min_df

In [125]:
min_df_arr = range(1, 5)

for min_df in min_df_arr:
    print('min_df=', min_df)
    
    pipe_1 = Pipeline([
        ('count_vect', TfidfVectorizer(min_df=min_df, ngram_range=(1,4))),
        ('log_reg', LinearSVC(C=4.196551724137931))
    ])
    
    score = np.mean(cross_val_score(pipe_1, X_train, y_train))
    print('model_1:', score)

    pipe_2 = Pipeline([
        ('count_vect', CountVectorizer(min_df=min_df, ngram_range=(1,2))),
        ('log_reg', LogisticRegression(C=2.831034482758621))
    ])

    score = np.mean(cross_val_score(pipe_2, X_train, y_train))
    print('model_1:', score)


min_df= 1
model_1: 0.783998391194793
model_1: 0.7680071375723551
min_df= 2
model_1: 0.7585006295650975
model_1: 0.7630066348207278
min_df= 3
model_1: 0.7455071263167216
model_1: 0.7595068831950392
min_df= 4
model_1: 0.7350071210640925
model_1: 0.7530086308197252


Подбор гиперпараметра min_df не дал улучшения качества

Заново обучим модели и проверим, какая из них даст лучший результат на тестовой выборке (путем submission на kaggle)

In [126]:
pipe_1 = Pipeline([
        ('count_vect', TfidfVectorizer(ngram_range=(1,4))),
        ('log_reg', LinearSVC(C=4.196551724137931))
])

pipe_2 = Pipeline([
        ('count_vect', CountVectorizer(ngram_range=(1,2))),
        ('log_reg', LogisticRegression(C=2.831034482758621))
])

np.mean(cross_val_score(pipe_1, X_train, y_train)),\
np.mean(cross_val_score(pipe_2, X_train, y_train))

(0.783998391194793, 0.7680071375723551)

In [209]:
cls_1 = pipe_1.fit(X_train, y_train)
test_df['y'] = cls_1.predict(test_df.text)

In [210]:
ans_1 = test_df[["y"]]

In [211]:
ans_1.to_csv('submission_1.csv')

In [212]:
sample = pd.read_csv('submission_1.csv', index_col="Id")

Текущая оценка 0.80888

Проверим другую модель

In [213]:
cls_2 = pipe_2.fit(X_train, y_train)
test_df['y'] = cls_2.predict(test_df.text)
ans_2 = test_df[["y"]]
ans_2.to_csv('submission_2.csv')

Текущая оценка 0.79111