# <center>Анализ тональности отзывов на фильмы
## <center>Часть 1: Строим простые модели

### Данные
Будем использовать стандартный датасет из **nltk**. Импортируем необходимый модуль

In [105]:
from nltk.corpus import movie_reviews
import nltk

In [99]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC

получим id-шники негативных и позитивных отзывов:

In [4]:
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

получим список негативных отзывов:

In [5]:
negfeats = [movie_reviews.words(fileids=[f]) for f in negids]

получим список позитивных отзывов:

In [6]:
posfeats = [movie_reviews.words(fileids=[f]) for f in posids]

Создадим список из текстов всех имеющихся отзывов, а также список с классами, которые будет использовать наш классификатор:    
    
    0 - для негативных отзывов и   
    1 - для позитивных.

In [7]:
allfeats = negfeats + posfeats # список из текстов всех имеющихся отзывов

In [8]:
class_list = [0] * len(negfeats) + [1] * len(posfeats) # список с классами

In [9]:
# функция для записи ответов
def write_answer(answer, filename, path='answers'):
    name = path + '/' + filename
    with open(name, 'w') as file:
        file.write(str(answer))

In [10]:
write_answer(len(allfeats), 'answer1_1.txt')

In [11]:
proportion_1 = (np.array(class_list) == 1).sum() / len(class_list)
proportion_1

0.5

In [12]:
write_answer(proportion_1, 'answer1_2.txt')

Протестируем **CountVectorizer**

In [20]:
vectorizer = CountVectorizer()

**CountVectorizer** принимает на вход список из текстов, а не список из списков отдельных токенов, поэтому преобразуем списки токенов в "как-бы текст", **соединив их через пробел**.

In [33]:
allfeats_join = [' '.join(row) for row in allfeats]

In [34]:
vectorizer.fit_transform(allfeats_join)

<2000x39659 sparse matrix of type '<class 'numpy.int64'>'
	with 666842 stored elements in Compressed Sparse Row format>

In [44]:
len(vectorizer.vocabulary_)

39659

In [45]:
write_answer(len(vectorizer.vocabulary_), 'answer1_3.txt')

Соберем **pipeline** из CountVectorizer и LogisticRegression c настройками по-умолчанию и с помощью cross_val_score (также со стандартными настройками) оценим получаемое "из коробки" качество по **accuracy**.

In [51]:
%%time
vectorizer = CountVectorizer()
clf = LogisticRegression()
vectorizer_clf = Pipeline([('vectorizer', vectorizer), ('clf', clf)])
acc = cross_val_score(vectorizer_clf, allfeats_join, class_list, scoring='accuracy').mean()
print('accuracy =', acc)

accuracy = 0.836021650393
Wall time: 10.4 s


In [52]:
write_answer(acc, 'answer1_4.txt')

Аналогично accuracy, оценим качество по **ROC AUC**

In [54]:
%%time
vectorizer = CountVectorizer()
clf = LogisticRegression()
vectorizer_clf = Pipeline([('vectorizer', vectorizer), ('clf', clf)])
auc = cross_val_score(vectorizer_clf, allfeats_join, class_list, scoring='roc_auc').mean()
print('roc_auc =', auc)
write_answer(auc, 'answer1_5.txt')

roc_auc = 0.910776493783
Wall time: 7.22 s


Обучим **логистическую регрессию** на всей доступной вам выборке и выведим **5 наиболее важных** для модели признака. Будем использовать метод `get_feature_names()` или поле `vocabulary_` у класса `CountVectorizer`.

In [56]:
features = vectorizer.fit_transform(allfeats_join)
logit = LogisticRegression()
logit.fit(features, class_list)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [83]:
important_5_ind = abs(logit.coef_[0, :]).argsort()[-5:]
important_5_ind

array([38417, 14159, 39195, 37056,  2954], dtype=int64)

In [84]:
important_5_words = np.array(vectorizer.get_feature_names())[important_5_ind]
important_5_words

array(['waste', 'fun', 'worst', 'unfortunately', 'bad'],
      dtype='<U58')

In [85]:
# запишем 2 самых важных признака
write_answer(' '.join(important_5_words[-2:]), 'answer1_6.txt')

## <center>Часть 2: Настройка параметров

Здесь и далее оценка качества будет выполняться с помощью **cross_val_score с cv=5** и остальными параметрами по умолчанию.

Оценим среднее качество ( .mean() ) и стандартное отклонение ( .std() ) по fold'ам для:   

    а) pipeline из CountVectorizer() и LogisticRegression(),   
    б) pipeline из TfidfVectorizer() и LogisticRegression(). 

In [89]:
%%time
CountVect = CountVectorizer()
logit = LogisticRegression()
CountVect_logit = Pipeline([('CountVect', CountVect), ('logit', logit)])
CountVect_logit_acc = cross_val_score(CountVect_logit, allfeats_join, class_list, cv=5)
CVl_acc_mean = CountVect_logit_acc.mean()
CVl_acc_std = CountVect_logit_acc.std()
print('CVl_acc_mean =', CVl_acc_mean)
print('CVl_acc_std =', CVl_acc_std)

CVl_acc_mean = 0.841
CVl_acc_std = 0.0167779617356
Wall time: 13.3 s


In [92]:
%%time
TfidfVect = TfidfVectorizer()
logit = LogisticRegression()
TfidfVect_logit = Pipeline([('TfidfVect', TfidfVect), ('logit', logit)])
TfidfVect_logit_acc = cross_val_score(TfidfVect_logit, allfeats_join, class_list, cv=5)
TVl_acc_mean = TfidfVect_logit_acc.mean()
TVl_acc_std = TfidfVect_logit_acc.std()
print('TVl_acc_mean =', TVl_acc_mean)
print('TVl_acc_std =', TVl_acc_std)

TVl_acc_mean = 0.821
TVl_acc_std = 0.00406201920232
Wall time: 10.8 s


In [93]:
write_answer(' '.join(np.array([CVl_acc_mean, CVl_acc_std, TVl_acc_mean, TVl_acc_std]).astype(str)), 'answer2_1.txt')

Попробуем разные **min_df** у **CountVectorizer**. 

Оценим качество классикатора с **min_df=10** и с **min_df=50**

In [95]:
%%time
CountVect = CountVectorizer()
logit = LogisticRegression()
CountVect_logit = Pipeline([('CountVect', CountVect), ('logit', logit)])
CVl_acc_means = []
for n in [10, 50]:
    CountVect_logit_acc = cross_val_score(CountVect_logit.set_params(CountVect__min_df=n), allfeats_join, class_list, cv=5)
    CVl_acc_mean = CountVect_logit_acc.mean()
    CVl_acc_means.append(CVl_acc_mean)
    print('min_df=', n, 'CVl_acc_mean =', CVl_acc_mean)

min_df= 10 CVl_acc_mean = 0.839
min_df= 50 CVl_acc_mean = 0.813
Wall time: 23.1 s


**ВЫВОД:**   
с увеличением min_df качество **падает**

In [96]:
write_answer(' '.join(np.array(CVl_acc_means).astype(str)), 'answer2_2.txt')

Пробуем разные **классификаторы**.   
Сравним результаты для **LogisticRegression, LinearSVC и SGDClassifier**

In [101]:
%%time
CountVect = CountVectorizer()
CV_clf_acc_means = []
for estimator in [LogisticRegression, LinearSVC, SGDClassifier]:
    clf = estimator()
    CountVect_clf = Pipeline([('CountVect', CountVect), ('clf', clf)])
    CV_clf_acc_mean = cross_val_score(CountVect_clf, allfeats_join, class_list, cv=5).mean()
    
    CV_clf_acc_means.append(CV_clf_acc_mean)
    print(str(clf),'\n', 'CV_clf_acc_mean =', CV_clf_acc_mean)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) 
 CV_clf_acc_mean = 0.841
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0) 
 CV_clf_acc_mean = 0.8325




SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False) 
 CV_clf_acc_mean = 0.784
Wall time: 40.7 s


In [103]:
write_answer(np.array(CV_clf_acc_means).min().astype(str), 'answer2_3.txt')

Подготовим список **стоп-слов** с помощью **nltk.corpus.stopwords.words('english')**, посмотрим на его элементы, и передадим его в **соответствующий параметр** CountVectorizer. 

Оценим качество классификатора.

In [110]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [111]:
from nltk.corpus import stopwords

In [112]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [114]:
%%time
CountVect = CountVectorizer(stop_words=stopwords.words('english'))
logit = LogisticRegression()
CountVect_logit = Pipeline([('CountVect', CountVect), ('logit', logit)])
CountVect_logit_acc = cross_val_score(CountVect_logit, allfeats_join, class_list, cv=5)
CVl_acc_StopWordsNltk = CountVect_logit_acc.mean()
print('stop_words - from nltk', 'CVl_acc_StopWordsNltk =', CVl_acc_StopWordsNltk)

stop_words - from nltk CVl_acc_StopWordsNltk = 0.8415
Wall time: 11.8 s


В **sklearn** также предусмотрен **свой список** английских стоп-слов - для этого нужно задать соответствующий параметр равным строке 'english'. 

Оценим качество классификатора.

In [115]:
%%time
CountVect = CountVectorizer(stop_words='english')
logit = LogisticRegression()
CountVect_logit = Pipeline([('CountVect', CountVect), ('logit', logit)])
CountVect_logit_acc = cross_val_score(CountVect_logit, allfeats_join, class_list, cv=5)
CVl_acc_StopWordsSklearn = CountVect_logit_acc.mean()
print('stop_words - from sklearn', 'CVl_acc_StopWordsSklearn =', CVl_acc_StopWordsSklearn)

stop_words - from sklearn CVl_acc_StopWordsSklearn = 0.839
Wall time: 11.3 s


In [116]:
write_answer(' '.join(np.array([CVl_acc_StopWordsNltk, CVl_acc_StopWordsSklearn]).astype(str)), 'answer2_4.txt')

В **CountVectorizer** добавим к словам биграммы и измерим качество модели.   

In [121]:
%%time
CountVect = CountVectorizer(ngram_range=(1, 2))
logit = LogisticRegression()
CountVect_logit = Pipeline([('CountVect', CountVect), ('logit', logit)])
CVl_acc_analyzerWord = cross_val_score(CountVect_logit, allfeats_join, class_list, cv=5).mean()
print('CVl_acc_analyzerWord =', CVl_acc_analyzerWord)

CVl_acc_analyzerWord = 0.8525
Wall time: 50.3 s


Построим модель на частотах буквенных n-грамм c n от 3 до 5, указав соответствующее значение параметра **ngram_range** и параметр **analyzer='char_wb'**.

In [119]:
%%time
CountVect = CountVectorizer(ngram_range=(3, 5), analyzer='char_wb')
logit = LogisticRegression()
CountVect_logit = Pipeline([('CountVect', CountVect), ('logit', logit)])
CVl_acc_analyzerChar_wb = cross_val_score(CountVect_logit, allfeats_join, class_list, cv=5).mean()
print('CVl_acc_analyzerChar_wb =', CVl_acc_analyzerChar_wb)

CVl_acc_analyzerChar_wb = 0.82
Wall time: 2min 15s


In [122]:
write_answer(' '.join(np.array([CVl_acc_analyzerWord, CVl_acc_analyzerChar_wb]).astype(str)), 'answer2_5.txt')

## <center>ИТОГИ

**Подведем итоги по настройке параметров**

1. **vectorizer**: CountVectorizer() / TfidfVectorizer(): 0.841 / 0.821 --> **CountVectorizer()**
2. **min_df** у CountVectorizer: 1 (default) / 10 / 50: 0.841/0.839/0.813 --> **1 (default)**
3. **классификатор**: LogisticRegression / LinearSVC / SGDClassifier: 0.841 / 0.833 / 0.784 --> **LogisticRegression**
4. **стоп-слова**: nltk.corpus.stopwords.words('english') / 'english': 0.8415 / 0,839 --> **nltk.corpus.stopwords.words('english')**
5. **ngramm**: (1, 2), analyzer='word' (default) / (3, 5), analyzer='char_wb': 0.853 / 0.82 --> **(1, 2), analyzer='word'** (default)