# Homework 5 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://forms.gle/r3pQShWbsBWSW3819)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [14]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [15]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('./data/train_toxic.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [16]:
train_text = train['comment_text']

In [17]:
train_text

0         Explanation\nWhy the edits made under my usern...
1         D'aww! He matches this background colour I'm s...
2         Hey man, I'm really not trying to edit war. It...
3         "\nMore\nI can't make any real suggestions on ...
4         You, sir, are my hero. Any chance you remember...
                                ...                        
159566    ":::::And for the second time of asking, when ...
159567    You should be ashamed of yourself \n\nThat is ...
159568    Spitzer \n\nUmm, theres no actual article for ...
159569    And it looks like it was actually you who put ...
159570    "\nAnd ... I really don't think you understand...
Name: comment_text, Length: 159571, dtype: object

# CountVectorizer

In [18]:
# Попробуйте разные Vectorizer и разные размеры n-gramm, обрезку редких слов, обрезку слишком частых слов
#word_vectorizer = CountVectorizer(
#                    analyzer='word', # token = word
#                    ngram_range=(1,1), # only unigrams are used, (1,2) - unigrams/bigrams, ..., etc.
#                    max_df=0.6, # don't filter words by their frequency
#                    max_features=10000, # only top-6 words will be used as columns
#                    stop_words='english'
#                    )

In [19]:
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    stop_words='english',
    ngram_range=(1, 1),
    max_df=0.6,
    max_features=20000)

In [20]:
#word_vectorizer = TfidfVectorizer(
#    sublinear_tf=True,
#    strip_accents='unicode',
#    analyzer='char',
#    stop_words='english',
#    ngram_range=(1, 3),
#    max_features=10000)

In [21]:
train_word_features = word_vectorizer.fit_transform((train_text))

In [7]:
counts = pd.DataFrame(train_word_features.toarray(), columns=word_vectorizer.get_feature_names_out())
counts

Unnamed: 0,00,000,000000,01,02,03,04,05,06,07,...,yup,zealand,zero,zionism,zionist,zionists,zoe,zone,zoo,zuck
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159566,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
159567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
159568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
159569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
max_fr = counts.sum(axis=0).sort_values(ascending=False).head(1)

In [9]:
print(f'most frequent word in train dataset is "{max_fr.index[0]}" with {max_fr[0]} occurencies')

most frequent word in train dataset is "article" with 57661 occurencies


In [None]:
#word_vectorizer.fit(train_text)
#train_word_features = word_vectorizer.transform(train_text)

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [22]:
classifier = LogisticRegression(penalty='l2',
                                tol=1e-4,
                                class_weight=None,
                                max_iter=1500,
                                C=0.01) # Попробуйте разные параметры, найтдите оттимальные на кросс-валидации

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [23]:
scores= []

for class_name in class_names:
    train_target = train[class_name]

    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.9438904093489351
CV score for class severe_toxic is 0.9824762789870254
CV score for class obscene is 0.9751916042776061
CV score for class threat is 0.9767788848663468
CV score for class insult is 0.9631453884368278
CV score for class identity_hate is 0.9638023974095438
Total score is 0.9675474938877141


Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
