# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('train.csv').fillna(' ')
test = pd.read_csv('test.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [3]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

In [None]:
rowsums=train.iloc[:,2:8].sum(axis=1)
train['rules_broken']=rowsums

In [4]:
# Попробуйте разные Vectorizer и разные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов
word_vectorizer = TfidfVectorizer(max_features=1000,
                                 stop_words='english',
                                 ngram_range=(1,2)) # TfidfVectorizer или CountVectorizer
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [None]:
weighted_target = pd.DataFrame()


In [None]:
samples = train[train['rules_broken'] > 0].sample(10000)
samples.shape
samples.head()
samples_text = samples['comment_text']

In [None]:
word_vectorizer.fit(samples_text)
word_vectorizer.vocabulary_

In [5]:
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [6]:
classifier = LogisticRegression(C=1,
                                penalty='l2',
                               solver='liblinear',
                               n_jobs=1) # Попробуйте разные параметры, найтдите оттимальные на кросс-валидации

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [8]:
scores= []

for class_name in class_names:
    
    train_target = train[class_name]
    classifier = LogisticRegression(C=1,
                                penalty='l2',
                               solver='liblinear',
                               n_jobs=1)
    
    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.9361231486352647
CV score for class severe_toxic is 0.9753419834419832
CV score for class obscene is 0.9679720754718191
CV score for class threat is 0.9652977949450635
CV score for class insult is 0.9500264770559737
CV score for class identity_hate is 0.954564028980413
Total score is 0.9582209180884195


Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [9]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [11]:
for class_name in class_names:
    train_target = train[class_name]
    
    classifier.fit(train_word_features, train_target)
    submission[class_name] = classifier.predict_proba(test_word_features)[:, 1]    

In [12]:
submission.to_csv('submission.csv', index=False)

In [13]:
submission.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.995359,0.144311,0.991977,0.040555,0.881297,0.193037
1,0000247867823ef7,0.010052,0.002673,0.004257,0.001602,0.009345,0.003195
2,00013b17ad220c46,0.022164,0.001471,0.008078,0.000642,0.011878,0.002377
3,00017563c3f7919a,0.01103,0.002682,0.00551,0.001319,0.005675,0.000667
4,00017695ad8997eb,0.079407,0.002064,0.009578,0.001848,0.019122,0.002338
