# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [2]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack

In [3]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('train.csv').fillna(' ')
test = pd.read_csv('test.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [4]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

In [5]:
all_text = all_text.apply(str.lower)
train_text = train_text.apply(str.lower)
test_text = test_text.apply(str.lower)

In [6]:
# TASK1 FINDING THE MOST POPULAR WORD
# looking for the list of 20 most popular words
print(pd.Series(' '.join(all_text).lower().split()).value_counts()[:20].to_string())

the     902050
to      532403
of      406605
a       399980
and     397026
i       342269
you     333262
is      323691
that    267385
in      259696
it      202532
for     181740
this    162853
not     158934
"       157806
on      155849
be      150192
as      136236
are     128645
have    126235


In [26]:
# TASK2 C PARAMETER IN LogReg
print('From python documentation: ')
print('C : float, default: 1.0')
print('Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.')

From python documentation: 
C : float, default: 1.0
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.


In [43]:
# SEARCHING THE BEST SET OF PARAMETERS WHEN  tfidf__max_features IN np.arange(1000, 35000, 8000)

total_score = []

df_res = pd.DataFrame(columns=class_names)
df_res = df_res.append({class_name: cv_score}, ignore_index=True)

for tfidf__max_df in np.arange(0.8, 1.1, 0.1):
    for tfidf__max_features in np.arange(1000, 35000, 8000):
        for tfidf__ngram_range in ((1, 1), (1, 2)):
            for lr__C in [0.9,1,3,5,10]:
                word_vectorizer = TfidfVectorizer(
                    analyzer='word', 
                    ngram_range=tfidf__ngram_range, 
                    vocabulary=None, 
                    max_df=tfidf__max_df,
                    max_features=tfidf__max_features, 
                    smooth_idf=True,
                    norm='l2' 
                    )


                classifier = LogisticRegression(penalty = 'l2', C = lr__C,max_iter = 1000)
                
                word_vectorizer.fit(all_text)
                train_word_features = word_vectorizer.transform(train_text)
                test_word_features = word_vectorizer.transform(test_text)
                
                scores= []
                d={}
                #d = dict.fromkeys(['a', 'b'])
                for class_name in class_names:
                    train_target = train[class_name]

                    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
                    
                    
                    print('CV score for class {} is {}'.format(class_name, cv_score))
                    
                    d[class_name] = cv_score
                    
                    
                    scores.append(cv_score)
                df_res = df_res.append(d, ignore_index=True)
                d.clear()
                #print(df_res)
                print('ITERATION:  ',len(total_score))
                print('Parameters  ',tfidf__max_df,'  ',tfidf__max_features,'   ',tfidf__ngram_range,'  ',lr__C)
                total_score.append(np.mean(scores))
                print('Total score is {}'.format(np.mean(scores)))
        
    
    



CV score for class toxic is 0.9488783969088379
CV score for class severe_toxic is 0.9787903595638028
CV score for class obscene is 0.9708751938217719
CV score for class threat is 0.9806301963198457
CV score for class insult is 0.9597272327776771
CV score for class identity_hate is 0.9570001310598428
ITERATION:   0
Parameters   0.8    1000     (1, 1)    0.9
Total score is 0.9659835850752964
CV score for class toxic is 0.9489345006635187
CV score for class severe_toxic is 0.9787558968210993
CV score for class obscene is 0.9708815563509644
CV score for class threat is 0.9808318267336373
CV score for class insult is 0.9596572687514041
CV score for class identity_hate is 0.9569165006691582
ITERATION:   1
Parameters   0.8    1000     (1, 1)    1
Total score is 0.9659962583316304
CV score for class toxic is 0.9488103271385269
CV score for class severe_toxic is 0.9773974603011654
CV score for class obscene is 0.9699542675884641
CV score for class threat is 0.9803719702662345
CV score for class

CV score for class toxic is 0.969761733858194
CV score for class severe_toxic is 0.9829536050196244
CV score for class obscene is 0.982003717422035
CV score for class threat is 0.9842271903074215
CV score for class insult is 0.9753241814834114
CV score for class identity_hate is 0.9718105388247834
ITERATION:   21
Parameters   0.8    17000     (1, 1)    1
Total score is 0.9776801611525783
CV score for class toxic is 0.971785665482568
CV score for class severe_toxic is 0.9824738479080665
CV score for class obscene is 0.9829691474637471
CV score for class threat is 0.9870091466466334
CV score for class insult is 0.9760907770002181
CV score for class identity_hate is 0.9725466924030483
ITERATION:   22
Parameters   0.8    17000     (1, 1)    3
Total score is 0.9788125461507136
CV score for class toxic is 0.97126276692366
CV score for class severe_toxic is 0.9813697419860041
CV score for class obscene is 0.9820720110515709
CV score for class threat is 0.9867919975913381
CV score for class in

CV score for class toxic is 0.9723486092939314
CV score for class severe_toxic is 0.9827776849370092
CV score for class obscene is 0.9839356585724762
CV score for class threat is 0.9864528101804068
CV score for class insult is 0.9764931747512243
CV score for class identity_hate is 0.9736464986651194
ITERATION:   42
Parameters   0.8    33000     (1, 1)    3
Total score is 0.9792757394000279
CV score for class toxic is 0.9720965634583667
CV score for class severe_toxic is 0.9818063570994436
CV score for class obscene is 0.9833762905383386
CV score for class threat is 0.9863854190176017
CV score for class insult is 0.9756123658913628
CV score for class identity_hate is 0.9726729411450297
ITERATION:   43
Parameters   0.8    33000     (1, 1)    5
Total score is 0.9786583228583572
CV score for class toxic is 0.9704388740720686
CV score for class severe_toxic is 0.9793454925256398
CV score for class obscene is 0.9814088056067943
CV score for class threat is 0.985275127163468
CV score for clas

CV score for class toxic is 0.969787148065275
CV score for class severe_toxic is 0.9814101030239474
CV score for class obscene is 0.9807582844813734
CV score for class threat is 0.9865286738509065
CV score for class insult is 0.9738452234031234
CV score for class identity_hate is 0.9697600348644801
ITERATION:   63
Parameters   0.9    9000     (1, 1)    5
Total score is 0.9770149112815177
CV score for class toxic is 0.9676849917652514
CV score for class severe_toxic is 0.9785745806902427
CV score for class obscene is 0.9779176896602378
CV score for class threat is 0.984785391955783
CV score for class insult is 0.970720198308749
CV score for class identity_hate is 0.9656728024259321
ITERATION:   64
Parameters   0.9    9000     (1, 1)    10
Total score is 0.9742259424676994
CV score for class toxic is 0.9660165489184193
CV score for class severe_toxic is 0.982274089221541
CV score for class obscene is 0.9793109188834942
CV score for class threat is 0.9838499646947136
CV score for class in

CV score for class toxic is 0.9700381864748772
CV score for class severe_toxic is 0.979191802280286
CV score for class obscene is 0.9807396234911958
CV score for class threat is 0.9851412503740248
CV score for class insult is 0.972590536107707
CV score for class identity_hate is 0.9691532289442009
ITERATION:   84
Parameters   0.9    25000     (1, 1)    10
Total score is 0.9761424379453819
CV score for class toxic is 0.9671601450503825
CV score for class severe_toxic is 0.9824264614282431
CV score for class obscene is 0.9797058374251183
CV score for class threat is 0.9832867792618986
CV score for class insult is 0.9739477832515441
CV score for class identity_hate is 0.9691552868756075
ITERATION:   85
Parameters   0.9    25000     (1, 2)    0.9
Total score is 0.9759470488821324
CV score for class toxic is 0.9676516397183028
CV score for class severe_toxic is 0.9825045584164428
CV score for class obscene is 0.9801209906606404
CV score for class threat is 0.983799336204681
CV score for cla

CV score for class toxic is 0.9375349835375028
CV score for class severe_toxic is 0.975938147438554
CV score for class obscene is 0.9581609184066527
CV score for class threat is 0.9676506191772171
CV score for class insult is 0.9489283485971217
CV score for class identity_hate is 0.9444568967391295
ITERATION:   105
Parameters   1.0    1000     (1, 2)    0.9
Total score is 0.955444985649363
CV score for class toxic is 0.9375892742131202
CV score for class severe_toxic is 0.9758660637131604
CV score for class obscene is 0.9581319001759256
CV score for class threat is 0.9679022169037476
CV score for class insult is 0.9488287236971663
CV score for class identity_hate is 0.9443186288723454
ITERATION:   106
Parameters   1.0    1000     (1, 2)    1
Total score is 0.9554394679292443
CV score for class toxic is 0.9375279852089894
CV score for class severe_toxic is 0.9740825844256955
CV score for class obscene is 0.9568955728062895
CV score for class threat is 0.9674444872727226
CV score for cla

CV score for class severe_toxic is 0.9821452459388094
CV score for class obscene is 0.9803295798119568
CV score for class threat is 0.9838626817333257
CV score for class insult is 0.9743405656421533
CV score for class identity_hate is 0.9700603839092573
ITERATION:   126
Parameters   1.0    17000     (1, 2)    1
Total score is 0.9763898777289168
CV score for class toxic is 0.969155306368979
CV score for class severe_toxic is 0.9815975153828153
CV score for class obscene is 0.9814722266421776
CV score for class threat is 0.9866229032279169
CV score for class insult is 0.9747832622677324
CV score for class identity_hate is 0.9703553843228039
ITERATION:   127
Parameters   1.0    17000     (1, 2)    3
Total score is 0.977331099702071
CV score for class toxic is 0.9682028412778513
CV score for class severe_toxic is 0.9801909525000397
CV score for class obscene is 0.9804160250882775
CV score for class threat is 0.986462169446438
CV score for class insult is 0.9733109197301203
CV score for cla

CV score for class severe_toxic is 0.9820699021594375
CV score for class obscene is 0.9818561032793057
CV score for class threat is 0.9867236924917577
CV score for class insult is 0.9754552918804035
CV score for class identity_hate is 0.9704699845453627
ITERATION:   147
Parameters   1.0    33000     (1, 2)    3
Total score is 0.977810239130049
CV score for class toxic is 0.9697459056373239
CV score for class severe_toxic is 0.9809961837223353
CV score for class obscene is 0.9811049668920164
CV score for class threat is 0.9867702187670618
CV score for class insult is 0.9743795142563267
CV score for class identity_hate is 0.9691938148565334
ITERATION:   148
Parameters   1.0    33000     (1, 2)    5
Total score is 0.9770317673552662
CV score for class toxic is 0.9676075013744603
CV score for class severe_toxic is 0.9782550106260134
CV score for class obscene is 0.9787945351673636
CV score for class threat is 0.9857422113660667
CV score for class insult is 0.9712878080603592
CV score for c

CV score for class severe_toxic is 0.9801094933301816
CV score for class obscene is 0.9779534181591986
CV score for class threat is 0.9863283420723379
CV score for class insult is 0.9710826298638926
CV score for class identity_hate is 0.9674695200216776
ITERATION:   168
Parameters   1.1    9000     (1, 2)    5
Total score is 0.9747819313886205
CV score for class toxic is 0.9630030871585888
CV score for class severe_toxic is 0.9764027933374001
CV score for class obscene is 0.9745189126958627
CV score for class threat is 0.9846722644219371
CV score for class insult is 0.9671673529570696
CV score for class identity_hate is 0.9627995320825766
ITERATION:   169
Parameters   1.1    9000     (1, 2)    10
Total score is 0.9714273237755724
CV score for class toxic is 0.9693160608986995
CV score for class severe_toxic is 0.9828935411520613
CV score for class obscene is 0.9816848736099733
CV score for class threat is 0.9837234196597638
CV score for class insult is 0.9750320030152837
CV score for c

CV score for class severe_toxic is 0.9778547329674087
CV score for class obscene is 0.9781589144254558
CV score for class threat is 0.9858946122381961
CV score for class insult is 0.9706672580007414
CV score for class identity_hate is 0.9654556559884959
ITERATION:   189
Parameters   1.1    25000     (1, 2)    10
Total score is 0.9740915109130684
CV score for class toxic is 0.9692539474373367
CV score for class severe_toxic is 0.9829949171669577
CV score for class obscene is 0.9819058553696567
CV score for class threat is 0.9827996116754542
CV score for class insult is 0.9749071837508723
CV score for class identity_hate is 0.9717737230430181
ITERATION:   190
Parameters   1.1    33000     (1, 1)    0.9
Total score is 0.9772725397405493
CV score for class toxic is 0.9697530516896403
CV score for class severe_toxic is 0.983067090711506
CV score for class obscene is 0.9822907525708624
CV score for class threat is 0.9833277801605855
CV score for class insult is 0.9752448783593701
CV score fo

In [71]:
df_res = df_res.fillna(0)    


br=[]
for i in df_res.columns:
    br.append(max(df_res[i]))

print(br)
print(np.mean(br))

[0.97234860929393141, 0.98312732066229691, 0.98393565857247622, 0.98700914664663342, 0.97649317475122432, 0.97364649866511943]
0.979426734765


In [20]:
# SEARCHING THE BEST SET OF PARAMETERS WHEN  tfidf__max_features IN  np.arange(40000, 50000, 5000)



total_score = []

df_res = pd.DataFrame(columns=class_names)
df_res = df_res.append({class_name: cv_score}, ignore_index=True)

for tfidf__max_df in np.arange(0.8, 1.1, 0.1):
    for tfidf__max_features in np.arange(40000, 50000, 5000):
        for tfidf__ngram_range in ((1, 1), (1, 2)):
            for lr__C in [0.9,1,3,5,10]:
                word_vectorizer = TfidfVectorizer(
                    analyzer='word', 
                    ngram_range=tfidf__ngram_range, 
                    vocabulary=None, 
                    max_df=tfidf__max_df,
                    max_features=tfidf__max_features, 
                    smooth_idf=True,
                    norm='l2' 
                    )


                classifier = LogisticRegression(penalty = 'l2', C = lr__C,max_iter = 1000)
                
                word_vectorizer.fit(all_text)
                train_word_features = word_vectorizer.transform(train_text)
                test_word_features = word_vectorizer.transform(test_text)
                
                scores= []
                d={}
                
                for class_name in class_names:
                    train_target = train[class_name]

                    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
                    
                    
                    print('CV score for class {} is {}'.format(class_name, cv_score))
                    
                    d[class_name] = cv_score
                    
                    
                    scores.append(cv_score)
                df_res = df_res.append(d, ignore_index=True)
                d.clear()
                #print(df_res)
                print('ITERATION:  ',len(total_score))
                print('Parameters  ',tfidf__max_df,'  ',tfidf__max_features,'   ',tfidf__ngram_range,'  ',lr__C)
                total_score.append(np.mean(scores))
                print('Total score is {}'.format(np.mean(scores)))
        
    
    




CV score for class toxic is 0.9691492919755298
CV score for class severe_toxic is 0.9829615169479098
CV score for class obscene is 0.9818470654092436
CV score for class threat is 0.9825722563751427
CV score for class insult is 0.9747557613384066
CV score for class identity_hate is 0.9715294679713439
ITERATION:   0
Parameters   0.8    40000     (1, 1)    0.9
Total score is 0.9771358933362627
CV score for class toxic is 0.9696565427638314
CV score for class severe_toxic is 0.9830343593583223
CV score for class obscene is 0.9822356511359415
CV score for class threat is 0.9830972118746493
CV score for class insult is 0.9750978213109104
CV score for class identity_hate is 0.9719650524344831
ITERATION:   1
Parameters   0.8    40000     (1, 1)    1
Total score is 0.9775144398130231
CV score for class toxic is 0.9723321296665891
CV score for class severe_toxic is 0.9827843286885942
CV score for class obscene is 0.9839690824300448
CV score for class threat is 0.986317374089804
CV score for clas

CV score for class toxic is 0.9696565427638314
CV score for class severe_toxic is 0.9830343593583223
CV score for class obscene is 0.9822356511359415
CV score for class threat is 0.9830972118746493
CV score for class insult is 0.9750978213109104
CV score for class identity_hate is 0.9719650524344831
ITERATION:   21
Parameters   0.9    40000     (1, 1)    1
Total score is 0.9775144398130231
CV score for class toxic is 0.9723321296665891
CV score for class severe_toxic is 0.9827843286885942
CV score for class obscene is 0.9839690824300448
CV score for class threat is 0.986317374089804
CV score for class insult is 0.9764095169177861
CV score for class identity_hate is 0.9734676809642896
ITERATION:   22
Parameters   0.9    40000     (1, 1)    3
Total score is 0.9792133521261847
CV score for class toxic is 0.9721242598192865
CV score for class severe_toxic is 0.9818609458068436
CV score for class obscene is 0.983457076752363
CV score for class threat is 0.986287344385648
CV score for class 

CV score for class toxic is 0.9723321296665891
CV score for class severe_toxic is 0.9827843286885942
CV score for class obscene is 0.9839690824300448
CV score for class threat is 0.986317374089804
CV score for class insult is 0.9764095169177861
CV score for class identity_hate is 0.9734676809642896
ITERATION:   42
Parameters   1.0    40000     (1, 1)    3
Total score is 0.9792133521261847
CV score for class toxic is 0.9721242598192865
CV score for class severe_toxic is 0.9818609458068436
CV score for class obscene is 0.983457076752363
CV score for class threat is 0.986287344385648
CV score for class insult is 0.9755477264426493
CV score for class identity_hate is 0.9725304395164875
ITERATION:   43
Parameters   1.0    40000     (1, 1)    5
Total score is 0.9786346321205462
CV score for class toxic is 0.970543695074752
CV score for class severe_toxic is 0.9795318755682928
CV score for class obscene is 0.9815610615681427
CV score for class threat is 0.98523619355567
CV score for class ins

CV score for class toxic is 0.9721242598192865
CV score for class severe_toxic is 0.9818609458068436
CV score for class obscene is 0.983457076752363
CV score for class threat is 0.986287344385648
CV score for class insult is 0.9755477264426493
CV score for class identity_hate is 0.9725304395164875
ITERATION:   63
Parameters   1.1    40000     (1, 1)    5
Total score is 0.9786346321205462
CV score for class toxic is 0.970543695074752
CV score for class severe_toxic is 0.9795318755682928
CV score for class obscene is 0.9815610615681427
CV score for class threat is 0.98523619355567
CV score for class insult is 0.9727698898584739
CV score for class identity_hate is 0.9698070443825636
ITERATION:   64
Parameters   1.1    40000     (1, 1)    10
Total score is 0.9765749600013159
CV score for class toxic is 0.9671898996637077
CV score for class severe_toxic is 0.9822220725387076
CV score for class obscene is 0.9793670321299995
CV score for class threat is 0.9830586412225483
CV score for class i

In [33]:
df_res = df_res.fillna(0)

br=[]
for i in df_res.columns:
    br.append(max(df_res[i]))

print(br)
print(np.mean(br))

[0.97233633862402391, 0.98303435935832228, 0.98397692700014128, 0.98728726484671137, 0.9764095169177861, 0.97346768096428959]
0.979418681285


Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [6]:
#TASK3 CV ON THE BEST SET OF PARAMETERS, FOUNDED PREVIOUSLY

scores= []

for class_name in class_names:
    train_target = train[class_name]
    
    if class_name == 'toxic':
        word_vectorizer = TfidfVectorizer(
                    analyzer='word', 
                    ngram_range=(1, 1), 
                    vocabulary=None, 
                    max_df=0.8, 
                    max_features=33000 , 
                    smooth_idf=True,
                    norm='l2' 
                    )

        classifier = LogisticRegression(penalty = 'l2', C = 3,max_iter = 5000)
        
    elif class_name == 'severe_toxic':
        word_vectorizer = TfidfVectorizer(
                    analyzer='word', 
                    ngram_range=(1, 1), 
                    vocabulary=None, 
                    max_df=0.8, 
                    max_features=9000, 
                    smooth_idf=True,
                    norm='l2' 
                    )

        classifier = LogisticRegression(penalty = 'l2', C = 1,max_iter = 5000)
    
    elif class_name == 'obscene':
        word_vectorizer = TfidfVectorizer(
                    analyzer='word', 
                    ngram_range=(1, 1), 
                    vocabulary=None, 
                    max_df=0.8, 
                    max_features=45000 , 
                    smooth_idf=True,
                    norm='l2' 
                    )

        classifier = LogisticRegression(penalty = 'l2', C = 3,max_iter = 5000)
        
    elif class_name == 'threat':
        word_vectorizer = TfidfVectorizer(
                    analyzer='word', 
                    ngram_range=(1, 2), 
                    vocabulary=None, 
                    max_df=0.8, 
                    max_features=40000 , 
                    smooth_idf=True,
                    norm='l2' 
                    )

        classifier = LogisticRegression(penalty = 'l2', C = 5,max_iter = 5000)
        
    elif class_name == 'insult':
        word_vectorizer = TfidfVectorizer(
                    analyzer='word', 
                    ngram_range=(1, 1), 
                    vocabulary=None, 
                    max_df=0.8, 
                    max_features=33000 , 
                    smooth_idf=True,
                    norm='l2' 
                    )

        classifier = LogisticRegression(penalty = 'l2', C = 3,max_iter = 5000)
        
    elif class_name == 'identity_hate':
        word_vectorizer = TfidfVectorizer(
                    analyzer='word', 
                    ngram_range=(1, 1), 
                    vocabulary=None, 
                    max_df=0.8, 
                    max_features=40000, 
                    smooth_idf=True,
                    norm='l2' 
                    )

        classifier = LogisticRegression(penalty = 'l2', C = 3,max_iter = 5000)
        
    word_vectorizer.fit(all_text)
    train_word_features = word_vectorizer.transform(train_text)
    
    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, scoring='roc_auc'))
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

CV score for class toxic is 0.9723486092939314
CV score for class severe_toxic is 0.9831273206622969
CV score for class obscene is 0.9839769270001413
CV score for class threat is 0.9872872648467114
CV score for class insult is 0.9764931747512243
CV score for class identity_hate is 0.9734676809642896
Total score is 0.9794501629197659


Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [14]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [2]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack

class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('train.csv').fillna(' ')
test = pd.read_csv('test.csv').fillna(' ')

train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    max_features=50000)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])

scores = []
submission = pd.DataFrame.from_dict({'id': test['id']})
for class_name in class_names:
    train_target = train[class_name]
    classifier = LogisticRegression(solver='sag')

    cv_score = np.mean(cross_val_score(classifier, train_features, train_target, cv=3, scoring='roc_auc'))
    scores.append(cv_score)
    print('CV score for class {} is {}'.format(class_name, cv_score))

    classifier.fit(train_features, train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]

print('Total CV score is {}'.format(np.mean(scores)))

submission.to_csv('submission.csv', index=False)

CV score for class toxic is 0.9783248627763198
CV score for class severe_toxic is 0.9885336069890597
CV score for class obscene is 0.9901305860590298
CV score for class threat is 0.9896536828692066
CV score for class insult is 0.9826307233013799
CV score for class identity_hate is 0.9823772016603595
Total CV score is 0.9852751106092258
