# Text classification challenge

## Description of the task:
Build a text classification service.
The model on which the service should be based on may be binary or multiclass/multilabel depending on the dataset of your choice.


## Typical applications of text classification
Text is one of the most common types of unstructured data. Analyzing, understanding, organizing, and sorting through text data is a hard and time-consuming task. 

<p> A few typical applications of text classification technology for companies include:</p>

*	Social media monitoring
*	Brand monitoring
*	Customer service
*   Email classification
*   Documents labeling

<br>
The fundamental tasks in Natural Language Processing (NLP) can be divided into:  
<ul>
<li>sentiment analysis (determining whether a text is positive, negative, or neutral), </li>
<li>topic labeling, </li>
<li>spam detection,</li>
<li>intent detection</li>
</ul>




##  Where can text classification be applied at Kuhne-Nagel

At the company’s web-page there are two text-input entry point:
* [Quote Request](https://onlineservices.kuehne-nagel.com/ac/login?dest=https%3A%2F%2Fonlineservices.kuehne-nagel.com%2Fecom%2Ffa%2Fquote-request)
* [Contact form](https://ee.kuehne-nagel.com/en_gb/other-links/contact-us/)



Classification of the topic in each of the entry points can create value to the company's performance. For example, defining information from application form such as type of goods to deliver, conditions and term provided by customer in text format will help to better process the application.

Based on that, I was searching for a dataset for topic classification. 

Among other open-source datasets the “news classification” dataset that is available in a Python [sklearn package](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html) that will be used in this task to create a topic classification service.

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic.



# Quality metrics
* [F1 score](https://en.wikipedia.org/wiki/F1_score)
* [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve)

## Text preprocessing

The goal of preprocessing is to clean text from different kinds of noise and reduce the dimension of the data.

Here are some of the approaches of cleaning:

* remove contractions
* correct misspelled words
* remove special characters
* replace emoticons by meaning
* remove punctuation
* remove accents
* replace numbers


In [1]:
def multi_step_cleaning(df, col_name ):

#    to_remove = ['a','to','of','and']
    contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }
    punct = "/-'?!.,#$%'()*+-/:;<=>@[\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€×™√²—–&'
    punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', }
    mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 
                    'travelling': 'traveling', 'counselling': 'counseling', 
                    'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 
                    'organisation': 'organization', 'wwii': 'world war 2', 
                    'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 
                    'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 
                    'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 
                    'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 
                    'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 
                    'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', 
                    '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 
                    'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 
                    'demonitization': 'demonetization', 'demonetisation': 'demonetization',
                   'b4':'before', 'otw':'on the way', '2moro':'tomorrow', '2mrrw':'tomorrow', '2mrw':'tomorrow'
                   , 'tomrw':'tomorrow', '2morrow':'tomorrow', '4u': 'for you'}    
    
    def text_specific(x):
        x=str(x)
        #        remove text before Subject
        ind=x.find('Subject:')
        x=x[ind:]
        
            
        # remove emails
        match=re.findall('\S+@\S+', x)
        for mail_add in match:
            x = x.replace(mail_add, ' ')
        for mail_add in match:
            x = x.replace(mail_add, ' ')

        # remove white space    
        x=x.strip()
        return x  
    df[f'{col_name}_clean']=df[col_name].apply(lambda x: text_specific(x))
    
    def clean_contractions(x, mapping):
        specials = ["’", "‘", "´", "`"]
        for s in specials:
            x = x.replace(s, "'")
        x = ' '.join([mapping[t] if t in mapping else t for t in x.split(" ")])
        return x
    df[f'{col_name}_clean']=df[f'{col_name}_clean'].apply(lambda x: clean_contractions(x, contraction_mapping))
  
    def clean_text(x):
    
        x = str(x)            
        for punct in "/-'":
            x = x.replace(punct, ' ')
        for punct in '&':
            x = x.replace(punct, f' {punct} ')
        for punct in "?!.,\"#$%'()*+-/:;<=>@[\]^_`{|}~' + '“”’":
            x = x.replace(punct, ' ')
            
        #remove white space    
        x=x.strip()
        return x  
    df[f'{col_name}_clean']=df[f'{col_name}_clean'].apply(lambda x: clean_text(x))
    
  
    
    def clean_special_chars(x, punct, mapping):
        for p in mapping:
            x = x.replace(p, mapping[p])
        
        for p in punct:
            x = x.replace(p, f' {p} ')
        
        specials = {'u200b': ' ', '…': ' ... ', 'ufeff': '', 'करना': '', 'है': ''}  # Other special characters 
        for s in specials:
            x = x.replace(s, specials[s])
        
        return x
    
    df[f'{col_name}_clean']=df[f'{col_name}_clean'].apply(lambda x: clean_special_chars(x, punct, punct_mapping))
    
    def correct_spelling(x, dic):
        for word in dic.keys():
            x = x.replace(word, dic[word])
        return x
    
    df[f'{col_name}_clean']=df[f'{col_name}_clean'].apply(lambda x: correct_spelling(x, mispell_dict))
    
    def clean_numbers(x):
    
        x = re.sub('[0-9]{5,}', '#####', x)
        x = re.sub('[0-9]{4}', '####', x)
        x = re.sub('[0-9]{3}', '###', x)
        x = re.sub('[0-9]{2}', '##', x)
        return x
    df[f'{col_name}_clean']=df[f'{col_name}_clean'].apply(lambda x: clean_numbers(x))
    
    def remove_accented_chars(x):
        import unidecode
        """remove accented characters from text, e.g. café"""
        x = unidecode.unidecode(x)
        return x
    df[f'{col_name}_clean']=df[f'{col_name}_clean'].apply(lambda x: clean_numbers(x))
    
    return df


## Importing dataset

In [2]:
import pandas as pd
import numpy as np
import pickle
import re
np.random.seed(42)
from sklearn.metrics import confusion_matrix, roc_auc_score, f1_score
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split

c_name='message'

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

def prepare(newsgroups_train):

    df=pd.DataFrame(newsgroups_train['data'])
    df.columns=[c_name]
    df['target']=newsgroups_train['target']
    di={}
    for i in range(20):
        di[i]=newsgroups_train['target_names'][i]
    df['target_name']=df['target'].map(di)
    
    df_d=pd.get_dummies(df['target_name'], dummy_na=False)
    
    df=pd.concat([df,df_d], axis=1)
    return df, di

X_test, di=prepare(newsgroups_test)
X_train, di=prepare(newsgroups_train)

targ=list(di.values())

#
print(f'X_train size: {X_train.shape[0]} \nX_test  size: {X_test.shape[0]}')
a=X_train['target_name'].value_counts().reset_index()
a.columns=['target_name', 'countInTrain']
b=X_test['target_name'].value_counts().reset_index()
b.columns=['target_name', 'countInTest']
a=pd.merge(a,b, on='target_name', how='left')
print('Distribution by category')
print(a)
del a, b

X_train size: 11314 
X_test  size: 7532
Distribution by category
                 target_name  countInTrain  countInTest
0           rec.sport.hockey           600          399
1     soc.religion.christian           599          398
2            rec.motorcycles           598          398
3         rec.sport.baseball           597          397
4                  sci.crypt           595          396
5                  rec.autos           594          396
6                    sci.med           594          396
7             comp.windows.x           593          395
8                  sci.space           593          394
9    comp.os.ms-windows.misc           591          394
10           sci.electronics           591          393
11  comp.sys.ibm.pc.hardware           590          392
12              misc.forsale           585          390
13             comp.graphics           584          389
14     comp.sys.mac.hardware           578          385
15     talk.politics.mideast           

In [3]:
#cleaning and preprocessing
X_train=multi_step_cleaning(X_train, c_name )
X_test=multi_step_cleaning(X_test, c_name )
# print sample
print('*'*20 ,'ORIGINAL', '*'*20)
print(X_train.loc[112, f'{c_name}'])
print('*'*20 ,'CLEANED', '*'*20)
print(X_train.loc[112, f'{c_name}_clean'])


******************** ORIGINAL ********************
From: keegan-edward@cs.yale.edu (Edward Keegan)
Subject: DEC MT 486, Adaptec SCSI, 3COMM conflict
Organization: Yale University Computer Science Dept., New Haven, CT 06520-2158
Lines: 14
Distribution: world
NNTP-Posting-Host: thumper.cf.cs.yale.edu


I have a DEC NT 486DX33 that has an Adaptec SCSI controller, hard disk
and cd-rom drive. When I add a 3COMM Ethernet card (3C503) and reboot
the system I receive an error message that a boot device cannot be
found. Pull the 3COMM card and reboot, everything is fine. I've moved
the controller and 3COMM card to various slots, different positions
(slot before the controller, slot after the controller) with the
same result. DEC hasn't responded to the problem yet. Any help would
be appreciated.
-- 
Edward T. Keegan, Facility Director             E-MAIL: keegan@cs.yale.edu
Yale University, Computer Science Department     PHONE: 1-203-432-1254
51 Prospect Street, Room 009                       F

## No Machine Learning Approach: Simple rule-based classifier

Let's create super simple rule-based classifier for one of categories as a baseline to assess quality

In [4]:
w_auto=['car', 'auto', 'automobile', 'vehicle', 'bus', 'truck', 'lorry',
        'transport']
#car types
w_auto+=['pickup','four-wheeler', 'convertible', 'suv', 'sedan','hatchback', 'limousine', 'limo',
         'van', 'taxi']
#car make
w_auto+=['daimler' ,'acura','alfa romeo','audi','bmw','bentley','buick','cadillac','chevrolet','chrysler','dodge','fiat','ford','gmc','genesis','honda','hyundai','infiniti','jaguar','jeep','kia','land rover','lexus','lincoln','lotus','maserati','mazda','mercedes benz','mercedes-benz','mercury','mini cooper','mitsubishi','nissan','polestar','pontiac','porsche','ram','rolls-royce','saab','saturn','scion','smart','subaru','suzuki','tesla','toyota','volkswagen','volvo','geely']
#driver 
w_auto+=['chauffeur','motorist','automobilist']
w_auto_s=[f'{x}s' for x in w_auto]
w_auto_es=[f'{x}es' for x in w_auto]
w_auto+=w_auto_s+w_auto_es
del w_auto_s,w_auto_es
print(f'Dictionary length: {len(w_auto)}')

a=X_test[[f'{c_name}_clean',  'rec.autos']].copy()

#predict
a['predict']=a[f'{c_name}_clean'].apply(lambda x: 1 if any((' ' + w + ' ') in x for w in w_auto) else 0 )


Dictionary length: 210


In [5]:
def cm2df(cm, labels=[0,1]):
    df = pd.DataFrame()
    # rows
    for i, row_label in enumerate(labels):
        rowdata={}
        # columns
        for j, col_label in enumerate(labels): 
            rowdata[col_label]=cm[i,j]
        df = df.append(pd.DataFrame.from_dict({row_label:rowdata}, orient='index'))
    return df[labels]

def qual(acm):
    prec=acm.iloc[1,1]/(acm.iloc[0,1]+acm.iloc[1,1]) #column
    recall=acm.iloc[1,1]/(acm.iloc[1,0]+acm.iloc[1,1]) #row
    f1=round(2*(prec*recall)/(prec+recall),4)
    auc=round(roc_auc_score(a['rec.autos'],a['predict']),4)
    return prec, recall, f1, auc

acm=cm2df(confusion_matrix(a['rec.autos'],a['predict']), [0,1])
prec, recall, f1, auc=qual(acm)
print( '*** Results of the classification ***')
print(f"Total cases from class:\nOriginal: {a['rec.autos'].sum()}\nPredicted: {a['predict'].sum()}")
print(acm,'\n\n', f'F1:  {round(f1,4)} \n AUC: {auc}' )

*** Results of the classification ***
Total cases from class:
Original: 396
Predicted: 811
      0    1
0  6606  530
1   115  281 

 F1:  0.4656 
 AUC: 0.8177


## Machine Learning 

Neural Networks with preprained Embeddings are the most common algorythms applied to text classification.
Neural network embeddings are useful because they can reduce the dimensionality of categorical variables and meaningfully represent categories in the transformed space (words with similar meaning will be represented with close values).

For the purpose to develope a classifier the tiniest embedding 'Glove' was chosen ([Link to Stanford](http://nlp.stanford.edu/data/glove.6B.zip))

## Embeddings
Preprocessing [link](https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings)

Embeddings
* [GoogleNews-vectors-negative300](https://code.google.com/archive/p/word2vec/)
* [glove.840B.300d](https://nlp.stanford.edu/projects/glove/)
* [paragram_300_sl999](https://cogcomp.org/page/resource_view/106)
* [wiki-news-300d-1M](https://fasttext.cc/docs/en/english-vectors.html)

Or 

https://www.kaggle.com/c/quora-insincere-questions-classification/data

In [6]:
from keras.preprocessing import text, sequence
import warnings
warnings.filterwarnings('ignore')

import os
max_features = 30000
embed_size = 300
maxlen = 200


print("tokenize")
tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(X_train['message'].fillna("fillna") ) + list(X_test['message'].fillna("fillna").values))

filename = 'tokenizer.sav'
pickle.dump(tokenizer, open(filename, 'wb'))


X_tr = tokenizer.texts_to_sequences(X_train['message'])
X_te = tokenizer.texts_to_sequences(X_test['message'])

X_tr = sequence.pad_sequences(X_tr, maxlen=maxlen)
X_te = sequence.pad_sequences(X_te, maxlen=maxlen)

EMBEDDING_FILE = 'glove.6B.300d.txt'

def get_coefs(word, *arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in open(EMBEDDING_FILE,encoding='utf-8'))

print("create weights matrix")
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.zeros((nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector



  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


tokenize
create weights matrix


## Model

In [7]:
#model
from keras.models import Model
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, concatenate
from keras.layers import GRU, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.layers import LSTM, Dropout, Activation,  Conv1D, GlobalMaxPool1D

from keras.callbacks import Callback

class RocAucEvaluation(Callback):
    def __init__(self, validation_data=(), interval=1):
        super(Callback, self).__init__()

        self.interval = interval
        self.X_val, self.y_val = validation_data

    def on_epoch_end(self, epoch, logs={}):
        if epoch % self.interval == 0:
            y_pred = self.model.predict(self.X_val, verbose=0)
            score = roc_auc_score(self.y_val, y_pred)
            print("\n ROC-AUC - epoch: %d - score: %.6f \n" % (epoch+1, score))


def get_model(emb=True):
    inp = Input(shape=(maxlen, ))
    if emb==True:
        x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
    else:
        x = Embedding(max_features, embed_size)(inp)

    x = SpatialDropout1D(0.2)(x)
    x = Bidirectional(GRU(80, return_sequences=True))(x)
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    outp = Dense(1, activation="sigmoid")(conc)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    return model


## Model description

- Layer 1: Input - As an input model is feeded by result of pad_sequences function. Pad_sequences applied to ensure that all sequences in a list have the same length. By default this is done by padding 0 in the beginning of each sequence until each sequence has the same length 
- Layer 2: Embedding. Another input layer to set initial weight to each token (word)
- Layer 3: Dropout layers are used to prevent overfitting
- Layer 4: Bidirectional (recurent layer) This layers are included because there is an evidence that the context of the whole context is used to interpret what is being "said" rather than a linear interpretation
- Layer 6: The result is a concatenation of two results of Pooling. Pooling is used to reduce the dimention by summarizing the presence of features in patches. Max pooling gets maximum from the patch. This also helps to reduce the impact of the location of the feature (words) in the input
- Layer 7: Output (Dense): classic fully connected neural network layer: each input node is connected to each output node. The result with activation "sigmoid" will be a probability that an input belongs to a given class 

## Training

There could be 

- one model trained for all classes 
- single model trained for each class 

First approach will be faster but second can provides better quality

There are two option to chose from: 

- single model for each class (train-validation split validation)
- Kfold validation (will create K models for each class, the resulting probability will be averaged, and the result will be in general more stable)

Both options are implemented but the result will be provided by default "single model per class".

In [8]:

batch_size = 32
epochs = 1

folds=5
s=4732
df_qual=pd.DataFrame()
df_threshold=pd.DataFrame()
for t in targ:
    print('*'*30)
    print(f'\nModel for {t}')
    fol=0
    y_tr=X_train[t]
    y_te=X_test[t]
    
    mdl = get_model(True)
    #print(model.summary())
    qq=[]
    thr=[]
    kfold_model=0
    if kfold_model==0:
        fol+=1
        print(fol)
        X_tra, X_val, y_tra, y_val = train_test_split(X_tr, y_tr, train_size=0.80, random_state=233, stratify=y_tr)
        
        RocAuc = RocAucEvaluation(validation_data=(X_val, y_val), interval=1)

        print("fit")
        mdll = mdl.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                         callbacks=[RocAuc]
        #                 , verbose=0
                         )
        print("predict")
        y_pred_val = mdl.predict(X_val, batch_size=1024)
        y_pred = mdl.predict(X_te, batch_size=1024)

        auc=round(roc_auc_score(y_val,y_pred_val),4)
        print(f'{t} fold {fol} AUC validation: {auc}')
        qq.append(auc)


        #f1=round(f1_score(y_te,y_pred),4)
        auc=round(roc_auc_score(y_te,y_pred),4)
        print(f'{t} fold {fol} AUC predict: {auc}')
        filename = f'finalized_model_{t}_{fol}.sav'
        pickle.dump(mdl, open(filename, 'wb'))


        fscor=0
        thresh_optima=0.5
        for thresh in np.arange(0.1, 0.701, 0.025):
            thresh = np.round(thresh, 3)
            fs=f1_score(y_val, (y_pred_val>thresh).astype(int))
            if fs>=fscor: 
                thresh_optima=thresh
                fscor=fs
        thr.append(thresh_optima)
        print(f"F1 score at threshold {thresh_optima} is {round(fscor,4)}")

        #store quality
        df_qual[t]=qq  
        #store threshold
        df_threshold[t]= thr  
    
    else:
    
        skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=s)
        sk_f=skf.split(X_tr,  y_tr)


        for train_index, test_index in sk_f:
            fol+=1
            print(fol)
            #    X_tra, X_val, y_tra, y_val = train_test_split(X_tr, y_tr, train_size=0.80, random_state=233, stratify=y_tr)

            X_tra=X_tr[train_index] 
            X_val=X_tr[test_index]

            y_tra=y_tr[train_index] 
            y_val=y_tr[test_index]

            RocAuc = RocAucEvaluation(validation_data=(X_val, y_val), interval=1)

            print("fit")
            mdll = mdl.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                             callbacks=[RocAuc]
            #                 , verbose=0
                             )
            print("predict")
            y_pred_val = mdl.predict(X_val, batch_size=1024)
            y_pred = mdl.predict(X_te, batch_size=1024)

            auc=round(roc_auc_score(y_val,y_pred_val),4)
            print(f'{t} fold {fol} AUC validation: {auc}')
            qq.append(auc)


            #f1=round(f1_score(y_te,y_pred),4)
            auc=round(roc_auc_score(y_te,y_pred),4)
            print(f'{t} fold {fol} AUC predict: {auc}')
            filename = f'finalized_model_{t}_{fol}.sav'
            pickle.dump(mdl, open(filename, 'wb'))


            fscor=0
            thresh_optima=0.5
            for thresh in np.arange(0.1, 0.701, 0.025):
                thresh = np.round(thresh, 3)
                fs=f1_score(y_val, (y_pred_val>thresh).astype(int))
                if fs>=fscor: 
                    thresh_optima=thresh
                    fscor=fs
            thr.append(thresh_optima)
            print(f"F1 score at threshold {thresh_optima} is {round(fscor,4)}")

        #store quality
        df_qual[t]=qq  
        #store threshold
        df_threshold[t]= thr  
    
df_threshold.to_csv("threshold.csv")
df_qual.to_csv("qual.csv")

******************************

Model for alt.atheism
1
fit
Train on 9051 samples, validate on 2263 samples
Epoch 1/1

 ROC-AUC - epoch: 1 - score: 0.975436 

predict
alt.atheism fold 1 AUC validation: 0.9754
alt.atheism fold 1 AUC predict: 0.955
F1 score at threshold 0.25 is 0.73
******************************

Model for comp.graphics
1
fit
Train on 9051 samples, validate on 2263 samples
Epoch 1/1

 ROC-AUC - epoch: 1 - score: 0.956297 

predict
comp.graphics fold 1 AUC validation: 0.9563
comp.graphics fold 1 AUC predict: 0.9455
F1 score at threshold 0.125 is 0.687
******************************

Model for comp.os.ms-windows.misc
1
fit
Train on 9051 samples, validate on 2263 samples
Epoch 1/1

 ROC-AUC - epoch: 1 - score: 0.973142 

predict
comp.os.ms-windows.misc fold 1 AUC validation: 0.9731
comp.os.ms-windows.misc fold 1 AUC predict: 0.9705
F1 score at threshold 0.3 is 0.6519
******************************

Model for comp.sys.ibm.pc.hardware
1
fit
Train on 9051 samples, validate on

In [9]:
df_threshold=pd.read_csv('threshold.csv')
df_qual=pd.read_csv('qual.csv')
df_threshold=df_threshold.T.reset_index()
df_qual=df_qual.T.reset_index()
if kfold_model==0:
    df_threshold['mean']=  df_threshold.iloc[:,1]
    df_qual['mean']=  df_qual.iloc[:,1]
else:
    df_threshold['mean']=  df_threshold.iloc[:,1:folds+1].mean(axis=1) 
    df_qual['mean']=  df_qual.iloc[:,1:folds+1].mean(axis=1) 

In [10]:
#print(df_threshold)
#print(df_qual)

## Text Classification Service

In [11]:
def prediction( my_text , targ=targ, folds=folds, maxlen=maxlen, kfold_model=0):
    if kfold_model==0: folds=1 
    rez_prob_vector={}
    #preprocessing
    X=pd.DataFrame()
    X[c_name]=[my_text]
    X=multi_step_cleaning(X, c_name)
    #print('cleaning done')
    # transformation
    tokenizer = pickle.load(open('tokenizer.sav', 'rb'))
    text_conv = tokenizer.texts_to_sequences(X[f'{c_name}_clean'])
    text_conv = sequence.pad_sequences(text_conv, maxlen=maxlen)
    #print('token done')  
    print(f'Prediction [{len(targ)}]: ', end='')
    df=pd.DataFrame(columns=['target','probability']) 
    for t in targ:
        print('|', end='')
        rez_prob_vector[t]=0
        for i in range(1,folds+1):
        #load model
            loaded_model = pickle.load(open(f'finalized_model_{t}_{i}.sav', 'rb'))
            v=loaded_model.predict(text_conv, batch_size=1024)
            rez_prob_vector[t]+=v[0][0]
            df.loc[df.shape[0]]=[t,v[0][0]]
        #print(rez_prob_vector[t])

    df=df.sort_values(['probability'], ascending=[False] )
    print (f"\n\nThe most probable topic is: {df.iloc[0,0]}")                  
    print('*'*40)      
    print ("\nPossible topics of the text are: ")
    rez=[]
    for t in targ:
        if kfold_model==0:
            trsh=0.30
        else:
            trsh=df_threshold.loc[df_threshold['index']==t, 'mean'].sum()
        if rez_prob_vector[t]>trsh: rez.append(t)
    
    if len(rez)==0:
        print (f'* general *', end = '')
    else:
        for i in rez:
            print (f'* {i} *', end = '')
    print("\n","*"*40,'\n')
 
    return df

In [12]:
from ipywidgets import widgets
from IPython.display import display

my_text=widgets.Text()
display(my_text)

def test_textf(mytext):
    print ("**", mytext, '**')
    
def handle_submit(sender):
    #print('start')
    df=prediction(my_text.value)
    print(df)
#    test_textf( my_text.value)
my_text.on_submit(handle_submit)

Prediction [20]: ||||||||||||||||||||

The most probable topic is: comp.sys.ibm.pc.hardware
****************************************

Possible topics of the text are: 
* comp.sys.ibm.pc.hardware *
 **************************************** 

                      target  probability
3   comp.sys.ibm.pc.hardware     0.650590
4      comp.sys.mac.hardware     0.097121
2    comp.os.ms-windows.misc     0.056079
6               misc.forsale     0.025849
12           sci.electronics     0.021547
13                   sci.med     0.003645
5             comp.windows.x     0.002233
19        talk.religion.misc     0.002187
11                 sci.crypt     0.001585
1              comp.graphics     0.001144
7                  rec.autos     0.001116
0                alt.atheism     0.001108
9         rec.sport.baseball     0.000941
8            rec.motorcycles     0.000755
17     talk.politics.mideast     0.000600
15    soc.religion.christian     0.000504
14                 sci.space     0.000444
16 

## Results

* By implementing Neural Network algorythms with pretrained embedding the task for topic classification can be solved with high accuracy
* Based on the quantity of samples in each class from chosen dataset we can expect that by adding extra text samples algorythm will be able to learn new category


## Further improvements

- find database to be inline with real company needs 
- extract Subject, Organization, etc. and use as a separate input
- test different Embeddings
- try other network structures
- find ways to make it work faster

## Usefull links & Datasets

* https://nlpprogress.com/english/text_classification.html

* https://www.kaggle.com/sbongo/tackling-toxic-problem-with-char-gram-cnn-lstm

* [Quora-insincere questions classification](https://www.kaggle.com/c/quora-insincere-questions-classification/overview)
* [Movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/)
* [News categorization](https://www.kaggle.com/yufengdev/bbc-text-categorization)
* [SMS spam](https://www.kaggle.com/kredy10/simple-lstm-for-text-classification)
* http://www.qizhexie.com/data/RACE_leaderboard.html)
* https://medium.com/@dataturks/rare-text-classification-open-datasets-9d340c8c508e
* https://blog.cambridgespark.com/50-free-machine-learning-datasets-natural-language-processing-d88fb9c5c8da
* https://lionbridge.ai/datasets/the-best-25-datasets-for-natural-language-processing/
* https://aws.amazon.com/ru/datasets/google-books-ngrams/

## Thank you!