# Klasyfikacja krotkich tekstow na przykladzie nazw zawodow

## O Mnie

**Rafal Pronko**

Jako DS pracowalem / Pracuje:
- Centrum Zastosowan Matematyki i Inzynierii Systemow
- Webinterpret (NLP / Churn prediction model)
- YND (NLP / Image processing) / Private Matter - blockchain scientist
- CVTimeline (NLP)

Mozna mnie spotkac:
- meetupy DS / Machinelearning
- Kariera IT - prelegent
- WDI - prelegent
- IT career summit - prelegent

# NLP

interdyscyplinarna dziedzina, łącząca zagadnienia sztucznej inteligencji i językoznawstwa, zajmująca się automatyzacją analizy, rozumienia, tłumaczenia i generowania języka naturalnego przez komputer (wikipedia)

Główne zagadnienia w obrębie NLP:
- tworzenie krótkich opisów z długich tekstów
- analiza sentymentu
- **klasyfikacja tekstu**
- ekstrakcja informacji z tekstu


# Klasyfikacja tekstu

przypisanie predefiniowanych kategorii dla tekstu pisanego w języku naturalnym. 

![title](img1.png)
https://developers.google.com/machine-learning/guides/text-classification/


## Klasyfikacja tekstu ma wiele zastosowań:
- Kategoryzacja ogłoszeń na portalach (Ebay / Amazon / Allegro ...)
- Wykrywanie niechcianych tekstów: SPAM / mowa nienawiści ...
- Klasyfikacja artykułów: przypisywanie kategorii / wykrywanie nieprawdziwych informacji
- Klasyfikacja zawodów ... 

## Klasyfikacja nazw zawodów

Dlaczego:
- HR
- Standaryzacja nazw - roznego typu statystyki
- Przewidywanie zachowan pracownikow
- Przewidywanie dlugosci zatrudnienia
- Wyszukiwanie odpowiedniej pracy dla danej osoby
- Przypisac niezbedne umiejetnosci
...

# CVTimeline

![img](img3.png)

![budowa modelu](img2.png)

https://developers.google.com/machine-learning/guides/text-classification/

Dane pochodza z serwisu https://www.onetcenter.org/dictionary/23.1/excel/alternate_titles.html

In [6]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

In [7]:
df = dd.read_csv("data.csv", delimiter=";")
df = df.reset_index().set_index('index')

In [8]:
df.head()

Unnamed: 0_level_0,Title,Alternate Title,Short Title
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Chief Executives,Aeronautics Commission Director,
1,Chief Executives,Agricultural Services Director,
2,Chief Executives,Alcohol and Drug Abuse Assistance Program Admi...,
3,Chief Executives,Arts and Humanities Council Director,
4,Chief Executives,Bakery Manager,


In [9]:
len(df)

59583

In [27]:
res = df.Title.unique()

In [11]:
with ProgressBar():
    unique_titles = res.compute()

[########################################] | 100% Completed |  0.2s


In [12]:
unique_titles.head()

0                       Chief Executives
1          Chief Sustainability Officers
2        General and Operations Managers
3                            Legislators
4    Advertising and Promotions Managers
Name: Title, dtype: object

In [13]:
len(unique_titles)

1109

In [14]:
res = df.Title.str.lower()
with ProgressBar():
    df = df.assign(title_lower=res.compute())

[########################################] | 100% Completed |  0.2s


In [15]:
res = df["Alternate Title"].str.lower()
with ProgressBar():
    df = df.assign(al_title_lower=res.compute())

[########################################] | 100% Completed |  0.3s


In [16]:
df.head()

Unnamed: 0_level_0,Title,Alternate Title,Short Title,title_lower,al_title_lower
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Chief Executives,Aeronautics Commission Director,,chief executives,aeronautics commission director
1,Chief Executives,Agricultural Services Director,,chief executives,agricultural services director
2,Chief Executives,Alcohol and Drug Abuse Assistance Program Admi...,,chief executives,alcohol and drug abuse assistance program admi...
3,Chief Executives,Arts and Humanities Council Director,,chief executives,arts and humanities council director
4,Chief Executives,Bakery Manager,,chief executives,bakery manager


In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(df["al_title_lower"].compute(), df["title_lower"].compute())

In [15]:
X_train.shape

(44687,)

## Reprezentacja tekstu przy klasyfikacji:
- one hot encoder - zadziala wtedy gdy bedziemy probowac klasyfikowac te same teksty caly czas - nie bedzie generalizowac
- bag of word - bedzie generalizowac

In [16]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

In [17]:
clf = Pipeline([("text_processing", CountVectorizer()), ("clf", MultinomialNB())])

In [18]:
param = {
    "text_processing": [CountVectorizer(), TfidfVectorizer()],
    "text_processing__ngram_range": [(1,1), (1,2), (1,3)],
    
}

In [19]:
grid = GridSearchCV(clf, param_grid=param, verbose=5, cv=5, scoring="accuracy")

In [20]:
grid.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 1) 
[CV]  text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 1), score=0.1696466581524053, total=   3.1s
[CV] text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.1s remaining:    0.0s


[CV]  text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 1), score=0.16710353866317168, total=   4.1s
[CV] text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    9.2s remaining:    0.0s


[CV]  text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 1), score=0.16920492721164615, total=   3.3s
[CV] text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   13.5s remaining:    0.0s


[CV]  text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 1), score=0.17488532110091742, total=   3.1s
[CV] text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   17.6s remaining:    0.0s


[CV]  text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 1), score=0.18114319387153802, total=   3.3s
[CV] text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 2) 
[CV]  text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict'

[CV]  text_processing=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), text_processing__ngram_range=(1, 3), score=0.19198585739540366, total=  11.4s
[CV] text_processing=TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None), text_processing__ngram_range=(1, 1) 
[CV]  text_process

[CV]  text_processing=TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None), text_processing__ngram_range=(1, 2), score=0.07469204927211646, total=   6.7s
[CV] text_processing=TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        v

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  4.2min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('text_processing', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), p...nizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'text_processing': [CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=Non..., use_idf=True,
        vocabulary=None)], 'text_processing__ngram_range': [(1, 1), (1, 2), (1, 3)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
    

In [21]:
grid.best_score_

0.18470696175621545

In [22]:
grid.best_estimator_

Pipeline(memory=None,
     steps=[('text_processing', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [23]:
from sklearn.metrics import accuracy_score
pred_baseline = grid.predict(X_test)
print(accuracy_score(y_test, pred_baseline))

0.1871643394199785


Google radzi: 

1. Oblicz #liczba_przykladow / #liczba_slow_per_przyklad
2. Jezeli wynik jest mniejszy niz 1500 tokenizuj tekst uzywajac n-gramow i uzyj MLP do klasyfikacji
    - uzyj tokenizacji na poziomie slow (znakow)
    - wybierz okolo 20K najbardziej istotnych ngramow
    - zbuduj model
3. Jezeli wynik jest wiekszy niz 1500 tokenizuj tekst jako sekwencje i nastepnie uzyj sepCNN
    - podziel sentencje na slowa i wybierz 20K najlepszych slow ze wzgledu na frekwencje
    - zmien reprezentacje na reprezentacje sekwencyjna
    - jesli wynik z punktu 1 jest mniejszy niz 15K to uzyj pretrenowanego embedingu i spuCNN
4. Optymalizuj model aby dobrac najlepsze parametry
   
https://developers.google.com/machine-learning/guides/text-classification/step-2-5

In [41]:
number_words = [len(x.split(" ")) for x in X_train.values]
avg_words_per_sample = sum(number_words) / X_train.shape[0]

In [42]:
X_train.shape[0] / avg_words_per_sample

17343.025359770025

In [62]:
# kod pochodzi ze strony https://developers.google.com/machine-learning/guides/text-classification/step-3
from tensorflow.python.keras.preprocessing import sequence
from tensorflow.python.keras.preprocessing import text

# Vectorization parameters
# Limit on the number of features. We use the top 20K features.
TOP_K = 20000

# Limit on the length of text sequences. Sequences longer than this
# will be truncated.
MAX_SEQUENCE_LENGTH = 500

def sequence_vectorize(train_texts, val_texts):
    """Vectorizes texts as sequence vectors.

    1 text = 1 sequence vector with fixed length.

    # Arguments
        train_texts: list, training text strings.
        val_texts: list, validation text strings.

    # Returns
        x_train, x_val, word_index: vectorized training and validation
            texts and word index dictionary.
    """
    # Create vocabulary with training texts.
    tokenizer = text.Tokenizer(num_words=TOP_K)
    tokenizer.fit_on_texts(train_texts)

    # Vectorize training and validation texts.
    x_train = tokenizer.texts_to_sequences(train_texts)
    x_val = tokenizer.texts_to_sequences(val_texts)

    # Get max sequence length.
    max_length = len(max(x_train, key=len))
    if max_length > MAX_SEQUENCE_LENGTH:
        max_length = MAX_SEQUENCE_LENGTH

    # Fix sequence length to max value. Sequences shorter than the length are
    # padded in the beginning and sequences longer are truncated
    # at the beginning.
    x_train = sequence.pad_sequences(x_train, maxlen=max_length)
    x_val = sequence.pad_sequences(x_val, maxlen=max_length)
    return x_train, x_val, tokenizer.word_index, max_length

In [5]:
x_train, x_test, idx, max_length = sequence_vectorize(X_train.values, X_test.values)

NameError: name 'sequence_vectorize' is not defined

In [24]:
from tensorflow.python.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

In [25]:
label_encoder = LabelEncoder()
y_train_labs = label_encoder.fit_transform(y_train)
y_test_labs = label_encoder.transform(y_test)

In [48]:
# kod pochodzi ze strony https://developers.google.com/machine-learning/guides/text-classification/step-3
def _get_last_layer_units_and_activation(num_classes):
    """Gets the # units and activation function for the last network layer.

    # Arguments
        num_classes: int, number of classes.

    # Returns
        units, activation values.
    """
    if num_classes == 2:
        activation = 'sigmoid'
        units = 1
    else:
        activation = 'softmax'
        units = num_classes
    return units, activation

In [47]:
# kod pochodzi ze strony https://developers.google.com/machine-learning/guides/text-classification/step-3
from tensorflow.python.keras import models
from tensorflow.python.keras import initializers
from tensorflow.python.keras import regularizers

from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.layers import Dropout
from tensorflow.python.keras.layers import Embedding
from tensorflow.python.keras.layers import SeparableConv1D
from tensorflow.python.keras.layers import MaxPooling1D
from tensorflow.python.keras.layers import GlobalAveragePooling1D

def sepcnn_model(blocks,
                 filters,
                 kernel_size,
                 embedding_dim,
                 dropout_rate,
                 pool_size,
                 input_shape,
                 num_classes,
                 num_features,
                 use_pretrained_embedding=False,
                 is_embedding_trainable=False,
                 embedding_matrix=None):
    """Creates an instance of a separable CNN model.

    # Arguments
        blocks: int, number of pairs of sepCNN and pooling blocks in the model.
        filters: int, output dimension of the layers.
        kernel_size: int, length of the convolution window.
        embedding_dim: int, dimension of the embedding vectors.
        dropout_rate: float, percentage of input to drop at Dropout layers.
        pool_size: int, factor by which to downscale input at MaxPooling layer.
        input_shape: tuple, shape of input to the model.
        num_classes: int, number of output classes.
        num_features: int, number of words (embedding input dimension).
        use_pretrained_embedding: bool, true if pre-trained embedding is on.
        is_embedding_trainable: bool, true if embedding layer is trainable.
        embedding_matrix: dict, dictionary with embedding coefficients.

    # Returns
        A sepCNN model instance.
    """
    op_units, op_activation = _get_last_layer_units_and_activation(num_classes)
    model = models.Sequential()

    # Add embedding layer. If pre-trained embedding is used add weights to the
    # embeddings layer and set trainable to input is_embedding_trainable flag.
    if use_pretrained_embedding:
        model.add(Embedding(input_dim=num_features,
                            output_dim=embedding_dim,
                            input_length=input_shape[0],
                            weights=[embedding_matrix],
                            trainable=is_embedding_trainable))
    else:
        model.add(Embedding(input_dim=num_features,
                            output_dim=embedding_dim,
                            input_length=input_shape[0]))

    for _ in range(blocks-1):
        model.add(Dropout(rate=dropout_rate))
        model.add(SeparableConv1D(filters=filters,
                                  kernel_size=kernel_size,
                                  activation='relu',
                                  bias_initializer='random_uniform',
                                  depthwise_initializer='random_uniform',
                                  padding='same'))
        model.add(SeparableConv1D(filters=filters,
                                  kernel_size=kernel_size,
                                  activation='relu',
                                  bias_initializer='random_uniform',
                                  depthwise_initializer='random_uniform',
                                  padding='same'))
        model.add(MaxPooling1D(pool_size=pool_size))

    model.add(SeparableConv1D(filters=filters * 2,
                              kernel_size=kernel_size,
                              activation='relu',
                              bias_initializer='random_uniform',
                              depthwise_initializer='random_uniform',
                              padding='same'))
    model.add(SeparableConv1D(filters=filters * 2,
                              kernel_size=kernel_size,
                              activation='relu',
                              bias_initializer='random_uniform',
                              depthwise_initializer='random_uniform',
                              padding='same'))
    model.add(GlobalAveragePooling1D())
    model.add(Dropout(rate=dropout_rate))
    model.add(Dense(op_units, activation=op_activation))
    return model

In [168]:
model = sepcnn_model(blocks=2,
                     filters=32,
                     kernel_size=5,
                     embedding_dim=300,
                     dropout_rate=0.3,
                     pool_size=3,
                     input_shape=(max_length,),
                     num_classes=1109,
                     num_features=TOP_K,
                     use_pretrained_embedding=False,
                     is_embedding_trainable=False,)

In [169]:
model.summary()

Model: "sequential_24"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_24 (Embedding)     (None, 24, 300)           6000000   
_________________________________________________________________
dropout_59 (Dropout)         (None, 24, 300)           0         
_________________________________________________________________
separable_conv1d_117 (Separa (None, 24, 32)            11132     
_________________________________________________________________
separable_conv1d_118 (Separa (None, 24, 32)            1216      
_________________________________________________________________
max_pooling1d_42 (MaxPooling (None, 8, 32)             0         
_________________________________________________________________
separable_conv1d_119 (Separa (None, 8, 64)             2272      
_________________________________________________________________
separable_conv1d_120 (Separa (None, 8, 64)           

In [170]:
model.predict(x_train[:2])

array([[0.000902  , 0.00089847, 0.00089884, ..., 0.00090491, 0.00090141,
        0.00090136],
       [0.000902  , 0.00089847, 0.00089884, ..., 0.00090491, 0.00090141,
        0.00090136]], dtype=float32)

In [171]:
from tensorflow.python.keras.optimizers import Adam

In [172]:
model.compile(optimizer=Adam(lr=1e-3), loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [173]:
model.fit(x_train,
          y_train_labs,
          epochs=100,
          validation_data=(x_test, y_test_labs),
          batch_size=512)

Train on 44687 samples, validate on 14896 samples


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100


Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<tensorflow.python.keras.callbacks.History at 0x7fc3546c44a8>

In [161]:
from tensorflow.python.keras.layers import Bidirectional
from tensorflow.python.keras.layers import GRU

In [167]:
def gru_model(num_features, embedding_dim, input_shape, num_classes):
    op_units, op_activation = _get_last_layer_units_and_activation(num_classes)
    model = models.Sequential()
    model.add(Embedding(input_dim=num_features,
                            output_dim=embedding_dim,
                            input_length=input_shape[0]))
    model.add(Bidirectional(GRU(128, return_sequences=True)))
    model.add(Bidirectional(GRU(64, return_sequences=False)))
    model.add(Dropout(0.5))
    model.add(Dense(op_units, activation=op_activation))
    return model

In [163]:
model_gru = gru_model(num_features=TOP_K,embedding_dim=250,input_shape=(max_length,),num_classes=1109)

In [164]:
model_gru.summary()

Model: "sequential_23"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_23 (Embedding)     (None, 24, 250)           5000000   
_________________________________________________________________
bidirectional (Bidirectional (None, 24, 256)           291072    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               123264    
_________________________________________________________________
dense_16 (Dense)             (None, 1109)              143061    
Total params: 5,557,397
Trainable params: 5,557,397
Non-trainable params: 0
_________________________________________________________________


In [165]:
model_gru.compile(optimizer=Adam(lr=1e-3), loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [166]:
model_gru.fit(x_train,
          y_train_labs,
          epochs=100,
          validation_data=(x_test, y_test_labs),
          batch_size=512)

Train on 44687 samples, validate on 14896 samples


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100


Epoch 56/100

KeyboardInterrupt: 

Aby uzyc fasttext musimy miec plik postaci 

```_ _ label _ _ <X>  <Text>```

In [185]:
with open('train.txt', 'w') as f:
    for i, a in enumerate(X_train):
        f.write("{}\n".format("__label__" + str(y_train_labs[i]) + " "+a))

In [186]:
with open('test.txt', 'w') as f:
    for i, a in enumerate(X_test):
        f.write("{}\n".format("__label__" + str(y_test_labs[i]) + " "+a))

In [2]:
import fasttext

In [3]:
classifier = fasttext.supervised('train.txt', 'text_classify', epoch=10, dim=200)

In [19]:
result = classifier.predict(X_test)

In [20]:
result = [int(r[0]) for r in result]

In [26]:
accuracy_score(y_test_labs, result)

0.29987916219119226

# Podsumowanie
Wyniki:

Baseline: 0.1871643394199785

Google baseline: 0.2136

GRU: 0.2716

FastText: 0.29987916219119226

# Do zrobienia
- sieci z attention
- sieci syjamskie

## Bibliografia

1. Separable Convolution
    - [Xception: Deep Learning with Depthwise Separable Convolutions](https://arxiv.org/pdf/1610.02357.pdf)
    - [A Basic Introduction to Separable Convolutions](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)
    - [Network Decoupling: From Regular to Depthwise Separable Convolutions](https://arxiv.org/pdf/1808.05517.pdf)
    - [Depthwise Separable Convolutions for Neural Machine Translation](https://arxiv.org/pdf/1706.03059.pdf)
    - [Depthwise separable convolutions for machine learning](https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/)
2. Job title classification
    - [Semantic Similarity Strategies for Job Title Classification](https://arxiv.org/pdf/1609.06268v1.pdf)
    - [Learning Text Similarity with Siamese Recurrent Networks](http://www.aclweb.org/anthology/W/W16/W16-1617.pdf)
3. Text classification
    - [Bag of Tricks for Efficient Text Classification](https://arxiv.org/pdf/1607.01759.pdf)
    - [Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?](https://arxiv.org/pdf/1708.02657.pdf)
    - [Fully Convolutional Networks for Text Classification](https://export.arxiv.org/pdf/1902.05575)
4. Text classification with siamese
    -[Learning Text Similarity with Siamese Recurrent Networks](http://www.aclweb.org/anthology/W/W16/W16-1617.pdf)
    
    
[![Depthwise CNN](http://img.youtube.com/vi/T7o3xvJLuHk/0.jpg)](http://www.youtube.com/watch?v=T7o3xvJLuHk "Video Title")