## Imports

In [1]:
import mltlk
print(mltlk.__version__)
from mltlk import *
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, Flatten

0.1.13


2023-06-01 15:39:10.624150: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load data
Load data, clean text and use Keras word vector embeddings preprocessing.

In [2]:
session = load_data("data/wikipedia_300.csv.gz", conf={
    "preprocess": "embeddings",
    "w2v_vector_size": 75,
    "stopwords": ["english", "stopwords/custom.csv"],
    "clean_text": "letters digits",
    "encode_labels": True,
    "max_length": 10000,
})

[1m[33mInfo: [0mClean texts keeping letters and digits
[1m[33mInfo: [0mLabels encoded
[1m[33mInfo: [0mLoad 180 stopwords from [36menglish, stopwords/custom.csv[0m
[1m[33mInfo: [0mVocabulary size is [34m53284[0m
[1m[33mInfo: [0m[34m97.3%[0m of sequences covered by max length [34m10000[0m
[1m[33mInfo: [0mLoaded [34m300[0m examples in [34m2[0m categories


#### Show data stats

In [3]:
data_stats(session)

0,1,2,3,4,5,6,7,8,9,10,11
Category,No,%,Σ%,Category,No,%,Σ%,Category,No,%,Σ%
Games (0),150,50.0%,50.0%,Programming (1),150,50.0%,100.0%,,,,
Examples:,300,,,Features:,10000,,,Categories:,2,,


## Define Keras model
Builds the structure for the Keras model to use.

In [4]:
def get_model(session):
    model = Sequential()
    model.add(Embedding(input_dim=session["vocab_size"], output_dim=session["embeddings_size"], input_length=session["max_length"]))
    model.add(Flatten())
    model.add(Dense(64, activation="relu"))
    model.add(Dropout(0.2))
    model.add(Dense(2, activation="softmax"))
    return model

## Evaluate model using train-test split
Build a RandomForest model and evaluate results using train-test split.

In [5]:
split_data(session, conf={
    "test_size": 0.1,
    "seed": 4,
    "stratify": True,
})

[1m[33mInfo: [0mSplit data using [34m90%[0m training data and [34m10%[0m test data with seed [34m4[0m and stratify


In [6]:
evaluate_model(get_model(session), session, reload=False, conf={
    "mode": "split",
    "categories": True,
    "seed": 42,
    "epochs": 8,
    "batch_size": 32,
    "loss": "categorical_crossentropy",
})

Building and evaluating model using train-test split took [34m38.07[0m sec



0,1
Results,
Accuracy:,76.67%
F1-score:,76.64%
Precision:,76.79%
Recall:,76.67%





0,1,2
Category,Accuracy,n
Games (0),80.00%,15
Programming (1),20.00%,3
Programming (1),73.33%,15
Games (0),26.67%,4





## Build final model and predict example
Build final model using all data and predict an unknown example.

In [7]:
build_model(get_model(session), session, conf={
    "seed": 42,
    "epochs": 8,
    "batch_size": 32,
    "loss": "categorical_crossentropy",
})
predict("This is an article about gamers - people who love playing games", session)

[1m[33mInfo: [0mBuilding final model on all data took [34m45.94[0m sec
[1m[33mInfo: [0mExample is predicted as [32mProgramming (1)[0m
