<img src=https://raw.githubusercontent.com/superwise-ai/elemeta/cedb93e339a61b6920231cb27f463c7dc6cb9da4/docs/images/superwise_slemeta_white_b.png alt="Elemta + Superwise">

This notebook provides a quickstart experience for extracting and enriching meta-features from plain text using Elemeta open source package. You will also be guided through Elemeta's two main use cases: 
* Engineer new features with extracted meta-features to build improved models.
* Using Elemeta to monitor NLP use cases (here using Superwise).

The notebook includes:
* [Installation](#installation)
* [Monitor NLP with Superwise and Elemeta](#monitor)
    * [Simulation preperation](#simulation_preperation)
    * [Create a project](#create_project)
    * [Training pipeline](#training_pipeline)
    * [Inference pipeline](#inference_pipeline)
    * [Ground truth pipeline](#ground_truth_pipeline)
---

# <a name="installation"></a>Installation

---

## Install packages
Install PyDrive to access Google Drive data directly, Superwise for the Superwise SDK, and Elemeta.

In [None]:
!pip install elemeta
!pip install superwise

## Restart the kernel

After installing everything, restart the notebook kernel so it can find the packages.


In [None]:
import os

# Automatically restart kernel after installs
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Imports
Import the relevant packages into the session.

In [None]:
from elemeta.nlp.runners.metafeature_extractors_runner import MetafeatureExtractorsRunner
from elemeta.dataset.dataset import get_tweets_likes
import pandas as pd
import numpy as np
from superwise import Superwise
from superwise.models.project import Project
from superwise.models.model import Model
from superwise.models.version import Version
from superwise.models.dataset import Dataset
from superwise.resources.superwise_enums import NotifyUpon, ScheduleCron


## Read data


Read Twitter tweets dataset. We will use only 3 fields: the tweet itself, it's timestamp, and the number of likes (this will be used later on as the label for our prediction task).

In [None]:
df_full = get_tweets_likes()
df_origin = df_full[:200]

# <a name="monitor"></a>Monitor NLP with Superwise and Elemeta
This is a quickstart example of how Elemeta can be used to monitor NLP use cases with Superwise Model Observability. Please ensure that you have an active Superwise account, and if you don't have one, [please create one](https://portal.superwise.ai/account/sign-up).

---



## <a name="simulation_preperation"></a>Simulation preperation
We will split the original Twitter data into three parts to simulate training, inference, and ground truth data pipelines.

In [None]:
from sklearn.model_selection import train_test_split
if "id" not in df_full.columns:
  df_full = df_full.reset_index().rename({"index":"id"},axis=1)
X_train, X_inference, y_train, y_inference = train_test_split(df_full, df_full.loc[:,'number_of_likes'], test_size=0.33, random_state=42)

sample_size = 200
train_sampled = X_train[:sample_size]
inference_sampled = X_inference[:sample_size]
ground_truth_sampled = pd.DataFrame(y_inference[:sample_size]).assign(id=inference_sampled["id"])

print(X_train.shape)
print(X_inference.shape)
print(ground_truth_sampled.shape)

(35203, 4)
(17339, 4)
(200, 2)


## <a name="create_project"></a>Create a project
We will programatically create a project and model using the Superwise SDK.

### Generate tokens
Please enter your API token or user token here. See how to generate them or import them [here](https://docs.superwise.ai/docs/authentication).

In [None]:
import os
os.environ['SUPERWISE_CLIENT_ID'] = '[CLIENT_ID]'
os.environ['SUPERWISE_SECRET'] = '[SECRET]'

### Create a new project

In [None]:
sw = Superwise()
project = Project(
    name="My NLP Project",
    description="Natural Language Processing"
)

project = sw.project.create(project)
print(f"New project Created - {project.id}")

### Create a new model

In [None]:
nlp_model = Model(
    project_id=project.id,
    name="Tweeter Likes NLP Model",
    description="Regression model with simulated data"
)

nlp_model = sw.model.create(nlp_model)

## <a name="training_pipeline"></a>Training pipeline
In order to predict the number of likes per tweet, we will train a regression model and log the training data into Superwise after Elemeta enrichment.

### Train a new model
Based on the training dataset, build a simple classifier model pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

from sklearn.linear_model import LogisticRegression,SGDRegressor

pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('sgdr',SGDRegressor(max_iter=3000))
    ])
pipe.fit(X_train["content"],y_train)

Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.


### Log training data
Prepare the training dataset formated for Superwise, extend it with Elemeta, and send it to Superwise.

In [None]:
train_sampled

Unnamed: 0,id,content,date_time,number_of_likes
10299,10299,I sincerely enjoy this and every moment I get ...,12/11/2014 12:43,4828
49940,49940,Like. Love. Affection. Romance. A double tap. ...,14/02/2016 20:41,1545
45822,45822,"With the opening of these 2 centers, @Movimien...",03/07/2015 16:02,1661
52156,52156,@WValderrama I know this won't mean anything t...,23/05/2015 15:33,12645
44558,44558,ありがとうございます Summersonic Tokyo!! だいすき！,15/08/2015 09:55,32410
...,...,...,...,...
12272,12272,@__glitterDICK Happy #cake day to my glitter d...,28/12/2012 18:31,224
17036,17036,"A night of magical moments, rocknroll and too ...",05/05/2015 17:30,5858
764,764,When you could go anywhere for your bday dinne...,27/10/2015 19:37,9421
48415,48415,Mmm...pasta. 🍝 Worldwide InstaMeet is coming s...,10/09/2016 15:00,916


##### Preparation for Superwise format and Enrichment


In [None]:
train_sampled["predicted_number_of_likes"] = pipe.predict(train_sampled["content"]).astype(int)

# Enrich the training dataset with Elemeta
metafeature_extractors_runner = MetafeatureExtractorsRunner()
print("The original dataset had {} columns".format(train_sampled.shape[1]))

# The enrichment process
print("Processing...")
train_sampled = metafeature_extractors_runner.run_on_dataframe(dataframe=train_sampled,text_column='content')
print("The transformed dataset has {} columns".format(train_sampled.shape[1]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The original dataset had 5 columns
Processing...
The transformed dataset has 31 columns


In [None]:
train_sampled

Index(['id', 'content', 'date_time', 'number_of_likes',
       'predicted_number_of_likes', 'detect_langauge', 'emoji_count',
       'text_complexity', 'unique_word_ratio', 'unique_word_count',
       'word_regex_matches_count', 'number_count', 'out_of_vocabulary_count',
       'must_appear_words_ratio', 'sentence_count', 'sentence_avg_length',
       'word_count', 'avg_word_length', 'text_length', 'stop_words_count',
       'punctuation_count', 'special_chars_count', 'capital_letters_ratio',
       'regex_match_count', 'email_count', 'link_count', 'hashtag_count',
       'mention_count', 'syllable_count', 'acronym_count', 'date_count'],
      dtype='object')

In [None]:
from superwise.models.dataset import Dataset
from superwise.resources.superwise_enums import DataEntityRole,FeatureType


dataset = Dataset.generate_dataset_from_dataframe(name="Tweeter Likes Dataset",
                  project_id=project.id,
                  dataframe=train_sampled,
                  roles={
                    DataEntityRole.METADATA.value:["content"],
                    DataEntityRole.PREDICTION_VALUE.value:["predicted_number_of_likes"],
                    DataEntityRole.TIMESTAMP.value:"date_time",
                    DataEntityRole.LABEL.value:["number_of_likes"],
                    DataEntityRole.ID.value:"id"},
                    )

# Create the dataset in Superwise, may take some time to process
dataset = sw.dataset.create(dataset)

new_version = Version(
    model_id=nlp_model.id,
    name="1.0.0",
    dataset_id=dataset.id
)

new_version = sw.version.create(new_version)
sw.version.activate(new_version.id)

## <a name="inference_pipeline"></a>Inference pipeline
Produce model inference predictions and log them to Superwise for monitoring. Inference logs will be sent in batches once Elemeta has enriched them.

In [None]:
inference_sampled.loc[:,"predicted_number_of_likes"] = pipe.predict(inference_sampled["content"]).astype(int)

# prep for Superwise format
prediction_time_vector = pd.Timestamp.now().floor('h') - \
    pd.TimedeltaIndex(inference_sampled.reset_index(drop=True).index // int(inference_sampled.shape[0] // 30), unit='D')

ongoing_predictions = inference_sampled.assign(
    date_time=prediction_time_vector,
)

#util function 
def chunks(df, n):
    """Yield successive n-sized chunks from df."""
    for i in range(0, df.shape[0], n):
        yield df[i:i + n]

# break the inference data into chunks
ongoing_predictions_chunks = chunks(ongoing_predictions, 50) # batches of 50

transaction_ids = list()
# for each chunk
for ongoing_predictions_chunk in ongoing_predictions_chunks:
    
    # enrich with Elemeta
    ongoing_predictions_chunk = metafeature_extractors_runner.run_on_dataframe(dataframe=ongoing_predictions_chunk,text_column="content")
    
    # send to Superwise
    transaction_id = sw.transaction.log_records(
        model_id=nlp_model.id, 
        version_id=new_version.id, 
        records=ongoing_predictions_chunk.to_dict(orient="records")
    )
    transaction_ids.append(transaction_id)
    print(transaction_id)

## <a name="ground_truth_pipeline"></a>Ground truth pipeline
Simulate ground truth collection and log it to Superwise for monitoring.

In [None]:
# prep for Superwise format
prediction_time_vector = pd.Timestamp.now().floor('h') - \
    pd.TimedeltaIndex(ground_truth_sampled.reset_index(drop=True).index // int(ground_truth_sampled.shape[0] // 30), unit='D')

ongoing_labels = ground_truth_sampled.assign(
    id = ground_truth_sampled["id"]
)

# break the label data into chunks
ongoing_labels_chunks = chunks(ongoing_labels, 50)

transaction_ids = list()
# for each chunk
for ongoing_labels_chunk in ongoing_labels_chunks:
    # send to Superwise
    transaction_id = sw.transaction.log_records(
        model_id=nlp_model.id, 
        version_id=new_version.id, 
        records=ongoing_labels_chunk.to_dict(orient="records")
    )
    transaction_ids.append(transaction_id)
    print(transaction_id)