# Tagging Pipeline - A DataFrames NLP API

The `tagging_utils.tagging_utils` api pipeline is essentially a decorator framework built on top of `transformers.pipelines` from hugging face. The additional functionality added in `tagging_utils` allows the user to process a dataframe containing columns of text data.

In [1]:
import sys
sys.path.append('..')
from nlp_utils import tagging_utils as tu
import pandas as pd

  return torch._C._cuda_getDeviceCount() > 0
  from pandas import Panel


## Sample Data 1 - Reddit Submissions

In [None]:
data = pd.read_csv('../scraped_data/wsb_comments.csv')
data.head()

In [None]:
data.isna().sum()

In [None]:
data['body'].str.len().plot.hist()

Some of the very large texts had to be clipped due to limited number of tokens. In future I hope to add an ability to dynamically chunk the text to produce chunked results.

In [None]:
long = data[data['body'].str.len() > 1500]

In [None]:
data['body'] = data['body'].str.slice(0, 500)

## Sentiment Analysis Pipeline

In [None]:
summarization = tu.tagging_pipline('summarization')
summ_df = summarization(long.sample(100), 'body')
summ_df['body_summarization'].head()

In [None]:
sentiment_pipe = tu.tagging_pipline('sentiment-analysis')  # ..., model, config, tokenizer, framework, **kwargs)
sentiment_pipe.model

In [None]:
sentiment_tags = sentiment_pipe(long.sample(100), 'body')  # ..., tag_suffix: Optional[str], file: Optional[str], *args, **kwargs)
sentiment_tags['body_sentiment']

In [None]:
many_sentiment_tags = sentiment_pipe(data, ['title', 'selftext'])
many_sentiment_tags[['title', 'selftext', 'title_sentiment', 'selftext_sentiment']]

In [None]:
ner_pipeline = tu.tagging_pipline('ner')
ner_pipeline.model

## Sample Data 2 - Reddit Comments

In [None]:
data = pd.read_csv('../comments_stream.csv')

In [None]:
ner_tags = ner_pipeline(data, 'body')
ner_tags[['body', 'body_nertag']]

In [None]:
featextr_pipe = tu.tagging_pipline('feature-extraction')
featextr_pipe.model

In [None]:
extraction_tags = featextr_pipe(data, 'body')
extraction_tags[['body', 'body_extracted']]

# Not Shown

* question-answering
* fill-mask
* summarization
* translation_en_to_fr
* translation_en_to_de
* translation_en_to_ro
* text-generation