## Quick start

it is possible to use wide viraite of out-of-the-box metadata extorctors found in the library(full list can be found here)

In [None]:
!pip install elemeta > /dev/null

##### Sentiment Polarity
* Polarity range between [-1,1]. 
* -1 defines a negative sentiment and 1 defines a positive sentiment. 
* Negation words reverse the polarity.

In [1]:
from elemeta.nlp.extractors.high_level.sentiment_polarity import SentimentPolarity

sp = SentimentPolarity()
print(sp("I love Superwise"))
print(sp("I hate haters"))
print(sp("This is not a super happy excited sentence"))

0.6369
-0.7845
-0.5337


##### Detect Langauge
This extractor will automatically detect the language of the text.

In [2]:
from elemeta.nlp.extractors.high_level.detect_langauge_langdetect import DetectLangauge

ld = DetectLangauge()
print(ld("This text is in English"))
print(ld("הטקסט הזה בעברית"))
print(ld("Ce texte est en français"))
print(ld("这段文字是法语"))

en
he
fr
zh-cn


### Enrichment Suite

We can use MetadataExtractorsRunner to muliple metadata values at once.
We can supply a list of metadate extractor we want to run and then we get a runner that can be applayed on text and getting list of metadata values.
to run all the extractor on a text we can use runner funtion `run(text: str) -> Dict[str, Union[str, float, int]]`

In [3]:
from elemeta.nlp.metadata_extractor_runner import MetadataExtractorsRunner

metadata_extractor_runner = MetadataExtractorsRunner(metadata_extractors=[sp,ld])
metadata_extractor_runner.run("This is a text about how good life is :)")

{'sentiment_polarity': 0.7096, 'detect_langauge': 'en'}

If no metadata extractors supplyed a defalult set of extractors will be selected

In [4]:
from elemeta.nlp.metadata_extractor_runner import MetadataExtractorsRunner

metadata_extractor_runner = MetadataExtractorsRunner()
metadata_extractor_runner.run("This is a text about how good life is :)")

{'emoji_count': 0,
 'text_complexity': 113.1,
 'unique_word_ratio': 0.875,
 'unique_word_count': 7,
 'word_regex_matches_count': 11,
 'number_count': 0,
 'out_of_vocabulary_count': 2,
 'must_appear_words_percentage': 0,
 'sentence_count': 1,
 'sentence_avg_length': 40.0,
 'word_count': 9,
 'avg_word_length': 3.2222222222222223,
 'text_length': 40,
 'stop_words_count': 6,
 'punctuation_count': 2,
 'special_chars_count': 0,
 'capital_letters_ratio': 0.034482758620689655,
 'regex_match_count': 1,
 'email_count': 0,
 'link_count': 0,
 'hashtag_count': 0,
 'mention_count': 0,
 'syllable_count': 9,
 'acronym_count': 0,
 'date_count': 0}

To add new `MetadataExtractor` to existing `MetadataExtractorsRunner` we can use
`add_metadata_extractor(metadata_extractor: AbstractMetadataExtractor) -> None:`

In [5]:
from elemeta.nlp.extractors.high_level.regex_match_count import RegexMatchCount

number_of_good_in_text = RegexMatchCount(name="number_of_good_in_text",regex="good|Good")
metadata_extractor_runner.add_metadata_extractor(number_of_good_in_text)
metadata_extractor_runner.run("This is a text about how good life is :)")

{'emoji_count': 0,
 'text_complexity': 113.1,
 'unique_word_ratio': 0.875,
 'unique_word_count': 7,
 'word_regex_matches_count': 11,
 'number_count': 0,
 'out_of_vocabulary_count': 2,
 'must_appear_words_percentage': 0,
 'sentence_count': 1,
 'sentence_avg_length': 40.0,
 'word_count': 9,
 'avg_word_length': 3.2222222222222223,
 'text_length': 40,
 'stop_words_count': 6,
 'punctuation_count': 2,
 'special_chars_count': 0,
 'capital_letters_ratio': 0.034482758620689655,
 'regex_match_count': 1,
 'email_count': 0,
 'link_count': 0,
 'hashtag_count': 0,
 'mention_count': 0,
 'syllable_count': 9,
 'acronym_count': 0,
 'date_count': 0,
 'number_of_good_in_text': 1}

To run the extractors on all the columns on dataframe we can use `run_on_dataframe(dataframe: DataFrame, text_column: str) -> DataFrame`
using this funtion we can supply a dataframe and the name of the text column and as return the funtion will return new dataframe with all the metadata values as new columns

In [6]:
from elemeta.dataset.dataset import get_imdb_reviews
reviews = get_imdb_reviews()[:200]
print("The original dataset had {} columns".format(reviews.shape[1]))

# The enrichment process
print("Processing...")

reviews = metadata_extractor_runner.run_on_dataframe(dataframe=reviews,text_column='review')
print("The transformed dataset has {} columns".format(reviews.shape[1]))

The original dataset had 2 columns
Processing...
The transformed dataset has 28 columns


In [7]:
reviews

Unnamed: 0,review,sentiment,emoji_count,text_complexity,unique_word_ratio,unique_word_count,word_regex_matches_count,number_count,out_of_vocabulary_count,must_appear_words_percentage,...,capital_letters_ratio,regex_match_count,email_count,link_count,hashtag_count,mention_count,syllable_count,acronym_count,date_count,number_of_good_in_text
0,One of the other reviewers has mentioned that ...,positive,0,70.02,0.726829,149,380,1,123,0,...,0.031250,1,0,0,0,0,419,14,2,0
1,A wonderful little production. <br /><br />The...,positive,0,56.49,0.739130,85,201,0,75,0,...,0.020177,1,0,0,0,0,251,4,0,0
2,I thought this was a wonderful way to spend ti...,positive,0,60.28,0.776000,97,205,1,58,0,...,0.031944,1,0,0,0,0,229,3,3,0
3,Basically there's a family where a little boy ...,negative,0,74.69,0.694737,66,175,2,64,0,...,0.042403,1,0,0,0,0,183,5,1,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,0,70.43,0.784314,120,283,0,93,0,...,0.032946,1,0,0,0,0,327,3,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,Phantasm ....Class. Phantasm II.....awesome. P...,negative,0,74.93,0.780142,110,247,3,106,0,...,0.052632,1,0,0,0,0,224,5,2,1
196,Ludicrous. Angelic 9-year-old Annakin turns in...,negative,0,71.65,0.773973,113,264,2,104,0,...,0.026578,1,0,0,0,0,287,4,5,0
197,"Scotty (Grant Cramer, who would go on to star ...",negative,0,76.86,0.789855,109,233,1,92,0,...,0.038835,1,0,0,0,0,223,4,1,1
198,If you keep rigid historical perspective out o...,positive,0,65.25,0.716783,205,606,1,163,0,...,0.038009,1,0,0,0,0,701,13,0,1
