<img src=https://raw.githubusercontent.com/superwise-ai/elemeta/54825ed11287ac69d809a4590749d6b63404dd1d/docs/images/elemeta_cover_image_white_narrow.png alt="Elemeta">

## Quick start

it is possible to use wide viraite of out-of-the-box metafeature extorctors found in the library(full list can be found here)

In [None]:
!pip install elemeta

##### Sentiment Polarity
* Polarity range between [-1,1]. 
* -1 defines a negative sentiment and 1 defines a positive sentiment. 
* Negation words reverse the polarity.

In [2]:
from elemeta.nlp.extractors.high_level.sentiment_polarity import SentimentPolarity

sp = SentimentPolarity()
print(sp("I love Superwise"))
print(sp("I hate haters"))
print(sp("This is not a super happy excited sentence"))

0.6369
-0.7845
-0.5337


##### Detect Langauge
This extractor will automatically detect the language of the text.

In [3]:
from elemeta.nlp.extractors.high_level.detect_langauge_langdetect import DetectLanguage

ld = DetectLanguage()
print(ld("This text is in English"))
print(ld("הטקסט הזה בעברית"))
print(ld("Ce texte est en français"))
print(ld("这段文字是法语"))

en
he
fr
zh-cn


### Enrichment Suite

We can use MetafeatureExtractorsRunner to muliple metafeature values at once.
We can supply a list of metadate extractor we want to run and then we get a runner that can be applayed on text and getting list of metafeatures values.
to run all the extractor on a text we can use runner funtion `run(text: str) -> Dict[str, Union[str, float, int]]`

In [4]:
from elemeta.nlp.runners.metafeature_extractors_runner import MetafeatureExtractorsRunner

metafeature_extractors_runner = MetafeatureExtractorsRunner(metafeature_extractors=[sp,ld])
metafeature_extractors_runner.run("This is a text about how good life is :)")

{'sentiment_polarity': 0.7096, 'detect_langauge': 'en'}

If no metafeature extractors supplied a default set of extractors will be selected

In [5]:
from elemeta.nlp.runners.metafeature_extractors_runner import MetafeatureExtractorsRunner

metafeature_extractors_runner = MetafeatureExtractorsRunner()
metafeature_extractors_runner.run("This is a text about how good life is :)")

{'emoji_count': 0,
 'text_complexity': 113.1,
 'unique_word_ratio': 0.875,
 'unique_word_count': 7,
 'word_regex_matches_count': 11,
 'number_count': 0,
 'out_of_vocabulary_count': 0,
 'must_appear_words_percentage': 0,
 'sentence_count': 1,
 'sentence_avg_length': 40.0,
 'word_count': 9,
 'avg_word_length': 3.2222222222222223,
 'text_length': 40,
 'stop_words_count': 6,
 'punctuation_count': 2,
 'special_chars_count': 0,
 'capital_letters_ratio': 0.034482758620689655,
 'regex_match_count': 1,
 'email_count': 0,
 'link_count': 0,
 'hashtag_count': 0,
 'mention_count': 0,
 'syllable_count': 9,
 'acronym_count': 0,
 'date_count': 0}

To add new `MetafeatureExtractor` to existing `MetafeatureExtractorsRunner` we can use
`add_metafeature_extractor(metafeature_extractor: AbstractTextMetafeatureExtractor) -> None:`

In [6]:
from elemeta.nlp.extractors.high_level.regex_match_count import RegexMatchCount

number_of_good_in_text = RegexMatchCount(name="number_of_good_in_text",regex="good|Good")
metafeature_extractors_runner.add_metafeature_extractor(number_of_good_in_text)
metafeature_extractors_runner.run("This is a text about how good life is :)")

{'emoji_count': 0,
 'text_complexity': 113.1,
 'unique_word_ratio': 0.875,
 'unique_word_count': 7,
 'word_regex_matches_count': 11,
 'number_count': 0,
 'out_of_vocabulary_count': 0,
 'must_appear_words_percentage': 0,
 'sentence_count': 1,
 'sentence_avg_length': 40.0,
 'word_count': 9,
 'avg_word_length': 3.2222222222222223,
 'text_length': 40,
 'stop_words_count': 6,
 'punctuation_count': 2,
 'special_chars_count': 0,
 'capital_letters_ratio': 0.034482758620689655,
 'regex_match_count': 1,
 'email_count': 0,
 'link_count': 0,
 'hashtag_count': 0,
 'mention_count': 0,
 'syllable_count': 9,
 'acronym_count': 0,
 'date_count': 0,
 'number_of_good_in_text': 1}

To run the extractors on all the columns on dataframe we can use `run_on_dataframe(dataframe: DataFrame, text_column: str) -> DataFrame`
using this funtion we can supply a dataframe and the name of the text column and as return the funtion will return new dataframe with all the metafeature values as new columns

In [None]:
from elemeta.dataset.dataset import get_avengers_endgame_tweets
tweets = get_avengers_endgame_tweets()
print("The original dataset had {} columns".format(tweets.shape[1]))

# The enrichment process
print("Processing...")

# Running on all the data should take around a minute
tweets = metafeature_extractors_runner.run_on_dataframe(dataframe=tweets,text_column='text')
print("The transformed dataset has {} columns".format(tweets.shape[1]))

In [8]:
tweets

Unnamed: 0,id,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id.1,replyToUID,...,capital_letters_ratio,regex_match_count,email_count,link_count,hashtag_count,mention_count,syllable_count,acronym_count,date_count,number_of_good_in_text
0,1,RT @mrvelstan: literally nobody:\r\nme:\r\n\r\...,False,0,,2019-04-23 10:43:30,False,,1120639328034676737,,...,0.135593,4,0,0,1,1,16,1,0,0
1,2,"RT @agntecarter: iï¿½m emotional, sorry!!\r\n\...",False,0,,2019-04-23 10:43:30,False,,1120639325199196160,,...,0.056338,5,0,0,2,1,25,1,2,0
2,3,saving these bingo cards for tomorrow \r\nï¿½\...,False,0,,2019-04-23 10:43:30,False,,1120639324683292674,,...,0.062500,3,0,0,1,0,16,1,0,0
3,4,RT @HelloBoon: Man these #AvengersEndgame ads ...,False,0,,2019-04-23 10:43:29,False,,1120639323328540672,,...,0.166667,1,0,0,1,1,18,2,0,0
4,5,"RT @Marvel: We salute you, @ChrisEvans! #Capta...",False,0,,2019-04-23 10:43:29,False,,1120639321571074048,,...,0.197368,1,0,0,2,2,18,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14995,14996,RT @natsdany: First time Last...,False,0,,2019-04-23 09:22:03,False,,1120618828918951937,,...,0.068966,5,0,0,1,1,16,1,0,0
14996,14997,RT @MTVNEWS: The #AvengersEndgame cast has see...,False,0,,2019-04-23 09:22:03,False,,1120618828038311936,,...,0.137615,1,0,0,1,2,30,2,0,0
14997,14998,@SPICinemas kindly announce the approximate ti...,False,0,SPICinemas,2019-04-23 09:22:02,False,,1120618823667920896,919079586.0,...,0.084034,2,0,0,2,2,29,0,0,0
14998,14999,"RT @Marvel: We salute you, @ChrisEvans! #Capta...",False,0,,2019-04-23 09:22:02,False,,1120618823600803840,,...,0.197368,1,0,0,2,2,18,2,0,0
