## 1. Basics of Textual Features

In supervised learning domain, for example, to perform classification tasks, usually our goal is to find a parametrized model, best in its class: <br><br> $A(X, \hat{w}): A(X, \hat{w}) \simeq f(X) \Leftrightarrow A(X, \hat{w}) = \operatorname*{arg\,min}_w \left\|A(X, w) - f(X)\right\|$

Where $X \in R^{ n\times m}$ - feature matrix ($n$ observations with $m$ features), $w \in R^{m}$ - vector of model parameters, $\hat{w}$ - "best" model parameters

However, as a candidate for X - all that we have <strong>is raw text input, algorithms can't use it as is</strong>

In order to apply machine learning on textual data, we first need to transform such content into some numerical format (to form feature vectors). 

In Natural Language Processing automated feature extraction may be achieved in many ways <strong>(bag-of-words, word embeddings, graph-based representations etc.)</strong>

Today, we will dive into details of <strong>bag-of-words</strong> approach and methods, built atop of it in Scikit-Learn library.

## 2. Bag-of-Words Approach

### 2.1 Intuition Behind the Model. Word Counters.

In bag-of-words approach we work under the following assumptions:
* The text can be analyzed without taking into account the word/token order 
* We only need to know what words/tokens the text consists of and how many times we met them 
* The more often a word/token appears in a document, the more important it is 

More formal, given the collection of texts $T_1, T_2, ... , T_n$, we extract unique tokens $w_1, w_2, ..., w_m$ to form a dictionary.

Thus, each text $T_i$ is represented by feature vector $F_j = \{x_{ij},\ j \in [1,m]\}$, where $x_{ij}$ corresponds to number of occurences of word $w_j$ in text $T_i$

Say, out corpus only consists of **2 texts**:

["I love data science", 
"A data scientist is often smarter than a data analyst"]

\* **As a preprocessing step, all letters are usually made lowercase, sometimes stemming/lemmatization is performed, as well as stop-words/punctuation removals, but that's not obligatory.**

Suppose our tokens are simple unigrams (words), therefore there are **11 unique words**: {i, love, data, science, a, scientist, is, often, smarter, than, analyst}

Then, our corpus is mapped to feature vectors $T_1=(1,1,1,1,0,0,0,0,0,0,0)$, $T_2=(0,0,2,0,2,1,1,1,1,1,1)$

|Text #|i|love|data|science|a|scientist|is  |often|smarter|than|analyst|
|------|------|------|------|------|------|------|------|------|------|------|------|
|$T_1$|1|1|1|1|0|0|0|0|0|0|0|
|$T_2$|0|0|2|0|2|1|1|1|1|1|1|

Well, how memory-effective this approach is?
If n == 20k, this textual corpus might spawn a dictionary with around 100k elements. 
<br>Thus, storing X as an array of type int32 would require 20000 x 100000 x 4 bytes ~ **8GB in RAM** which is barely manageable on today’s computers.

Fortunately, **most values in X will be zeros** since for a given document less than a couple thousands (or even hundreds) of distinct words will be used. For this reason we say that bags of words are **typically high-dimensional sparse datasets**. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.
Sparse matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

#### Pros
* Very intuitive approach, easy to use, understand and apply - you can code it yourself
* Built-in support in many scientific/NLP libraries
* Memory-efficient sparse format, acceptable by most algorithms 
* Despite its simplicity, works well, good results could be reached
* Fast preprocessing, even on 1 core

#### Cons
* Huge corpus usually leads to huge vocabulary size (millions of words), even sparse format wouldn't help you (only hashing tricks)
* There are other approaches, manageable to catch more details (semantics, relations, structure) - word embeddings etc.
* A bag of words is an orderless representation: throwing out spatial relationships between features leads to the fact that simplified model cannot let us to distinguish between sentences, built from the same words while having opposite meanings:
    * "New episodes **don't** feel like the first - watch it!" (positive)
    * "New episodes feel like the first - **don't** watch it!" (negative)
* **However, it is somehow treated by increasing the "length" of the token (unigrams $\rightarrow$ bigrams, n-grams etc.), gluing negative particles with next word (not like $\rightarrow$ not_like), using character n-grams, skip-grams etc.** (see [this section for n-grams details](#3_5))

### 2.2 Capturing Dependencies. N-grams Recall

Simple Bag-of-Words(BoW) model, built on simple tokens (unigrams), is too simplified and catch no spatial dependencies.
To deal with it and to expand our knowledge, let's briefly recall what a **N-gram** is:
* N-gram is a sequence of $N$ basic tokens. 
* N-grams can be defined in different ways, based on token definition. ('word', 'character', 'character_wb' etc.)

1) **Word n-grams: (to catch more semantics)** 
* unigrams: 'I love data science' $\rightarrow$ [i, love, data, science]
* bigrams (2-grams): 'I love data science' $\rightarrow$ [i love, love data, data science]
* 3-grams: 'I love data science' $\rightarrow$ [i love data, love data science]
* ...

2) **Character n-grams: (allows to catch features like ":)", deal somehow with misspeled words like "looong" etc.)**
* 5-grams: 'I love data science' $\rightarrow$ ["i lov", " love", ... , "cienc", "ience"]
* ...

3) **Character-wb n-grams (n-grams, only in word boundaries):**
* 5-grams: 'I love data science' $\rightarrow$ {" i ", " love", "love ", ... , "cienc", "ience"]
* ...

4) **Skip-n-grams or k-skip-n-grams (both character- and word-based, extends spatial dependencies)**
* A sequence of $N$ basic tokens, having distance of $\leq K$ tokens between them
* 1-skip-2-grams: 'I love data science' $\rightarrow$ [i data, love science]
* ...
 



#### PROS

The same as in Bag-of-Words + more context can be captured

#### CONS

Don't forget that with the increase of n-gram range the vocabulary **rapidly grows up**!
<br>**|(1,1)-grams| << |(1,2)-grams| << |(1,3)-grams| << ...**
<br>where (1,1)-grams = unigrams, (1,2)-grams = unigrams AND bigrams, etc.

### 2.3  CountVectorizer

CountVectorizer in Sklearn implements aforementioned Bag-of-Words approach:

**Commonly used parameters:**
* **analyzer**={‘word’, ‘char’, ‘char_wb’} - what token to use (word, char-n-grams etc.)
* **ngram_range**=(min_n, max_n) - what N to use: say, ngram_range=(1,2) $\rightarrow$  use both unigrams and bigrams
* **stop_words**={‘english’, list_of_words, or None} - whether to filter stop-words or not
* **vocabulary**={None, your_own_dictionary} - whether to use given vocabulary or to build it from extracted tokens
* **max_features**={N, None} - to build a vocabulary that consider **top-N** terms ordered by term frequency (TF) across the corpus
* **max_df** – when building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
* **min_df** – when building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold

In [None]:
# usage example

# import CountVectorizer from sklearn library
from sklearn.feature_extraction.text import CountVectorizer

# create CountVectorizer object
cv = CountVectorizer(
                    analyzer='word', # token = word
                    ngram_range=(1,1), # only unigrams are used, (1,2) - unigrams/bigrams, ..., etc.
                    stop_words=['my', 'stop', 'word', 'list'], # or stop_words='english'
                    vocabulary=None, # or vocabulary=your_own_dictionary
                    max_df=1.0, # don't filter words by their frequency
                    max_features=6 # only top-6 words will be used as columns
                    )

In [None]:
# We'll be using it as an example for the other feature extraction methods
# You can use iterables, numpy arrays, pandas DataFrames as an input.
texts = [
    'nobody can stop me', # "stop" will be filtered by stop_words list
    'word is a building blocks of a text', # "word" will be filtered by stop_words list
    'I like doing feature extraction on text',
    'I do not like digits in text like 12345'
    ]

In [None]:
# apply CountVectorizer to text corpus
transformed_texts_cv = cv.fit_transform(texts)
# convert sparse representation of transformed texts to dense format and explore it
print('Obtained feature matrix X:')
print(transformed_texts_cv.todense(), '\n')

In [None]:
# print dictionary (sorted by column index) to see mapping between indices/columns and words 
print('Dictionary:')
for k,v in sorted(cv.vocabulary_.items(), reverse=False):
    print('column index:{}, token: {}'.format(v,k))

In [None]:
# transform new sentences (having CountVectorizer trained)
new_text = ['i like feature extraction very much'] 
new_transformed = cv.transform(new_text)
# some words, like "very" and "much", were not used to build the dictionary, thus, they will be skipped
print('\nNew sentence (transformed):')
print(new_transformed.todense(), '\n')

### 2.4 TF-IDF Augmentation. TfIdfVectorizer

In TF-IDF approach (term frequency - inverse document frequency), in addition to usual BoW-model, the following augmentation is made:
* The text can be analyzed without taking into account the word/token order
* We only need to know what words/tokens the text consists of and how many times we met them
* The more often a word/token appears in a document, the more important it is
* **If a word/token appears in a document, but rarely appears in other documents - it is important and vice versa: <br>if its commonly across most documents - then we cannot rely on this word to help us distinquish between texts** 

Thus, we are looking on the whole corpus, usual word counters (term frequencies, TF) are weighted by IDF multiplier:

$$  
    \begin{cases} TF(w,T)=n_{Tw} \\ IDF(w, T)= log{\frac{N}{n_{w}}}\end{cases} \implies 
    TF\text{-}IDF(w, T) = n_{Tw}\ log{\frac{N}{n_{w}}} \ \ \ \ \forall w \in W
$$

<br> where $T$ corresponds to current document (text), 
<br>$w$ - selected word in document T, 
<br>$n_{Tw}$ - number of occurences of $w$ in text $T$, 
<br>$n_{w}$ - number of documents, containing word $w$, 
<br> $N$ - total number of documents in a corpus.


**Commonly used parameters:**
* **analyzer**={‘word’, ‘char’, ‘char_wb’} - what token to use (word, char-n-grams etc.)
* **ngram_range**=(min_n, max_n) - what N to use: say, ngram_range=(1,2) $\rightarrow$  use both unigrams and bigrams
* **stop_words**={‘english’, list_of_words, or None} (default) - whether to filter stop-words or not
* **vocabulary**={None, your_own_dictionary} - whether to use given vocabulary or to build it from exracted tokens
* **max_features**={N, None} - to build a vocabulary that only consider the top N ordered by term frequency across the corpus
* **norm**={‘l1’, ‘l2’ or None, optional} - norm feature vector to unit norm ($L_2-$, $L_1-$ norms)
* **smooth_idf**={True, False} Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
* **max_df** – when building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
* **min_df** – when building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold

In [None]:
# usage example

# import TfidfVectorizer from sklearn library
from sklearn.feature_extraction.text import TfidfVectorizer

# create TfidfVectorizer object
tv = TfidfVectorizer(
                    analyzer='word', # token = word
                    ngram_range=(1,1), # only unigrams are used, (1,2) - unigrams/bigrams, ..., etc.
                    stop_words=['my', 'stop', 'word', 'list'], # or stop_words='english'
                    vocabulary=None, # or vocabulary=your_own_dictionary
                    max_df=1.0, # don't filter words by their frequency
                    max_features=6, # only top-6 words will be used as columns,
                    smooth_idf=True,
                    norm='l2' # euclidean norm is used by default
                    )

In [None]:
# apply TfidfVectorizer to the same text corpus
transformed_texts_tv = tv.fit_transform(texts)
# convert sparse representation of transformed texts to dense format and explore it
print('Obtained feature matrix X (see, L2-norm is used):')
print(transformed_texts_tv.todense(), '\n')

In [None]:
# print dictionary (sorted by column index) to see mapping between indices/columns and words 
print('Dictionary:')
for k,v in sorted(tv.vocabulary_.items(), reverse=False):
    print('column index:{}, token: {}'.format(v,k))

In [None]:
# transform new sentences (having TfidfVectorizer trained)
new_text = ['i like extraction very much'] 
new_transformed = tv.transform(new_text)
# "very", "much" etc. were not used to build the dictionary, thus, they will be skipped
print('\nNew sentence (transformed):')
print(new_transformed.todense(), '\n')

### 2.5 Hashes. HashingVectorizer

A hash function is any function that **can be used to map data of arbitrary size to data of fixed size**. 
<br>The values returned by a hash function are called hash values, hash codes, or simply hashes.
<br>$f(X) \rightarrow \{0,N-1\}:\ f(X) = X\  mod\ N$, function, that maps input into a set of $N$ "buckets", is an example of a hash function:

Say, $N = 2^k = 2^3 = 8$, then $\ f(15)=15\ mod \ 8 = 7,\ f(9)=9\ mod \ 8 = 1,\ ...$

This vectorizer implementation uses the hashing trick to find the mapping of **token string name** to **feature integer index**.

#### PROS:

* **Very memory-scalable to large datasets** as there is no need to store a vocabulary dictionary in memory
* Fast to serialize/deserialize as it holds no state besides the constructor parameters
* Can be used in a streaming (partial fit) and/or be parallelized as there is no state computed during fit
* Can be used as a "silly" dimensionality reduction

#### CONS (vs Vectorizers with in-memory vocabulary): 

* There is no way to compute the inverse transform (to get from feature indices to string feature names) <br> which **can be a problem when trying to introspect which features are most important to a model**.
* There can be **collisions**: distinct tokens can be mapped to the same "bucket" (feature index). 
<br>However, in practice this is rarely an issue if number of bins is large enough (e.g. $2^{18}$ for text classification problems)


\* The hash function used is the signed 32-bit version of Murmurhash3 (for those, who are really interested :)  )

**Commonly used parameters:**
* **analyzer**={‘word’, ‘char’, ‘char_wb’} - what token to use (word, char-n-grams etc.)
* **ngram_range**=(min_n, max_n) - what N to use: say, ngram_range=(1,2) $\rightarrow$  use both unigrams and bigrams
* **stop_words**={‘english’, list_of_words, or None} (default) - whether to filter stop-words or not
* **n_features**={N} - how many "buckets" to use
* **norm**={‘l1’, ‘l2’ or None, optional} - norm feature vector to unit norm ($L_2-$, $L_1-$ norms)

In [None]:
# usage example

# import HashingVectorizer from sklearn library
from sklearn.feature_extraction.text import HashingVectorizer

# create HashingVectorizer object
hv = HashingVectorizer(
                    analyzer='word', # token = word
                    ngram_range=(1,1), # only unigrams are used, (1,2) - unigrams/bigrams, ..., etc.
                    stop_words=['my', 'stop', 'word', 'list'], # or stop_words='english'
                    n_features=6, # only 6 bins will be used as columns, high probability of collisions!
                    norm=None
                    )

In [None]:
# apply HashingVectorizer to the same text corpus
transformed_texts_hv = hv.fit_transform(texts)
# convert sparse representation of transformed texts to dense format and explore it
print('Obtained feature matrix X (see, no norm is used):')
print(transformed_texts_hv.todense(), '\n')

In [None]:
# I see no dictionary ...
print('Dictionary:')
print('Oops, Hashing trick assumes no vocabulary will be used at all, online learning :)')
print("However, we won't be able to do reverse transform and to get exact words :( ")

In [None]:
  
# transform new sentences (having HashingVectorizer trained)
new_text = ['i like extraction very much'] 
new_transformed = hv.transform(new_text)
# "very", "much" etc. were not used to build the dictionary, thus, they will be skipped
print('\nNew sentence (transformed):')
print(new_transformed.todense(), '\n')

## 3. Going Beyond: Feature Engineering

Usually, specific domain leads to specific information, hidden inside of your data. 
You need to extract it, as much as possible. 

For example, if we want to run sentiment analysis (classification task) on the IMDB dataset (movie reviews) and it seems to us that **many reviews may contain explicit marks (say, in a form of x/xx)**, than we should check this out and extract useful custom feature:

["Average film, however, starring Matt Damon, 8/10", 1] $\rightarrow$ {"8/10"} $\rightarrow$ 8/10=0.8 ~ 1 $\rightarrow$ review is positive
<br>["2/10, there is nothing to add", 0] $\rightarrow$ {"2/10"} $\rightarrow$ 2/10=0.2 ~ 0 $\rightarrow$ review is negative.

However, be aware of dates and outliers (in relation to this particular feature) or whatever else - always check your code / regular expressions:

Say, incorrect parsing of **'01/10/1999'** would lead to **{1/10, 10/1999} or {1/10}  ~ 0 (negative review?!)** errors.

### Hereinafter, we'll discuss domain specific features, they are no panacea in general.

### 3.1 Token-based Level

We need to look on tokens (words, entities like smiles etc.) and try to extract meaningful features

* positive smiles
* negative smiles
* explicit rating (marks)

In [None]:
import pandas as pd
import numpy as np
from textblob import Word, TextBlob
import re # for regular expressions

# download resources to be used by TextBlob wrapper (if not yet downloaded)
import nltk
nltk.download('punkt')
pass

In [None]:
# this implemenation does not deal with aforementioned cases, 
# to extract rating "candidates" in a text s
def get_rate(s):
    # searching for possible candidates
    candidates = re.findall(r'(\d{1,3}[\\|/]{1}\d{1,2})', s)
    rates = []
    for c in candidates:
        try:
            rates.append(eval(c)) # by the way, "eval" is a prime evil, it may lead you to the dark side :)
            # instead, say, install sympy
            # from sympy import sympify
            # sympify("1*5/6*(7+8)").evalf()
        except SyntaxError:
            pass
        except ZeroDivisionError:
            return 0
    return np.mean(rates) if rates else -1 # if there is more than 1 value, calculate mean

# bags of positive/negative smiles
positive_smiles = set([
":‑)",":)",":-]",":]",":-3",":3",":->",":>","8-)","8)",":-}",":}",":o)",":c)",":^)","=]","=)",":‑D",":D","8‑D","8D",
"x‑D","xD","X‑D","XD","=D","=3","B^D",":-))",";‑)",";)","*-)","*)",";‑]",";]",";^)",":‑,",";D",":‑P",":P","X‑P","XP",
"x‑p","xp",":‑p",":p",":‑Þ",":Þ",":‑þ",":þ",":‑b",":b","d:","=p",">:P", ":'‑)", ":')",  ":-*", ":*", ":×"
])
negative_smiles = set([
":‑(",":(",":‑c",":c",":‑<",":<",":‑[",":[",":-||",">:[",":{",":@",">:(","D‑':","D:<","D:","D8","D;","D=","DX",":‑/",
":/",":‑.",'>:\\', ">:/", ":\\", "=/" ,"=\\", ":L", "=L",":S",":‑|",":|","|‑O","<:‑|"
])

# function to extract token-level features from texts
def get_token_level_features(texts, visualize=True):
    
    # assume texts = pd.Series with review text
    print('extracting token-level features...')
    tdf = pd.DataFrame()
    tdf['text'] = texts # this is our review
    
    # 1. extract rating, like "great film. 9/10" will yield 0.9
    tdf['rating'] = tdf['text'].apply(get_rate).fillna(-1) # rating (if found in review, else substitute NaN's by -1)

    # 2. extract smiles and count positive/negative smiles per review
    tdf['positive_smiles'] = tdf.text.apply(lambda s: len([x for x in s.split() if x in positive_smiles]))
    tdf['negative_smiles'] = tdf.text.apply(lambda s: len([x for x in s.split() if x in negative_smiles]))
    
    if visualize:
        # this is used for visual clarity, return pd.DataFrame
        return tdf 
    else:
        # get correct (and sparse) representation of feature matrix F
        from scipy.sparse import csr_matrix 
        return csr_matrix(tdf[tdf.columns[1:]].values)

### 3.2 Sentence-based / Text-based Level

We moved up to sentence/text level.
<br><i>Someone can argue about level of these features, but let us just put them here</i>
<br>Let's see what features we can search for:
* **Sentence count** (text must be split into sentences, then extract length of obtained list) 
* **Exclamation marks count** (integer) or presence (boolean) - catching stress, expecially if we use probabilistic output instead of binary classification
* **Question marks count** (integer) or presence (boolean) - can sometimes help in catching sarcasm
* **Uppercase word count** (of length > 1, to omit "A"s) - stress of a text, expecially if we use probabilistic output instead of binary classification
* **Contrast conjugations**, like {'instead','nevertheless','on the contrary','on the other hand'} - to catch possible changes of a sentiment

Some information regarding text "edges" - first/last sentences in a review:
* **"polarity" of first/last sentence[s]**
* **"subjectivity" of first/last sentence[s]**
* **"purity" of first/last sentence[s] or the whole set of sentences** - to catch a change of a sentiment

In [None]:
# let's continue...

# contrast conjugations
contrast_conj = set([
'alternatively','anyway','but','by contrast','differ from','elsewhere','even so','however','in contrast','in fact',
'in other respects','in spite of','in that respect','instead','nevertheless','on the contrary','on the other hand',
'rather','though','whereas','yet'])

# to get review "purity" ~ shows same sentiment over review (~1) or changing sentiment (~0)
def purity(sentences):
    # obtain polarities across the sentences
    polarities = np.array([TextBlob(x).sentiment.polarity for x in sentences])
    return polarities.sum() / np.abs(polarities).sum()

# uppercase pattern
uppercase_pattern = re.compile(r'(\b[0-9]*[A-Z]+[0-9]*[A-Z]+[0-9]*\b)')

# regular expression to split review on sentences, you can use inline textblob object field: TextBlob(x).sentences_
sentence_splitter = re.compile('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\!|\?|\.)\s')
# you can https://regex101.com/ for regex creation/checking, very convenient

# feature engineering
def get_text_level_features(texts, visualize=True):
    # assume text = pd.Series with review text
    print('extracting text-level features...')
    tdf = pd.DataFrame()
    tdf['text'] = texts # this is our review
    tdf['sentences'] = tdf.text.apply(lambda s: re.split(sentence_splitter, s)) # split it into sentences
    
    tdf['sentence_cnt'] = tdf['sentences'].apply(len) # sentence count
    tdf['exclamation_cnt'] = tdf.text.str.count('\!') # exclamation mark count
    tdf['question_cnt'] = tdf.text.str.count('\?') # question mark count
    
    # uppercase words cnt (like HOLY JESUS!)
    tdf['upper_word_cnt'] = tdf.text.apply(lambda s: len(re.findall(uppercase_pattern, s)))
    
    # not so informative, but still - contrast conjugations
    tdf['contrast_conj_cnt'] = tdf.text.apply(lambda s: len([c for c in contrast_conj if c in s]))
    
    # polarity of 1st sentence
    tdf['polarity_1st_sent'] = tdf.sentences.apply(lambda s: TextBlob(s[0]).sentiment.polarity)
    # subjectivity of 1st sentence
    tdf['subjectivity_1st_sent'] = tdf.sentences.apply(lambda s: TextBlob(s[0]).sentiment.subjectivity)
    # polarity of last sentence
    tdf['polarity_last_sent'] = tdf.sentences.apply(lambda s: TextBlob(s[-1]).sentiment.polarity)
    # subjectivity of last sentence
    tdf['subjectivity_last_sent'] = tdf.sentences.apply(lambda s: TextBlob(s[-1]).sentiment.subjectivity)
    # subjectivity of review itself
    tdf['polarity'] = tdf.text.apply(lambda s: TextBlob(s[-1]).sentiment.polarity)
    # "purity" of review, |sum(sentence polarity) / sum(|sentence polarity|))|, ~ 1 is better, ~ 0 -> mixed
    tdf['purity'] = tdf.sentences.apply(purity)
    tdf['purity'].fillna(0, inplace=True)
    
    if visualize:
        # this is used for visual clarity, return pd.DataFrame
        return tdf 
    else:
        # get correct (and sparse) representation of feature matrix F
        from scipy.sparse import csr_matrix 
        return csr_matrix(tdf[tdf.columns[2:]].values)

### BE CAREFUL, if you use LINEAR MODELS and have MOSTLY SHORT REVIEWS (1 sentence), then
### tdf['subjectivity_1st_sent'] ~ tdf['subjectivity_last_sent'], two same columns, leads to multicollinearity!

In [None]:
# let's test custom features:

reviews = [
    "Waste of time :( 2/10 for the plot and 4/10 for acting!",
    'Awful film! Nobody can like it',
    'Wow! Am I impressed?? TOTALLY :D',
    '7/10'
]

# token-based
token_lf = get_token_level_features(reviews)
token_lf

In [None]:
# token-based
token_lf = get_text_level_features(reviews)
token_lf

# Example

In [None]:
from sklearn.datasets import fetch_20newsgroups
from collections import Counter

In [None]:
train = fetch_20newsgroups()
test = fetch_20newsgroups(subset="test")

In [None]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

In [None]:
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

In [None]:
pipeline = Pipeline([
    ('bow', CountVectorizer()),
    ('clf', LogisticRegression()),
])

In [None]:
params = dict(clf__C=[10, 1, 0.1, 0.01])
grid_search = GridSearchCV(pipeline, params, scoring="accuracy", cv=skf, n_jobs=-1)

In [None]:
grid_search.fit(train["data"], train["target"], )

In [None]:
grid_search.best_score_, grid_search.best_estimator_

In [None]:
pipeline = Pipeline([
    ('bow', CountVectorizer()),
    ('clf', LogisticRegression(C=1)),
])
pipeline.fit(train["data"], train["target"])

In [None]:
from sklearn.metrics import accuracy_score, classification_report

In [None]:
predictions = pipeline.predict(test["data"])
accuracy_score(test["target"], predictions)

In [None]:
print(classification_report(test["target"], predictions, target_names=test["target_names"]))