# Feature engineering in NLP

1. tokenization
2. stemming
3. lemmatization
4. word distances
5. text stats
6. readability scores 

## Data loading and preparation



In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("../input/commonlitreadabilityprize/train.csv")

#df = df.iloc[df.course.values != "other",:]

df.head()

## Simple features

- Number of words 
- Number of characters 
- Average length of words

Chars count you can calculate as follow:

In [None]:
len("".join([i for i in "coronavirus = death" if i in "qwertyuiopasdfghjklzxcvbnm "]).split())

In [None]:
%%time
df['char_count'] = [len(i) for i in df['excerpt']]

In [None]:
%%time
df['char_count'] = df['excerpt'].apply(len)

In [None]:
%%time
df['char_count'] = df['excerpt'].str.len()

In [None]:
print(df['char_count'].mean())

In [None]:
df.head()

So, it is more interesting estimate count of words:

In [None]:
def count_words(string):
    string = "".join([i for i in string.lower() if i in "qwertyuiopasdfghjklzxcvbnmйцукенгшщзхїфивапролджєячсмітьбю "])
    words = string.split()
    return len(words)

df['word_count'] = df['excerpt'].apply(count_words)

print(df['word_count'].mean())

In [None]:
df['mean_word_len'] = df['char_count'] / df['word_count']

pd.concat([df.sort_values("mean_word_len").head(),
df.sort_values("mean_word_len").tail(10)])

In [None]:
df.loc[df['mean_word_len'] != np.inf,'mean_word_len'].hist()

Or count of some word or char:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def count_mentions(string):
    string = "".join([i for i in string.lower() if i in "qwertyuiopasdfghjklzxcvbnmйцукенгшщзхїфивапролджєячсмітьбю "])
    words = string.split()
    mentions = np.array([word for word in words if word == "react"])#word.startswith("data")
    return(len(mentions))

df['mention_count'] = df['excerpt'].apply(count_mentions)
df['mention_count'].hist()
plt.title('Mention count distribution')
plt.show()

## Advanced text statistics

To do this you schould install textatistic package:

In [None]:
%%bash

pip install textatistic

Main measures of texatistic:

- **char_count** - number of non-space characters.
- **notdalechall_count** - number of words not on Dale-Chall list of words understood by 80% of 4th graders.
- **polysyblword_count** - number of words with three or more syllables.
- **sent_count** - number of sentences.
- **sybl_count** - number of syllables.
- **word_count** - number of words.
- **dalechall_score** -	Dale-Chall score.
- **flesch_score** - Flesch Reading Ease score.
- **fleschkincaid_score** -	Flesch-Kincaid score.
- **gunningfog_score** - Gunning Fog score.
- **smog_score** - SMOG score.

**Flesch score interpretation**

* 100.00–90.00 -	5th grade	- Very easy to read. Easily understood by an average 11-year-old student.
* 90.0–80.0	- 6th grade	- Easy to read. Conversational English for consumers.
* 80.0–70.0	- 7th grade	- Fairly easy to read.
* 70.0–60.0	- 8th & 9th grade	- Plain English. Easily understood by 13- to 15-year-old students.
* 60.0–50.0	- 10th to 12th grade	- Fairly difficult to read.
* 50.0–30.0	- College	- Difficult to read.
* 30.0–0.0 - College graduate	- Very difficult to read. Best understood by university graduates.





In [None]:
test_text = """
It is natural to suppose that, before philosophy enters upon its subject proper — namely, the actual knowledge of what truly is — 
it is necessary to come first to an understanding concerning knowledge, which is looked upon as the instrument by which to take 
possession of the Absolute, or as the means through which to get a sight of it. The apprehension seems legitimate, on the one hand 
that there may be various kinds of knowledge, among which one might be better adapted than another for the attainment of our purpose — 
and thus a wrong choice is possible: on the other hand again that, since knowing is a faculty of a definite kind and with a determinate 
range, without the more precise determination of its nature and limits we might take hold on clouds of error instead of the heaven of truth.
This apprehensiveness is sure to pass even into the conviction that the whole enterprise which sets out to secure for consciousness 
by means of knowledge what exists per se, is in its very nature absurd; and that between knowledge and the Absolute there lies a boundary 
which completely cuts off the one from the other. For if knowledge is the instrument by which to get possession of absolute Reality, 
the suggestion immediately occurs that the application of an instrument to anything does not leave it as it is for itself, but rather 
entails in the process, and has in view, a moulding and alteration of it. Or, again, if knowledge is not an instrument which we actively 
employ, but a kind of passive medium through which the light of the truth reaches us, then here, too, we do not receive it as it is in 
itself, but as it is through and in this medium. In either case we employ a means which immediately brings about the very opposite 
of its own end; or, rather, the absurdity lies in making use of any means at all. It seems indeed open to us to find in the knowledge 
of the way in which the instrument operates, a remedy for this parlous state; for thereby it becomes possible to remove from the result 
the part which, in our idea of the Absolute received through that instrument, belongs to the instrument, and thus to get the truth in 
its purity. But this improvement would, as a matter of fact, only bring us back to the point where we were before. If we take away again 
from a definitely formed thing that which the instrument has done in the shaping of it, then the thing (in this case the Absolute) stands 
before us once more just as it was previous to all this trouble, which, as we now see, was superfluous. If the Absolute were only to be 
brought on the whole nearer to us by this agency, without any change being wrought in it, like a bird caught by a limestick, it would 
certainly scorn a trick of that sort, if it were not in its very nature, and did it not wish to be, beside us from the start. For a trick 
is what knowledge in such a case would be, since by all its busy toil and trouble it gives itself the air of doing something quite 
different from bringing about a relation that is merely immediate, and so a waste of time to establish. Or, again, if the examination of 
knowledge, which we represent as a medium, makes us acquainted with the law of its refraction, it is likewise useless to eliminate 
this refraction from the result. For knowledge is not the divergence of the ray, but the ray itself by which the truth comes in contact 
with us; and if this be removed, the bare direction or the empty place would alone be indicated."""

In [None]:
from textatistic import Textatistic

print(Textatistic(test_text))

print("Char count:", Textatistic(test_text).char_count)
print("Notdalechall_count:", Textatistic(test_text).notdalechall_count)
print("Polysyblword_count:", Textatistic(test_text).polysyblword_count)
print("Sent_count:", Textatistic(test_text).sent_count)
print("Sybl_count:", Textatistic(test_text).sybl_count)
print("Word_count:", Textatistic(test_text).word_count)

In [None]:
Textatistic(test_text).scores

In [None]:
from textatistic import Textatistic

readability_scores = Textatistic(test_text).scores

print(readability_scores)

flesch = readability_scores['flesch_score']
print("The Flesch Reading Ease is %.2f" % (flesch))

In [None]:
def flesch_score(text):
  try:
    return Textatistic(text).scores['flesch_score']
  except:
    return 50

df['flesch_score'] = df['excerpt'].apply(flesch_score)

In [None]:
df.head()

In [None]:
excerpts = df.excerpt.values[:10]

gunning_fog_scores = [Textatistic(excerpt).scores['gunningfog_score'] for excerpt in excerpts]
  
print(gunning_fog_scores)

In [None]:
df

## SpaCy package for text preprocessing

You schould download **xx_ent_wiki_sm** model for language processing:

In [None]:
%%bash

python3 -m spacy download xx_ent_wiki_sm

And run it:

In [None]:
import spacy
import xx_ent_wiki_sm

nlp = xx_ent_wiki_sm.load()
nlp

In [None]:
doc = nlp(df.excerpt.values[0])
doc

In [None]:
tokens = [token.text for token in doc]
print(tokens)

In [None]:
nlp.pipeline

In [None]:
nlp_eng = spacy.load('en_core_web_sm')
nlp_eng.pipeline

In [None]:
doc_eng = nlp_eng(df.excerpt.values[-2])

print(doc_eng)

In [None]:
doc_eng[2]

In [None]:
[token for token in doc_eng]

In [None]:
doc_eng[0], doc_eng[0].lemma_
doc_eng[19], doc_eng[19].lemma_

In [None]:
lemmas = [token.lemma_ for token in doc_eng] 
print(lemmas)

**isalpha()** function to check words:
                        

In [None]:
print("Dog".isalpha())
print("3dogs".isalpha())
print("WHO".isalpha())
print("kirichenko17roman@gmail.com".isalpha())
print("spaCy".isalpha())
print("#pretty_girl".isalpha())
print("@elonmusk".isalpha())
print("spaceboy:)".isalpha())
print("12347".isalpha())
print("!".isalpha())
print("?".isalpha())

The function above is useful for lematization:

In [None]:
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma == '-PRON-']
print(' '.join(a_lemmas))

In [None]:
doc_eng[4], doc_eng[4].like_num

Also we can simply use methods from doc object:

In [None]:
print('Index: ', [token.i for token in doc_eng])
print('Text: ', [token.text for token in doc_eng])
print('is_alpha:', [token.is_alpha for token in doc_eng])
print('is_punct:', [token.is_punct for token in doc_eng]) 
print('like_num:', [token.like_num for token in doc_eng])

In [None]:
np.array([token.text for token in doc_eng])[[token.like_num for token in doc_eng]]

Stop words for text cleaning:

In [None]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(stopwords)
print(len(stopwords))
string = df.excerpt.values[0]

In [None]:
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]

print(' '.join(a_lemmas))

## POS tagging in spaCy

In [None]:
doc = nlp_eng(df.excerpt.values[-2])

In [None]:
pos = [(token.text, token.pos_) for token in doc] 
print(pos)

In [None]:
np.array([i == 'VERB' for i in [token.pos_ for token in doc]]).sum()

In [None]:
spacy.explain('AUX')

In [None]:
dep = [(token.text, token.dep_) for token in doc] 
print(dep)

## Name entity recognition in spaCy

In [None]:
doc.ents

In [None]:
ne = [(ent.text, ent.label_) for ent in doc.ents] 
print(ne)

## Bag of words approach

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer(stop_words='english', lowercase=True,  min_df = 5, max_df = 0.15) 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['excerpt'], df['target'], test_size=0.25, random_state = 0)

In [None]:
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [None]:
X_train_bow.shape

In [None]:
from sklearn.svm import SVR 

reg = SVR()
reg.fit(X_train_bow, y_train)

np.mean((reg.predict(X_test_bow) - y_test)**2)**0.5

You can also use ngrams:

In [None]:
bigrams = CountVectorizer(ngram_range=(2,2), min_df = 5, max_df = 0.15) 

ngrams = CountVectorizer(ngram_range=(1,3), min_df = 5, max_df = 0.15)

In [None]:
X_train_bow = ngrams.fit_transform(X_train)
X_test_bow = ngrams.transform(X_test)

In [None]:
print(X_test_bow[:10,:30].toarray())

In [None]:
X_test_bow.shape

In [None]:
reg = SVR()
reg.fit(X_train_bow, y_train)

np.mean((reg.predict(X_test_bow) - y_test)**2)**0.5

And tf-idf vectorizer:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 

vectorizer = TfidfVectorizer(stop_words='english', lowercase=True,  min_df = 5, max_df = 0.15)

tfidf_matrix = vectorizer.fit_transform(df['excerpt']) 

print(tfidf_matrix.toarray()[:10,:30])

In [None]:
vectorizer = vectorizer = TfidfVectorizer(stop_words='english', lowercase=True,  min_df = 0.05, max_df = 0.15) 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['excerpt'], df['target'], test_size=0.2, random_state = 0)

In [None]:
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [None]:
X_train_bow.shape

In [None]:
reg = SVR()
reg.fit(X_train_bow, y_train)

np.mean((reg.predict(X_test_bow) - y_test)**2)**0.5

## Similarity measures

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

A = (-1,0,1,2,3,4) 
B = (1,2,3,4,5,6)

score = cosine_similarity([A], [B])

print(score)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer # Create TfidfVectorizer object
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True,  min_df = 0.01, max_df = 0.15)
tfidf_matrix = vectorizer.fit_transform(df['excerpt'])

In [None]:
tfidf_matrix.shape

In [None]:
from sklearn.metrics.pairwise import cosine_similarity # Generate cosine similarity matrix

cosine_sim = cosine_similarity(tfidf_matrix.T, tfidf_matrix.T)

In [None]:
cosine_sim.shape

In [None]:
vectorizer.get_feature_names()

In [None]:
np.fill_diagonal(cosine_sim, 0)

In [None]:
import numpy as np

vectorizer.get_feature_names()[cosine_sim[:,np.array(vectorizer.get_feature_names())=='war'].argmax()]

In [None]:
np.max(cosine_sim[0,1:])

In [None]:
from sklearn.metrics.pairwise import linear_kernel # Generate cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
cosine_sim

In [None]:
for token1 in doc_eng:
    for token2 in doc_eng:
        sim = token1.similarity(token2)
        if (sim > 0.7) & (sim != 1):
            print(token1.text, token2.text, sim)

In [None]:
sent1 = nlp_eng("I am happy") 
sent2 = nlp_eng("I am sad") 
sent3 = nlp_eng("I am joyous")

print(sent1.similarity(sent2))
print(sent1.similarity(sent3))
print(sent2.similarity(sent3))

## Pipelines


- nlp.add_pipe(component, last=True)
- nlp.add_pipe(component, first=True)
- nlp.add_pipe(component, before='ner')
- nlp.add_pipe(component, after='tagger')

In [None]:
nlp_eng.pipeline

In [None]:
def custom_component(doc):
    return doc

nlp_eng.add_pipe(custom_component)

In [None]:
nlp_eng.pipeline

In [None]:
nlp = spacy.load('en_core_web_sm') 
def custom_component(doc):
    print('Doc length:', len(doc)) 
    return doc

nlp.add_pipe(custom_component, first=True) 
print('Pipeline:', nlp.pipe_names)

In [None]:
nlp = spacy.load('en_core_web_sm') 
def custom_component(doc):
    print('Doc length:', len(doc))
    return doc

nlp.add_pipe(custom_component, first=True) 
doc = nlp("Hello world!")

In [None]:
from spacy.tokens import Token

Token.set_extension('is_color', default=False, force=True) 
doc = nlp("The sky is blue.")

doc[3]._.is_color = True

In [None]:
from spacy.tokens import Token

def get_is_color(token):
    colors = ['red', 'yellow', 'blue'] 
    return token.text in colors

Token.set_extension('is_color', getter=get_is_color, force=True) 
doc = nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)

In [None]:
from spacy.tokens import Span

def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

Span.set_extension('has_color', getter=get_has_color, force=True)
doc = nlp("The sky is blue.") 
print(doc[1:4]._.has_color, '-', doc[1:4].text) 
print(doc[0:2]._.has_color, '-', doc[0:2].text)

In [None]:
from spacy.tokens import Doc

def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]

Doc.set_extension('has_token', method=has_token, force=True)
doc = nlp("The sky is blue.") 
print(doc._.has_token('blue'), '- blue') 
print(doc._.has_token('cloud'), '- cloud')


BAD:
- docs = [nlp(text) for text in LOTS_OF_TEXTS] 

GOOD:
- docs = list(nlp.pipe(LOTS_OF_TEXTS))

In [None]:
data = [
  ('This is a text', {'id': 1, 'page_number': 15}), ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True): 
    print(doc.text, context['page_number'])

In [None]:
from spacy.tokens import Doc 

Doc.set_extension('id', default=None, force=True)
Doc.set_extension('page_number', default=None, force=True)

data = [
('This is a text', {'id': 1, 'page_number': 15}), ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True): 
    doc._.id = context['id']
    doc._.page_number = context['page_number']

BAD:
- doc = nlp("Hello world") 

GOOD:
- doc = nlp.make_doc("Hello world!")
       

In [None]:
with nlp.disable_pipes('tagger', 'parser'):
    doc = nlp(df['excerpt'].values[0])
    print(doc.ents)