# Feature engineering

In this notebook I'd like to expand [@eikedehling search for features](https://www.kaggle.com/eikedehling/feature-engineering) and in part revise his findings. 

In [None]:
import pandas as pd
import collections
import string
import seaborn as sns
import matplotlib.pyplot as plt

from nltk import pos_tag
from nltk.corpus import stopwords

Let's add a few constants:

In [None]:
COMMENT = 'comment_text'
LABELS = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [None]:
train = pd.read_csv("../input/train.csv", encoding="utf-8")
print(train.head())
print("Train: %s samples" % len(train))

In the following I'll add two sets of manually engineered features. The first set takes into account the structure of the comment (lengths, punctuation, capitals, etc...). The second set will investigate lexical categories.

The main assumption is that an angry/disgruntled/mad user will follow a particular pattern while writing a toxic comment. Let's see if we found any.

In my experience bare counts of occurrences will not help you much to visualize patterns. Normalization is key here. I will add two: vs length and vs words count.

In [None]:
train['total_length'] = train[COMMENT].apply(len)

train['words'] = train[COMMENT].apply(lambda comment: len(comment.split()))
train['words_vs_length'] = train['words'] / train['total_length']

train['capitals'] = train[COMMENT].apply(lambda comment: sum(1 for c in comment if c.isupper()))
train['capitals_vs_length'] = train['capitals'] / train['total_length']
train['capitals_vs_words'] = train['capitals'] / train['words']

train['paragraphs'] = train[COMMENT].apply(lambda comment: comment.count('\n'))
train['paragraphs_vs_length'] = train['paragraphs'] / train['total_length']
train['paragraphs_vs_words'] = train['paragraphs'] / train['words']

eng_stopwords = set(stopwords.words("english"))
train['stopwords'] = train[COMMENT].apply(lambda comment: sum(comment.count(w) for w in eng_stopwords))
train['stopwords_vs_length'] = train['stopwords'] / train['total_length']
train['stopwords_vs_words'] = train['stopwords'] / train['words']

train['exclamation_marks'] = train[COMMENT].apply(lambda comment: comment.count('!'))
train['exclamation_marks_vs_length'] = train['exclamation_marks'] / train['total_length']
train['exclamation_marks_vs_words'] = train['exclamation_marks'] / train['words']

train['question_marks'] = train[COMMENT].apply(lambda comment: comment.count('?'))
train['question_marks_vs_length'] = train['question_marks'] / train['total_length']
train['question_marks_vs_words'] = train['question_marks'] / train['words']

train['punctuation'] = train[COMMENT].apply(
    lambda comment: sum(comment.count(w) for w in string.punctuation))
train['punctuation_vs_length'] = train['punctuation'] / train['total_length']
train['punctuation_vs_words'] = train['punctuation'] / train['words']

train['unique_words'] = train[COMMENT].apply(
    lambda comment: len(set(w for w in comment.split())))
train['unique_words_vs_length'] = train['unique_words'] / train['total_length']
train['unique_words_vs_words'] = train['unique_words'] / train['words']

repeated_threshold = 15
def count_repeated(text):
    text_splitted = text.split()
    word_counts = collections.Counter(text_splitted)
    return sum(count for word, count in sorted(word_counts.items()) if count > repeated_threshold)

train['repeated_words'] = train[COMMENT].apply(lambda comment: count_repeated(comment))
train['repeated_words_vs_length'] = train['repeated_words'] / train['total_length']
train['repeated_words_vs_words'] = train['repeated_words'] / train['words']

train['mentions'] = train[COMMENT].apply(
    lambda comment: comment.count("User:"))
train['mentions_vs_length'] = train['mentions'] / train['total_length']
train['mentions_vs_words'] = train['mentions'] / train['words']


train['smilies'] = train[COMMENT].apply(
    lambda comment: sum(comment.count(w) for w in (':-)', ':)', ';-)', ';)')))
train['smilies_vs_length'] = train['smilies'] / train['total_length']
train['smilies_vs_words'] = train['smilies'] / train['words']

train['symbols'] = train[COMMENT].apply(
    lambda comment: sum(comment.count(w) for w in '*&#$%“”¨«»®´·º½¾¿¡§£₤‘’'))
train['symbols_vs_length'] = train['symbols'] / train['total_length']
train['symbols_vs_words'] = train['symbols'] / train['words']

Ok, now let's see if nouns/verbs/adjectives distributions tell us anything worth the effort of tagging the comment corpus. For this I will use the excellent nltk support for category/tagging part of speech. This part, at least, in this form, only applies to english.

In [None]:
def tag_part_of_speech(text):
    text_splited = text.split(' ')
    text_splited = [''.join(c for c in s if c not in string.punctuation) for s in text_splited]
    text_splited = [s for s in text_splited if s]
    pos_list = pos_tag(text_splited)
    noun_count = len([w for w in pos_list if w[1] in ('NN','NNP','NNPS','NNS')])
    adjective_count = len([w for w in pos_list if w[1] in ('JJ','JJR','JJS')])
    verb_count = len([w for w in pos_list if w[1] in ('VB','VBD','VBG','VBN','VBP','VBZ')])
    return[noun_count, adjective_count, verb_count]


train['nouns'], train['adjectives'], train['verbs'] = zip(*train[COMMENT].apply(
    lambda comment: tag_part_of_speech(comment)))

Ok, even here, let's apply normalization..

In [None]:
train['nouns_vs_length'] = train['nouns'] / train['total_length']
train['adjectives_vs_length'] = train['adjectives'] / train['total_length']
train['verbs_vs_length'] = train['verbs'] / train['total_length']
train['nouns_vs_words'] = train['nouns'] / train['words']
train['adjectives_vs_words'] = train['adjectives'] / train['words']
train['verbs_vs_words'] = train['verbs'] / train['words']

Let's see if everthing is ok...

In [None]:
train.head()

Now let's explore the correlation between the added features and the to-be-predicted columns, this should be an indication of whether a model could use these features:

In [None]:
features = ('total_length', 
            'words', 'words_vs_length',
            'capitals', 'capitals_vs_length', 'capitals_vs_words',
            'paragraphs', 'paragraphs_vs_length', 'paragraphs_vs_words',
            'stopwords', 'stopwords_vs_length', 'stopwords_vs_words',
            'exclamation_marks', 'exclamation_marks_vs_length', 'exclamation_marks_vs_words',
            'question_marks', 'question_marks_vs_length', 'question_marks_vs_words',
            'punctuation', 'punctuation_vs_length', 'punctuation_vs_words',
            'unique_words', 'unique_words_vs_length', 'unique_words_vs_words',
            'repeated_words', 'repeated_words_vs_length', 'repeated_words_vs_words',
            'mentions', 'mentions_vs_words', 'mentions_vs_length',
            'smilies', 'smilies_vs_length', 'smilies_vs_words',
            'symbols', 'symbols_vs_length', 'symbols_vs_words',
            'nouns', 'nouns_vs_words', 'nouns_vs_length', 
            'adjectives', 'adjectives_vs_words', 'adjectives_vs_length',
            'verbs', 'verbs_vs_words', 'verbs_vs_length',
           )
train['none'] = 1 - train[LABELS].max(axis=1)

In [None]:
columns = LABELS + ['none']

rows = [{c:train[f].corr(train[c]) for c in columns} for f in features]
train_correlations = pd.DataFrame(rows, index=features)
train_correlations

Pretty impressive, but still not useful without a proper visualization. Let's see in heatmap form

In [None]:
import seaborn as sns

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(train_correlations, annot=True, vmin=-0.23, vmax=0.23, center=0.0, ax=ax)

I may be biased... but wow! :)

Wrapping up...

We can see lots of cells lighting up. In particular:
- number of words: severe toxic comments tend to have a high count of words, toxic a somewhat higher count than normal
- capitals: toxic loves capitals. It is probably the most significant indicator we've got.
- paragraphs: probably not so useful, the correlations are quite low. Still...
- stopwords: clean comments tend to have more of them than toxic ones. It may worth not to filter them out.
- exclamation marks: toxic are quite full of them.
- repeated words: toxic (especially severe toxic) tend to have the same words repeated over and over (we had a 15 threshold count, but we could probably lower it and still get something useful)
- nouns: adjectives and verbs mean little, but nouns are used all over in toxic comments. Nice to know.

In addition:
- length vs words: normalizing vs length seems to return more meaningful correlation than vs words. That seems counterintuitive to me, so it's a good finding in my opinion.
- threat is an outsider: it's really difficult to find correlation with any indicator I throw at it. It probably demands more attention.
- identity_hate: again difficult. A little help comes from capitals, but not so much.
- clean comments are quite uncorrelated to all the high correlations that we found. That is nice.

Ok, all these may be useful hints for cleaning your corpus during data preparation. But the main point here is that counting occurrences in the comments is nice and fine, but the missing part is that some kind of normalization is needed to get meaningful results. Just look, for example, at the nouns or paragraph rows: pretty insignificant until you reduce them to fractions. 

Feedback or ideas are welcome. 

Thanks for you time!