# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning**: convert charaters to lower case, remove special characters.
2. **Text Normalization**: perform stemming and lemmatization, convert categories to numbers
3. **Text Representation**: use of TF-IDF scores to represent text.

In [29]:
# data manipulation
import pickle
import pandas as pd
import numpy as np
import altair as alt
# text cleaning & normalization
import re
import nltk
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# text representation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2

# get working directory
import os
cwd = os.getcwd()
parent = os.path.dirname(cwd) 

# Downloading the stop words list
nltk.download('stopwords')
# Downloading punkt, wordnet, averaged_perceptron_tagger from NLTK
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yeqiaoling/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Let us load the dataset!

In [4]:
# training data
path_df_train = parent + "/02. Exploratory Data Analysis/train_dataset.pickle"
with open(path_df_train, 'rb') as data:
    df_train = pickle.load(data)
# test data
path_df_test = parent + "/02. Exploratory Data Analysis/test_dataset.pickle"
with open(path_df_test, 'rb') as data:
    df_test = pickle.load(data)

In [7]:
df_train['Label'] = 'train'
df_test['Label'] = 'test'
columns = ['File_Name', 'Content', 'Category', 'Label']
df = pd.concat([df_train[columns], df_test[columns]])

In [8]:
df.head(10)

Unnamed: 0,File_Name,Content,Category,Label
2,51119,From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro...,alt.atheism,train
3,51120,From: mathew <mathew@mantis.co.uk>\nSubject: R...,alt.atheism,train
4,51121,From: strom@Watson.Ibm.Com (Rob Strom)\nSubjec...,alt.atheism,train
6,51123,From: keith@cco.caltech.edu (Keith Allan Schne...,alt.atheism,train
7,51124,From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro...,alt.atheism,train
8,51125,From: keith@cco.caltech.edu (Keith Allan Schne...,alt.atheism,train
9,51126,From: keith@cco.caltech.edu (Keith Allan Schne...,alt.atheism,train
10,51127,From: keith@cco.caltech.edu (Keith Allan Schne...,alt.atheism,train
11,51128,From: keith@cco.caltech.edu (Keith Allan Schne...,alt.atheism,train
12,51130,From: keith@cco.caltech.edu (Keith Allan Schne...,alt.atheism,train


In [9]:
df.iloc[6]['Content']

"From: keith@cco.caltech.edu (Keith Allan Schneider)\nSubject: Re: >>>>>>Pompous ass\nOrganization: California Institute of Technology, Pasadena\nLines: 9\nNNTP-Posting-Host: punisher.caltech.edu\n\nkmr4@po.CWRU.edu (Keith M. Ryan) writes:\n\n>>Then why do people keep asking the same questions over and over?\n>Because you rarely ever answer them.\n\nNope, I've answered each question posed, and most were answered multiple\ntimes.\n\nkeith"

## 1. Text cleaning 

### 1.1. Downcase

We can downcase all the texts because we assume capital and smaller cases have the same meaning.

In [10]:
df['Content_Parsed_1'] = df['Content'].str.lower()

### 1.2. Redundant character removal

We can remove following redundant charaters because they have no predicting power.

1. **special text** (text conversion), such as ``\n``, `` ``,...
2. **punctuation signs**, such as ``"``, ``.``, ``;``...
3. **stop words**, such as ``to``, ``its``, ...

#### 1.2.1 Remove special characters and possessive pronouns

In [11]:
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("'s", "")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("'", "")

#### 1.2.2 Remove punctuation signs and other redundant symbols

In [12]:
special_symbols = list('<>{}[]?:!.,;()/~!@#$^&*_-+=|0123456789\\\"\'')
for special_symbol in special_symbols:
    df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace(special_symbol, ' ')
    df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('  ', ' ')

#### 1.2.3 Remove stop words, single letters and other unneeded words

In [14]:
stop_words = list(stopwords.words('english'))
stop_word_regex = "\\b(" + "|".join(stop_words) + ")\\b"
df['Content_Parsed_1'] = df['Content_Parsed_1'].apply(lambda s : re.sub(stop_word_regex, '', s))

single_letters = 'abcdefghijklmnopqrstuvwxyz'
for single_letter in single_letters:
    df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace(' ' + single_letter + ' ', ' ')
    
redundant_words = 'et al'.split()
for redundant_word in redundant_words:
    df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace(' ' + redundant_word + ' ', ' ')

In [17]:
df['Content_Parsed_1'][1]

' cfaehl vesta unm edu chris faehl subject  amusing atheists  agnostics organization university  new mexico albuquerque lines distribution world nntp posting host vesta unm edu  article timmbake mcl timmbake mcl ucsb edu clam bake timmons writes fallacy atheism   faith lo  hear  faq beckoning   wonderful rule deleted youre correct  didnt say anything   conspiracy correction hard atheism   faith yes rule dont mix apples  oranges    say   extermination   mongols  worse  stalin khan conquered people unsympathetic   cause   atrocious  stalin killed millions    people  loved  worshipped    atheist state   anyone  worse      explain     stalin  nothing   name  atheism whethe       atheist  irrelevant get  grip man  stalin example  brought     indictment  atheism  merely  another example   people  kill others   name  fit   occasion  look    never said   implication  pretty clear im sorry     respond   words   true meaning usenet   slippery medium deleted wrt  burden  proof  hard atheism  noth

In [18]:
df['Content'][1]

'From: cfaehl@vesta.unm.edu (Chris Faehl)\nSubject: Re: Amusing atheists and agnostics\nOrganization: University of New Mexico, Albuquerque\nLines: 88\nDistribution: world\nNNTP-Posting-Host: vesta.unm.edu\n\nIn article <timmbake.735265296@mcl>, timmbake@mcl.ucsb.edu ("Clam" Bake Timmons) writes:\n\n> \n> >Fallacy #1: Atheism is a faith. Lo! I hear the FAQ beckoning once again...\n> >[wonderful Rule #3 deleted - you\'re correct, you didn\'t say anything >about\n> >a conspiracy]\n> \n> Correction: _hard_ atheism is a faith.\n\nYes.\n \n> \n> >>Rule #4:  Don\'t mix apples with oranges.  How can you say that the\n> >>extermination by the Mongols was worse than Stalin?  Khan conquered people\n> >>unsympathetic to his cause.That was atrocious.But Stalin killed millions of\n> >>his own people who loved and worshipped _him_ and his atheist state!!How can\n    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^     \n> >>anyone be worse than that?\n> \n> >I will not explain thi

## 2. Text nomalization 

### 2.1. Stemming 
Stemming can produce output words that don't exist in the text, and it is more aggressive text modification method. Due to the time concern, we will skip this strategy at this moment.

### 2.2. Lemmatization

Lemmatization takes into consideration the morphological analysis of the words and returns the ``lemma`` of the world, and we will use it to modify the text.

In [19]:
# a function to get the morphology 
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [20]:
lemmatized_text_list = []
# break down document, and modify each word
for sentence in df['Content_Parsed_1']:
    tokens = word_tokenize(sentence)  
    tagged_sent = pos_tag(tokens)
    # start lemmatization
    wnl = WordNetLemmatizer()
    lemmas = []
    for tag in tagged_sent:
        wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
        lemmas.append(wnl.lemmatize(tag[0], pos = wordnet_pos)) 
    lemmatized_text_list.append(' '.join(lemmas))
    

In [23]:
df['Content_Parsed_2'] = lemmatized_text_list

In [24]:
df['Content_Parsed_2'][2]

2    dbstu rz tu b de benedikt rosenau subject gosp...
2    mathew mathew mantis co uk subject yet rushdie...
Name: Content_Parsed_2, dtype: object

### Let us take a look at the text after cleaning and normalization!

In [25]:
list_columns = ['File_Name', 'Category', 'Content', 'Content_Parsed_2', 'Label']
df = df[list_columns]
df = df.rename(columns={'Content_Parsed_2': 'Content_Parsed'})

In [26]:
df.head()

Unnamed: 0,File_Name,Category,Content,Content_Parsed,Label
2,51119,alt.atheism,From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro...,dbstu rz tu b de benedikt rosenau subject gosp...,train
3,51120,alt.atheism,From: mathew <mathew@mantis.co.uk>\nSubject: R...,mathew mathew mantis co uk subject university ...,train
4,51121,alt.atheism,From: strom@Watson.Ibm.Com (Rob Strom)\nSubjec...,strom watson ibm com rob strom subject soc mot...,train
6,51123,alt.atheism,From: keith@cco.caltech.edu (Keith Allan Schne...,keith cco caltech edu keith allan schneider su...,train
7,51124,alt.atheism,From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro...,dbstu rz tu b de benedikt rosenau subject anec...,train


## 3. Text Representation


### 3.1 Label coding

First, let us change the categorical variables to numberical variables by a dictionary mapping category to numbers.

In [27]:
# generate category codes
categories = df['Category'].unique()
category_codes = dict()
i = 0
for cat in categories:
    category_codes[cat] = i
    i += 1
# add code column
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code': category_codes})
# split X and y
df_train, df_test = df[df['Label'] == 'train'], df[df['Label'] == 'test']
X_train, y_train = df_train['Content_Parsed'], df_train['Category_Code']
X_test, y_test = df_test['Content_Parsed'], df_test['Category_Code']


Take a look at the response variable.

In [30]:
def bar_plot(df, TITLE = 'Category Histogram in Training Data', XVAL = 'Category_Code'):
    alt.data_transformers.disable_max_rows()
    bars = alt.Chart(df).mark_bar(size=20).encode(
        x = alt.X(XVAL),
        y = alt.Y('count():Q', axis = alt.Axis(title = 'Number of articles')),
        tooltip = [alt.Tooltip('count()', title = 'Number of articles'), XVAL],
        color = XVAL
    )

    text = bars.mark_text(
        align = 'center',
        baseline = 'bottom',
    ).encode(
        text = 'count()'
    )

    return (bars + text).interactive().properties(
        height = 300, 
        width = 700,
        title = TITLE,
    )
bar_plot(pd.DataFrame(y_train))

In [31]:
bar_plot(pd.DataFrame(y_test))

### 3.2 Text representation

There are several options to represent text:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

Due to the time constraint, we'll use **TF-IDF Vectors** as features. It is short for ``Term Frequency-Inverse Document Frequency``. 
It is based on word frequency to conver text to vectors, without taking account the order or the sequence of the words. 
* Let TF_i($x$) be the frequence of word $x$ in a document $i$, and $N(x)$ be the total count in the dataset. 
* We can calculate the inverse document frequence, $TDF(X) = \log\left(\frac {N + 1} {N(x) + 1}\right) + 1$.
* The TF-IDF score is defined as $TF(x) \times IDF(x)$.

We can define the following parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

Note that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

#### 3.2.1. Parameter selection

We have chosen these values as a first approximation, and these values are subject to change. 

In [32]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 500

#### 3.2.2. Fit and transform training data (or just transform testing data)

We fit and then transform the training set, but only transform the test set.

In [33]:
tfidf = TfidfVectorizer(encoding = 'utf-8',
                        ngram_range = ngram_range,
                        stop_words = None,
                        lowercase = False,
                        max_df = max_df,
                        min_df = min_df,
                        max_features = max_features,
                        norm = 'l2',
                        sublinear_tf = True)
                        
features_train = tfidf.fit_transform(X_train).toarray()  # numpy.ndarray
labels_train = y_train                                   # pandas.core.series.Series
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(11018, 500)
(7761, 500)


#### 3.2.3. Feature screening (a first check)

We can use the [Chi squared test](https://stattrek.com/chi-square-test/independence.aspx) in order to see what unigrams and bigrams are most correlated with each category.  

In [35]:
for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


# 'alt.atheism' category:
  . Most correlated unigrams:
. cwru
. bible
. religion
. god
. keith
  . Most correlated bigrams:
. can not
. cwru edu

# 'comp.graphics' category:
  . Most correlated unigrams:
. program
. color
. file
. image
. graphic
  . Most correlated bigrams:
. newsreader tin
. ac uk

# 'comp.os.ms-windows.misc' category:
  . Most correlated unigrams:
. win
. file
. driver
. window
. windows
  . Most correlated bigrams:
. netcom com
. nasa gov

# 'comp.sys.ibm.pc.hardware' category:
  . Most correlated unigrams:
. pc
. drive
. mb
. card
. scsi
  . Most correlated bigrams:
. line article
. newsreader tin

# 'comp.sys.mac.hardware' category:
  . Most correlated unigrams:
. drive
. scsi
. monitor
. apple
. mac
  . Most correlated bigrams:
. article apr
. cwru edu

# 'comp.windows.x' category:
  . Most correlated unigrams:
. sun
. display
. application
. window
. mit
  . Most correlated bigrams:
. ac uk
. mit edu

# 'misc.forsale' category:
  . Most correlated unigrams:
. 

As we can see, each category has different key word(s). Since we're restricting the number of features, the number of bigrams is less than the number of unigrams.

## Let's save the data!

In [37]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
with open('Pickles/df_train.pickle', 'wb') as output:
    pickle.dump(df_train, output)
with open('Pickles/df_test.pickle', 'wb') as output:
    pickle.dump(df_test, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)