# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [40]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

First of all we'll load the dataset:

In [41]:
path_df = "C:/Users/Sai Kaushik/Desktop/Feature Engineering/ICE 1/Latest-News-Classifier-master/0. Latest News Classifier/02. Exploratory Data Analysis/News_dataset.pickle"

with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [42]:
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length
0,001.txt,Why the feds are investigating Teslaâ€™s Autop...,Tesla,001.txt-Tesla,1,5657
1,002.txt,Tesla Wants To Launch Full Self-Driving Public...,Tesla,002.txt-Tesla,1,2592
2,003.txt,How Good Is Tesla Full Self-Driving (Beta) Rig...,Tesla,003.txt-Tesla,1,6568
3,004.txt,Tesla Must Send Autopilot Data to Feds by Octo...,Tesla,004.txt-Tesla,1,3704
4,005.txt,Survey Reveals Tesla's Full Self-Driving Take ...,Tesla,005.txt-Tesla,1,3994


And visualize one sample news content:

In [43]:
df.loc[1]['Content']

"Tesla Wants To Launch Full Self-Driving Public Beta In September\r\nAfter a few days ago Elon Musk announced that the public debut of Full Self-Driving Beta was just weeks away, now the Tesla CEO has tweeted the exact time and date when Version 10 Beta will be made available to beta testers. Musk has now officially set the date for next Friday and he expects that about two weeks after that (around September 25), an updated version, Beta 10.1 is expected to be good enough for public rollout.\r\n\r\nSo after being postponed several times and many months, it looks like all Tesla owners whose vehicles have the right hardware will be able to experience FSD. The software and the neural network behind it have been improved a lot recently and Tesla opted to eliminate the need for anything other than cameras with Version 9 of the system, what the manufacturer called 'Pure Vision.'\r\n\r\nThe way Elon phrased it in his tweet, though, itâ€™s clear heâ€™s not 100 percent confident that Version 10

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [44]:
# \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a *real* character:

In [45]:
text = "Mr Greenspan\'s"
text

"Mr Greenspan's"

In [46]:
# " when quoting text
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

### 1.2. Upcase/downcase

We'll downcase the texts because we want, for example, `Football` and `football` to be the same word.

In [47]:
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [48]:
punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

  df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')


By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [49]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [50]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package punkt to C:\Users\Sai
[nltk_data]     Kaushik\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Sai
[nltk_data]     Kaushik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [51]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [52]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [53]:
df['Content_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [54]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Sai
[nltk_data]     Kaushik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [55]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [56]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

In [57]:
example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

We can now loop through all the stop words:

In [58]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

  df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')


We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [59]:
df.loc[5]['Content']

'Amid Teslaâ€™s Autopilot Probe, Nearly Half the Public Thinks Autonomous Vehicles Are Less Safe Than Normal Cars\r\nMore than 50% of Americans have not heard much or anything about the crashes involving Tesla vehicles using â€œAutopilotâ€\x9d or the federal governmentâ€™s investigation into the matter.\r\n\r\n17% believe autonomous vehicles are as safe as cars driven by humans, up from 8% in 2018.\r\n\r\n37% of U.S. adults said they may ride in an autonomous vehicle in the future and 34% said they would not.\r\n\r\nThe inside of a Tesla vehicle is viewed as it sits parked in a new Tesla showroom and service center in Red Hook, Brooklyn, on July 5, 2016, in New York City. The electric car company and its chief executive and founder, Elon Musk, have come under increasing scrutiny following a crash of one of its electric cars while using the Autopilot service. (Spencer Platt/Getty Images)\r\n\r\nAs the federal government investigates Tesla Inc. for crashes involving its vehicles using th

1. Special character cleaning

In [60]:
df.loc[5]['Content_Parsed_1']

'Amid Teslaâ€™s Autopilot Probe, Nearly Half the Public Thinks Autonomous Vehicles Are Less Safe Than Normal Cars  More than 50% of Americans have not heard much or anything about the crashes involving Tesla vehicles using â€œAutopilotâ€\x9d or the federal governmentâ€™s investigation into the matter. 17% believe autonomous vehicles are as safe as cars driven by humans, up from 8% in 2018. 37% of U.S. adults said they may ride in an autonomous vehicle in the future and 34% said they would not. The inside of a Tesla vehicle is viewed as it sits parked in a new Tesla showroom and service center in Red Hook, Brooklyn, on July 5, 2016, in New York City. The electric car company and its chief executive and founder, Elon Musk, have come under increasing scrutiny following a crash of one of its electric cars while using the Autopilot service. (Spencer Platt/Getty Images) As the federal government investigates Tesla Inc. for crashes involving its vehicles using the â€œAutopilotâ€\x9d feature, 

2. Upcase/downcase

In [61]:
df.loc[5]['Content_Parsed_2']

'amid teslaâ€™s autopilot probe, nearly half the public thinks autonomous vehicles are less safe than normal cars  more than 50% of americans have not heard much or anything about the crashes involving tesla vehicles using â€œautopilotâ€\x9d or the federal governmentâ€™s investigation into the matter. 17% believe autonomous vehicles are as safe as cars driven by humans, up from 8% in 2018. 37% of u.s. adults said they may ride in an autonomous vehicle in the future and 34% said they would not. the inside of a tesla vehicle is viewed as it sits parked in a new tesla showroom and service center in red hook, brooklyn, on july 5, 2016, in new york city. the electric car company and its chief executive and founder, elon musk, have come under increasing scrutiny following a crash of one of its electric cars while using the autopilot service. (spencer platt/getty images) as the federal government investigates tesla inc. for crashes involving its vehicles using the â€œautopilotâ€\x9d feature, 

3. Punctuation signs

In [62]:
df.loc[5]['Content_Parsed_3']

'amid teslaâ€™s autopilot probe nearly half the public thinks autonomous vehicles are less safe than normal cars  more than 50% of americans have not heard much or anything about the crashes involving tesla vehicles using â€œautopilotâ€\x9d or the federal governmentâ€™s investigation into the matter 17% believe autonomous vehicles are as safe as cars driven by humans up from 8% in 2018 37% of us adults said they may ride in an autonomous vehicle in the future and 34% said they would not the inside of a tesla vehicle is viewed as it sits parked in a new tesla showroom and service center in red hook brooklyn on july 5 2016 in new york city the electric car company and its chief executive and founder elon musk have come under increasing scrutiny following a crash of one of its electric cars while using the autopilot service (spencer platt/getty images) as the federal government investigates tesla inc for crashes involving its vehicles using the â€œautopilotâ€\x9d feature a new poll indica

4. Possessive pronouns

In [63]:
df.loc[5]['Content_Parsed_4']

'amid teslaâ€™s autopilot probe nearly half the public thinks autonomous vehicles are less safe than normal cars  more than 50% of americans have not heard much or anything about the crashes involving tesla vehicles using â€œautopilotâ€\x9d or the federal governmentâ€™s investigation into the matter 17% believe autonomous vehicles are as safe as cars driven by humans up from 8% in 2018 37% of us adults said they may ride in an autonomous vehicle in the future and 34% said they would not the inside of a tesla vehicle is viewed as it sits parked in a new tesla showroom and service center in red hook brooklyn on july 5 2016 in new york city the electric car company and its chief executive and founder elon musk have come under increasing scrutiny following a crash of one of its electric cars while using the autopilot service (spencer platt/getty images) as the federal government investigates tesla inc for crashes involving its vehicles using the â€œautopilotâ€\x9d feature a new poll indica

5. Stemming and Lemmatization

In [64]:
df.loc[5]['Content_Parsed_5']

'amid teslaâ€™s autopilot probe nearly half the public think autonomous vehicles be less safe than normal cars  more than 50% of americans have not hear much or anything about the crash involve tesla vehicles use â€œautopilotâ€\x9d or the federal governmentâ€™s investigation into the matter 17% believe autonomous vehicles be as safe as cars drive by humans up from 8% in 2018 37% of us adults say they may ride in an autonomous vehicle in the future and 34% say they would not the inside of a tesla vehicle be view as it sit park in a new tesla showroom and service center in red hook brooklyn on july 5 2016 in new york city the electric car company and its chief executive and founder elon musk have come under increase scrutiny follow a crash of one of its electric cars while use the autopilot service (spencer platt/getty images) as the federal government investigate tesla inc for crash involve its vehicles use the â€œautopilotâ€\x9d feature a new poll indicate much of the public have safet

6. Stop words

In [65]:
df.loc[5]['Content_Parsed_6']

'amid teslaâ€™ autopilot probe nearly half  public think autonomous vehicles  less safe  normal cars    50%  americans   hear much  anything   crash involve tesla vehicles use â€œautopilotâ€\x9d   federal governmentâ€™ investigation   matter 17% believe autonomous vehicles   safe  cars drive  humans   8%  2018 37%  us adults say  may ride   autonomous vehicle   future  34% say  would   inside   tesla vehicle  view   sit park   new tesla showroom  service center  red hook brooklyn  july 5 2016  new york city  electric car company   chief executive  founder elon musk  come  increase scrutiny follow  crash  one   electric cars  use  autopilot service (spencer platt/getty images)   federal government investigate tesla inc  crash involve  vehicles use  â€œautopilotâ€\x9d feature  new poll indicate much   public  safety concern  autonomous vehicles though americansâ€™ interest   cars  rise slightly compare    years ago forty-seven percent  us adults say   new morning consult poll   believe a

Finally, we can delete the intermediate columns:

In [66]:
df.head(1)

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,001.txt,Why the feds are investigating Teslaâ€™s Autop...,Tesla,001.txt-Tesla,1,5657,Why the feds are investigating Teslaâ€™s Autop...,why the feds are investigating teslaâ€™s autop...,why the feds are investigating teslaâ€™s autop...,why the feds are investigating teslaâ€™s autop...,why the feds be investigate teslaâ€™s autopilo...,feds investigate teslaâ€™ autopilot mean...


In [67]:
list_columns = ["File_Name", "Category", "Complete_Filename", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [68]:
df.head()

Unnamed: 0,File_Name,Category,Complete_Filename,Content,Content_Parsed
0,001.txt,Tesla,001.txt-Tesla,Why the feds are investigating Teslaâ€™s Autop...,feds investigate teslaâ€™ autopilot mean...
1,002.txt,Tesla,002.txt-Tesla,Tesla Wants To Launch Full Self-Driving Public...,tesla want launch full self-driving public be...
2,003.txt,Tesla,003.txt-Tesla,How Good Is Tesla Full Self-Driving (Beta) Rig...,good tesla full self-driving (beta) right (...
3,004.txt,Tesla,004.txt-Tesla,Tesla Must Send Autopilot Data to Feds by Octo...,tesla must send autopilot data feds october ...
4,005.txt,Tesla,005.txt-Tesla,Survey Reveals Tesla's Full Self-Driving Take ...,survey reveal tesla full self-driving take rat...


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [69]:
category_codes = {
    'Tesla': 0,
    'Waymo': 1,
}

In [70]:
# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})

In [71]:
df.head()

Unnamed: 0,File_Name,Category,Complete_Filename,Content,Content_Parsed,Category_Code
0,001.txt,Tesla,001.txt-Tesla,Why the feds are investigating Teslaâ€™s Autop...,feds investigate teslaâ€™ autopilot mean...,0
1,002.txt,Tesla,002.txt-Tesla,Tesla Wants To Launch Full Self-Driving Public...,tesla want launch full self-driving public be...,0
2,003.txt,Tesla,003.txt-Tesla,How Good Is Tesla Full Self-Driving (Beta) Rig...,good tesla full self-driving (beta) right (...,0
3,004.txt,Tesla,004.txt-Tesla,Tesla Must Send Autopilot Data to Feds by Octo...,tesla must send autopilot data feds october ...,0
4,005.txt,Tesla,005.txt-Tesla,Survey Reveals Tesla's Full Self-Driving Take ...,survey reveal tesla full self-driving take rat...,0


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [72]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [73]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [74]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(34, 98)
(6, 98)


Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [75]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


# 'Tesla' category:
  . Most correlated unigrams:
. waymo
. full
. phoenix
. part
. fully
  . Most correlated bigrams:
. autonomous vehicle
. autonomous vehicles

# 'Waymo' category:
  . Most correlated unigrams:
. waymo
. full
. phoenix
. part
. fully
  . Most correlated bigrams:
. autonomous vehicle
. autonomous vehicles



As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [76]:
bigrams

['self driving', 'elon musk', 'autonomous vehicle', 'autonomous vehicles']

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [77]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)