Data preprocessing
- Discuss how you split the dataset and why.
- Is your dataset IID?
- Does it have group structure?
- Is it a time-series data?
- How should you split the dataset given your ML question to best mimic future use when you deploy the model?
- Apply MinMaxEncoder or StandardScaler on the continuous features
- Apply OneHotEncoder or OrdinalEncoder on categorical features
- Describe why you chose a particular preprocessor for each feature.
- How many features do you have in the preprocessed data?

1. The dataset
- Discuss how you split the dataset and why.
- Is your dataset IID?
- Does it have group structure?
- Is it a time-series data?

The dataset consists of records of online articles published by Mashable. Each row corresponds to a single article. There are a total of 39,644 articles. 

The target variable is 'popular' - this is a binary variable where 0 corresponds to not-popular and 1 corresponds to popular. This variable is derived from the 'shares' variable present in the original dataset. Shares is the number of shares for a given article, and we set a threshold = 1400, above which an article is considered popular. 

The features list various characteristics about the articles, including some NLP based metrics, the number of tokens in the title, in the content, the number of images and videos, the topic and the day of the week on which the article was published. There are 58 features. 

Since there is no group structure or time-series structure to the data, the data can be considered IID. 


2. How should you split the dataset given your ML question to best mimic future use when you deploy the model?

The point of this project was to be able to predict how popular an article would be *before publication*, and make recommendations to increase popularity. 

Each of the feature variables is available to us prior to publication in a production environment, so we can safely split the dataset into the standard train:validation:test split. Here, I use a split of 70:20:10, which gives us 27751, 7929, 3964 articles for the train, validation and test sets respectively. 

3. Feature Encoding
- Apply MinMaxEncoder or StandardScaler on the continuous features
- Apply OneHotEncoder or OrdinalEncoder on categorical features
- Describe why you chose a particular preprocessor for each feature.
- How many features do you have in the preprocessed data?

On the continuous features, I am applying StandardScaler for all features except the following:

The reason for applying StandardScaler is that the histograms for these features are very fat tailed, and a MinMaxEncoder would clump most values to the leftmost part of the histogram. 

For the categorical features, I use OneHotEncoder. Eg. the data_channel feature tells us what the topic of the article is, and it is not possibly to order topics. Similarly, I use OneHot encoding on the is_weekend feature and the day_of_week feature. 

In [2]:
from project_paths import *

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GroupShuffleSplit
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [3]:
data = pd.read_csv(data_csv_for_preprocessing)

print(data.shape)
data.head()

(39644, 51)


Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares,day_of_week,topic,popular
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593,Monday,Entertainment,0
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,-0.125,-0.1,0.0,0.0,0.5,0.0,711,Monday,Business,0
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,...,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500,Monday,Business,1
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200,Monday,Entertainment,0
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,...,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505,Monday,Tech,0


In [4]:
RANDOM_STATE = 7

USING_BASIC_SPLIT = True

if USING_BASIC_SPLIT:
    non_predictive_columns = ['url', 'timedelta']
    target_column = 'popular'

    y = data[target_column]
    X = data[[x for x in data.columns if x not in non_predictive_columns and x != target_column]]

    # Splitting the data - BASIC TRAIN TEST SPLIT

    X_train, X_other, y_train, y_other = train_test_split(X, y, train_size=0.8, random_state=RANDOM_STATE)
    X_val, X_test, y_val, y_test = train_test_split(X_other, y_other, train_size=0.5, random_state=RANDOM_STATE)

    print(X_train.shape)
    print(X_val.shape)
    print(X_test.shape)

    print(y_train.shape)
    print(y_val.shape)
    print(y_test.shape)

(31715, 48)
(3964, 48)
(3965, 48)
(31715,)
(3964,)
(3965,)


In [8]:
if not USING_BASIC_SPLIT:
    # Identifying quantiles (mulitples of 10)
    groups_percentiles = data['shares'].quantile([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]).to_dict()

    def get_group_given_shares(groups_percentiles, i):
        for kx, k in enumerate(sorted(groups_percentiles.keys())):
            if i > groups_percentiles[k]:
                continue
            return k
            if i > groups_percentiles[ sorted(groups_percentiles.keys())[kx-1] ]:
                return sorted(groups_percentiles.keys())[kx]
        return 1.0

    groups = []
    for i in data['shares']:
        groups.append(get_group_given_shares(groups_percentiles, i))

    non_predictive_columns = ['url', 'timedelta']
    target_column = 'popular'

    data['groups'] = groups
    y = data[target_column]
    X = data[[x for x in data.columns if x not in non_predictive_columns and x != target_column]]
    
    # Using stratification to make sure that articles of all levels of popularity are covered in all sets

    # First a stratified split into train and other
    groups = X['groups']
    X_train, X_other, y_train, y_other = train_test_split(X, y, train_size = 0.8, stratify=groups, random_state=RANDOM_STATE)

    # Then a stratified split into val and test
    groups_other = X_other["groups"]
    X_val, X_test, y_val, y_test = train_test_split(X_other, y_other, train_size = 0.5, stratify=groups_other, random_state=RANDOM_STATE)
    
    # Dropping the groups column since it was only used for stratitfication
    X_train = X_train.drop('groups', axis=1)
    X_val = X_val.drop('groups', axis=1)
    X_test = X_test.drop('groups', axis=1)

In [9]:
print(X_train.shape)
print(y_train.shape)

print(X_val.shape)
print(y_val.shape)

print(X_test.shape)
print(y_test.shape)

(31715, 48)
(31715,)
(3964, 48)
(3964,)
(3965, 48)
(3965,)


In [12]:
# Defining Feature Types for encoding

onehot_ftrs = ['topic', 'is_weekend', 'day_of_week']
# ordinal_ftrs = ['day_of_week']
# ordinal_cats = [['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']]
minmax_ftrs = ['n_tokens_title',
                'average_token_length',
                'num_keywords',
                'LDA_00',
                'LDA_01',
                'LDA_02',
                'LDA_03',
                'LDA_04',
                'title_subjectivity',
                'title_sentiment_polarity',
                'abs_title_subjectivity',
                'abs_title_sentiment_polarity']
standard_ftrs = ['n_tokens_content',
                'n_unique_tokens',
                'n_non_stop_words',
                'n_non_stop_unique_tokens',
                'num_hrefs',
                'num_self_hrefs',
                'num_imgs',
                'num_videos',
                'kw_min_min',
                'kw_max_min',
                'kw_avg_min',
                'kw_min_max',
                'kw_max_max',
                'kw_avg_max',
                'kw_min_avg',
                'kw_max_avg',
                'kw_avg_avg',
                'self_reference_min_shares',
                'self_reference_max_shares',
                'self_reference_avg_sharess',
                'global_subjectivity',
                'global_sentiment_polarity',
                'global_rate_positive_words',
                'global_rate_negative_words',
                'rate_positive_words',
                'rate_negative_words',
                'avg_positive_polarity',
                'min_positive_polarity',
                'max_positive_polarity',
                'avg_negative_polarity',
                'min_negative_polarity',
                'max_negative_polarity',]

# print([x for x in data.columns if x not in onehot_ftrs+ordinal_ftrs+minmax_ftrs+standard_ftrs])
save_list_to_pkl(onehot_ftrs, 'onehot_ftrs.pkl')
save_list_to_pkl(minmax_ftrs, 'minmax_ftrs.pkl')
save_list_to_pkl(standard_ftrs, 'standard_ftrs.pkl')

In [11]:
np.random.seed(RANDOM_STATE)

# collect all the encoders
preprocessor = ColumnTransformer(
    transformers=[('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'), onehot_ftrs), 
                 ('minmax', MinMaxScaler(), minmax_ftrs),
                 ('standard', StandardScaler(), standard_ftrs),])

clf = Pipeline(steps=[('preprocessor', preprocessor)])

X_train_prep = clf.fit_transform(X_train)
X_val_prep = clf.transform(X_val)
X_test_prep = clf.transform(X_test)

print(X_train_prep.shape)
print(X_val_prep.shape)
print(X_test_prep.shape)
X_train_prep

(31715, 60)
(3964, 60)
(3965, 60)


array([[ 0.        ,  1.        ,  0.        , ...,  0.79044934,
         0.79389477,  0.37938288],
       [ 0.        ,  0.        ,  0.        , ..., -0.14082918,
        -1.64618344,  0.60437983],
       [ 1.        ,  0.        ,  0.        , ...,  0.86351945,
         0.76518797,  0.37938288],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -1.55291375,
        -0.95722018,  0.07938695],
       [ 0.        ,  0.        ,  0.        , ...,  0.66246029,
         0.76518797,  0.07938695],
       [ 0.        ,  0.        ,  0.        , ..., -0.43708197,
         0.07622471,  0.07938695]])

In [10]:
save_preprocessed_data(X_train_prep, preprocessed_data_path_train)
save_preprocessed_data(X_val_prep, preprocessed_data_path_val)
save_preprocessed_data(X_test_prep, preprocessed_data_path_test)

save_preprocessed_data(y_train, preprocessed_data_y_train)
save_preprocessed_data(y_val, preprocessed_data_y_val)
save_preprocessed_data(y_test, preprocessed_data_y_test)

In [11]:
feature_names = [t for t in clf['preprocessor'].transformers_]

In [38]:
oe_features = feature_names[0][1].get_feature_names()
all_features_pp = list(oe_features) + minmax_ftrs + standard_ftrs
save_list_to_pkl(all_features_pp, 'processed_features_list.pkl')