Data preprocessing
- Discuss how you split the dataset and why.
- Is your dataset IID?
- Does it have group structure?
- Is it a time-series data?
- How should you split the dataset given your ML question to best mimic future use when you deploy the model?
- Apply MinMaxEncoder or StandardScaler on the continuous features
- Apply OneHotEncoder or OrdinalEncoder on categorical features
- Describe why you chose a particular preprocessor for each feature.
- How many features do you have in the preprocessed data?

1. The dataset
- Discuss how you split the dataset and why.
- Is your dataset IID?
- Does it have group structure?
- Is it a time-series data?

The dataset consists of records of online articles published by Mashable. Each row corresponds to a single article. There are a total of 39,644 articles. 

The target variable is 'popular' - this is a binary variable where 0 corresponds to not-popular and 1 corresponds to popular. This variable is derived from the 'shares' variable present in the original dataset. Shares is the number of shares for a given article, and we set a threshold = 1400, above which an article is considered popular. 

The features list various characteristics about the articles, including some NLP based metrics, the number of tokens in the title, in the content, the number of images and videos, the topic and the day of the week on which the article was published. There are 58 features. 

Since there is no group structure or time-series structure to the data, the data can be considered IID. 


2. How should you split the dataset given your ML question to best mimic future use when you deploy the model?

The point of this project was to be able to predict how popular an article would be *before publication*, and make recommendations to increase popularity. 

Each of the feature variables is available to us prior to publication in a production environment, so we can safely split the dataset into the standard train:validation:test split. Here, I use a split of 70:20:10, which gives us 27751, 7929, 3964 articles for the train, validation and test sets respectively. 

3. Feature Encoding
- Apply MinMaxEncoder or StandardScaler on the continuous features
- Apply OneHotEncoder or OrdinalEncoder on categorical features
- Describe why you chose a particular preprocessor for each feature.
- How many features do you have in the preprocessed data?

On the continuous features, I am applying StandardScaler for all features except the following:

The reason for applying StandardScaler is that the histograms for these features are very fat tailed, and a MinMaxEncoder would clump most values to the leftmost part of the histogram. 

From the categorical features, I use OneHotEncoder on data_channel, since that tells us what the topic of the article is, and it is not possibly to order topics. Similarly, I use OneHot encoding on the is_weekend feature. 

The last remaining categorical feature is the day_of_week feature, for which I use the OrdinalEncoder since the days of the week occur in a sequence. 

In [26]:
from project_paths import *

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [27]:
data = pd.read_csv(data_csv_for_preprocessing)

print(data.shape)
data.head()

(39644, 50)


Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,day_of_week,data_channel,popular
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,Monday,Entertainment,0
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,Monday,Business,0
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,...,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,Monday,Business,1
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,Monday,Entertainment,0
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,...,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,Monday,Tech,0


In [28]:
RANDOM_STATE = 7

non_predictive_columns = ['url', 'timedelta']
target_column = 'popular'

y = data[target_column]
X = data[[x for x in data.columns if x not in non_predictive_columns and x != target_column]]

In [29]:
# Splitting the data

X_train, X_other, y_train, y_other = train_test_split(X, y, train_size=0.7, random_state=RANDOM_STATE)
X_val, X_test, y_val, y_test = train_test_split(X_other, y_other, train_size=0.66667, random_state=RANDOM_STATE)

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(27750, 47)
(7929, 47)
(3965, 47)
(27750,)
(7929,)
(3965,)


In [30]:
# Defining Feature Types for encoding

onehot_ftrs = ['data_channel', 'is_weekend']
ordinal_ftrs = ['day_of_week']
ordinal_cats = [['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']]
minmax_ftrs = ['n_tokens_title',
                'average_token_length',
                'num_keywords',
                'LDA_00',
                'LDA_01',
                'LDA_02',
                'LDA_03',
                'LDA_04',
                'title_subjectivity',
                'title_sentiment_polarity',
                'abs_title_subjectivity',
                'abs_title_sentiment_polarity']
standard_ftrs = ['n_tokens_content',
                'n_unique_tokens',
                'n_non_stop_words',
                'n_non_stop_unique_tokens',
                'num_hrefs',
                'num_self_hrefs',
                'num_imgs',
                'num_videos',
                'kw_min_min',
                'kw_max_min',
                'kw_avg_min',
                'kw_min_max',
                'kw_max_max',
                'kw_avg_max',
                'kw_min_avg',
                'kw_max_avg',
                'kw_avg_avg',
                'self_reference_min_shares',
                'self_reference_max_shares',
                'self_reference_avg_sharess',
                'global_subjectivity',
                'global_sentiment_polarity',
                'global_rate_positive_words',
                'global_rate_negative_words',
                'rate_positive_words',
                'rate_negative_words',
                'avg_positive_polarity',
                'min_positive_polarity',
                'max_positive_polarity',
                'avg_negative_polarity',
                'min_negative_polarity',
                'max_negative_polarity',]

# print([x for x in data.columns if x not in onehot_ftrs+ordinal_ftrs+minmax_ftrs+standard_ftrs])

In [40]:
np.random.seed(RANDOM_STATE)

# collect all the encoders
preprocessor = ColumnTransformer(
    transformers=[('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'), onehot_ftrs), 
                 ('ordinal', OrdinalEncoder(categories=ordinal_cats), ordinal_ftrs), 
                 ('minmax', MinMaxScaler(), minmax_ftrs),
                 ('standard', StandardScaler(), standard_ftrs),])

clf = Pipeline(steps=[('preprocessor', preprocessor)])

X_train_prep = clf.fit_transform(X_train)
X_val_prep = clf.transform(X_val)
X_test_prep = clf.transform(X_test)

print(X_train_prep.shape)
print(X_val_prep.shape)
print(X_test_prep.shape)
X_train_prep

(27750, 54)
(7929, 54)
(3965, 54)


array([[ 0.        ,  0.        ,  0.        , ...,  1.0935568 ,
         0.7643038 ,  0.60840291],
       [ 0.        ,  0.        ,  0.        , ..., -0.02324474,
        -0.95633221,  0.60840291],
       [ 0.        ,  0.        ,  0.        , ..., -0.98890021,
        -1.30045941,  0.07790326],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -1.55803048,
        -0.95633221,  0.07790326],
       [ 0.        ,  0.        ,  0.        , ...,  0.66251059,
         0.7643038 ,  0.07790326],
       [ 0.        ,  0.        ,  0.        , ..., -0.43959619,
         0.07604939,  0.07790326]])