<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6
### <center> Beating benchmarks in "How good is your Medium article?"
    
[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "Assignment 6 baseline".

In [3]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
import scipy.sparse 
from sklearn.preprocessing import OneHotEncoder
import matplotlib
import matplotlib.pyplot as plt

The following code will help to throw away all HTML tags from an article content.

In [4]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [5]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [6]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = [open(os.path.join(path_to_data,
                                       '{}_{}.txt'.format(prefix, feat)),
                          'w', encoding='utf-8')
                     for feat in features]
   
    totals = {'test': 34645, 'train': 62313}
    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file, total=totals[prefix]):
            json_data = read_json_line(line)
            content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
            content_no_html_tags = strip_tags(content)
            title = json_data['title'].replace('\n', ' ').replace('\r', ' ')
            print(content_no_html_tags, file=feature_files[0])
            print(json_data['published']['$date'], file=feature_files[1])
            print(title, file=feature_files[2])
            print(json_data['meta_tags']['author'], file=feature_files[3])


In [7]:
PATH_TO_DATA = '/Users/lucky/.kaggle/competitions/how-good-is-your-medium-article' # modify this if you need to

In [8]:
def feature_file(prefix, feature):
    return os.path.join(PATH_TO_DATA, '{}_{}.txt'.format(prefix, feature))

In [9]:
if not os.path.exists(feature_file('train', 'content')):
    extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

In [10]:
if not os.path.exists(feature_file('test', 'content')):
    extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [11]:
%time
X_train_content_sparse, X_train_title_sparse, X_train_author_sparse, X_train_time_features_sparse = (None, None, None, None)
X_test_content_sparse, X_test_title_sparse, X_test_author_sparse, X_test_time_features_sparse = (None, None, None, None)
    
hv = HashingVectorizer(non_negative=True, norm=None, n_features=100000, ngram_range=(1,3))
tfidf = TfidfTransformer()

with open(feature_file('train', 'content'), encoding='utf-8') as input_train_file:
    X_train_content_sparse = hv.transform(input_train_file)
    print('[-] Hashing of train dataset has been completed')
#    X_train_content_sparse = tfidf.fit_transform(train_vectors)
#    print('[+] TFxIDF transform is done')

with open(feature_file('test', 'content'), encoding='utf-8') as input_test_file:
    X_test_content_sparse = hv.transform(input_test_file)
    print('[-] Hashing of test dataset has been completed')
#    X_test_content_sparse = tfidf.transform(test_vectors)
#    print('[+] TFxIDF transform is done')
    

tv = TfidfVectorizer(ngram_range=(1, 2), max_features=100000)
with open(feature_file('train', 'title'), encoding='utf-8') as input_train_file:
    X_train_title_sparse = tv.fit_transform(input_train_file)

with open(feature_file('test', 'title'), encoding='utf-8') as input_test_file:
    X_test_title_sparse = tv.transform(input_test_file)


tv1 = TfidfVectorizer(binary=True, use_idf=False, norm=None, token_pattern='[^\n]+')
with open(feature_file('train', 'author'), 'rt') as input_train_file:
    X_train_author_sparse = tv1.fit_transform(input_train_file)
    
with open(feature_file('test', 'author'), 'rt') as input_test_file:
    X_test_author_sparse = tv1.transform(input_test_file)



CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 10.3 µs




[-] Hashing of train dataset has been completed




[-] Hashing of test dataset has been completed


In [12]:
def get_time_features(filename):
    x = pd.read_csv(filename, header=None)
    x.columns = ['published']
    tf = pd.DataFrame()
    x.published = x.published.astype('datetime64[ns]')
    tf['hour_of_day'] = x.published.dt.hour
    tf['day_of_week'] = x.published.dt.dayofweek
    tf['day_of_year'] = x.published.dt.dayofyear
    tf['is_friday_saturday'] = tf.day_of_week.apply(lambda d: 1 if d==4 or d==5 else 0)
    tf['part_of_day'] = tf['hour_of_day'].apply(lambda h: 1 if h > 5 and h <= 11 else (
                                                     2 if h > 11 and h <= 17 else (
                                                     3 if h > 17 and h <= 23 
                                                         else 4)))
    tf['top_hours'] = tf['hour_of_day'].apply(lambda h: 1 if h in (1, 5, 9, 14) else 0)
    return tf

X_train_time_features = get_time_features(feature_file('train', 'published'))
X_test_time_features = get_time_features(feature_file('test', 'published'))

encoder = OneHotEncoder(categorical_features='all')
X_train_time_features_sparse = encoder.fit_transform(X_train_time_features)
X_test_time_features_sparse = encoder.transform(X_test_time_features)

**Join all sparse matrices.**

In [13]:
X_train_content_sparse.shape, X_train_title_sparse.shape, X_train_author_sparse.shape, X_train_time_features_sparse.shape
#X_train_sparse = csr_matrix(hstack([X_train_content_sparse, X_train_title_sparse,
                                    #X_train_author_sparse]))

((62313, 100000), (62313, 100000), (62313, 31319), (62313, 405))

In [14]:
X_train_sparse = csr_matrix(hstack([X_train_content_sparse, X_train_title_sparse,
                                    X_train_author_sparse, X_train_time_features_sparse]))
X_test_sparse = csr_matrix(hstack([X_test_content_sparse, X_test_title_sparse,
                                    X_test_author_sparse, X_test_time_features_sparse]))

**Read train target and split data for validation.**

In [15]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

Let's try to find out correlation between some time features and the target attribute:

In [16]:
df = pd.concat([X_train_time_features, train_target['log_recommends'].apply(np.expm1)], axis=1)
df.groupby('hour_of_day').log_recommends.mean(), df.groupby('day_of_week').log_recommends.mean(), \
df.groupby('is_friday_saturday').log_recommends.mean(), df.groupby('top_hours').log_recommends.mean()

(hour_of_day
 0.0     194.080199
 1.0     461.213947
 2.0     305.840884
 3.0     339.406281
 4.0     155.343964
 5.0     583.033277
 6.0     255.599988
 7.0     295.258059
 8.0     369.778356
 9.0     566.212346
 10.0    252.296197
 11.0    279.795048
 12.0    267.701192
 13.0    313.730345
 14.0    427.061511
 15.0    363.566415
 16.0    276.849522
 17.0    335.731069
 18.0    359.708586
 19.0    241.422629
 20.0    363.857726
 21.0    316.952048
 22.0    318.374608
 23.0    321.751181
 Name: log_recommends, dtype: float64, day_of_week
 0.0    328.316343
 1.0    322.094256
 2.0    338.987363
 3.0    291.908646
 4.0    350.506525
 5.0    386.893680
 6.0    307.986354
 Name: log_recommends, dtype: float64, is_friday_saturday
 0.0    319.370722
 1.0    363.392023
 Name: log_recommends, dtype: float64, top_hours
 0.0    303.209582
 1.0    486.138853
 Name: log_recommends, dtype: float64)

In [17]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [18]:
ridge = Ridge(random_state=17)
ridge.fit(X_train_part_sparse, y_train_part);
ridge_pred = ridge.predict(X_valid_sparse)
valid_mae = mean_absolute_error(y_valid, ridge_pred)
valid_mae, np.expm1(valid_mae)

(1.2472784149842058, 2.4808566071786915)

* with time features: (1.3055035609096739, 2.6895465374084075) - 1.62456 was 1.62666
* with is_friday_saturday,day_of_year,top_hours: (1.305855113415952, 2.6908438347613699) - 1.62321 was 1.62456

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [19]:
%%time
ridge.fit(X_train_sparse, y_train);
ridge_test_pred = ridge.predict(X_test_sparse)

CPU times: user 40min 44s, sys: 16.6 s, total: 41min
Wall time: 41min 15s


In [20]:
def write_submission_file(prediction, filename,
                          path_to_sample='sample_submission.csv'):
    submission = pd.read_csv(os.path.join(PATH_TO_DATA, path_to_sample), index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [21]:
write_submission_file(ridge_test_pred, 'assignment6_medium_submission.csv')

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeroes. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [22]:
write_submission_file(np.zeros_like(ridge_test_pred), 
                      'medium_all_zeros_submission.csv')

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [23]:
kaggle_test_mean = 4.33328
ridge_test_pred_modif = ridge_test_pred + (kaggle_test_mean - ridge_test_pred.mean())
ridge_test_pred.mean(), ridge_test_pred_modif.mean()

(3.125392289172281, 4.3332799999999994)

In [24]:
ridge_test_pred_modif = ridge_test_pred + (kaggle_test_mean - ridge_test_pred.mean())

In [25]:
write_submission_file(ridge_test_pred_modif, 
                      'assignment6_medium_submission_with_hack.csv')