<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center>Assignment #6
### <center> Beating baselines in "How good is your Medium article?"
    
<img src='../../img/medium_claps.jpg' width=40% />


[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "A6 baseline" (~1.45 Public LB score). Do not forget about our shared ["primitive" baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline) - you'll find something valuable there.

**Your task:**
 1. "Freeride". Come up with good features to beat the baseline "A6 baseline" (for now, public LB is only considered)
 2. You need to name your [team](https://www.kaggle.com/c/how-good-is-your-medium-article/team) (out of 1 person) in full accordance with the [course rating](https://drive.google.com/open?id=19AGEhUQUol6_kNLKSzBsjcGUU3qWy3BNUg8x8IFkO3Q). You can think of it as a part of the assignment. 16 credits for beating the mentioned baseline and correct team naming.

In [1]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ' '.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [3]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [52]:
publishes = []
titles = []
authors = []
contents = []
tags = []

In [53]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = [open(os.path.join(path_to_data,
                                       '{}_{}.txt'.format(prefix, feat)),
                          'w', encoding='utf-8')
                     for feat in features]

    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            
            # You code here
            publishes.append(json_data['published']['$date'])
            titles.append(json_data['title'])
            authors.append(json_data['meta_tags']['author'])
            contents.append(strip_tags(json_data['content']))
            tags.append(' '.join(json_data['tags']))
            pass
        pass
    
    #feature_files[0].write('\n'.join(contents))
    #feature_files[1].write('\n'.join(publishes))
    #feature_files[2].write('\n'.join(titles))
    #feature_files[3].write('\n'.join(authors))
    
    for f in feature_files:
        f.close()
        pass
    pass
            

In [6]:
PATH_TO_DATA = './' # modify this if you need to

In [54]:
extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [55]:
train_publish = publishes
train_title = titles
train_author = authors
train_content = contents
train_tag = tags

In [56]:
len(train_publish), len(train_title), len(train_author), len(train_content), len(train_tag)

(62313, 62313, 62313, 62313, 62313)

In [57]:
publishes = []
titles = []
authors = []
contents = []
tags = []

In [58]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [59]:
test_publish = publishes
test_title = titles
test_author = authors
test_content = contents
test_tag = tags

In [60]:
len(test_publish), len(test_title), len(test_author), len(test_content), len(test_tag)

(34645, 34645, 34645, 34645, 34645)

**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [61]:
print(train_publish[0])
print(pd.to_datetime(train_publish[0]))

2012-08-13T22:54:53.510Z
2012-08-13 22:54:53.510000


In [62]:
# convert dateframe

In [63]:
train = pd.DataFrame()
train['publish'] = pd.to_datetime(train_publish, format='%Y-%m-%dT%H:%M:%S.%fZ')
train['title'] = train_title
train['author'] = train_author
train['content'] = train_content
train['tag'] = train_tag

In [64]:
test = pd.DataFrame()
test['publish'] = pd.to_datetime(test_publish, format='%Y-%m-%dT%H:%M:%S.%fZ')
test['title'] = test_title
test['author'] = test_author
test['content'] = test_content
test['tag'] = test_tag

In [18]:
train_weekend = train['publish'].apply(lambda x: x.dayofweek in [5, 6]).astype('int')
train_hour = train['publish'].apply(lambda x: x.hour)
train_morning = ((train_hour >= 7) & (train_hour <= 11)).astype('int')
train_day = ((train_hour >= 12) & (train_hour <= 18)).astype('int')
train_everning = ((train_hour >= 19) & (train_hour <= 23)).astype('int')
train_night = ((train_hour >= 0) & (train_hour <= 6)).astype('int')

In [65]:
train_tag_count = train['tag'].apply(lambda x: len(x.split()))

In [19]:
test_weekend = test['publish'].apply(lambda x: x.dayofweek in [5, 6]).astype('int')
test_hour = test['publish'].apply(lambda x: x.hour)
test_morning = ((test_hour >= 7) & (test_hour <= 11)).astype('int')
test_day = ((test_hour >= 12) & (test_hour <= 18)).astype('int')
test_everning = ((test_hour >= 19) & (test_hour <= 23)).astype('int')
test_night = ((test_hour >= 0) & (test_hour <= 6)).astype('int')

In [66]:
test_tag_count = test['tag'].apply(lambda x: len(x.split()))

In [20]:
%%time
train_content_vect = TfidfVectorizer(ngram_range=(1,2), max_features=100000)
X_train_content_sparse = train_content_vect.fit_transform(train['content'])

Wall time: 10min 40s


In [21]:
%%time
train_title_vect = TfidfVectorizer(ngram_range=(1,2), max_features=100000)
X_train_title_sparse = train_title_vect.fit_transform(train['title'])

Wall time: 4.05 s


In [68]:
%%time
train_tag_vect = TfidfVectorizer(ngram_range=(1,1))
X_train_tag_sparse = train_tag_vect.fit_transform(train['tag'])

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [22]:
%%time
test_content_vect = TfidfVectorizer(ngram_range=(1,2), max_features=100000)
X_test_content_sparse = test_content_vect.fit_transform(test['content'])

Wall time: 4min 11s


In [23]:
%%time
test_title_vect = TfidfVectorizer(ngram_range=(1,2), max_features=100000)
X_test_title_sparse = test_title_vect.fit_transform(test['title'])

Wall time: 1.94 s


In [None]:
%%time
test_tag_vect = TfidfVectorizer(ngram_range=(1,1), max_features=50000)
X_test_tag_sparse = test_tag_vect.fit_transform(test['tag'])

In [None]:
## convert authors

In [24]:
df_full = pd.concat([train, test])

In [25]:
#authors_flatten = df_full['author'].values.flatten()

In [26]:
authors_set = set(df_full['author'].values)

In [29]:
#authors_set
#print(len(authors_set))
#print(authors_flatten.shape)

In [30]:
authors_len = 1
authors_dict = {}
for author in authors_set:
    authors_dict[author] = authors_len
    authors_len += 1
    pass


In [31]:
#authors_dict
train['author_encode'] = train['author'].apply(lambda x: authors_dict[x]).astype('int')
test['author_encode'] = test['author'].apply(lambda x: authors_dict[x]).astype('int')

In [32]:
df_full['author_encode'] = df_full['author'].apply(lambda x: authors_dict[x]).astype('int')

In [33]:
authors_flatten = df_full['author_encode'].values.flatten()

In [34]:
full_authors_sparse = csr_matrix(([1] * authors_flatten.shape[0], authors_flatten, range(0, authors_flatten.shape[0] + 1, 1)))[:, 1:]

In [35]:
idx_split = len(train)

In [36]:
X_train_author_sparse = full_authors_sparse[:idx_split, :]

In [37]:
X_test_author_sparse = full_authors_sparse[idx_split:, :]

In [38]:
## good feature

In [39]:
train_content_len = np.log(train['content'].apply(len))
train_title_len = np.log(train['title'].apply(len))

In [40]:
#train_title_len.head()
#np.log(train_content_len).min()
#np.log(train_title_len).min()

In [41]:
test_content_len = np.log(test['content'].apply(len))
test_title_len = np.log(test['title'].apply(len))

**Join all sparse matrices.**

In [42]:
X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
                         X_train_author_sparse, 
                         #X_train_time_features_sparse
                         train_weekend.values.reshape(-1, 1)
                        , train_morning.values.reshape(-1, 1)
                        , train_day.values.reshape(-1, 1)
                        , train_everning.values.reshape(-1, 1)
                        , train_night.values.reshape(-1, 1)
                        , train_content_len.values.reshape(-1, 1)
                         , train_title_len.values.reshape(-1, 1)
                        ]).tocsr()

In [43]:
X_test_sparse = hstack([X_test_content_sparse, X_test_title_sparse,
                        X_test_author_sparse, 
                        #X_test_time_features_sparse
                        test_weekend.values.reshape(-1, 1)
                        , test_morning.values.reshape(-1, 1)
                        , test_day.values.reshape(-1, 1)
                        , test_everning.values.reshape(-1, 1)
                        , test_night.values.reshape(-1, 1)
                        , test_content_len.values.reshape(-1, 1)
                         , test_title_len.values.reshape(-1, 1)
                       ]).tocsr()

**Read train target and split data for validation.**

In [44]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [45]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [46]:
# You code here
simple_ridege = Ridge()
simple_ridege.fit(X_train_part_sparse, y_train_part)


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [47]:
simple_ridege_pred = simple_ridege.predict(X_valid_sparse)
mean_absolute_error(y_valid, simple_ridege_pred)

1.0572284116357322

In [48]:
ridge_log = Ridge(random_state=17)

In [49]:
%%time
ridge_log.fit(X_train_part_sparse, np.log1p(y_train_part))

Wall time: 2min 35s


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=17, solver='auto', tol=0.001)

In [50]:
%%time
ridge_log_pred = np.expm1(ridge_log.predict(X_valid_sparse))

Wall time: 172 ms


In [51]:
mean_absolute_error(y_valid, ridge_log_pred)

1.0494639636432592

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [74]:
# You code here
simple_ridege.fit(X_train_sparse, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [75]:
ridge_test_pred = simple_ridege.predict(X_test_sparse)

In [76]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 
                                                      'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [77]:
write_submission_file(ridge_test_pred, os.path.join(PATH_TO_DATA,
                                                    'assignment6_medium_submission.csv'))

In [98]:
# use ridgeCV
n_alphas = 10
ridge_alphas = np.logspace(-2, 6, n_alphas)

In [100]:
from sklearn.linear_model import RidgeCV

In [101]:
#RidgeCV?

In [109]:
#import sklearn
#sorted(sklearn.metrics.SCORERS.keys())

In [110]:
ridge_cv = RidgeCV(alphas=ridge_alphas, 
                   scoring='neg_mean_absolute_error',
                   cv=3, njob=-1)


In [111]:
%%time
ridge_cv.fit(X_train_sparse, y_train)

Wall time: 1h 27min 53s


RidgeCV(alphas=array([1.00000e-02, 7.74264e-02, 5.99484e-01, 4.64159e+00, 3.59381e+01,
       2.78256e+02, 2.15443e+03, 1.66810e+04, 1.29155e+05, 1.00000e+06]),
    cv=3, fit_intercept=True, gcv_mode=None, normalize=False,
    scoring='neg_mean_absolute_error', store_cv_values=False)

In [113]:
ridge_cv_test_pred = ridge_cv.predict(X_test_sparse)

In [114]:
write_submission_file(ridge_cv_test_pred, os.path.join(PATH_TO_DATA,
                                                    'assignment6_medium_submission1.csv'))

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeroes. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [78]:
write_submission_file(np.zeros_like(ridge_test_pred), 
                      os.path.join(PATH_TO_DATA,
                                   'medium_all_zeros_submission.csv'))

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [None]:
ridge_test_pred_modif = ridge_test_pred # You code here

In [None]:
write_submission_file(ridge_test_pred_modif, 
                      os.path.join(PATH_TO_DATA,
                                   'assignment6_medium_submission_with_hack.csv'))

That's it for the assignment. Much more credits will be given to the winners in this competition, check [course roadmap](https://mlcourse.ai/roadmap). Do not spoil the assignment and the competition - don't share high-performing kernels (with MAE < 1.5).

Some ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will learn much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!

<img src='../../img/kaggle_shakeup.png' width=50%>