<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

# <center> Assignment #2. Spring 2019
## <center>  Competition 2. Predicting Medium articles popularity with Ridge Regression <br>(beating baselines in the "Medium" competition)
    
<img src='../../img/medium_claps.jpg' width=40% />


In this [competition](https://www.kaggle.com/c/how-good-is-your-medium-article) we are predicting Medium article popularity based on its features like content, title, author, tags, reading time etc. 

Prior to working on the assignment, you'd better check out the corresponding course material:
 1. [Classification, Decision Trees and k Nearest Neighbors](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic03_decision_trees_kNN/topic3_decision_trees_kNN.ipynb?flush_cache=true), the same as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-3-decision-trees-and-knn) (basics of machine learning are covered here)
 2. Linear classification and regression in 5 parts: 
    - [ordinary least squares](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-1-ols)
    - [linear classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification)
    - [regularization](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-3-regularization)
    - [logistic regression: pros and cons](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit)
    - [validation](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-5-validation)
 3. You can also practice with demo assignments, which are simpler and already shared with solutions: 
    - " Sarcasm detection with logistic regression": [assignment](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit) + [solution](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution)
    - "Linear regression as optimization": [assignment](https://www.kaggle.com/kashnitsky/a4-demo-linear-regression-as-optimization/edit) (solution cannot be officially shared)
    - "Exploring OLS, Lasso and Random Forest in a regression task": [assignment](https://www.kaggle.com/kashnitsky/a6-demo-linear-models-and-rf-for-regression) + [solution](https://www.kaggle.com/kashnitsky/a6-demo-regression-solution)
 4. Baseline with Ridge regression and "bag of words" for article content, [Kernel](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline)
 5. Other [Kernels](https://www.kaggle.com/c/how-good-is-your-medium-article/kernels?sortBy=voteCount&group=everyone&pageSize=20&competitionId=8673) in this competition. You can share yours as well, but not high-performing ones (Public LB MAE shall be > 1.5). Please don't spoil the competitive spirit.  
 6. If that's still not enough, watch two videos (Linear regression and regularization) from here [mlcourse.ai/video](https://mlcourse.ai/video), the second one on LTV prediction is smth that you won't typically find in a MOOC - real problem, real metrics, real data.

**Your task:**
 1. "Freeride". Come up with good features to beat the baselines "A2 baseline (10 credits)" and "A2 strong baseline (20 credits)". As names suggest, you'll get 10 more credits for beating the first one, and 10 more (20 in total) for beating the second one. You need to name your [team](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2/team) (out of 1 person) in full accordance with the [course rating](https://docs.google.com/spreadsheets/d/1LAy1eK8vIONzIWgcCEaVmhKPSj579zK5lrECf_tQT60/edit?usp=sharing) (for newcomers: you need to name your team with your real full name). You can think of it as a part of the assignment.
 2. If you've beaten "A2 baseline (10 credits)" or performed better, you need to upload your solution as described in [course roadmap](https://mlcourse.ai/roadmap) ("Kaggle Inclass Competition Medium"). For all baselines that you see on Public Leaderboard, it's OK to beat them on Public LB as well. But 10 winners will be defined according to the private LB, which will be revealed by @yorko on March 11. 
 
 ### <center> Deadline for A2: 2019 March 10, 20:59 GMT (London time)
 
### How to get help
In [ODS Slack](https://opendatascience.slack.com) (if you still don't have access, fill in the [form](https://docs.google.com/forms/d/1BMqcUc-hIQXa0HB_Q2Oa8vWBtGHXk8a6xo5gPnMKYKA/edit) mentioned on the mlcourse.ai main page), we have a channel **#mlcourse_ai_news** with announcements from the course team.
You can discuss the course content freely in the **#mlcourse_ai** channel (we still have a huge Russian-speaking group, they have a separate channel **#mlcourse_ai_rus**).

Please stick this special threads for your questions:
 - [#a2_medium](https://opendatascience.slack.com/archives/C91N8TL83/p1549882568052400) 
 
Help each other without sharing actual code. Our TA Artem @datamove is there to help (only in the mentioned thread, do not write to him directly).

In [10]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge, SGDRegressor, RidgeCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV

In [11]:
PATH_TO_DATA = './data'
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

The following code will help to throw away all HTML tags from an article content.

In [12]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [17]:
#from nltk.corpus import stopwords
#stop_words = set(stopwords.words('english'))
#def remove_stop_words(words):
#    words2 = [word for word in words if not word in stop_words]
#    return words2

from nltk.tokenize import word_tokenize

def remove_punctuation(text):
    
    # Tokenize the string into words
    #tokens = word_tokenize(text)
    tokens = text.split()

    # Remove non-alphabetic tokens, such as punctuation
    words = [word.lower() for word in tokens if word.isalpha()]
    return ' '.join(words)


Supplementary function to read a JSON line without crashing on escape characters.

In [18]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

#### Fields available in the json_data:

```js
['_id', '_timestamp', '_spider', 'url', 'domain', 'published',
 'title', 'content', 'author', 'image_url', 'tags', 'link_tags', 'meta_tags']
```

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [19]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = [open(os.path.join(path_to_data,
                                       '{}_{}.txt'.format(prefix, feat)),
                          'w', encoding='utf-8')
                     for feat in features]
    
    with open(os.path.join(path_to_data, inp_filename), encoding='utf-8') as inp_json_file:
        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            
            # You code here
            
            
            content = strip_tags(json_data['content']).replace('\n', '').replace('\r', '')
            content = remove_punctuation(content)
            feature_files[0].write(content + '\n')
            #feature_files[0].write(strip_tags(json_data['content']).replace('\n', '').replace('\r', '') + '\n')
            #feature_files[0].write(json_data['content'].replace('\n', '') + '\n')
            
            feature_files[1].write(json_data['published']['$date'] + '\n')
            #print(json_data['published'])
            # {'$date': '2015-08-03T07:44:50.331Z'}
            
            feature_files[2].write(json_data['title'].replace('\n', '').replace('\r', '') + '\n')
            

            
            # print(json_data['author'])
            # {'name': None, 'url': 'https://medium.com/@Medium', 'twitter': '@Medium'}
            feature_files[3].write(json_data['author']['url'] + '\n') # TODO get the author from the end of the url?
            

            #import dateutil.parser as dp
            #t = '1984-06-02T19:05:00.000Z'
            #parsed_t = dp.parse(t)
            #print(parsed_t)
            #datetime.datetime(1984, 6, 2, 19, 5, tzinfo=tzutc())

In [20]:
extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [21]:
!wc -l ./data/train_content.txt

   62313 ./data/train_content.txt


In [22]:
!wc -l ./data/train_published.txt

   62313 ./data/train_published.txt


In [23]:
!wc -l ./data/train_title.txt

   62313 ./data/train_title.txt


In [24]:
!wc -l ./data/train_author.txt

   62313 ./data/train_author.txt


In [25]:
#!head -5 ./data/train_content.txt

In [26]:
!head -5 ./data/train_published.txt

2012-08-13T22:54:53.510Z
2015-08-03T07:44:50.331Z
2017-02-05T13:08:17.410Z
2017-05-06T08:16:30.776Z
2017-06-04T14:46:25.772Z


In [27]:
!head -5 ./data/train_title.txt

Medium Terms of Service – Medium Policy – Medium
Amendment to Medium Terms of Service Applicable to U.S. Government Users
走入山與海之間：閩東大刀會和兩岸走私 – Yun-Chen Chien（簡韻真） – Medium
How fast can a camera get? – What comes to mind – Medium
A game for the lonely fox – What comes to mind – Medium


In [28]:
!head -5 ./data/train_author.txt

https://medium.com/@Medium
https://medium.com/@Medium
https://medium.com/@aelcenganda
https://medium.com/@vaibhavkhulbe
https://medium.com/@vaibhavkhulbe


In [29]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [16]:
# You code here

In [17]:
#content_df = pd.read_csv('./data/train_content.txt', header=None, names=['content'])
#content_df.head()

#### Add feature: bag of authors (i.e. One-Hot-Encoded author names)

In [33]:
train_author_df = pd.read_csv('./data/train_author.txt', header=None, names=['author'])
train_author_df.head(3)

Unnamed: 0,author
0,https://medium.com/@Medium
1,https://medium.com/@Medium
2,https://medium.com/@aelcenganda


In [34]:
test_author_df = pd.read_csv('./data/test_author.txt', header=None, names=['author'])
test_author_df.head(3)

Unnamed: 0,author
0,https://medium.com/@HITRECORD.org
1,https://medium.com/@mariabustillos
2,https://medium.com/@HITRECORD.org


Create a united dataframe of author data for one hot encoding the author data

In [35]:
# United dataframe of the author data 
full_author_df = pd.concat([train_author_df, test_author_df])

# Index to split the training and test data sets
idx_split = train_author_df.shape[0]

In [36]:
onehotencoder = OneHotEncoder(categories='auto')
X_full_author_sparse = onehotencoder.fit_transform(full_author_df).toarray()

In [37]:
print(len(X_full_author_sparse))
print(len(X_full_author_sparse[0]))

96958
45374


Split the authors back into training and test sets 

In [38]:
X_train_author_sparse = X_full_author_sparse[:idx_split,:]

In [39]:
print(len(X_train_author_sparse))
print(len(X_train_author_sparse[0]))
X_train_author_sparse.shape

62313
45374


(62313, 45374)

In [41]:
X_test_author_sparse = X_full_author_sparse[idx_split:,:]

In [42]:
print(len(X_test_author_sparse))
print(len(X_test_author_sparse[0]))
X_test_author_sparse.shape

34645
45374


(34645, 45374)

In [16]:
# clear this out to save RAM since nothing is using it anymore?
# X_full_author_sparse = None

#### Add Title Tf-Idf feature

In [43]:
#foo = pd.read_csv('./data/test_title.txt', header=None, names=['title'])
#len(foo)

In [66]:
%%time
cv = TfidfVectorizer(ngram_range=(1, 2), max_features=100000, analyzer="word", stop_words="english")

with open('./data/train_title.txt') as inp_train_file:
    X_train_title_sparse = cv.fit_transform(inp_train_file)
with open('./data/test_title.txt') as inp_test_file:
    X_test_title_sparse = cv.transform(inp_test_file)

print(X_train_title_sparse.shape, X_test_title_sparse.shape)

(62313, 100000) (34645, 100000)
CPU times: user 3.98 s, sys: 53.2 ms, total: 4.03 s
Wall time: 2.25 s


#### Add Content Tf-Idf feature

In [45]:
%%time
cv = TfidfVectorizer(ngram_range=(1, 2), max_features=100000)

with open('./data/train_content.txt') as inp_train_file:
    X_train_content_sparse = cv.fit_transform(inp_train_file)
with open('./data/test_content.txt') as inp_test_file:
    X_test_content_sparse = cv.transform(inp_test_file)

print(X_train_content_sparse.shape, X_test_content_sparse.shape)

(62313, 100000) (34645, 100000)
CPU times: user 3min 59s, sys: 6.07 s, total: 4min 6s
Wall time: 4min 3s


#### Add time features

In [46]:
train_published_df = pd.read_csv('./data/train_published.txt',
                           header=None, names=['published'], parse_dates=['published'])
test_published_df = pd.read_csv('./data/test_published.txt',
                           header=None, names=['published'], parse_dates=['published'])
#train_published_df.head()
#test_published_df.head()

In [47]:
# publication hour, whether it's morning, day, night, whether it's a weekend

In [48]:
def add_time_features(df):    
    hour = df['published'].apply(lambda ts: ts.hour)
    
    morning = ((hour >= 7) & (hour <= 11)).astype('int')
    day = ((hour >= 12) & (hour <= 18)).astype('int')
    evening = ((hour >= 19) & (hour <= 23)).astype('int')
    night = ((hour >= 0) & (hour <= 6)).astype('int')
    
    dayofweek = df['published'].apply(lambda ts: ts.dayofweek)
    is_weekday = ((dayofweek == 5) | (dayofweek == 6)).astype('int')

    onehotencoder = OneHotEncoder(categories='auto')
    hour_published_hot_encoded = onehotencoder.fit_transform(hour.values.reshape(-1, 1)).toarray()

    month_published = df['published'].apply(lambda ts: ts.month).astype('int')
    onehotencoder = OneHotEncoder(categories='auto')
    month_published_hot_encoded = onehotencoder.fit_transform(month_published.values.reshape(-1, 1)).toarray()

    
    empty_sparse_matrix = csr_matrix((df.shape[0], 0))
    X = hstack([empty_sparse_matrix,
                morning.values.reshape(-1, 1),
                day.values.reshape(-1, 1),
                evening.values.reshape(-1, 1),
                night.values.reshape(-1, 1),
                is_weekday.values.reshape(-1,1),
                #month_published_hot_encoded # #ignore for now - the test set does not have data for all 12 months
                hour_published_hot_encoded  
                ]).tocsr()
    return X

In [49]:
%%time
X_train_time_features_sparse = add_time_features(train_published_df)
X_test_time_features_sparse = add_time_features(test_published_df)

CPU times: user 2.73 s, sys: 38.8 ms, total: 2.77 s
Wall time: 2.77 s


In [50]:
X_train_time_features_sparse.shape, X_test_time_features_sparse.shape

((62313, 29), (34645, 29))

### Join all sparse matrices

In [78]:
%%time
#X_train_sparse = X_train_author_sparse
X_train_sparse = hstack([X_train_author_sparse,
                         X_train_time_features_sparse,
                         X_train_title_sparse
                         #X_train_content_sparse
                        ]).tocsr()

CPU times: user 27.1 s, sys: 17.1 s, total: 44.2 s
Wall time: 47 s


In [79]:
%%time
#X_test_sparse = X_test_author_sparse
X_test_sparse = hstack([X_test_author_sparse,
                        X_test_time_features_sparse,
                        X_test_title_sparse
                        #X_test_content_sparse
                       ]).tocsr()

CPU times: user 15.2 s, sys: 9.44 s, total: 24.7 s
Wall time: 26.1 s


### Read train target and split data for validation

In [80]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [81]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [82]:
#lr = SGDRegressor(random_state=17)

In [83]:
lr = Ridge(random_state=17, alpha=1.0) # TODO - tune the params

In [84]:
%%time
lr.fit(X_train_part_sparse, y_train_part)

CPU times: user 1.72 s, sys: 17.1 ms, total: 1.74 s
Wall time: 900 ms


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=17, solver='auto', tol=0.001)

In [85]:
%%time
y_pred = lr.predict(X_valid_sparse)

CPU times: user 1.55 ms, sys: 1.07 ms, total: 2.62 ms
Wall time: 1.75 ms


In [86]:
score = mean_absolute_error(y_valid, y_pred)
print(score) # 1.082827440142141  # ACTUAL LB 1.73588

1.157637379513177


In [34]:
# Alpha values  # ACTUAL LB 1.72882
# 0.1    1.1171688614045647
# 1.0    1.0756046224566074
# 10.0   1.1238285918537352
# 0.001  1.1422441780065717
# 0.01   1.1386397887882853
# 0.0001

In [35]:
# Default values for alphas is alphas=(0.1, 1.0, 10.0)
# TODO look at setting this param too cv : int, cross-validation generator or an iterable, optional
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html
#modelCV = RidgeCV(alphas = [1.0, 0.1, 0.01], store_cv_values = True)

In [None]:
#%%time
#modelCV.fit(X_train_part_sparse, y_train_part)

In [None]:
#print(np.mean(modelCV.cv_values_, axis=0))

Make a submission file (no cross validation)

In [76]:
%%time
# Make a prediction for test data set
ridge_test_pred = lr.predict(X_test_sparse)

CPU times: user 66.1 ms, sys: 883 µs, total: 67 ms
Wall time: 65.7 ms


In [88]:
## Leaderboard probing

In [89]:
mean_test_target = 4.33328

In [90]:
write_submission_file(ridge_test_pred + mean_test_target - y_train.mean(), './submissions/02-assignment2_medium_submission-probing.csv')

In [None]:
# write_submission_file(ridge_test_pred + mean_test_target - y_train.mean(), 'submissions/03-ridge_submission.csv')

In [77]:
# in theory may want to use this as the baseline and do the grid search later (since it is slow)
#write_submission_file(ridge_test_pred, os.path.join(PATH_TO_DATA, 'assignment2_medium_submission.csv'))
write_submission_file(ridge_test_pred, './submissions/02-assignment2_medium_submission.csv')

## GridSearchCV

In [36]:
lr_sgd = SGDRegressor(random_state=17)

In [46]:
# TODO add scoring = 'neg_mean_squared_error' to the GridSearchCV - or the not squared one?
# TODO tune the param_grid below

# TODO pass fewer options

#(0.4211899487138652,
# {'alpha': 1e-05,
#  'learning_rate': 'optimal',
#  'loss': 'huber',
#  'penalty': 'l2'})

In [39]:
%%time
param_grid = {
    'alpha': 10.0 ** -np.arange(1, 7),
    'loss': ['squared_loss', 'huber', 'epsilon_insensitive'],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'learning_rate': ['constant', 'optimal', 'invscaling'],
}
clf_sgd = GridSearchCV(lr_sgd, param_grid, cv=3) # TODO cv to 5?
#clf_sgd.fit(X_train_sparse, y_train)
print("Best score: " + str(clf_sgd.best_score_))

















































Best score: 0.4211899487138652
CPU times: user 39min 16s, sys: 4min 34s, total: 43min 50s
Wall time: 27min 26s


In [40]:
clf_sgd.best_score_, clf_sgd.best_params_

(0.4211899487138652,
 {'alpha': 1e-05,
  'learning_rate': 'optimal',
  'loss': 'huber',
  'penalty': 'l2'})

In [42]:
%%time
# Make a prediction for test data set
sgd_test_pred = clf_sgd.predict(X_test_sparse)

CPU times: user 95.3 ms, sys: 814 µs, total: 96.1 ms
Wall time: 95 ms


In [44]:
# aaascore---
write_submission_file(sgd_test_pred, './submissions/01-sgd-grid-assignment2_medium_submission-grid.csv')

In [45]:
# VALUES FOR SGDRegressor
#%%time
#param_grid = {
#    'alpha': 10.0 ** -np.arange(1, 7),
#    'loss': ['squared_loss', 'huber', 'epsilon_insensitive'],
#    'penalty': ['l2', 'l1', 'elasticnet'],
#    'learning_rate': ['constant', 'optimal', 'invscaling'],
#}
#clf = GridSearchCV(lr, param_grid, cv=3) # TODO cv to 5?
#clf.fit(X_train_sparse, y_train)
#print("Best score: " + str(clf.best_score_))

In [46]:
#clf.best_score_, clf.best_params_  #

In [47]:
# The SGDClassifier instance fitted with the best hyperparameters is stored in gs.best_estimator_.
# The coef_ and intercept_ are the fitted parameters of that best model.

### TODO look at the params for the GridSearchCV and see if they can be tuned

### TODO scoring should be neg_mean_absolute_error instead?
It takes 30 mins (twice as long) to use neg_mean_absolute_error instead of neg_mean_squared_error?

In [48]:
%%time
params={'alpha': [25,10,4,2,1.0,0.8,0.5,0.3,0.2,0.1,0.05,0.02,0.01]}
#rdg_reg = Ridge()
clf = GridSearchCV(lr,params,cv=2,verbose = 1, scoring = 'neg_mean_squared_error')

CPU times: user 86 µs, sys: 244 µs, total: 330 µs
Wall time: 334 µs


In [49]:
#%%time
#params={'alpha': [25,10,4,2,1.0,0.8,0.5,0.3,0.2,0.1,0.05,0.02,0.01]}
##rdg_reg = Ridge()
#clf = GridSearchCV(lr,params,cv=2,verbose = 1, scoring = 'neg_mean_absolute_error')

In [50]:
#%%time
#clf.fit(X_train_sparse, y_train)

In [51]:
#clf.best_params_

In [52]:
%%time
clf.fit(X_train_sparse, y_train)

Fitting 2 folds for each of 13 candidates, totalling 26 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed: 106.3min finished


CPU times: user 23min 18s, sys: 37.8 s, total: 23min 56s
Wall time: 1h 47min 19s


GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=17, solver='auto', tol=0.001),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'alpha': [25, 10, 4, 2, 1.0, 0.8, 0.5, 0.3, 0.2, 0.1, 0.05, 0.02, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=1)

In [53]:
clf.best_params_

{'alpha': 1.0}

In [54]:
#%%time
# run grid search
#model = SGDClassifier(penalty='elasticnet',n_iter = np.ceil(10**6 / n),shuffle=True)
#param_grid = [{'alpha' : 10.0**-np.arange(1,7),'l1_ratio':[.05, .15, .5, .7, .9, .95, .99, 1]}]
#gs = grid_search.GridSearchCV(lr,param_grid,n_jobs=8,verbose=1)
#gs.fit(X_train, y_train)

In [55]:
# run grid search
#model = SGDClassifier(penalty='elasticnet',n_iter = np.ceil(10**6 / n),shuffle=True)
#param_grid = [{'alpha' : 10.0**-np.arange(1,7),'l1_ratio':[.05, .15, .5, .7, .9, .95, .99, 1]}]
#gs = grid_search.GridSearchCV(model,param_grid,n_jobs=8,verbose=1)
#gs.fit(X_train, y_train)

In [56]:
%%time
# Make a prediction for test data set
ridge_test_pred = clf.predict(X_test_sparse)

CPU times: user 105 ms, sys: 1.08 ms, total: 106 ms
Wall time: 105 ms


In [57]:
# 1.71760
#write_submission_file(ridge_test_pred, os.path.join(PATH_TO_DATA, '01-assignment2_medium_submission-grid.csv'))
write_submission_file(ridge_test_pred, './submissions/02-assignment2_medium_submission-grid.csv')

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [58]:
# You code here

In [59]:
#def write_submission_file(prediction, filename,
#                          path_to_sample=os.path.join(PATH_TO_DATA, 
#                                                      'sample_submission.csv')):
#    submission = pd.read_csv(path_to_sample, index_col='id')
#    
#    submission['log_recommends'] = prediction
#    submission.to_csv(filename)

In [60]:
#write_submission_file(ridge_test_pred, os.path.join(PATH_TO_DATA, 'assignment2_medium_submission.csv'))

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeros. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [61]:
write_submission_file(np.zeros_like(ridge_test_pred), 
                      os.path.join(PATH_TO_DATA,
                                   'medium_all_zeros_submission.csv'))

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [62]:
ridge_test_pred_modif = ridge_test_pred # You code here

In [63]:
write_submission_file(ridge_test_pred_modif, 
                      os.path.join(PATH_TO_DATA,
                                   'assignment2_medium_submission_with_hack.csv'))

Some ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will learn much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!

<img src='../../img/kaggle_shakeup.png' width=50%>