 In this competition, you will predict the popularity of a set of New York Times blog articles from the time period September 2014-December 2014.
 
 Many blog articles are published each day, and the New York Times has to decide which articles should be featured. In this competition, we challenge you to develop an analytics model that will help the New York Times understand the features of a blog post that make it popular.

# Importing Libraries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import datetime

plt.rcParams['figure.figsize'] = (15,5)

# Dataset Description

File Descriptions
The data provided for this competition is split into two files:

NYTimesBlogTrain.csv = the training data set. It consists of 6532 articles.
NYTimesBlogTest.csv = the testing data set. It consists of 1870 articles.  
We have also provided a sample submission file, SampleSubmission.csv. This file gives an example of the format of submission files (see the Evaluation page for more information). The data for this competition comes from the New York Times website.

# Variable Descriptions
The dependent variable in this problem is the variable Popular, which labels if an article had 25 or more comments in its online comment section (equal to 1 if it did, and 0 if it did not). The dependent variable is provided in the training data set, but not the testing dataset. This is an important difference from what you are used to - you will not be able to see how well your model does on the test set until you make a submission on Kaggle.

The independent variables consist of 8 pieces of article data available at the time of publication, and a unique identifier:

NewsDesk = the New York Times desk that produced the story (Business, Culture, Foreign, etc.)
SectionName = the section the article appeared in (Opinion, Arts, Technology, etc.)
SubsectionName = the subsection the article appeared in (Education, Small Business, Room for Debate, etc.)
Headline = the title of the article
Snippet = a small portion of the article text
Abstract = a summary of the blog article, written by the New York Times
WordCount = the number of words in the article
PubDate = the publication date, in the format "Year-Month-Day Hour:Minute:Second"
UniqueID = a unique identifier for each article

In [3]:
from google.colab import files
uploaded=files.upload()

Saving NY_train.csv to NY_train.csv


In [4]:
from google.colab import files
uploaded=files.upload()

Saving NY_test.csv to NY_test.csv


In [5]:
NYT_train_raw = pd.read_csv("NY_train.csv")
NYT_test_raw = pd.read_csv("NY_test.csv")

Join the data for preprocessing

In [6]:
print('Max train ID: %d. Max test ID: %d' % (np.max(NYT_train_raw['UniqueID']), np.max(NYT_test_raw['UniqueID'])))
joined = NYT_train_raw.merge(NYT_test_raw, how = 'outer')

Max train ID: 6532. Max test ID: 8402


Create additional features:

"QorE": question or exclamation mark in the headline
"Q&A": "Q. and A." phrase in the headline (I don't think it was valuable, but stayed here from my previous attemps)

In [7]:
joined['QorE'] = joined['Headline'].str.contains(r'\!|\?').astype(int)
joined['Q&A'] = joined['Headline'].str.contains(r'Q\. and A\.').astype(int)

Convert "PubDate" into two columns: Weekday and Hour:

In [8]:
joined['PubDate'] = pd.to_datetime(joined['PubDate'])
joined['Weekday'] = joined['PubDate'].dt.weekday
joined['Hour'] = joined['PubDate'].dt.hour

In [9]:
print("At the moment, we have %d entries with NewsDesk=Nan." % len(joined.loc[joined['NewsDesk'].isnull()]))

At the moment, we have 2408 entries with NewsDesk=Nan.


# More features and gap filling

Below are the results of one day of searching for meaningful patterns in the data. There are a few easily identifiable features, most of which lead to zero popularity. They are:

"History": article headings always started with a year. None of them were popular in the training set
"Daily rubric": I added this new NewsDesk category for types of articles that appeared regularly (not necessarily daily): "Daily Clip Report", "Today in Politics", "What we're reading", "First Draft", "Pictures of the day", "Week in pictures". They also were not popular.
Now, as ask788 pointed out in this thread, the problem with data is often their structure, not the models we use on them. I agree that ideally this feature engineering should have been done automatically, but I am a novice, and had to tediously plod through the rows of data manually.

You can browse individual features that I selected by printing the head() of a subset, like so:

In [10]:
joined.loc[(joined['NewsDesk'] == 'Foreign') & (joined['SectionName'].isnull())].head()

Unnamed: 0,NewsDesk,SectionName,SubsectionName,Headline,Snippet,Abstract,WordCount,PubDate,Popular,UniqueID,QorE,Q&A,Weekday,Hour
11,Foreign,,,1939: German Troops Invade Poland,Highlights from the International Herald Tribu...,Highlights from the International Herald Tribu...,97,2014-09-01 14:39:43,0.0,12,0,0,0,14
20,Foreign,,,1914: Russian Army Scores Victory,Highlights from the International Herald Tribu...,Highlights from the International Herald Tribu...,108,2014-09-01 09:30:14,0.0,21,0,0,0,9
67,Foreign,,,1914: City Prepares for War Wounded,Highlights from the International Herald Tribu...,Highlights from the International Herald Tribu...,101,2014-09-02 13:34:59,0.0,68,0,0,1,13
81,Foreign,,,1889: British Traders in East Africa,Highlights from the International Herald Tribu...,Highlights from the International Herald Tribu...,122,2014-09-02 10:48:08,0.0,82,0,0,1,10
184,Foreign,,,1939: War on Germany Declared,Highlights from the International Herald Tribu...,Highlights from the International Herald Tribu...,79,2014-09-03 07:41:26,0.0,185,0,0,2,7


In [11]:
joined.loc[(joined['NewsDesk'] == 'Styles') & (joined['SectionName'].isnull()), 'NewsDesk'] = 'TStyle'
joined.loc[(joined['NewsDesk'] == 'Foreign') & (joined['SectionName'].isnull()), 'NewsDesk'] = 'History'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'].str.contains(r'^1[0-9]{3}')), 'NewsDesk'] = 'History'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'] == 'Daily Clip Report'), 'NewsDesk'] = 'Daily Rubric'
joined.loc[joined['NewsDesk'] == 'Daily Rubric', 'SectionName'] = 'Clip Report'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'] == 'Today in Politics'), 'SectionName'] = 'Today in Politics'
joined.loc[joined['SectionName'] == 'Today in Politics', 'NewsDesk'] = 'Daily Rubric'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'].str.contains(r'what we\'re reading', case=False)), 'SectionName'] = 'What we\'re reading'
joined.loc[joined['SectionName'] == 'What we\'re reading', 'NewsDesk'] = 'Daily Rubric'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['Headline'].str.contains(r'first draft', case=False)), 'SectionName'] = 'First draft'
joined.loc[joined['SectionName'] == 'First draft', 'NewsDesk'] = 'Daily Rubric'
joined.loc[(joined['NewsDesk'].isnull()) & (joined['SubsectionName'] == 'Education'), 'NewsDesk'] = 'Daily Rubric'
joined.loc[(joined['Headline'].str.contains('pictures of the day|week in pictures', case=False)), 'NewsDesk'] = 'Daily Rubric'

Filling the gaps in NewsDesk, SectionName and SubsectionName.

In [12]:
section_to_newsdesk = {'Business Day': 'Business', 'Crosswords/Games': 'Business', 'Technology': 'Business',
     'Arts': 'Culture',
     'World': 'Foreign',
     'Magazine': 'Magazine',
     'N.Y. / Region': 'Metro',
     'Opinion': 'OpEd',
     'Travel': 'Travel',
     'Multimedia': 'Multimedia',
     'Open': 'Open'}

section_to_subsection = {'Crosswords/Games': 'Crosswords/Games',
                        'Technology': 'Technology'}

newsdesk_to_section = {'TStyle': 'TStyle',
                      'Culture': 'Arts',
                      'OpEd': 'Opinion',
                      'History': 'History'}

newsdesk_to_subsection = {'TStyle': 'TStyle',
                         'Culture': 'Arts',
                         'Daily Rubric': 'Rubric',
                         'Magazine': 'Magazine',
                         'Metro': 'Metro',
                         'Multimedia': 'Multimedia',
                         'OpEd': 'OpEd',
                         'Science': 'Science',
                         'Sports': 'Sports',
                         'Styles': 'Styles',
                         'Travel': 'Travel',
                         'History': 'History'}
for sec in set(joined['SectionName']):
    try: section_to_newsdesk[sec]
    except KeyError:
        pass
    else:
        joined['NewsDesk'].fillna(joined.loc[(joined['SectionName'] == sec)]['NewsDesk'].fillna(section_to_newsdesk[sec]), inplace=True)

    try: section_to_subsection[sec]
    except KeyError:
        pass
    else:
        joined['SubsectionName'].fillna(joined.loc[(joined['SectionName'] == sec)]['SubsectionName'].fillna(section_to_subsection[sec]), inplace=True)        


for nd in set(joined['NewsDesk']):
    try: newsdesk_to_section[nd]
    except KeyError:
        pass
    else:
        joined['SectionName'].fillna(joined.loc[(joined['NewsDesk'] == nd)]['SectionName'].fillna(newsdesk_to_section[nd]), inplace=True)
        
    try: newsdesk_to_subsection[nd]
    except KeyError:
        pass
    else:
        joined['SubsectionName'].fillna(joined.loc[(joined['NewsDesk'] == nd)]['SubsectionName'].fillna(newsdesk_to_subsection[nd]), inplace=True)

Filling even more gaps with some clustering. I created a TFI-DF matrix and did Ward clustering on words. Four of six clusters I thought were meaningful and fitted well into existing NewsDesk/S(ubs)ectionName.

In [13]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer

nans = joined.loc[joined['NewsDesk'].isnull()]
words = list(nans.apply(lambda x:'%s' % (x['Abstract']),axis=1))
tfv = TfidfVectorizer(min_df=0.005,  max_features=None, 
        strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
        ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1,
        stop_words = 'english')
X_tr = tfv.fit_transform(words)

ward = AgglomerativeClustering(n_clusters=6,
        linkage='ward').fit(X_tr.toarray())

joined.loc[joined['NewsDesk'].isnull(), 'cluster'] = ward.labels_
cluster_to = {}
cluster_to['NewsDesk'] = {4: 'Metro', 3: 'National', 2: 'Foreign', 1: 'National'}
cluster_to['SectionName'] = {4: 'N.Y. / Region', 3: 'U.S.', 2: 'Not_Asia', 1: 'U.S.'}
cluster_to['SubsectionName'] = {4: 'NYT', 3: 'Politics', 2: 'Not_Asia', 1: 'Politics'}

for key in cluster_to:
    for key2 in cluster_to[key]:
        joined.loc[(joined['cluster'] == key2) & (nans['NewsDesk'].isnull()), key] = cluster_to[key][key2]



You can see what these clusters look like by typing:

In [14]:
joined.loc[joined['cluster'] == 3].head()

Unnamed: 0,NewsDesk,SectionName,SubsectionName,Headline,Snippet,Abstract,WordCount,PubDate,Popular,UniqueID,QorE,Q&A,Weekday,Hour,cluster
1510,National,U.S.,Politics,Congress to Weigh In on White House Security,The House Committee on Oversight and Governmen...,The House Committee on Oversight and Governmen...,98,2014-09-22 16:50:03,0.0,1511,0,0,0,16,3.0
1540,National,U.S.,Politics,White House Will Use Locks More Often,White House Not Changing the Locks Just Yet,White House Not Changing the Locks Just Yet,304,2014-09-22 12:49:19,0.0,1541,0,0,0,12,3.0
1569,National,U.S.,Politics,"The White House Front Door, When Entered Properly",A brief history of the White Houses North Port...,A brief history of the White Houses North Port...,220,2014-09-22 10:18:46,0.0,1570,0,0,0,10,3.0
1672,National,U.S.,Politics,Lunchtime Laughs: Jumper at 1600,The Daily Shows take on the disturbing details...,The Daily Shows take on the disturbing details...,114,2014-09-23 12:40:54,0.0,1673,0,0,1,12,3.0
1776,National,U.S.,Politics,White House Reporters Working on Pool End-Around,The White House Correspondents Association is ...,The White House Correspondents Association is ...,210,2014-09-24 14:46:28,0.0,1777,0,0,2,14,3.0


Finally, use a few (6) obvious keywords to categorise the data even more. After this, we are left with 950 entries where NewsDesk, SectionName and SubsectionName are NaN, but I didn't have an idea how to deal with them.

In [15]:
joined.drop('cluster', axis=1, inplace=True)

In [16]:
keywords = {}
keywords['clinton|white house|obama'] = {'NewsDesk': 'National', 'SectionName': 'U.S.', 'SubsectionName': 'Politics'}
keywords['isis|iraq'] = {'NewsDesk': 'Foreign', 'SectionName': 'Not_Asia', 'SubsectionName': 'Not_Asia'}
keywords['york'] = {'NewsDesk': 'Metro', 'SectionName': 'N.Y. / Region', 'SubsectionName': 'N.Y. / Region'}

for key in keywords:
    indices = (joined['NewsDesk'].isnull()) & (joined['Abstract'].str.contains(key, case=False))
    for sec in keywords[key]:
        joined.loc[indices, sec] = keywords[key][sec]

In [17]:
print("Now we have %d entries with NewsDesk=Nan." % len(joined.loc[joined['NewsDesk'].isnull()]))

Now we have 948 entries with NewsDesk=Nan.


# Categorical (factor) colums

First, turn the categorial data into 0/1 binary columns. Yes, it's more painful in Python than in R.

In [18]:
from sklearn.feature_extraction import DictVectorizer

def categorizeDF(df):
    old_columns = df.columns
    cat_cols = ['NewsDesk', 'SectionName', 'SubsectionName']
    temp_dict = df[cat_cols].to_dict(orient="records")
    vec = DictVectorizer()
    vec_arr = vec.fit_transform(temp_dict).toarray()
    
    new_df = pd.DataFrame(vec_arr).convert_dtypes(convert_integer=True)
    new_df.index = df.index
    new_df.columns = vec.get_feature_names_out()
    columns_to_add = [col for col in old_columns if col not in cat_cols]
    new_df[columns_to_add] = df[columns_to_add]
    new_df.drop(cat_cols, inplace=True, axis=1)
    return new_df

joined_cat = categorizeDF(joined)

# Recover train and test sets

In [19]:
train = joined_cat[joined_cat['UniqueID'] <= 6532]
test = joined_cat[joined_cat['UniqueID'] > 6532]

# Random Forest

Parametres for RF had been optimised with a GridSearchCV function from sklearn.

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score

Xcols = train.columns
Xcols = [x for x in Xcols if not x in ('Headline', 'Snippet', 'Abstract', 'PubDate', 'UniqueID', 'Popular', 'Q&A')]

y = train['Popular']

forest = RandomForestClassifier(n_estimators=7000, max_features=0.1, min_samples_split=24, random_state=33, n_jobs=3)
forest.fit(train[Xcols], y)

probsRF = forest.predict_proba(test[Xcols])[:,1]

print("10 Fold CV Score: ", np.mean(cross_val_score(forest, train[Xcols], y, cv=10, scoring='roc_auc')))

10 Fold CV Score:  0.9462052194546713


# Gradient Boosting Method

In [21]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score

Xcols = train.columns
Xcols = [x for x in Xcols if not x in ('Headline', 'Snippet', 'Abstract', 'PubDate', 'UniqueID', 'Popular', 'Q&A')]

y = train['Popular']

est = GradientBoostingClassifier(n_estimators=3000,
                                 learning_rate=0.005,
                                 max_depth=4,
                                 max_features=0.3,
                                 min_samples_leaf=9,
                                 random_state=33)
est.fit(train[Xcols], y)

probsGBC = est.predict_proba(test[Xcols])[:,1]

print("10 Fold CV Score: ", np.mean(cross_val_score(est, train[Xcols], y, cv=10, scoring='roc_auc')))

10 Fold CV Score:  0.9455640054930132
