# ** Predicting Annual Spikes w/ "This Day in History"**

> *When this competition started, I entered with much enthusiasm, but about halfway through, I got caught up with something that prohibited me from giving it my all. With that said, I've decided to publish an approach that I was pursuing that I believe could have [some] value at this point to those who are still working hard on cracking this. With over 145k individual time series, it likely won't make enough of a difference to move spots on the leaderboard, but could prove valuable in Stage 2 for those seeking to push the limits of their models.*

I noticed early in the competition that the main approaches that competitors were taking involved ignoring spikes - deeming them unreliable, unpredictable, and out of the scope of forecasting. It seems that the baseline method involves forecasting with a lagged median and then informing it all kinds of domain insight such as weekends, holidays, etc. 

The purpose of this module is to demonstrate that annual "spikes" in the traffic on certain (but not all) pages can indeed be identified and forecasted. This code below is not intended to help one formulate a model from scratch, but merely to supplement an existing approach with additional information. For getting started with a baseline approach, there are a wealth of great kernels to work through.

Because this kernel involves web scraping from Wikipedia, this module cannot be run in its entirety. I will comment out and make notes on the areas where a web connection is required. You will need to run the notebook locally to gather and apply the data that the code can generate in an otherwise connected environment.

Let's do our necessary imports, fetch the training data, and get started:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python

import re
import requests
import pandas as pd
import os
import pickle
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from datetime import datetime
from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 15, 6

def get_language(page):
    res = re.search('[a-z][a-z].wikipedia.org',page)
    if res:
        return res.group()[:2]
    return 'na'

In [None]:
train = pd.read_csv('../input/train_1.csv').fillna(0)
for col in train.columns[1:]:
    train[col] = pd.to_numeric(train[col],downcast='integer')
train['lang'] = train.Page.map(get_language)

The motivation for this module  is based on the assumption that if annual "spikes" in traffic can be observed to occur on the same date or at least in the same date range for two or more years, then there's a good reason to believe that a spike will occur in the following year.\*\* Since SMAPE heavily punishes under-forecasting, it doesn't make sense to just neglect spike trends and forecast normal medians or something to that effect. Furthermore, its doubtful that any statistical or ML model, .e.g ARIMA, RNN, XGBoost, etc. could adequately predict these spike with such limited data. It's best to go with common sense and manually add them in after your baseline forecasts have been established.

\*\* *Spikes going back two years cannot be validated in the range of interest for Stage 1 because data only for the first quarter of 2016 is available. However, spikes for the past two years can be validated and used for prediction in Stage 2, because data for the third and fourth quarters of 2015 and 2016 is available.*



There are several outstanding events in the Stage 2 date range (09/10/2017 - 11/10-2017) that we can all agree should have predictable spikes:

1. September 11th Attacks (September 11th)
2. Halloween (October 31st)
3. Release of iPhone 7s (or iPhone 8) (previous models were typically released in mid to late September)

However, there are a few major events that garner attention that one may not be aware of. For example, how about the final release of John F. Kennedy's assassination documents on October 26th 2017, which on a side note also happens to fall within the anniversary range of the Cuban Missile Crisis? In any case, while some events are obvious, there are too many to keep track of, so it is inevitable that some will "slip through the cracks." 

Thankfully, using Wikipedia pages for each day of the year, we can generate dictionaries that capture many events that could be of recurring interest year after year. Each page can be thought of as "This Day in History", as it contains links to key anniversaries, events, births, deaths, holidays, etc. that fall on the particular date that a page covers. Of course, not all events are significant to generate recurring traffic year after year, so this kernel is at best a starting point for further investigation into spike forecasting.

Before going further, let's establish that certain articles exist that have very clear and predictable traffic spikes:

In [None]:
def range_analysis(article, start, end):

    series_2015 = train[train['Page'] == '{0}_en.wikipedia.org_all-access_all-agents'.format(article)].loc[:,'2015-{}'.format(start):'2015-{}'.format(end)]
    series_2016 = train[train['Page'] == '{0}_en.wikipedia.org_all-access_all-agents'.format(article)].loc[:,'2016-{}'.format(start):'2016-{}'.format(end)]
    try: # Some topics don't have series for the 'all-access_all-agents' suffix
        series_2015.transpose().plot(kind='bar', title='2015 Pattern')
        series_2016.transpose().plot(kind='bar', title='2016 Pattern')                                                                                       
    except TypeError: # If not, then 'mobile-web_all-agents' can be attempted,
        # but note that this could throw an error as well, although very rarely
        series_2015 = train[train['Page'] == '{0}_en.wikipedia.org_mobile-web_all-agents'.format(article)].loc[:,'2015-{}'.format(start):'2015-{}'.format(end)]
        series_2016 = train[train['Page'] == '{0}_en.wikipedia.org_mobile-web_all-agents'.format(article)].loc[:,'2016-{}'.format(start):'2016-{}'.format(end)]
        series_2015.transpose().plot(kind='bar', title='2015 Pattern')
        series_2016.transpose().plot(kind='bar', title='2016 Pattern')
    plt.show()

In [None]:
range_analysis('Halloween', '09-10','11-10')

In [None]:
# 'Casualties_of_the_September_11_attacks' exists in the data as well
# Time series shows similar spike pattern
range_analysis('September_11_attacks', '09-10','11-10')

As we can see from these two examples, the spikes from 2015 and 2016 are nearly a mirror image of each other. The build-up and fade have a very similar structure as well. Now are you sure you wouldn't want to at least attempt adding in a spike for 2017, versus naively predicting an informed median or going with what your model says?

Now let's investigate the structure of a typical Wikipedia date page URL so that we can automate a web scraping process that will fetch topics of interest for every day of the year: 
> https://en.wikipedia.org/wiki/January_1

This https:// `language` -wikipedia.org.wiki/ `month`_`day` structure is valid for Wikipedia articles for every date of the year, provided you can spell out every date and every month in all seven languages for which time series are available.

I've assembled dictionaries below that perform such mapping:

In [None]:
ENGLISH_MONTHS = {
          '01': 'January',
          '02': 'February',
          '03': 'March',
          '04': 'April',
          '05': 'May',
          '06': 'June',
          '07': 'July',
          '08': 'August',
          '09': 'September',
          '10': 'October',
          '11': 'November',
          '12': 'December'
         }

FRENCH_MONTHS = {
          '01': 'janvier',
          '02': 'février',
          '03': 'mars',
          '04': 'avril',
          '05': 'mai',
          '06': 'juin',
          '07': 'juillet',
          '08': 'août',
          '09': 'septembre',
          '10': 'octobre',
          '11': 'novembre',
          '12': 'décembre'
         }

JAPANESE_MONTHS = {
          '01': '1月',
          '02': '2月',
          '03': '3月',
          '04': '4月',
          '05': '5月',
          '06': '6月',
          '07': '7月',
          '08': '8月',
          '09': '9月',
          '10': '10月',
          '11': '11月',
          '12': '12月'
         }

CHINESE_MONTHS = {
          '01': '1月',
          '02': '2月',
          '03': '3月',
          '04': '4月',
          '05': '5月',
          '06': '6月',
          '07': '7月',
          '08': '8月',
          '09': '9月',
          '10': '10月',
          '11': '11月',
          '12': '12月'
         }

RUSSIAN_MONTHS = {
          '01': 'января',
          '02': 'февраля',
          '03': 'марта',
          '04': 'апреля',
          '05': 'мая',
          '06': 'июня',
          '07': 'июля',
          '08': 'августа',
          '09': 'сентября',
          '10': 'октября',
          '11': 'ноября',
          '12': 'декабря'
         }

GERMAN_MONTHS = {
          '01': 'Januar',
          '02': 'Februar',
          '03': 'März',
          '04': 'April',
          '05': 'Mai',
          '06': 'Juni',
          '07': 'Juli',
          '08': 'August',
          '09': 'September',
          '10': 'Oktober',
          '11': 'November',
          '12': 'Dezember'
         }

SPANISH_MONTHS = {
          '01': 'enero',
          '02': 'febrero',
          '03': 'marzo',
          '04': 'abril',
          '05': 'mayo',
          '06': 'junio',
          '07': 'julio',
          '08': 'agosto',
          '09': 'septiembre',
          '10': 'octubre',
          '11': 'noviembre',
          '12': 'diciembre'
         }

Now let's use the month dictionaries along with some helper functions to create date ranges for 366 days and 7 languages. Note that the pandas date range involve using the year 2016, however this can be ignored as we only need the month and day to request and scrape articles. Discrete functions for each language are necessary because each language has its own quirks when it comes to dates.  French adds "-er" to the first day of every month for example, resulting in `1er_janvier`.

In [None]:
# English
def fetch_dates_en(start, end):
    en_dates = []
    date_range = pd.date_range(start,end)
    for date in date_range:
        month_day = datetime.strftime(date, '%Y-%m-%d')[-5:].split('-')
        en_dates.append(ENGLISH_MONTHS[month_day[0]] + '_{}'.format(str(int(month_day[1]))))
    return en_dates

# French
def fetch_dates_fr(start, end):
    fr_dates = []
    date_range = pd.date_range(start,end)
    for date in date_range:
        month_day = datetime.strftime(date, '%Y-%m-%d')[-5:].split('-')
        if int(month_day[1]) == 1:
            fr_dates.append('{}er_'.format(str(int(month_day[1]))) + FRENCH_MONTHS[month_day[0]])
        else:
            fr_dates.append('{}_'.format(str(int(month_day[1]))) + FRENCH_MONTHS[month_day[0]])
    return fr_dates

# Japanese
def fetch_dates_ja(start, end):
    ja_dates = []
    date_range = pd.date_range(start,end)
    for date in date_range:
        month_day = datetime.strftime(date, '%Y-%m-%d')[-5:].split('-')
        ja_dates.append(JAPANESE_MONTHS[month_day[0]] + '{}日'.format(str(int(month_day[1]))))
    return ja_dates

# Chinese
def fetch_dates_zh(start, end):
    zh_dates = []
    date_range = pd.date_range(start,end)
    for date in date_range:
        month_day = datetime.strftime(date, '%Y-%m-%d')[-5:].split('-')
        zh_dates.append(CHINESE_MONTHS[month_day[0]] + '{}日'.format(str(int(month_day[1]))))
    return zh_dates

# Russian
def fetch_dates_ru(start, end):
    ru_dates = []
    date_range = pd.date_range(start,end)
    for date in date_range:
        month_day = datetime.strftime(date, '%Y-%m-%d')[-5:].split('-')
        ru_dates.append('{}_'.format(str(int(month_day[1]))) + RUSSIAN_MONTHS[month_day[0]])
    return ru_dates

# German
def fetch_dates_de(start, end):
    de_dates = []
    date_range = pd.date_range(start,end)
    for date in date_range:
        month_day = datetime.strftime(date, '%Y-%m-%d')[-5:].split('-')
        de_dates.append('{}._'.format(str(int(month_day[1]))) + GERMAN_MONTHS[month_day[0]])
    return de_dates

# Spanish
def fetch_dates_es(start, end):
    es_dates = []
    date_range = pd.date_range(start,end)
    for date in date_range:
        month_day = datetime.strftime(date, '%Y-%m-%d')[-5:].split('-')
        es_dates.append('{}_de_'.format(str(int(month_day[1]))) + SPANISH_MONTHS[month_day[0]])
    return es_dates

start_date, end_date = '2016-01-01', '2016-12-31'

# There are 366 dates for each year because 2016 was a leap year (includes Feb. 29th)

english_dates = fetch_dates_en(start_date, end_date)
assert len(english_dates) == 366

french_dates = fetch_dates_fr(start_date, end_date)
assert len(french_dates) == 366

japanese_dates = fetch_dates_ja(start_date, end_date)
assert len(japanese_dates) == 366

chinese_dates = fetch_dates_zh(start_date, end_date)
assert len(chinese_dates) == 366

russian_dates = fetch_dates_ru(start_date, end_date)
assert len(russian_dates) == 366

german_dates = fetch_dates_de(start_date, end_date)
assert len(german_dates) == 366

spanish_dates = fetch_dates_es(start_date, end_date)
assert len(spanish_dates) == 366

Before scraping the Wikipedia date pages looking for articles of interest, we need to extract the article information from the dataset in each language group, so we can search for their corresponding hypyerlinks in the source HTML.

In [None]:
def get_article(page):
    res = re.search('.*_*.wiki', page)
    return res.group()[:-8] # Extract article info. only

english_articles = train[train['lang'] == 'en']['Page'].map(get_article).tolist()
french_articles = train[train['lang'] == 'fr']['Page'].map(get_article).tolist()
japanese_articles = train[train['lang'] == 'ja']['Page'].map(get_article).tolist()
chinese_articles = train[train['lang'] == 'zh']['Page'].map(get_article).tolist()
russian_articles = train[train['lang'] == 'ru']['Page'].map(get_article).tolist()
german_articles = train[train['lang'] == 'de']['Page'].map(get_article).tolist()
spanish_articles = train[train['lang'] == 'es']['Page'].map(get_article).tolist()
non_language_articles = train[train['lang'] == 'na']['Page'].map(get_article).tolist()

assert len(english_articles) + len(chinese_articles) + len(japanese_articles) \
       + len(german_articles) + len(french_articles) + len(russian_articles) \
       + len(spanish_articles) + len(non_language_articles) == train.shape[0]
        
print('Sample English articles: {0}, {1}, {2}'.format(*english_articles[:3]), sep=',')
print('Sample Chinese articles: {0}, {1}, {2}'.format(*chinese_articles[:3]), sep=',')
print('Sample Russian articles: {0}, {1}, {2}'.format(*russian_articles[:3]), sep=',')

Finally, let's make a function that will allow us to scrape all 366 date pages in all 7 languages, searching for articles that are both present in the dataset and on the date pages in the form of hyperlinks.

In [None]:
### The function below is responsible for the actual requesting and requires
### an Internet connection to call. We can define it in Kaggle without calling though.

def _request_parse_html(language, date):
    html = requests.get('https://{0}.wikipedia.org/wiki/{1}'.format(language,date))
    soup = BeautifulSoup(html.text, "html5lib")
    lists = soup.find_all('ul')
    all_articles = []
    for ul in lists:
        links = ul.find_all('a')
        for link in links:
            if language == 'ru':
                try:
                 # Russian hyperlinks have some strange encoding that I haven't figured out.
                 # We can't extract the link directly, so we'll take the title instead.
                 # In rare cases, the article titles and links differ, so matches will fail.
                 # Not as reliable as the regex below, but best we can do.
                    article = '_'.join(link['title'].split())
                    all_articles.append(article)
                except (AttributeError, KeyError, IndexError) as e:
                    continue
            elif language in {'ja', 'zh'}:
                try:
                # Asian hyperlinks have an encoding issue as well.
                # Don't need to split and rejoin though because these languages don't use spaces
                    article = link['title'] 
                    all_articles.append(article)
                except (AttributeError, KeyError, IndexError) as e:
                    continue
            else: 
                try:
                # Can extract link directly for English, French, German, Spanish
                # This extracts the article link directly and is guaranteed to work for matching links
                    article = re.search(r'(?<=/wiki/).*', link['href']).group()
                    all_articles.append(article)
                except (AttributeError, KeyError, IndexError) as e:
                    continue
    return all_articles

def find_key_articles(language, date, article_list):
    key_articles = set()
    date_articles = _request_parse_html(language, date)
    for article in article_list:
        if article in date_articles:
            key_articles.add(article)
    return key_articles

This is as far as this kernel can run, on Kaggle at least. The last step is to uncomment and run the following 7 blocks on your machine to get the final output. Since 2,562 (366 days * 7 languages) requests will be made, this is could take a long time, so go ahead and work on fine-tuning other parts of your model while waiting. I like to run in individual blocks by language in case have a connection issue halfway through, but you can also combine all them into one block and run it overnight or something if needed.

The final output for each block will consist of a dictionary with 366 keys - one for each day of a leap year. The keys can be accessed using English dates of the form "%B\_%d", e.g. "October_7". Each value is a set containing articles in the time series dataset that are present on the Wikipedia page for its respective date. When it's all said and done, there should be the following number of articles of interest for each language:

Some redudant articles can be found in each set such as `Special:MyContributions`, `Special:RecentChanges`, `Wikipedia:Contact_us`. Although these articles are present in the time series data, they are not relevant for our purposes so can be ignored and/or filtered out. 

In [None]:
### Requires Internet connection

# key_english_articles = dict()

# for i, date in enumerate(english_dates):
#     key_articles = find_key_articles('en', date, english_articles)
#     key_english_articles[english_dates[i]] = key_articles

In [None]:
### Require Internet connection

# key_french_articles = dict()

# for i, date in enumerate(french_dates):
#     key_articles = find_key_articles('fr', date, french_articles)
#     key_french_articles[english_dates[i]] = key_articles

In [None]:
### Requires Internet connection

# key_japanese_articles = dict()

# for i, date in enumerate(japanese_dates):
#     key_articles = find_key_articles('ja', date, japanese_articles)
#     key_japanese_articles[english_dates[i]] = key_articles

In [None]:
### Require Internet connection

# key_chinese_articles = dict()

# for i, date in enumerate(chinese_dates):
#     key_articles = find_key_articles('zh', date, chinese_articles)
#     key_chinese_articles[english_dates[i]] = key_articles

In [None]:
### Requires Internet connection

# key_russian_articles = dict()

# for i, date in enumerate(russian_dates):
#     key_articles = find_key_articles('ru', date, russian_articles)
#     key_russian_articles[english_dates[i]] = key_articles

In [None]:
### Requires Internet connection

# key_german_articles = dict()

# for i, date in enumerate(german_dates):
#     key_articles = find_key_articles('de', date, german_articles)
#     key_german_articles[english_dates[i]] = key_articles

In [None]:
### Requires Internet connection

# key_spanish_articles = dict()

# for i, date in enumerate(spanish_dates):
#     key_articles = find_key_articles('es', date, spanish_articles)
#     key_spanish_articles[english_dates[i]] = key_articles

### **Optional**

If you'd like to serialize and store locally the key article dictionaries the objects you've just spent all this time gathering, uncomment and run the following:

In [None]:
# dest = 'pkl_objects'
# if not os.path.isdir('./pkl_objects'):
#     os.mkdir(dest)
# pickle.dump(key_english_articles, open(os.path.join(dest, 'english_collection.pkl'), 'wb'), protocol=4)
# pickle.dump(key_french_articles, open(os.path.join(dest, 'french_collection.pkl'), 'wb'), protocol=4)
# pickle.dump(key_japanese_articles, open(os.path.join(dest, 'japanese_collection.pkl'), 'wb'), protocol=4)
# pickle.dump(key_chinese_articles, open(os.path.join(dest, 'chinese_collection.pkl'), 'wb'), protocol=4)
# pickle.dump(key_russian_articles, open(os.path.join(dest, 'russian_collection.pkl'), 'wb'), protocol=4)
# pickle.dump(key_german_articles, open(os.path.join(dest, 'german_collection.pkl'), 'wb'), protocol=4)
# pickle.dump(key_spanish_articles, open(os.path.join(dest, 'spanish_collection.pkl'), 'wb'), protocol=4)

### **Where to go from here**:





- Although we've identified relevant articles from the dataset that could attract annual interest for each day of the year, this won't translate directly to better predictive capacity for our models. What is needed at this point is further filtering to pinpoint the articles that display recurring spike patterns on the same days for 2015 and 2016. This could be accomplished with a mean-aware approach such as Z-Scores that considers a score greater than 3.0 or 4.0 for example to be a "spike." Alternatively a mean-naive approach could be used that qualifies spikes using percentiles such as 98%, 99%, etc. 




- The articles that this cross-referencing step discovers can be examined and one can use such information to generate spikes in predicted traffic, using a method of choice such as averaging spikes between 2015 and 2016 . There's a bit of manual work involved upfront, but with a few helper functions, the whole process could be easily streamlined for all key articles.





- In this module, we've only found articles that are matched between the dataset and the date pages. A further path of investigation could involve looking into how traffic "spills" over from one article to others that are related. One could start by examining traffic for the first five or ten hyperlinks on each article of interest. This could generate further insight and allow more predictable spikes to be discovered. In regards to `September_11_attacks`, here's a related spike:

In [None]:
range_analysis('World_Trade_Center_(1973–2001)', '09-10','11-10')