# Wikipedia Web Traffic Prediction 
## Libraries and Configs

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from IPython.display import display

+ `train_*.csv` - contains traffic data. 
    1. This a csv file where each row corresponds to a particular article and each column correspond to a particular date. 
    2. Some entries are missing data. The page names contain the Wikipedia project (e.g. en.wikipedia.org), type of access (e.g. desktop) and type of agent (e.g. spider). 
    3. In other words, each article name has the following format: 'name_project_access_agent' (e.g. 'AKB48_zh.wikipedia.org_all-access_spider').  
+ `key_*.csv` - gives the mapping between the page names and the shortened Id column used for prediction.  
+ `sample_submission_*.csv` - a submission file showing the correct format.  

For prediction, we need to prediction page view for 60 days for each page

## Read in data 

In [2]:
%time key_1 = pd.read_csv('../input/key_1.csv')
%time submission = pd.read_csv('../input/sample_submission_1.csv')
%time train_1 = pd.read_csv('../input/train_1.csv')

CPU times: user 12.3 s, sys: 1.36 s, total: 13.7 s
Wall time: 13.7 s
CPU times: user 3.79 s, sys: 542 ms, total: 4.34 s
Wall time: 4.35 s
CPU times: user 10.2 s, sys: 865 ms, total: 11.1 s
Wall time: 11.1 s


## Basic Stats of Dataset

In [8]:
print(train_1.shape)
display(train_1.head())
print(key_1.shape)
display(key_1.head())
print(key_1.head().Page.values)
print(submission.shape)
display(submission.head())

(145063, 551)


Unnamed: 0,Page,2015-07-01,2015-07-02,2015-07-03,2015-07-04,2015-07-05,2015-07-06,2015-07-07,2015-07-08,2015-07-09,...,2016-12-22,2016-12-23,2016-12-24,2016-12-25,2016-12-26,2016-12-27,2016-12-28,2016-12-29,2016-12-30,2016-12-31
0,2NE1_zh.wikipedia.org_all-access_spider,18.0,11.0,5.0,13.0,14.0,9.0,9.0,22.0,26.0,...,32.0,63.0,15.0,26.0,14.0,20.0,22.0,19.0,18.0,20.0
1,2PM_zh.wikipedia.org_all-access_spider,11.0,14.0,15.0,18.0,11.0,13.0,22.0,11.0,10.0,...,17.0,42.0,28.0,15.0,9.0,30.0,52.0,45.0,26.0,20.0
2,3C_zh.wikipedia.org_all-access_spider,1.0,0.0,1.0,1.0,0.0,4.0,0.0,3.0,4.0,...,3.0,1.0,1.0,7.0,4.0,4.0,6.0,3.0,4.0,17.0
3,4minute_zh.wikipedia.org_all-access_spider,35.0,13.0,10.0,94.0,4.0,26.0,14.0,9.0,11.0,...,32.0,10.0,26.0,27.0,16.0,11.0,17.0,19.0,10.0,11.0
4,52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...,,,,,,,,,,...,48.0,9.0,25.0,13.0,3.0,11.0,27.0,13.0,36.0,10.0


(8703780, 2)


Unnamed: 0,Page,Id
0,!vote_en.wikipedia.org_all-access_all-agents_2...,bf4edcf969af
1,!vote_en.wikipedia.org_all-access_all-agents_2...,929ed2bf52b9
2,!vote_en.wikipedia.org_all-access_all-agents_2...,ff29d0f51d5c
3,!vote_en.wikipedia.org_all-access_all-agents_2...,e98873359be6
4,!vote_en.wikipedia.org_all-access_all-agents_2...,fa012434263a


['!vote_en.wikipedia.org_all-access_all-agents_2017-01-01'
 '!vote_en.wikipedia.org_all-access_all-agents_2017-01-02'
 '!vote_en.wikipedia.org_all-access_all-agents_2017-01-03'
 '!vote_en.wikipedia.org_all-access_all-agents_2017-01-04'
 '!vote_en.wikipedia.org_all-access_all-agents_2017-01-05']
(8703780, 2)


Unnamed: 0,Id,Visits
0,bf4edcf969af,0
1,929ed2bf52b9,0
2,ff29d0f51d5c,0
3,e98873359be6,0
4,fa012434263a,0


*Different type of page ?*

In [14]:
train_1.Page.map(lambda x: x.split('.org_')[-1]).value_counts()

all-access_all-agents    39402
mobile-web_all-agents    35939
all-access_spider        34913
desktop_all-agents       34809
Name: Page, dtype: int64

*Does the all-access_all-agents equal to the sum of all ? and what about the correlations?*

In [57]:
display(train_1.loc[train_1.Page.map(lambda x: '2NE1_zh.wikipedia.org_' in x),:].iloc[:, :10])
print(np.corrcoef(train_1.loc[train_1.Page.map(lambda x: '2NE1_zh.wikipedia.org_' in x),:].iloc[:,1:].values))

Unnamed: 0,Page,2015-07-01,2015-07-02,2015-07-03,2015-07-04,2015-07-05,2015-07-06,2015-07-07,2015-07-08,2015-07-09
0,2NE1_zh.wikipedia.org_all-access_spider,18.0,11.0,5.0,13.0,14.0,9.0,9.0,22.0,26.0
27965,2NE1_zh.wikipedia.org_all-access_all-agents,785.0,725.0,641.0,704.0,690.0,657.0,670.0,698.0,791.0
60570,2NE1_zh.wikipedia.org_desktop_all-agents,540.0,514.0,400.0,458.0,467.0,432.0,441.0,490.0,573.0
105110,2NE1_zh.wikipedia.org_mobile-web_all-agents,238.0,211.0,239.0,244.0,219.0,223.0,226.0,203.0,213.0


[[ 1.          0.62924733  0.6092952   0.6473105 ]
 [ 0.62924733  1.          0.99495919  0.98860308]
 [ 0.6092952   0.99495919  1.          0.96853068]
 [ 0.6473105   0.98860308  0.96853068  1.        ]]


*How many days of visit we need to predict for each page ?*

In [12]:
key_1.Page.map(lambda x: x.split('_')[-1]).value_counts().value_counts()

145063    60
Name: Page, dtype: int64

In [5]:


display(key_1.groupby(key_1.Page.map(lambda x: x.split('2017')[0])).size().head())
print(key_1.groupby(key_1.Page.map(lambda x: x.split('2017')[0])).size().value_counts())
print(sum(key_1.groupby(key_1.Page.map(lambda x: x.split('2017')[0])).size().value_counts().values))

Unnamed: 0,Page,Id
0,!vote_en.wikipedia.org_all-access_all-agents_2...,bf4edcf969af
1,!vote_en.wikipedia.org_all-access_all-agents_2...,929ed2bf52b9
2,!vote_en.wikipedia.org_all-access_all-agents_2...,ff29d0f51d5c
3,!vote_en.wikipedia.org_all-access_all-agents_2...,e98873359be6
4,!vote_en.wikipedia.org_all-access_all-agents_2...,fa012434263a


!vote_en.wikipedia.org_all-access_all-agents_2017-01-01
!vote_en.wikipedia.org_all-access_all-agents_2017-01-02


Page
                                                              6600
!vote_en.wikipedia.org_all-access_all-agents_                   60
!vote_en.wikipedia.org_all-access_spider_                       60
!vote_en.wikipedia.org_desktop_all-agents_                      60
"Awaken,_My_Love!"_en.wikipedia.org_all-access_all-agents_      60
dtype: int64

60      144346
240         90
180         32
480          2
300          2
1260         1
360          1
720          1
6600         1
840          1
3240         1
420          1
660          1
dtype: int64
144480
