# README

This readme file contains documentation for combining and processing a dataset related to Wikipedia page traffic and analyzing it.


Below are the steps involved
## Gather the data
We need to collect monthly data from the below
There are 5 things here
Legacy Pagecounts API
1. Desktop data
2. Mobile data

Pageviews API
1. Desktop data
2. Mobile web data
3. Mobile app traffic data

### Imports
Before we do any processing, we need to import all required libraries for python. These are available on PyPi repository

In [115]:
import json
import requests
import pandas as pd
from datetime import datetime

### Constants and common definitions
The below code will declare constants like endpoints, headers and define the access points supported by the different endpoints

In [70]:
endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/en.wikipedia.org/{access}/monthly/{start}/{end}'
endpoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia.org/{access}/{agent}/monthly/{start}/{end}'
headers = {
    'User-Agent': 'https://github.com/vikrantb',
    'From': 'vikrantb@uw.edu'
}

legacy_access_points = ['desktop-site', 'mobile-site']
page_view_access_points = ['desktop', 'mobile-app', 'mobile-web']

### Common helper functions

Below are some helper functions that will be used throughout. With an explanation

#### api_call
This method makes a call to the endpoint provided with the parameters provided. It also has an option to directly extract the items

#### get_wiki_df
This method returns the pagecount/pageviews information in a dataframe format

#### get_from_legacy
This method is a wrapper method on top of api_call specifically for legacy endpoint

#### get_from_page_views
This method is a wrapper method on top of api_call specifically for pageviews endpoint


In [284]:
def api_call(endpoint,parameters, extract_items=False):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response['items'] if extract_items else response

def get_from_legacy(access, start, end, save_file=False):
    params = {"project" : "en.wikipedia.org",
             "access" : access,
             "granularity" : "monthly",
             "start" : "{start}0100".format(start=start),
             "end" : "{end}0100".format(end=end)
            }
    data = api_call(endpoint=endpoint_legacy, parameters=params)
    file_name="pagecounts_{access}_{start}-{end}.json".format(access=access, start=start, end=end)
    with open(file_name, 'w') as f:
        json.dump(data, f)


def get_from_page_views(access, start, end, save_file=False):
    params = {"project" : "en.wikipedia.org",
             "access" : access,
             "agent" : "user",
             "granularity" : "monthly",
             "start" : "{start}0100".format(start=start),
             "end" : "{end}0100".format(end=end)
            }
    data = api_call(endpoint=endpoint_pageviews, parameters=params)
    file_name="pageviews_{access}_{start}-{end}.json".format(access=access, start=start, end=end)
    with open(file_name, 'w') as f:
        json.dump(data, f)

def get_wiki_df(file_path):
    wiki_json = ''
    with open(file_path) as f:
        wiki_json = str(json.load(f)['items']).replace("\'", "\"")
    df = pd.read_json(wiki_json, orient='records', convert_dates=False)
    df['timestamp'] = df['timestamp'].apply(str)
    df['year'] = df['timestamp'].apply(lambda x: x[:4])
    df['month'] = df['timestamp'].apply(lambda x: x[4:6])
    df['num_views'] = df['views'] if 'views' in df.columns else df['count']
    return df[['year', 'month', 'num_views']]


## Step 1: Gathering the data

In this step we will gather the wikipedia pagecount data per month level.
As noted below, the data is available in legacy endpoint from January 2008 to July 2016
As noted below, the data is available in pageviews endpoint from July 2015 2008 to September 2020.

There is an overlap for data from July 2015 to July 2016

After execution of the below step, there will be 5 json files created as below

- pagecounts_desktop-site_200801-201607.json - desktop site information from legacy api from January 2008 to July 2016
- pagecounts_mobile-site_200801-201607.json - mobile site information from legacy api from January 2008 to July 2016
- pageviews_desktop_201507-202009.json - desktop information from pageviews api from July 2015 to September 2020
- pageviews_mobile-app_201507-202009.json - mobile app information from pageviews api from July 2015 to September 2020
- pageviews_mobile-web_201507-202009.json - mobile web information from pageviews api from July 2015 to September 2020

In [285]:
legacy_start_date='200801'
legacy_end_date='201607'
for access_type in legacy_access_points:
    get_from_legacy(access=access_type, start=legacy_start_date,end=legacy_end_date, save_file=True)

legacy_start_date='201507'
legacy_end_date='202009'
for access_type in page_view_access_points:
    get_from_page_views(access=access_type, start=legacy_start_date,end=legacy_end_date, save_file=True)

## Step 2: Processing the data
The objective of this step is to combine data from all the json files collected in the gathering step into a single csv file upon which we will run analysis.

For the pageviews api, we will sum the counts from mobile-app and mobile-web to create a single value for mobile traffic.

For all the data, we will be only using the `YYYYMM` part of the timestamp and ignoring the `DDHH` part

In [288]:
pagecounts_desktop_fp = 'pagecounts_desktop-site_200801-201607.json'
pagecounts_mobile_fp = 'pagecounts_mobile-site_200801-201607.json'
pageviews_desktop_fp = 'pageviews_desktop_201507-202009.json'
pageviews_mobile_app_fp = 'pageviews_mobile-app_201507-202009.json'
pageviews_mobile_web_fp = 'pageviews_mobile-web_201507-202009.json'

pagecounts_desktop_df = get_wiki_df(pagecounts_desktop_fp)
pagecounts_mobile_df = get_wiki_df(pagecounts_mobile_fp)
pageviews_desktop_df = get_wiki_df(pageviews_desktop_fp)
pageviews_mobile_app_df = get_wiki_df(pageviews_mobile_app_fp)
pageviews_mobile_web_df = get_wiki_df(pageviews_mobile_web_fp)

In [289]:
pageviews_mobile_df = pd.concat([pageviews_mobile_app_df,pageviews_mobile_web_df])
pageviews_mobile_df = pageviews_mobile_df.groupby(['year', 'month']).sum().reset_index()
pagecount_all_df = pd.concat([pagecounts_desktop_df, pagecounts_mobile_df])
pagecount_all_df = pagecount_all_df.groupby(['year', 'month']).sum().reset_index()

pageview_all_df = pd.concat([pageviews_desktop_df, pageviews_mobile_df])
pageview_all_df = pageview_all_df.groupby(['year', 'month']).sum().reset_index()

In [295]:
all_of_all_df = pd.concat([pagecount_all_df, pageview_all_df])[['year', 'month']]


In [296]:
all_of_all_df.head(3)

Unnamed: 0,year,month
0,2008,1
1,2008,2
2,2008,3


In [299]:
all_of_all_df.groupby(['year', 'month']).sum().reset_index()

Unnamed: 0,year,month
0,2008,01
1,2008,02
2,2008,03
3,2008,04
4,2008,05
...,...,...
147,2020,04
148,2020,05
149,2020,06
150,2020,07
