# Process Data

This script reformats and aggregates the mobile and desktop data in the 'raw_data' folder in order to create the visualizations described in the ReadMe:
- calculates the average number of pageviews for each article and access type
- calculates the peak number of pageviews for each article and access type
- calculates the number of months data was available for
- identifies what articles to include for each visualization
- creates timeseries data for each visualization, ready to be plotted
- saves processed data to 'processed_data' folder

## Import Packages

In [1]:
import pandas as pd
from statistics import mean

## Load Data

In [2]:
mobile_df = pd.read_json('../raw_data/dino_monthly_mobile_201501-202209.json', orient='index')
desktop_df = pd.read_json('../raw_data/dino_monthly_desktop_201501-202209.json', orient='index')

setting the orient parameter to 'index' is necessary, because we want the index of the dataframe to be the keys of the dictionary (article titles)

## Functions

### Function to aggregate pagecounts for an article using a given function.

Inputs:
- article: a list of dictionaries for a specific article
    - dictionary keys must include 'views'
- func: a function to apply to the viewcounts, must aggregate

Output:
- a single value of the aggregated viewcount

In [3]:
def aggregate_dino(article, func):
    temp = []
    for month in article:
        temp.append(month['views'])
    return(func(temp))

### Function to turn dictionary data into timeseries

Inputs:
- article: a list of dictionaries for a specific article
    - dictionary keys must include 'views' and 'timestamp'
    - timestamp must be in the format YYYYMMDD00
- name: a string of what to name the timeseries (for the purpose of labeling in a legend)

Output:
- a dataframe with the following columns
    - month: a datetime object with the month and year of the data
    - views: the number of pageviews in that month for the input article
    - article: a name for the purpose of labeling in a legend

In [4]:
def get_timeseries(article, name):
    temp = pd.DataFrame(columns = ['month', 'views'])
    for month in article:
        temp = temp.append({'month': month['timestamp'], 'views' : month['views']}, ignore_index = True)
    temp['month'] = pd.to_datetime(temp['month'], format='%Y%m%d00')
    temp['views'] = temp['views'].astype(int)
    temp['article'] = name
    return(temp)

### Function to create timeseries data for a subset of articles

Inputs:
- filt_mobile: a filtered list of article names for the mobile access type
- filt_mobile: a filtered list of article names for the desktop access type

Output:
- a dataframe with the following columns
    - month: a datetime object with the month and year of the data
    - views: the number of pageviews in that month for the input article
    - article: the name of the article
    - access_type: the access type (mobile or desktop)

In [5]:
def combine_timeseries(filt_mobile, filt_desktop):
    df = pd.DataFrame()
    for article in filt_mobile:
        temp = get_timeseries(mobile_df.loc[article]['items'], article)
        temp['access_type'] = 'mobile'
        df = pd.concat([df, temp])
    for article in filt_desktop:
        temp = get_timeseries(desktop_df.loc[article]['items'], article)
        temp['access_type'] = 'desktop'
        df = pd.concat([df, temp])
    return(df)

## Process Data

### Aggregate data

get average viewcount for each article and each access type

In [6]:
mobile_df['mean'] = mobile_df['items'].apply(aggregate_dino, args=(mean,))
desktop_df['mean'] = desktop_df['items'].apply(aggregate_dino, args=(mean,))

get highest viewcount in a single month for each article and each access type

In [7]:
mobile_df['peak'] = mobile_df['items'].apply(aggregate_dino, args=(max,))
desktop_df['peak'] = desktop_df['items'].apply(aggregate_dino, args=(max,))

get the number of months there was data available for each article and each access type

In [8]:
mobile_df['num_months'] = mobile_df['items'].apply(len)
desktop_df['num_months'] = desktop_df['items'].apply(len)

### Identify relevant data for visualizations

identify articles with the smallest and largest average viewcount for each access type

In [9]:
mobile_min = mobile_df['mean'].idxmin()
mobile_max = mobile_df['mean'].idxmax()
desktop_min = desktop_df['mean'].idxmin()
desktop_max = desktop_df['mean'].idxmax()

identify articles within the top 10 highest viewcount in a single month for each access type

In [10]:
peak_mobile = mobile_df.nlargest(10, 'peak').index
peak_desktop = desktop_df.nlargest(10, 'peak').index

identify the 10 articles with the fewest months of available data for each access type

In [11]:
newest_mobile = mobile_df.nsmallest(10, 'num_months').index
newest_desktop = desktop_df.nsmallest(10, 'num_months').index

### Create timeseries data

create and consolidate timeseries data for articles with the smallest and largest average viewcount for each access type

In [17]:
mobile_min_df = get_timeseries(mobile_df.loc[mobile_min]['items'], mobile_min + ' - mobile min')
mobile_max_df = get_timeseries(mobile_df.loc[mobile_max]['items'], mobile_max + ' - mobile max')
desktop_min_df = get_timeseries(desktop_df.loc[desktop_min]['items'], desktop_min + ' - desktop min')
desktop_max_df = get_timeseries(desktop_df.loc[desktop_max]['items'], desktop_max + ' - desktop max')
mean_df = pd.concat([mobile_min_df, mobile_max_df, desktop_min_df, desktop_max_df])

create and consolidate timeseries data for articles within the top 10 highest viewcount in a single month for each access type

In [13]:
peak_df = combine_timeseries(peak_mobile, peak_desktop)

create and consolidate timeseries data for articles with the fewest months of available data for each access type

In [14]:
newest_df = combine_timeseries(newest_mobile, newest_desktop)

### Save data

In [18]:
mean_df.to_csv('../processed_data/average_viewcount.csv', index=False)
peak_df.to_csv('../processed_data/peak_viewcount.csv', index=False)
newest_df.to_csv('../processed_data/fewest_months_viewcount.csv', index=False)