# Step 2: Analysis
You will conduct a very basic visual analysis. The analysis for this homework is to graph specific subsets of the data as a timeseries. You will produce three different graphs.

## 2.i. Maximum Average and Minimum Average
The first graph should contain time series for the articles that have the highest average monthly page requests and the lowest average monthly page requests for desktop access and mobile access. Your graph should have four lines (max desktop, min desktop, max mobile, min mobile).
- We start by importing all the necessary libraries

In [1]:
import json
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

- Defining the constants to be used throughout the program

In [2]:
DATA = "data"
MOBILE_ACCESS_DATA_FILENAME = "academy_monthly_mobile_201507-202309.json"
DESKTOP_ACCESS_DATA_FILENAME = "academy_monthly_desktop_201507-202309.json"
RELATIVE_PATH_NOTATION = ".."
RESULTS = "results"

- We read and load JSON data files for desktop and mobile access, then convert them into dataframes for ease of analysis.

In [3]:
desktop_data = open(os.path.join(RELATIVE_PATH_NOTATION, DATA, DESKTOP_ACCESS_DATA_FILENAME))
mobile_data = open(os.path.join(RELATIVE_PATH_NOTATION, DATA, MOBILE_ACCESS_DATA_FILENAME))

desktop_json = json.load(desktop_data)
mobile_json = json.load(mobile_data)
desktop_df = pd.json_normalize(desktop_json)
mobile_df = pd.json_normalize(mobile_json)

- We create a function to compute the mean views for all articles, and subsequently identify the article names with the lowest and highest average views by applying a filter based on the desired access type.

- Also, we keep only the essential elements, namely the timestamp, article title, and page views.

In [4]:
def maximum_and_minimum_avg(df, access):
    """
    Calculate the mean views for articles in a DataFrame and identify the articles with the maximum and minimum mean views.
    
    Parameters:
    - df (DataFrame): The input DataFrame containing columns 'timestamp', 'article', and 'views'.
    - access (str): The access type (e.g., 'desktop' or 'mobile') for labeling data points.

    Returns:
    - result_df (DataFrame): A DataFrame containing data points for articles with maximum and minimum mean views,
      labeled with their respective access type and article name.
    """

    df['timestamp'] = df['timestamp'].astype('str')
    df['timestamp'] = pd.to_datetime(df['timestamp'], format = '%Y%m%d%H')
    g_data = df.groupby('article').mean('views')
    article_max = g_data.loc[g_data['views'].idxmax()].name
    article_min = g_data.loc[g_data['views'].idxmin()].name
    data_max = df[df['article'] == article_max][['timestamp', 'article', 'views']]
    data_min = df[df['article'] == article_min][['timestamp', 'article', 'views']]
    data_max['min_max'] = 'max_' + access
    data_max['label'] = data_max['min_max'] + '_' + data_max['article'] 
    data_min['min_max'] = 'min_' + access
    data_min['label'] = data_min['min_max'] + '_' + data_min['article']
    result_df = pd.concat([data_max, data_min])
    return result_df

- We invoke the aforementioned function for both mobile and desktop, and configure the timestamp as the primary index.

In [5]:
desktop_main = maximum_and_minimum_avg(desktop_df, access='desktop')
mobile_main = maximum_and_minimum_avg(mobile_df, access='mobile')
df_main = pd.concat([desktop_main, mobile_main])
df_main = df_main.set_index('timestamp')

- Visualizing the data in the "df_main" dataframe, which encompasses the minimum and maximum average monthly article views for both mobile and desktop access, we can create separate dataframes for different aspects:
    1. "df_1a" to represent the average maximum views for desktop access.
    2. "df_1b" to represent the average minimum views for desktop access.
    3. "df_2a" to denote the average maximum views for mobile access.
    4. "df_2b" to denote the average minimum views for mobile access.


- Finally, we display and save the image as a .png to be exported

In [6]:
fig2a = plt.gcf()
plt.figure(figsize = (25,10), dpi = 720)
df_1a = df_main[df_main['min_max']=='max_desktop']
df_1a['views'].plot(label = df_1a['label'].unique()[0], color = 'teal')
df_1b = df_main[df_main['min_max']=='min_desktop']
df_1b['views'].plot(label = df_1b['label'].unique()[0], color = 'turquoise')
df_2a = df_main[df_main['min_max']=='max_mobile']
df_2a['views'].plot(label = df_2a['label'].unique()[0], color = 'maroon')
df_2b = df_main[df_main['min_max']=='min_mobile']
df_2b['views'].plot(label = df_2b['label'].unique()[0], color = 'lightpink')
plt.title('Maximum Average and Minimum Average')
plt.xlabel('Time')
plt.ylabel('Views')
plt.legend()
plt.savefig(os.path.join(RELATIVE_PATH_NOTATION, RESULTS, "Max_Min_Average_plot.png"))
plt.close()

<Figure size 432x288 with 0 Axes>

## 2.ii. Top 10 Peak Page Views

The second graph should contain time series for the top 10 article pages by largest (peak) page views over the entire time by access type. You first find the month for each article that contains the highest (peak) page views, and then order the articles by these peak values. Your graph should contain the top 10 for desktop and top 10 for mobile access (20 lines).
- We determine the highest value achieved among all articles and identify the corresponding month when this peak occurred.

- And then, organize these findings in a ranked order, selecting only the top 10 results for both mobile and desktop platforms, resulting in a total of 20 trend lines.

In [7]:
def top_10_peak_page_views(df, access):
    """
    Extracts the top 10 articles with the highest views for a given access type and
    returns a modified dataframe containing these top articles with additional information.

    Parameters:
        df (DataFrame): The input dataframe containing article data.
        access (str): The access type for which the top articles should be extracted.

    Returns:
        DataFrame: A dataframe containing the top 10 articles for the specified access type
        along with additional information like the timestamp, access type, and views.
    """

    group_df = df.groupby('article').max('views').reset_index()
    group_df = group_df.sort_values(by='views', ascending=False)
    only_top10 = group_df.head(10)
    only_top10 = only_top10[['article']].reset_index()
    only_top10 = only_top10.drop('index', axis=1)
    only_top10['rank'] = pd.Series(np.arange(1,11))
    df['access'] = access
    df = df[['timestamp', 'article', 'access', 'views']]
    df = df.merge(only_top10, on='article', how='inner', suffixes=('_left', '_right'))
    df['label'] = 'top_' + df['rank'].astype('str') + access + '_' + df['article']
    return df

- We invoke the function provided above with "mobile" and "desktop" as access types, followed by consolidating both datasets into a single dataframe to facilitate more convenient visualization.

In [8]:
top_desktop = top_10_peak_page_views(desktop_df, 'desktop')
top_mobile = top_10_peak_page_views(mobile_df, 'mobile')
df_final = pd.concat([top_desktop, top_mobile])
df_final = df_final.set_index('timestamp')

- We iterate through a loop to generate plots for the top 10 entries in the previously generated dataframe for both access types, resulting in a total of 20 trend lines.

- Finally, we display and save the image as a .png to be exported

In [9]:
fig2b = plt.gcf()
plt.figure(figsize = (25,10), dpi = 720)
for i in range(1,11):
    df = df_final[df_final['rank'] == i]
    df[df['access'] == 'desktop']['views'].plot(label=df['label'].unique()[0])
    df[df['access'] == 'mobile']['views'].plot(label=df['label'].unique()[1])
plt.title('Top 10 Peak Page Views')
plt.xlabel('Time')
plt.ylabel('Views')
plt.legend()
plt.savefig(os.path.join(RELATIVE_PATH_NOTATION, RESULTS, "Top_10_Peak_Page_Views_plot.png"))
plt.close()

<Figure size 432x288 with 0 Axes>

## 2.iii. Fewest Months of Data
The third graph should show pages that have the fewest months of available data. These will all be relatively short time series and should contain a set of the most recent academy award winners. Your graph should show the 10 articles with the fewest months of data for desktop access and the 10 articles with the fewest months of data for mobile access.
- Let's begin by examining the timestamp's duration for each article and for each access instance.

- Next, we identify the ten items with the lowest counts, arranging them in ascending order, and subsequently assigning ranks based on this order.

- And then, we generate a labeling column to facilitate the creation of visualizations.

In [10]:
def fewest_df(df, access):
    """
    Create a DataFrame with the top 10 articles with the fewest views
    for a specific access type and join it with the original DataFrame.

    Parameters:
    - df (DataFrame): The input DataFrame containing article view data.
    - access (str): The access type for which to find the top 10 articles.

    Returns:
    - df (DataFrame): A modified DataFrame with the top 10 articles with the fewest views
      for the specified access type, including a 'label' column.
    """

    few_df = df.groupby("article").count().reset_index()
    few_df = few_df[['article', 'views']]
    few_df = few_df.sort_values(by='views', ascending=True)
    top_10 = few_df.head(10)
    top_10 = top_10[['article']].reset_index()
    top_10 = top_10.drop('index', axis = 1)
    top_10['rank'] = pd.Series(np.arange(1,11))
    df['access'] = access
    df = df[['timestamp', 'article', 'access', 'views']]
    df = df.merge(top_10, on='article', how='inner', suffixes=('_left', '_right'))
    df['label'] = 'fewest_' + df['rank'].astype('str') + access + '_' + df['article']
    return df

- We invoke the function above with "mobile" and "desktop" as access types, subsequently consolidating them into a single dataframe to facilitate more convenient data visualization.

In [11]:
top_desktop = fewest_df(desktop_df, 'desktop')
top_mobile = fewest_df(mobile_df, 'mobile')
df_final = pd.concat([top_desktop, top_mobile])
df_final = df_final.set_index('timestamp')

- We execute a loop to generate plots for the top 10 entries in the previously generated dataframe, separately for each access type, resulting in a total of 20 trend lines.
- Finally, we display and save the image as a .png to be exported

In [12]:
fig2c = plt.gcf()
plt.figure(figsize = (25,10), dpi = 720)
for i in range(1,11):
    df = df_final[df_final['rank'] == i]
    df[df['access'] == 'desktop']['views'].plot(label=df['label'].unique()[0])
    df[df['access'] == 'mobile']['views'].plot(label=df['label'].unique()[1])
plt.title('Fewest Months of Data')
plt.xlabel('Time')
plt.ylabel('Views')
plt.legend()
plt.savefig(os.path.join(RELATIVE_PATH_NOTATION, RESULTS, "Fewest_months_plot.png"))
plt.close()

<Figure size 432x288 with 0 Axes>