# Data Analysis
### Samuel Mendez

This journal defines functions to extract valuable information from the clean data we created in the preprocessing journal. The various functions defined allow the user to filter by subscription type. 365 Data Science can then view the most prominent patterns of pages visited based on subsciption type. The following functions are defined:   
`Page count` is the most fundamental metric; it counts how many times each page can be found in all user journeys.  
`Page presence` is similar to ‘page count’ but counts each page only once if it exists in a journey; it shows how many times each page is part of a journey.  
`Page destination` is a metric that shows the most frequent follow-ups after every page. It looks at every page and counts which pages follow next. If one is interested in what the users do after visiting page X, they can consult this metric.  
`Page sequences`: the most popular sequence of pages given the number of pages a user specifies.  
`Journey length`: a metric that considers the average length of a user journey in terms of pages.  

In [65]:
# import libraries
import pandas as pd
import numpy as np
from collections import Counter

In [200]:
# import clean data
data = pd.read_csv('clean_data.csv')

In [201]:
data.head(10)

Unnamed: 0,user_id,subscription_type,user_journey
0,1516,Annual,Homepage-Log in-Other-Sign up-Log in-Homepage-...
1,3395,Annual,Other-Pricing-Sign up-Log in-Homepage-Pricing-...
2,10107,Annual,Homepage-Career tracks-Homepage-Career tracks-...
3,11145,Monthly,Homepage-Log in-Homepage-Log in-Homepage-Log i...
4,12400,Monthly,Homepage-Career tracks-Sign up-Log in-Other-Ca...
5,13082,Monthly,Checkout-Homepage-Sign up-Log in-Checkout
6,14415,Monthly,Pricing-Sign up-Pricing-Sign up-Homepage-Log i...
7,15630,Annual,Log in-Checkout-Pricing-Checkout-Other-Homepag...
8,16589,Quarterly,Homepage-Career tracks-Homepage-Career tracks-...
9,19458,Annual,Homepage-Sign up-Log in-Other-Homepage-Pricing...


In [202]:
# check for missing values
data.isnull().sum()

user_id              0
subscription_type    0
user_journey         0
dtype: int64

In [203]:
# define a function that counts how times a page can be found in all user journeys

def page_count(df, subtype = 'All', target_column = 'user_journey'):
    df_copy = df.copy()
    # drop missing values 
    df_copy = df_copy.dropna()
    
    # choose membership type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subscription_type'].str.contains(subtype)]

            
    # split the pages at the hyphen
    pages_list = df_copy[target_column].str.split('-')
    # flatten the rows of list
    pages = [page for sublist in pages_list for page in sublist]
    # create a series to count the values  
    pages_count = pd.Series(pages).value_counts()
        
        
    
    return pages_count

In [204]:
page_count(data, subtype = 'Quarterly')

Homepage                    109
Log in                       94
Sign up                      65
Checkout                     63
Career tracks                36
Courses                      36
Pricing                      33
Other                        26
Career track certificate     18
Course certificate           10
Coupon                        8
Resources center              8
Success stories               4
Upcoming courses              2
About us                      1
Name: count, dtype: int64

In [166]:
# define a function that counts if a page is found in user's journeys

def page_presence(df, subtype = 'All', target_column = 'user_journey'):
    df_copy = df.copy()
    # drop missing values 
    df_copy = df_copy.dropna()
        
    # choose membership type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subscription_type'].str.contains(subtype)]
        
    # split the pages at the hyphen
    pages_list = df_copy[target_column].str.split('-')
    
    # convert each list to a set to remove duplicates
    pages_list = pages_list.apply(lambda x: list(set(x)) if isinstance(x, list) else x)
    
    # flatten the rows of list
    pages = [page for sublist in pages_list for page in sublist]
        
    # create a series to count the values  
    pages_count = pd.Series(pages).value_counts()

    
    return pages_count

In [205]:
page_presence(data, subtype = 'Quarterly')

Checkout                    38
Sign up                     36
Homepage                    32
Log in                      32
Pricing                     16
Other                       16
Courses                     15
Career tracks               13
Career track certificate     8
Course certificate           7
Coupon                       6
Resources center             4
Success stories              2
Upcoming courses             2
About us                     1
Name: count, dtype: int64

In [177]:
# define a function that shows the most frequent follow-ups after every page

def page_destination(df, subtype = 'All', target_column = 'user_journey'):
    df_copy = df.copy()
    # drop missing values 
    df_copy = df_copy.dropna()
        
    # choose membership type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subscription_type'].str.contains(subtype)]
    
    # split each row at the hyphen
    pages_list = df_copy[target_column].str.split('-')
    
    
    # count the following page after every page
    page_pair_count = Counter()
    
    for pair in pages_list:
        if isinstance(pair, list):
            for i in range(len(pair)-1):
                page_pair = (pair[i], pair[i+1])
                page_pair_count[page_pair] += 1
    
    return page_pair_count


In [206]:
page_destination(data, subtype = 'All')

Counter({('Homepage', 'Log in'): 953,
         ('Log in', 'Homepage'): 817,
         ('Log in', 'Checkout'): 701,
         ('Homepage', 'Pricing'): 449,
         ('Sign up', 'Log in'): 394,
         ('Homepage', 'Career tracks'): 357,
         ('Career tracks', 'Courses'): 353,
         ('Resources center', 'Other'): 344,
         ('Homepage', 'Sign up'): 341,
         ('Other', 'Resources center'): 295,
         ('Pricing', 'Checkout'): 291,
         ('Courses', 'Career tracks'): 290,
         ('Other', 'Log in'): 290,
         ('Sign up', 'Homepage'): 279,
         ('Homepage', 'Courses'): 246,
         ('Sign up', 'Checkout'): 222,
         ('Log in', 'Other'): 213,
         ('Courses', 'Sign up'): 213,
         ('Career tracks', 'Sign up'): 198,
         ('Checkout', 'Homepage'): 196,
         ('Career tracks', 'Homepage'): 182,
         ('Courses', 'Homepage'): 175,
         ('Career track certificate', 'Career tracks'): 173,
         ('Log in', 'Coupon'): 170,
         ('Pricing'

In [66]:
# define a function returns the most popular sequence given the number of pages in a sequence

def page_sequence(df, sequence_size, subtype = 'All', target_column = 'user_journey'):
    df_copy = df.copy()
    # drop na
    df_copy = df_copy.dropna()
    
    # filter by subscription type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subcription_type'].str.contains(subtype)]
    
    # split strings at hyphen
    pages_list = df_copy[target_column].str.split('-')
    
    # count the occurance of page sequence once per row
    sequence_count = Counter()
    
    for sequence in pages_list:
        if len(sequence) >= sequence_size:
            for i in range(len(sequence)-sequence_size +1):
                sequence_group = tuple(sequence[i:i + sequence_size])
                sequence_count[sequence_group] += 1
    
    top_sequence, count = sequence_count.most_common(1)[0]
    
    return top_sequence, count


In [196]:
page_sequence(data, sequence_size = 4)

(('Log in', 'Homepage', 'Log in', 'Checkout'), 49)

In [193]:
# def function to check the average number of user page visits

def avg_journey(df, subtype = 'All', target_column = 'user_journey'):
    df_copy = df.copy()
    
    # drop missing values
    df_copy = df_copy.dropna()
    
    # filter by subscription type
    if subtype != 'All':
        df_copy = df_copy[df_copy['subscription_type'].str.contains(subtype)]
    
    # seperate strings at hyphen
    pages_list = df_copy[target_column].str.split('-')
    
    # count the length of every row
    journey_length = pages_list.apply(len)
    
    # Average journey 
    avg_visits = pd.DataFrame(journey_length).mean().round(1)
    
    return print("Average journey for", subtype, "subscribers:", avg_visits)


In [194]:
avg_journey(data, subtype = 'All')

Average journey for All subscribers: user_journey    3.6
dtype: float64


End of journal.