# Week 1 Skills Assessment - Case Study


## Data Notes
This data challenge crequies the use of two datasets (below). The accounts data reflects profile information about each app user (for an exercise logging app), and the exercise-log data represents events logged by the app user. This dataset reflects some of the event data that was collected when testing 2 different software versions of an app (app_version). This exercise is simply to get you aquainted with the first few steps of using data for information & decision making. You will answer a few questions using the data provided that you will import/
    
- accounts data: **user_accounts_data.csv**
    - user_id: unique identifier for each customer
    - churned: whether or not the customer account was considered churned at the time of this snapshot
    - app_version: the version of the app presented to the user when logging exercises
    - gender: user gender
    - primary_device: which device type the user was logging exercise events with
    - n_users_on_acct: how many "profiless" exist on the account (think Netflix)

<br />

- exercise logs: **user_exercise_log_data**
    - user_id: unique identifier for each customer
    - app_version: the version of the app presented to the user when logging exercises
    - start_timestamp: timestamp when the exercise event logging began
    - end_timestamp: timestamp when the exercise event logging concluded
    - satisfaction_score: a composit score computed based on a few questions users respond to when they conclude the exercise log

Both datasets for *this* part of the assessment can be found in the subdirectory: *case_study/data/*

Start by running the *Imports* cells, and the rest is on you! Good luck!

# Imports

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import pandas as pd # ensure v = 0.24 !pip install --upgrade pandas 
import numpy as np
import datetime as dt

In [4]:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import Colormap,LinearSegmentedColormap
from matplotlib.gridspec import GridSpec
import matplotlib.patches as mpatches

In [5]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

In [6]:
myColors = [
     (88/255.,166/255.,24/255.)
          ,(233/255.,131/255.,0/255.)
          ,(0/255.,117/255.,191/255.)
          ,(147/255.,38/255.,143/255.)
         ]

name = 'dexmap'
cm = LinearSegmentedColormap.from_list(name, myColors, N=4)

In [7]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [8]:
sns.set_style('darkgrid')

In [9]:
sns.set_context('talk')

# Data Import & Preprocessing

Complete the following:
- Import 'user_accounts_data.csv' as a pandas dataframe
- Import 'user_exercise_logs_data.csv' as a pandas dataframe
- Determine if any columns have nulls in either dataset
- Create the following new columns:
    - 'start_date': a date-column representing the date the exercise log was initiated
    - 'end_date': a date-column representing the date the exercise log was completed

# Basic Computations

## Answer each of the following questions about account data, adding your answer under the question

1. How many unique users are there?
2. What proportion of users that are male vs female?
3. What is the most common type of primary-device?
4. What is the average number of users per account?
5. Which gender, on average, adds more users to their account?
6. Which app-version saw a greater proportion of users churned?


## Answer each of the following questions about exercise-logs data, adding your answer under the question

1. What percent of the users in this cohort had exercised events logged? 
2. What is the average satisfaction score?
3. What is the average duration of an event log?
4. Which app version had the longer average event duration?

# Basic Plotting

1. Plot the distribution of event log durations
2. Create a `countplot` of unique exercise-logging events by gender
3. Use of one seaborns categorical scatter plots (e.g. violin) looking at primary device type vs average event duration

# Challenge Questions

**Easier-challenge questions**
1. Create a bar plot of the average event duration by gender 
2. Add a "hue" to your plot from (2) to break it out further by primary device type
3. What is the interquartile range of exercise log duration per primary device type?


**Harder-challenge questions**
4. Plot the distribution of event log durations broken out by gender (on the same plot or figure)
5. Is there a statistical difference between males & females in terms of % churned?
6. Are there any outlier exercise-log data? What makes them an outlier?
7. Is there a correlational/statistical relationship between exercise-log duration and:
    - satisfaction score?
    - churn?