# Fitbit Data Exploration: Making a prediction for the two weeks of missing data

By: Norrick McGee and Eric Escalante  
January 19, 2019  

In this Jupyter Notebook, we will use the Time Series Methology and fitbit data collected from 04/26/2018 to 12/06/2018 to predict two weeks worth of missing data; we will add our pridicted data in a separate csv file.

## Imports
**Import the necessary packages and their use cases for this project:**
> pandas: data frames and data manipulation  
> numpy: summary statistics  
> matplotlib: used for visualizations  
> seasborn: fancy visualizations  
> datetime: turn the dates into datetime objects / get day of week  
> warning: used to ignore python warnings

In [60]:
from acquire import acquire_fitbit

import numpy as np
import pandas as pd

import os
from datetime import datetime
import itertools
 
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

## Table of contents
1. [Project Planning](#project-planning)
1. [Acquisition](#acquisition)
1. [Preparation](#preparation)
1. [Exploration](#exploration)
1. [Modeling](#modeling)

## Project Planning <a name="project-planning"></a>

### Goals  
> Goals for the Project are:  
1. We will predict using different Time Series Methodologies the missing two weeks of Fitbit data
2. We will show the reasoning behind our preditions with visualizations and statistical findings
3. Finally we will create a viewable csv file with those pridictions so that you could see the endire dataset

### Deliverables

**_TODO:_**
> Summarization of the data:

### Data Dictionary & Domain Knowledge

So we have sixteen columns in our dataset. Let us understand what each one is.

> **activity_calories:** amount of calories burned throughout the day   
> **calories:** amount of calories consumed throughout the day  
> **carbs:** amount of carbs consumed throughout the day  
> **distance:** distance traveled  
> **fat:** amount of fat consumed throughout the day  
> **fiber:** amount of fiber comsumed throughout the day  
> **floors:** floors climbed  
> **mins_active_high:** minutes of high activity in the day   
> **mins_active_light:** minutes of light activity in the day  
> **mins_active_med:** minutes of moderate activity in the day  
> **mins_sedentary:** minutes the fit bit assums you are sitting down  
> **protien:** amount of protein consumed throughout the day  
> **sodium:** amount of sodium consumed throughout the day  
> **steps:** total amount of steps taken in a day  
> **total_calories:** total calories consumed throughout the day  
> **water:** metric desplays tracked water intake or set a consumption

### Hypotheses

**_TODO:_**  
> Hypotheses:

### Thoughts & Questions

**_TODO:_**
> Thoughts:  
> Questions:

## Prepare the Environment

## Preparation <a name="preparation"></a>

In [51]:
df = pd.concat(acquire_fitbit())

In [58]:
def nulls_by_col(df):
    '''
    Funciton used to find missing information in each column
    '''
    num_missing = df.isnull().sum()
    rows = df.shape[0]
    pct_missing = num_missing/rows
    cols_missing = pd.DataFrame({'num_rows_missing': num_missing, 'pct_rows_missing': pct_missing})
    return cols_missing

def nulls_by_row(df):
    '''
    Function used to find missing information in each row
    '''
    num_cols_missing = df.isnull().sum(axis=1)
    pct_cols_missing = df.isnull().sum(axis=1)/df.shape[1]*100
    rows_missing = pd.DataFrame({'num_cols_missing': num_cols_missing, 'pct_cols_missing': pct_cols_missing})\
                     .reset_index().groupby(['num_cols_missing','pct_cols_missing']).count()\
                     .rename(index=str, columns={'index': 'num_rows'}).reset_index()
    return rows_missing

def df_summary(df):
    '''
    Funciton summarizes our created data frame with information on: shape, type & null info, desribe function from pandas library, 
    null values by column, null values by row, and value counts
    '''
    print('--- Shape: {}'.format(df.shape))
    print('--- Info')
    df.info()
    print('--- Descriptions')
    print(df.describe(include='all'))
    print('--- Nulls By Column')
    print(nulls_by_col(df))
    print('--- Nulls By Row')
    print(nulls_by_row(df))
    print('--- Value Counts')

In [53]:
df_summary(df)

--- Shape: (711, 16)
--- Info
<class 'pandas.core.frame.DataFrame'>
Index: 711 entries, 2018-04-26 to 20181228
Data columns (total 16 columns):
activity_calories    225 non-null object
calories             486 non-null object
carbs                239 non-null object
distance             225 non-null object
fat                  239 non-null object
fiber                239 non-null object
floors               225 non-null object
mins_active_high     225 non-null object
mins_active_light    225 non-null object
mins_active_med      225 non-null object
mins_sedentary       225 non-null object
protien              239 non-null object
sodium               239 non-null object
steps                225 non-null object
total_calories       225 non-null object
water                239 non-null object
dtypes: object(16)
memory usage: 94.4+ KB
--- Descriptions
       activity_calories calories carbs distance  fat fiber floors  \
count                225      486   239      225  239   239    225   
u

### Handle Missing Values

**_TODO:_**
> How are we going to handle all the missing values?
1. Option could be just to add a zero there; reasoning could because the person forgot to log his info
2. Could use the average because if these guys are so busy that they do not have time for side conversations they may just eat/do the same thing each day 

### Handle Duplicates

**_TODO:_**
> Do we have duplicated data

### Fix Data Types

**_TODO:_**
> Need to decide which colums we want to change from an object to an int or float

### Handle Outliers

**_TODO:_**
> Personally do not want to remove outliers here

### Re-Check Missing Values

In [64]:
df_summary(df)

--- Shape: (711, 16)
--- Info
<class 'pandas.core.frame.DataFrame'>
Index: 711 entries, 2018-04-26 to 20181228
Data columns (total 16 columns):
activity_calories    225 non-null object
calories             486 non-null object
carbs                239 non-null object
distance             225 non-null object
fat                  239 non-null object
fiber                239 non-null object
floors               225 non-null object
mins_active_high     225 non-null object
mins_active_light    225 non-null object
mins_active_med      225 non-null object
mins_sedentary       225 non-null object
protien              239 non-null object
sodium               239 non-null object
steps                225 non-null object
total_calories       225 non-null object
water                239 non-null object
dtypes: object(16)
memory usage: 94.4+ KB
--- Descriptions
       activity_calories calories carbs distance  fat fiber floors  \
count                225      486   239      225  239   239    225   
u

## Exploration  <a name="exploration"></a>

In [61]:
def summary_stat(df, col_name):
    '''
    Function to provide median, min, and max of data in column -> returns a new small dataframe with the info
    '''
    df = pd.DataFrame({'Mean ' : str(round(np.mean(df[col_name]),2)),
         'Median' : str(np.median(df[col_name])),
         'Min' : str(np.min(df[col_name])),
         'Max' : str(np.max(df[col_name]))}, index=[0])
    return df

def bin_feature(df, col, newcol, bin_cuts=[]):
    '''
    Function we will use to bin different columns during our exploration -> returns the dataframe with the column binned 
    '''
    labs = list(range(len(bin_cuts)))[1:]
    df[newcol] = pd.cut(df[col], bin_cuts, labels=labs, include_lowest=False)
    return df

### Train-Test Split

### Visualizations

### Statistical Tests

In [62]:
def calc_r2(Actual_Y, Estimated_Y):
    '''
    Function that calculates R squared
    '''
    return float(1 - sum((Actual_Y-Estimated_Y)**2 ) / sum((Actual_Y-Actual_Y.mean(axis=0))**2))

def calc_rmse(Actual_Y, Estimated_Y):
    '''
    Function that calculates Root Mean Squared Error
    '''
    return( math.sqrt(sum((Actual_Y-Estimated_Y )**2 ) / Actual_Y.shape[0]))

### Summarize Conclusions

## Modeling <a name="modeling"></a>

### Feature Engineering & Selection

In [63]:
def new_features(df):
    '''
    Function used to create new features -> returns the original dataframe with new features added
    '''
    return df

### Train & Test Models

### Summarize Conclusions