# EDA - Lift Opening Times
Start Date: 2021.10.25<br>
William Matthews

### Report Objectives

This report has two primary purposes.  The first is to load up our cleaned dataset from PowDay and confirm that our data cleaning process accomplished what we wanted.  The second is to complete EDA and a time-series analylsis to determine the following:
- The balance of our target data set
- Basic insights into our target data set
- Determine if our time series is stationary or not and transform as needed
- Explore autocorrelation for lifts within a season and across seasons
- Explore partial autocorrelation for lifts within a season and across seasons
- Calculate sample entropy to determine how hard of a problem we have signed up for
- Perform Granger Causality test to determine if any one lift is useful in predicting if another lift will open.

### Load and Confirm Data

In [19]:
# imports for data management
import pandas as pd
import datetime

In [3]:
# load data
lift_df = pd.read_csv('./Data/PowDayHistoryClean.csv')

In [7]:
# first looks
display(lift_df.shape, lift_df.head(), lift_df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26264 entries, 0 to 26263
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   time        26264 non-null  object
 1   lift        26264 non-null  object
 2   day_status  26264 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 615.7+ KB


(26264, 3)

Unnamed: 0,time,lift,day_status
0,2014-12-31 00:00:00,Whistler Village Gondola,0
1,2014-12-31 00:00:00,Jersey Cream,0
2,2014-12-31 00:00:00,Crystal,0
3,2014-12-31 00:00:00,7th Heaven,0
4,2014-12-31 00:00:00,Glacier,0


None

The number of rows and columns match the last check before creating the csv file, so that is great!  It looks like `time` needs to be transformed back into a `datetime` object.  Let's do that first.

In [8]:
# transform to datetime object
lift_df['time'] = pd.to_datetime(lift_df['time'])

# confirm it worked
lift_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26264 entries, 0 to 26263
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   time        26264 non-null  datetime64[ns]
 1   lift        26264 non-null  object        
 2   day_status  26264 non-null  int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 615.7+ KB


Our above transformation worked.  Let's use that to confrim our starting and ending dates for the data set.

In [10]:
lift_df['time'].agg(['min', 'max'])

min   2014-12-31 00:00:00
max   2021-03-29 10:30:14
Name: time, dtype: datetime64[ns]

The above dates match what we expected to see, that our data set ranges from the last day of 2014 through the end of the 2020/2021 season.

It looks like our data is all as we expected, so let's move onto EDA.

### EDA - Determining Balance of Target Data

The goal of the model is to predict if a given lift will open on a given day.  The user of the output is going to be interested in both cases being predicted accurately.  That means  the only metric of real interest to us is the accuracy of the model.  We are going to establish a baseline on which this accuracy will be judged by calculating the average probabililty that a lift is open/closed on any given day.

In [17]:
# group by chair and get average of open/closed days
mean_prob_df = lift_df.groupby('lift').mean()[['day_status']]

# rename day_status column for readability
mean_prob_df.rename(columns={'day_status': 'Open'}, inplace = True)

# create inverse probability
mean_prob_df['Closed'] = (1 - mean_prob_df['Open'])

# check it worked and rank by Open
mean_prob_df.sort_values('Open', ascending = False)

Unnamed: 0_level_0,Open,Closed
lift,Unnamed: 1_level_1,Unnamed: 2_level_1
Emerald,0.962687,0.037313
Red Chair,0.962687,0.037313
Whistler Village Gondola (Lower),0.957356,0.042644
Creekside Gondola,0.955224,0.044776
Whistler Village Gondola,0.954158,0.045842
Jersey Cream,0.943497,0.056503
Excalibur Gondola (Lower),0.938166,0.061834
Excelerator,0.938166,0.061834
Excalibur Gondola,0.934968,0.065032
Catskinner,0.928571,0.071429


The above looks like we expected based on our experience skiing Whistler/Blackcomb.  From it we can draw the following insights:
- For the mountain access gondolas and the mid-mountain lifts, the threashold for providing accurate predictions is incrediably high.  If we cannot nail a near 100% accuracy in predictions for these lifts, we will not be providing any information of value.
- The alpline lifts (`7th Heaven, Glacier, Harmony, Peak, Symphony`) are the primary lifts of interest for prediction purposes.  As a group they are open between 67% to 78% of the time.  That is a much more manageable target.  We will have to be careful in the modelling stage though, as it is always better to have equally balanced data sets.
- `Franz's` lift is a mid-mountain lift with openings/closings that have been a complete mystery to just about everyone.  Most people don't even understand why it was built.  There is a chance we may drop it from our target set if it's openings/closing are to difficult to predict.  This would not be an issue as no one ever really rides it!
- The 'ski areas' in our list each have an opening probability lower than the lifts that service them (`Glacier/Spanky's, Showcase/Blackcomb Glacier, Symphony/Flute Bowl`) as expected.
- `Symphony` has a lower opening frequency than `Harmony` as expected.  `Harmony` is the access route to `Symphony`.  This means the probability of `Harmony` opening should be part of the predictor set used to determine if `Symphony` will open.
- `Emerald` and `Red Chair` have the highest opening percentages.  This might seem counter intuitive since they are mid-mountain lifts and you need the mountain access gondolas to be open in order to access them.  The key is that when at least one of `Whistler Village Gondola` or `Creekside Gondola`, but not neccessarily both, are open you are able to access `Emerald` and `Red Chair`.  The conclusion we can draw from `Emerald` and `Red Chair` having higher opening percentages is that they are more reliable (just) from a mechanical standpoint.

To summarize, when we get to the modelling stage, we will have to use additional techniques to ensure the imbalance in our target data does not adversely affect our model predictions.

### EDA - Basic Insights

Let's start taking a look a our data on a season-by-seaons basis.  First thing to do is define our seasons.  Stealing from our data cleaning session:

In [21]:
# list to store tuples of (open, close)
seasons = []

# add all open close dates
seasons.append((datetime.datetime(2014, 12, 31), datetime.datetime(2015, 4, 17, 23, 59, 59)))
seasons.append((datetime.datetime(2015, 11, 19), datetime.datetime(2016, 4, 17, 23, 59, 59)))
seasons.append((datetime.datetime(2016, 11, 23), datetime.datetime(2017, 4, 17, 23, 59, 59)))
seasons.append((datetime.datetime(2017, 11, 17), datetime.datetime(2018, 4, 17, 23, 59, 59)))
seasons.append((datetime.datetime(2018, 11, 22), datetime.datetime(2019, 4, 17, 23, 59, 59)))
seasons.append((datetime.datetime(2019, 11, 26), datetime.datetime(2020, 3, 14, 23, 59, 59)))
seasons.append((datetime.datetime(2020, 11, 26), datetime.datetime(2021, 3, 29, 23, 59, 59)))

Let's get a look at the open/closed frequencies for each chair by season.

# Left Off Here - Log 1.5 hrs to this point

In [42]:
def open_closed_lift_days(df, open_close):
    """
    Takes a current data frame of lift records and a list of season dates. Returns how many
    days open and closed for each chair lift in each season.
    _________________
    
    Parameters:
                df: pandas dataFrame object, columns[time, lift, day_status]
                open_close: a list of 2-tuples storing datetime objects in the form (opening day, closing day)
    ________________
    
    Returns:
           pandas dataframe object with lifts as rows, seasons as columns with sub columns for 
           open/closed,  and days open/closed as values 
    
    """    
    # dict to store {lift: dict{season: days_missed}
    lift_dict = {}
    
    # for each chair
    for lift in df['lift'].unique():
        
        # dict to store {season: open_closed_dict}
        season_dict = {}
        
        # for each season - utilize season dates stored earlier
        for opened, closed in open_close:
            
            # dict to store {open: num_day_open, closed: num_days closed}
            open_closed_dict = {}
            
#             # extract days from time delta object
#             days_in_season = (closed - opened).days + 1

#             # extract number of records
#             season_records = df[(df['lift'] == lift) & 
#                                 (df['time'] >= opened) & 
#                                 (df['time'] <= closed)].shape[0]

            # calculate days with out records in the season and store
            missing_days = days_in_season - season_records
            season_label = f"{opened.year}/{closed.year}"
            season_dict[season_label] = missing_days
            #print(f"{lift} in {opened.year}/{closed.year} missing {days_in_season - season_records} days")
        lift_dict[lift] = season_dict
    
    # create and return dataFrame
    df = pd.DataFrame(data = lift_dict).T
    return df

Next let's see if there are any major patterns in a given lifts opening/closing frequency. (do heat map for each chair by season)