## A notebook for empirical work on the `libData.csv` peer effects data

This cell sets up the notebook to import numpy, datetime, seaborn, pandas, matplotlib etc.

In [1]:
# Run this cell to set up the notebook.

# These lines import the Numpy, Datascience, pandas modules.
import numpy as np
import pandas as pd
import seaborn as sns
import datetime as datetime
import matplotlib
import matplotlib.pyplot as plt

# Importing plotting libraries and styles
%matplotlib inline
plt.style.use('fivethirtyeight')

# For Pandas to ignore FutureWarning displays
import warnings
warnings.simplefilter('ignore', FutureWarning)

The function below computes summary statistics.

**Format:** Person, Table, Library, Arrival Time, Departure Time, Num Breaks, Break Start Times (list), Break End Times (list), fromStart, tillEnd, Neighbours

Todo: (look at the .describe thing in Econ 141 PSET 2) -> Transform dataframe and use that. 
* Average duration of stay (percentiles as well) DONE
* Average number of breaks DONE
* Average duration of break DONE
* Average work-time between breaks (frequency of breaks) todo: compute on paper DONE
* Average number of neighbors DONE
* Number of people per library DONE
* Number of people there fromStart DONE
* Number of people there tillEnd DONE

In [109]:
libData = pd.read_csv('libDataTest.csv')
libData

Unnamed: 0,Person,Table,Library,Arrival,Departure,Num_Breaks,Break_Start,Break_End,fromStart,tillEnd,Neighbours
0,2,A,Moffit,2019-03-09 23:02:11.492717,2019-03-09 23:02:53.851422,0,[],[],0,0,[5]
1,5,A,Moffit,2019-03-09 23:02:11.492950,2019-03-10 01:30:38.018536,1,"[datetime.datetime(2019, 3, 9, 23, 2, 11, 6495...","[datetime.datetime(2019, 3, 9, 23, 2, 49, 8223...",0,1,[2]
2,1,B,Stacks,2019-03-09 23:02:13.029750,2019-03-09 23:03:26.732945,1,"[datetime.datetime(2019, 3, 9, 23, 2, 14, 4168...","[datetime.datetime(2019, 3, 9, 23, 3, 26, 7325...",1,0,[3]
3,3,B,Stacks,2019-03-09 23:02:13.726172,2019-03-10 01:30:38.018536,2,"[datetime.datetime(2019, 3, 9, 23, 2, 14, 4169...","[datetime.datetime(2019, 3, 9, 23, 3, 26, 7328...",0,1,[1]


In [110]:
# This function takes in the libData.csv dataframe and outputs some summary statistics about the data.

def summaryStats(libData):
    libStats = pd.DataFrame()
    departureDatetime = libData["Departure"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S.%f"))
    arrivalDatetime = libData["Arrival"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S.%f"))
    duration = (departureDatetime - arrivalDatetime).apply(datetime.timedelta.total_seconds)
    libStats['duration'] = duration
    libStats['num_breaks'] = libData["Num_Breaks"]
    libStats['fromStart'] = libData["fromStart"]
    libStats['tillEnd'] = libData["tillEnd"]
    libStats['num_neighbors'] = libData["Neighbours"].apply(eval).apply(len)
    
    # Maximum number of breaks for each person
    maxBreaks = libData["Break_End"].apply(lambda x: eval(x)).apply(len)
    breakEnds = libData["Break_End"].apply(lambda x: eval(x))
    breakStarts = libData["Break_Start"].apply(lambda x: eval(x))
    
    # Array of average durations for each row which will be appended into libStats
    breakDurationArray = []
    
    for rowIndex in range(len(libData)):
        avgDuration = []
        for breakIndex in range(maxBreaks[rowIndex]):
            breakDuration = breakEnds[rowIndex][breakIndex] - breakStarts[rowIndex][breakIndex]
            avgDuration.append(breakDuration.total_seconds())
        breakDurationArray.append(np.mean(avgDuration))
    
    libStats['average_break_duration'] = breakDurationArray
    
    # Computing average duration between break
    betweenBreakDuration = []
    # Required array of start breaks
    maxStartBreaks = libData["Break_Start"].apply(lambda x: eval(x)).apply(len)
    
    for rowIndex in range(len(libData)):
        avgBetweenDuration = []
        
        # Including the time from arrival to breakStart
        if len(breakStarts[rowIndex])>0:
            avgBetweenDuration.append((breakStarts[rowIndex][0] - arrivalDatetime[rowIndex]).total_seconds())
        
        for j in range(1, maxStartBreaks[rowIndex]):
            # Adding the difference between new breakStart time and old breakEnd time
            avgBetweenDuration.append((breakStarts[rowIndex][j] - breakEnds[rowIndex][j-1]).total_seconds())
            
        betweenBreakDuration.append(np.mean(avgBetweenDuration))
    
    libStats['time_between_breaks'] = betweenBreakDuration
    
    print("Number of people who were there at the start", np.count_nonzero(libData["fromStart"]))
    print("Number of people who stayed till the end", np.count_nonzero(libData["tillEnd"]))
    peoplePerLibrary = libData.groupby("Library").count()["Person"]
    
    for i in peoplePerLibrary.index:
        print("Number of people in "+ i + " is: " + str(peoplePerLibrary[i]))
    
    return libStats.describe()

In [111]:
summaryStats(libData)

Number of people who were there at the start 1
Number of people who stayed till the end 2
Number of people in Moffit is: 2
Number of people in Stacks is: 2


Unnamed: 0,duration,num_breaks,fromStart,tillEnd,num_neighbors,average_break_duration,time_between_breaks
count,4.0,4.0,4.0,4.0,4.0,3.0,3.0
mean,4481.719963,1.0,0.25,0.5,1.0,49.50204,1478.016359
std,5108.052194,0.816497,0.5,0.57735,0.0,19.757317,2542.762071
min,42.358705,0.0,0.0,0.0,1.0,38.017648,1.387122
25%,65.867072,0.75,0.0,0.0,1.0,38.095236,9.960337
50%,4488.99778,1.0,0.0,0.5,1.0,38.172824,18.533552
75%,8904.850669,1.25,0.25,1.0,1.0,55.244236,2216.330977
max,8906.525586,2.0,1.0,1.0,1.0,72.315649,4414.128403
