In [None]:
'''
Get going by asking the following questions and looking for the answers with some code and plots:
    Can you count something interesting?
    Can you find some trends (high, low, increase, decrease, anomalies)?
    Can you make a bar plot or a histogram?
    Can you compare two related quantities?
    Can you make a scatterplot?
    Can you make a time-series plot?

Having made these plots:
    What are some insights you get from them? 
    Do you see any correlations? 
    Is there a hypothesis you would like to investigate further? 
    What other questions do they lead you to ask?

By now you’ve asked a bunch of questions, and found some neat insights. 
    Is there an interesting narrative, a way of presenting the insights using text and plots from the above, 
        that tells a compelling story? 
    As you work out this story, what are some other trends/relationships you think will make it more complete?

'''

In [3]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from glob import glob


In [8]:
try:

    file_path_slug = '../../datasets/bayareabikeshare/*_station_data.csv'

    # glob all files
    file_list = glob(file_path_slug)

    stations = pd.DataFrame()

    for file in file_list:
        print('Reading file \t ' + str(file))

        # import file in chunks to temp DataFrame
        station_reader = pd.read_csv(file, chunksize=1000, iterator=True)

        # concat chunks into DataFrame
        tmp_df = pd.concat(station_reader)

        # concat tmp dataframe to status_df
        stations = pd.concat([stations, tmp_df], ignore_index=True)

    print('data loaded successfully!')
except:
    print('oops... something went wrong loading the data :(')


Reading file 	 ../../datasets/bayareabikeshare/201402_station_data.csv
Reading file 	 ../../datasets/bayareabikeshare/201408_station_data.csv
Reading file 	 ../../datasets/bayareabikeshare/201508_station_data.csv
Reading file 	 ../../datasets/bayareabikeshare/201608_station_data.csv
data loaded successfully!


In [5]:
#   drop empty rows
stations.dropna(how="all", inplace=True)

# create a lat_long column
stations['lat_long'] = stations['lat'].astype(str) + ',' + stations['long'].astype(str)

#   convert station IDs to strings
stations['station_id'] = stations['station_id'].astype(int)
stations['station_id'] = stations['station_id'].astype(str)

#   convert dockcount to int, no such thing as a partial dock
stations['dockcount'] = stations['dockcount'].astype(int)

#   convert installation to datetime
stations['installation'] = pd.to_datetime(stations['installation'])

#   drop duplicate rows and reindex
stations = stations.drop_duplicates(keep='first')
stations.reset_index(inplace=True, drop=True)