# Seoul Bike Sharing Lab

### Introduction

Ok, so now let's return to our Seoul bike sharing lab.  Where we last left off, we explored the initial data, identifying the grain of the data, seeing if there were overlapping attributes adn then exploring the completeness and the time period that the dataset included.

Let's pick that back up.  And continue to work with our data.

### Loading our data

Let's read in the data

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/python-fundamentals-jigsaw/review-datatypes/main/SeoulBikeData.csv"
df = pd.read_csv(url, encoding='unicode_escape')
df[:2]

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes


And then we can coerce this to a list of dictionaries.

In [2]:
bike_hours = df.to_dict('records')

### Remembering our data

Ok, so let's begin by taking another look at the attributes of our bike data.

In [3]:
first_record = bike_hours[0]

print(first_record.keys())

dict_keys(['Date', 'Rented Bike Count', 'Hour', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 'Seasons', 'Holiday', 'Functioning Day'])


Now let's just create a list of dictionaries that has a subset of these attributes.

Ok, so now we'll create the dictionaries with those attributes:

*  `'Rented Bike Count'`, `'Hour'`, `'Temperature(°C)'`, `'Rainfall(mm)'`, `'Seasons'`, `'Holiday'` and `Formatted Date`.

> We'll get you started.

In [104]:
attrs = ['Rented Bike Count', 'Hour', 'Temperature(°C)', 'Rainfall(mm)', 'Seasons', 'Holiday']
selected_bike_hours = []
for bike_hour in bike_hours:
    selected_attrs = [] 
    for k, v in bike_hour.items():
        if k in attrs:
            selected_attrs.append((k, v))
    selected_bike_hours.append(dict(selected_attrs))
print(selected_bike_hours[:2])

# Here we use a nested loop, because otherwise, we would have to zip together 
# many different lists, not just one or two

[{'Rented Bike Count': 254, 'Hour': 0, 'Temperature(°C)': -5.2, 'Rainfall(mm)': 0.0, 'Seasons': 'Winter', 'Holiday': 'No Holiday'}, {'Rented Bike Count': 204, 'Hour': 1, 'Temperature(°C)': -5.5, 'Rainfall(mm)': 0.0, 'Seasons': 'Winter', 'Holiday': 'No Holiday'}]


Now the only thing missing is the formatted dates, that we created in our previous Seoul lab.

In [105]:
formatted_dates = []
for bike_hour in bike_hours:
    date = bike_hour['Date']
    formatted_date = '/'.join(list(reversed(date.split('/'))))
    formatted_dates.append(formatted_date)
formatted_dates[:2]

['2017/12/01', '2017/12/01']

So use zip to iterate through the list of `selected_bike_hours` above, and add in the `formatted_date` for each dictionary. 

Remember, you can assign a new key value pair to a dictionary with the following:

In [4]:
original_dict = {'name': 'Fred'}

original_dict['hometown'] = 'NYC'
original_dict

{'name': 'Fred', 'hometown': 'NYC'}

Ok, we'll let you get started.  To repeat, use zip to iterate through the list of selected_bike_hours above, and add in the formatted_date for each dictionary

In [107]:
print(selected_bike_hours[:1])

# [{'Rented Bike Count': 254, 'Hour': 0, 'Temperature(°C)': -5.2, 'Rainfall(mm)': 0.0, 'Seasons': 'Winter', 'Holiday': 'No Holiday', 'Formatted Date': '2017/12/01'}]

[{'Rented Bike Count': 254, 'Hour': 0, 'Temperature(°C)': -5.2, 'Rainfall(mm)': 0.0, 'Seasons': 'Winter', 'Holiday': 'No Holiday', 'Formatted Date': '2017/12/01'}]


In [108]:
len(selected_bike_hours)

# 8760

8760

Ok, so now we can see a dataset with reduced key value pairs.

* Looking for missing data

Next let's look for missing data.  Let's take a look at one of our dictionaries.

In [109]:
first_hour = selected_bike_hours[0]

print(first_hour)

{'Rented Bike Count': 254, 'Hour': 0, 'Temperature(°C)': -5.2, 'Rainfall(mm)': 0.0, 'Seasons': 'Winter', 'Holiday': 'No Holiday', 'Formatted Date': '2017/12/01'}


Now for all of our dictionaries, iterate through the keys value pairs, and if the value is `None` or an empty string, append the key to a list. 

In [110]:
missing_keys = []
for selected_hour in selected_bike_hours:
    for k, v in selected_hour.items():
        if v == None or v == '':
            missing_keys.append(k)

missing_keys[:3]

[]

Ok, so no missing values were found.  That's good news.

### Data Exploration

Okay, so now let's calculate the total number of bike rentals per hour.  We'll initialize a new dictionary for you using some fancy dictionary comprehension -- whatever that is.

In [111]:
total_bike_rentals_per_hour = {hour: 0 for hour in range(24)}
print(total_bike_rentals_per_hour)

{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0, 22: 0, 23: 0}


So you can see that we created a dictionary that has a key for each hour, setting the value to zero.

Now populate that dictionary with the total number of bike rentals for each hour.

In [113]:

# {0: 197633, 1: 155557, 2: 110095, 3: 74216,
# 4: 48396, 5: 50765, 6: 104961, 7: 221192, 8: 370731, 9: 235784, 10: 192655, 11: 219311, 12: 255296, 13: 267635, 14: 276971, 15: 302653, 16: 339677, 17: 415556, 18: 548568, 19: 436229, 20: 390172, 21: 376479, 22: 336821, 23: 244961}

{0: 197633, 1: 155557, 2: 110095, 3: 74216, 4: 48396, 5: 50765, 6: 104961, 7: 221192, 8: 370731, 9: 235784, 10: 192655, 11: 219311, 12: 255296, 13: 267635, 14: 276971, 15: 302653, 16: 339677, 17: 415556, 18: 548568, 19: 436229, 20: 390172, 21: 376479, 22: 336821, 23: 244961}


So these are the total number of bikes rented across all hours.  Next, find the average bike rented per each hour.  To do that create another dictionary that has a count of each time the hour was seen in the dataset.

This time, produce that initial dictionary yourself.

In [8]:
count_bike_rentals_per_hour = None

print(count_bike_rentals_per_hour)
# {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0, 22: 0, 23: 0}

{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0, 22: 0, 23: 0}


And then count up each time the hour was seen.

In [116]:

count_bike_rentals_per_hour

# {0: 365, 1: 365, 2: 365, 3: 365, 4: 365, 5: 365, 6: 365, 7: 365, 8: 365, 9: 365, 10: 365, 11: 365, 12: 365, 13: 365, 14: 365, 15: 365, 16: 365, 17: 365,
#  18: 365, 19: 365, 20: 365, 21: 365, 22: 365, 23: 365}

{0: 365, 1: 365, 2: 365, 3: 365, 4: 365, 5: 365, 6: 365, 7: 365, 8: 365, 9: 365, 10: 365, 11: 365, 12: 365, 13: 365, 14: 365, 15: 365, 16: 365, 17: 365, 18: 365, 19: 365, 20: 365, 21: 365, 22: 365, 23: 365}


Ok, so it looks like they were each seen 365 times.  Confirm that by showing there is only one value in the dictionary -- 365.

In [117]:


# {365}

{365}

Ok, so now divide each of the values in `total_bike_rentals_per_hour` by 365 to produce `avg_bike_rentals_per_hour`, and round to the nearest whole number.

In [9]:
avg_bike_rentals = {}

print(avg_bike_rentals)

# {0: 541, 1: 426, 2: 302, 3: 203, 4: 133, 5: 139, 6: 288, 7: 606, 8: 1016, 9: 646, 10: 528, 11: 601, 12: 699, 13: 733, 14: 759, 15: 829, 16: 931, 17: 1139,
#  18: 1503, 19: 1195, 20: 1069, 21: 1031, 22: 923, 23: 671}

{}


Can we plot it? Let's use a bit of plotly.

> You may need to uncomment and run the following two lines.

In [11]:
# !pip install plotly
# !jupyter labextension install jupyterlab-plotly

In [12]:
hours = list(avg_bike_rentals.keys())
avg_rentals = list(avg_bike_rentals.values())

In [15]:
import plotly.graph_objects as go
fig = go.Figure(data = go.Scatter(x = hours, y = avg_rentals))
fig

* More analysis

Ok, now let's see what else can do with our data.  For example, remember that we also have information on each season.  So for each season, find the total number of bikes rented.

> Try not to look at your work above.

In [148]:
seasons = ['Winter', 'Spring', 'Summer', 'Autumn']
# selected_bike_hours
total_hours_per_season = {}

In [149]:
print(total_hours_per_season)

# {'Winter': 487169, 'Spring': 1611909, 'Summer': 2283234, 'Autumn': 1790002}

{'Winter': 487169, 'Spring': 1611909, 'Summer': 2283234, 'Autumn': 1790002}


Then **count** the number of bike hours per season.

In [150]:
seasons = ['Winter', 'Spring', 'Summer', 'Autumn']
# selected_bike_hours

num_hours_per_season = {}


In [151]:
num_hours_per_season
# {'Winter': 2160, 'Spring': 2208,
# 'Summer': 2208, 'Autumn': 2184}

{'Winter': 2160, 'Spring': 2208, 'Summer': 2208, 'Autumn': 2184}

And from here, return a dictionary of the avg bike rentals per season.

In [153]:
avg_rentals_per_hour_season = {}

# {'Winter': 226, 'Spring': 730, 'Summer': 1034, 'Autumn': 820}

{'Winter': 226, 'Spring': 730, 'Summer': 1034, 'Autumn': 820}

Ok, that's good enough for now.

### Summary

In this lesson, we moved through formatting our data (by selecting relevant attributes, and adding in our formatted dates).  And from there, we checked to see if there was any missing data. 

Then we moved onto some analysis -- calculating the average rentals per hour.  And initializing with dictionary comprehension.

In [155]:
print({hour: 0 for hour in range(24)})

{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0, 22: 0, 23: 0}


And then performed calculations to find the average rentals per hour and per season.