![](https://upload.wikimedia.org/wikipedia/en/1/1b/CapitalBikeshare_Logo.jpg)

## Forecast use of a city bikeshare system

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

![](https://kaggle2.blob.core.windows.net/competitions/kaggle/3948/media/bikes.png)

##### [Kaggle Reference](https://www.kaggle.com/c/bike-sharing-demand)

You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

### Attributes

datetime - hourly dat* **datetime** - hourly date + timestamp  
* **season** -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
* **holiday** - whether the day is considered a holiday
* **workingday** - whether the day is neither a weekend nor holiday
* **weather** - 
    * 1: Clear, Few clouds, Partly cloudy, Partly cloudy 
    * 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
    * 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered * **clouds**
    * 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
* **temp** - temperature in Celsius
* **atemp** - "feels like" or apparent temperature in Celsius
* **humidity** - relative humidity
* **windspeed** - wind speed
* **casual** - number of non-registered user rentals initiated
* **registered** - number of registered user rentals initiated
* **count** - number of total rentals

### Evaluation

Submissions are evaluated one the Root Mean Squared Logarithmic Error (RMSLE). The RMSLE is calculated as

$$ \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } $$

Where:

* $n$ is the number of hours in the test set
* $p_i$ is your predicted count
* $a_i$ is the actual count
* $log(x)$ is the natural logarithm

## Planning

In [1]:
import base64

In [4]:
answer = 'CjEuICoqRXhwbG9yYXRpb24qKgogICAgMS4gRGlzdHJpYnV0aW9ucyAoVW5pdmFyaWF0ZSkKICAg\nIDEuIENvcnJlbGF0aW9ucyAoQml2YXJpYXRlKQogICAgMS4gUGxvdHMgKE11bHRpdmFyaWF0ZSkK\nMi4gKipBbmFseXNpcyoqCiAgICAxLiBXcml0ZSB0aGUgU2NvcmluZyBNZXRob2QKICAgIDEuIEJ1\naWxkIGEgJ21lYW4gdmFsdWUnIGJhc2VsaW5lIG1vZGVsIGZvciByZWZlcmVuY2UKICAgIDEuIFNl\ndCB1cCBjcm9zc3ZhbGlkYXRpb24gcGlwZWxpbmUKICAgIDEuIEJ1aWxkIG91ciBmaXJzdCByZWdy\nZXNzaW9uIG1vZGVsCiAgICAxLiBGZWF0dXJlIEVuZ2luZWVyaW5nCiAgICAxLiBUdW5lIHBhcmFt\nZXRlcnMgdG8gaW1wcm92ZSB0aGUgbW9kZWwKMy4gKipTdWJtaXNzaW9uKioKICAgIDEuIFN1Ym1p\ndCBvdXIgcHJlZGljdGlvbnMgdG8gS2FnZ2xlLgo=\n'
for line in base64.decodestring(answer).split('\n'):
     print(line)


1. **Exploration**
    1. Distributions (Univariate)
    1. Correlations (Bivariate)
    1. Plots (Multivariate)
2. **Analysis**
    1. Write the Scoring Method
    1. Build a 'mean value' baseline model for reference
    1. Set up crossvalidation pipeline
    1. Build our first regression model
    1. Feature Engineering
    1. Tune parameters to improve the model
3. **Submission**
    1. Submit our predictions to Kaggle.



### Versions

In [5]:
import sys
import sklearn
import pandas as pd
import numpy as np

print pd.__name__, pd.__version__
print np.__name__, np.__version__
print sklearn.__name__, sklearn.__version__
print sys.version

pandas 0.20.3
numpy 1.13.3
sklearn 0.19.0
2.7.14 |Anaconda, Inc.| (default, Oct 16 2017, 17:29:19) 
[GCC 7.2.0]
