# Divvy Bikes Trips Forecast for Given Weather Conditions


### Project Description :
Divvy is Chicago’s bike share system with thousands of bikes available at more than 580+ stations and 5,800+ bikes across the region. We want to suggest Chicago’s department of transportation about the better management of city transportation system by suggesting the patterns and frequencies about the usage of the Divvy bikes based on city weather. Divvy bike are provided under transportation department of Chicago. Our analysis of bike movements and frequencies will help in suggesting better Divvy bikes kiosk placement in future, considering city dynamics and mobility with respect to current time and weather conditions. When we combine Divvy bike’s daily trip data with Chicago’s weather data, it can be an indicator of commuting habit and frequency of people as well as usage of bikes over any given time with respect to changes in weather. Their usage preferences and frequency might give a sneak peek into indicators for city transportation and mobility.


More details about Divvy can be found @ https://www.cityofchicago.org/city/en/depts/cdot.html

Divvy bike trips data @ http://www.divvybikes.com/data

Chicago's weather History @ https://www.wunderground.com/history/airport/KORD/

### Approach and Analysis

The data has several continuous and categorical features, so we have dealt with it in two different notebooks.
Kindly launch and execute each to see the output.

Preprocessing of data includes, dropping features based on missing values (where missing data is more than 50%), scaling of continuous features, binarization of categorical features and removal of outliers from target data.

We have done many experiments and our findings indicate that our continuous features alone gives better result as compared to continuous and categorical features both taken together in data set.

#### There are three main reasons for above observation:

1. Many of the values in categorical features are missing in original data set, this compelled us to try replicating features based on most probabale values, just to study and analyse their effects.

2. Limitation on sampled data is another factor, we have analysed data available only for one quarter that does not capture every possible information required to predict the rides correctly.

3. Target variable (divvy bike trip counts) may or may not be effected by special events (like ball game, special events in vicinity etc.), which can give rise to incorrect correlation between the predictiion and real data.

We have tried two baseline models for regression as our basic model(Mean/Median).

#### Other models that we have tried are:

1) Linear regression

2) Ridge regression

3) Lasso regression


4) Linear regression(polynomial feature of degree 2)

5) Ridge regression(ploynomial feature of degree 2)

6) Lasso regression(ploynomial feature of degree 2)


7) Linear regression(polynomial feature of degree 3)

8) Ridge regression(ploynomial feature of degree 3)

9) Lasso regression(ploynomial feature of degree 3)


10) Ridge and Lasso with different values of alpha for each model.

#### Experimenting with continuous and categorical features together predicts target value as negative sometimes, to avoid this we have tried two possible approaches:

1) Every negative predicted value will be set as zero.

2) Change the scale of target variable to logarithm and scale back after prediction.

Both approcahes mentioned above gave only slight improvement on performace of model, so we dropped these two approaches.

We are able to beat the baseline performance using only continuous features (It can be verified with Notebook1).

#### Our experiments were to find the correlation between weather and divvy bike trip counts, as have taken data from different sources, which may not be completely correlated so, we have observed that the X is not able to predict Y completely all the time. But, still we were able to beat the baseline performance.


#### Description of the Notebooks:

Notebook 1 : It deals only with continous features in data set. [Continuous Features](approach/evaluation_continuous.ipynb)

Notebook 2 : It deals with continuous as well as categorical features. [Continuous and Categorical Features](approach/evaluation_continuous_categorical.ipynb)


Both notebooks cover all the points mentioned on blackboard, the only difference is in input data.

## Major Tasks:


1. Baseline Regression

2. Ridge Regression

3. Data Sampling for Divvy Bikes

4. Performance indicator

5. Polynomial features implementation

6. Linear Regression

7. Data colelction for weather

8. Features scaling

9. Outliers removal

10. Feature experiments with continuous data

11. Lasso Regression

12. Data consolidation and preparation for divvy bikes and weather

13. Binarizer for catagorical values

14. Feature experimentation for categorical features

15. Target variable scaling

### Performance Evaluation : 

#### Legends: Meaning of "Train" in documents is "whole data" and CV10 means 10 Fold Cross Validation.

#### Mean of target is : [9.3422459893]

We have tried different values of alpha, a snapshot of alpha = 0.5 is below.

1) BaseLine Mean :

[Train] : [RSME : 12.9547857905], [R^2 : 0.000]

[Cv10]  : [RSME : 12.9573388654], [R^2 : -0.292]


2) BaseLine Median :

[Train] : [RSME : 14.4239578156], [R^2 : -0.240]

[Cv10]  : [RSME : 14.4239578156], [R^2 : -0.302]  


3) Linear regression :

[Train] : [RSME : 9.39246473384], [R^2 : 0.474]

[Cv10]  : [RSME : 9.42400000000], [R^2 : 0.404]


4) Ridge regression : [alpha = 0.5]

[Train] : [RSME : 9.45096993372], [R^2 : 0.468]

[Cv10]  : [RSME : 9.46150000000], [R^2 : 0.392]


5) Lasso regression : [alpha = 0.5]

[Train] : [RSME : 10.1684547211], [R^2 : 0.384]

[Cv10]  : [RSME : 9.98130000000], [R^2 : 0.355]


#### Polynomial Feature of Degree 2

6) Linear regression : 

[Train] : [RSME : 9.18191236453], [R^2 : 0.498]

[Cv10]  : [RSME : 9.36410000000], [R^2 : 0.3466]

7) Ridge regression : [alpha = 2.712] [Best Model]

[Train] : [RSME : 9.313], [R^2 : 0.483]

[Cv10]  : [RSME : 9.364], [R^2 : 0.4045]

8) Lasso regression : [alpha = 0.5]

[Train] : [RSME : 10.1130], [R^2 : 0.3905]

[Cv10]  : [RSME : 10.1211], [R^2 : 0.3126]



#### Polynomial Feature of Degree 3 [alpha = 0.5]

9) Linear regression : 

[Train] : [RSME : 8.78000504516], [R^2 : 0.541]

[Cv10]  : [RSME : 11.2402], [R^2 : -0.524]

10) Ridge regression : 

[Train] : [RSME : 9.1249131893], [R^2 : 0.504]

[Cv10]  : [RSME : 11.240200000], [R^2 : 0.395]

11) Lasso regression : 

[Train] : [RSME : 10.113000000], [R^2 : 0.390]

[Cv10]  : [RSME : 10.136900000], [R^2 : 0.3126]

12) Ridge and Lasso with different values of alpha for each model : 

Please refer notebook for various values of alpha.

### Conclusion And Learning : 
Using linear features are simple and less tend to over fit the data, going for higehr polynomial degree results in better performnace over train data but over cross validation it results in more and more error because of overfitting over the train data.
Using graphs and performance indicator helped us to figure out overfitting and find out best model.

# Best Model Performance Summary : (Continuous Features)

#### Best Model : Ridge Regression polynomial degree 2, for alpha = 2.712

#### Table for RMSE:

|Model|     Baseline(Mean)    |     Baseline(Median)    |   Best model   |
|-----|-----------------------|-------------------------|----------------|
|Train|     12.954            |       14.423            |    9.3133      |
|CV10 |     12.957            |       14.424            |    9.3802      |


#### Table for R2 Score

|Model|     Baseline(Mean)    |     Baseline(Median)    |   Best model   |
|-----|-----------------------|-------------------------|----------------|
|Train|        0              |       -.240             |     0.400      |
|CV10 |       -.292           |       -.302             |     0.404      |



### Important Features  (In descending order of coefficients) :
|Coefficient|     Feature Name    |
|-----------|---------------------|
|13.277     |Dew_PointF           |
|3.319      |WindDirDegrees       |
|1.266      |Wind_SpeedMPH        |
|0.000      |TemperatureF         |
|-0.252     |VisibilityMPH        |
|-1.265     |Humidity             |
|-7.558     |Sea_Level_PressureIn |
