# Bikeshare Submission

## Packages Used

In [2]:
import pandas as pd
# from ydata_profiling import ProfileReport
import os

You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

Data Fields
- datetime - hourly date + timestamp  
- season 
    - 1 = spring, 
    - 2 = summer, 
    - 3 = fall, 
    - 4 = winter 
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- temp - temperature in Celsius
- atemp - "feels like" temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals <- PREDICTION

# Extracting the Train and Test Data

In [3]:
train_df = pd.read_csv('../data/train.csv')

# Dropping casual and registered columns
train_df.drop(['casual', 'registered'], axis=1, inplace=True)

train_df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,1


In [4]:
test_df = pd.read_csv('../data/test.csv')
test_df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
0,2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027
1,2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,0.0
2,2011-01-20 02:00:00,1,0,1,1,10.66,13.635,56,0.0
3,2011-01-20 03:00:00,1,0,1,1,10.66,12.88,56,11.0014
4,2011-01-20 04:00:00,1,0,1,1,10.66,12.88,56,11.0014


# Generate the profile report

In [4]:
profile_df = pd.concat([train_df.drop(columns=["count"]), test_df], axis=0)

In [5]:
profile = ProfileReport(profile_df, title='Pandas Profiling Report', explorative=True)

In [6]:
profile.to_file("./reports/eda-profile.html")
_ = os.system("open ./reports/eda-profile.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'cannot reindex on an axis with duplicate labels')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# Training without Using Data Engineering

In [5]:
from autogluon.tabular import TabularPredictor

  from .autonotebook import tqdm as notebook_tqdm


## - Running Tabular Predictor

In [6]:
predictor = TabularPredictor(
        label='count',
        path='autogluon',
        eval_metric='root_mean_squared_error',
    ).fit(
        train_df,
        time_limit=600,
        presets='best_quality'
    )

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.1.1
Python Version:     3.11.8
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000
CPU Count:          8
Memory Avail:       6.05 GB / 16.00 GB (37.8%)
Disk Space Avail:   51.70 GB / 460.43 GB (11.2%)
Presets specified: ['best_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
DyStack is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
	This is used to identify the optimal `num_stack_levels` value. Copies of AutoGluon will be fit on subsets of the data. Then holdout validation data is used to detect stacked 

[1000]	valid_set's rmse: 129.692
[2000]	valid_set's rmse: 128.562
[3000]	valid_set's rmse: 128.461


	Ran out of time, early stopping on iteration 3482. Best iteration is:
	[2833]	valid_set's rmse: 128.314
	Time limit exceeded... Skipping LightGBMXT_BAG_L1.
Fitting model: LightGBM_BAG_L1 ... Training model for up to 81.28s of the 130.75s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy


[1000]	valid_set's rmse: 129.274
[1000]	valid_set's rmse: 129.285
[1000]	valid_set's rmse: 135.098
[1000]	valid_set's rmse: 124.896
[1000]	valid_set's rmse: 134.058
[1000]	valid_set's rmse: 134.479
[1000]	valid_set's rmse: 136.511


	-131.8496	 = Validation score   (-root_mean_squared_error)
	25.18s	 = Training   runtime
	0.1s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 55.73s of the 105.2s of remaining time.
	-119.5502	 = Validation score   (-root_mean_squared_error)
	3.02s	 = Training   runtime
	0.28s	 = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 52.26s of the 101.73s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
	Ran out of time, early stopping on iteration 2232.
	Ran out of time, early stopping on iteration 2650.
	Ran out of time, early stopping on iteration 3312.
	Ran out of time, early stopping on iteration 3440.
	Ran out of time, early stopping on iteration 3596.
	Ran out of time, early stopping on iteration 4016.
	-131.835	 = Validation score   (-root_mean_squared_error)
	49.64s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ..

[1000]	valid_set's rmse: 68.4396
[1000]	valid_set's rmse: 70.4375
[1000]	valid_set's rmse: 78.3282
[1000]	valid_set's rmse: 72.9126
[2000]	valid_set's rmse: 72.8003
[1000]	valid_set's rmse: 75.2908
[1000]	valid_set's rmse: 76.558
[1000]	valid_set's rmse: 70.1196
[1000]	valid_set's rmse: 74.4144
[2000]	valid_set's rmse: 73.768


	-73.1365	 = Validation score   (-root_mean_squared_error)
	38.89s	 = Training   runtime
	0.17s	 = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 9.08s of the 9.06s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
	Ran out of time, early stopping on iteration 290. Best iteration is:
	[288]	valid_set's rmse: 62.5063
	Ran out of time, early stopping on iteration 303. Best iteration is:
	[143]	valid_set's rmse: 65.2556
	Ran out of time, early stopping on iteration 309. Best iteration is:
	[140]	valid_set's rmse: 70.5065
	Ran out of time, early stopping on iteration 321. Best iteration is:
	[180]	valid_set's rmse: 68.5704
	Ran out of time, early stopping on iteration 352. Best iteration is:
	[184]	valid_set's rmse: 67.7337
	Ran out of time, early stopping on iteration 375. Best iteration is:
	[111]	valid_set's rmse: 71.169
	Ran out of time, early stopping on iteration 414. Best iteration is:
	[145]	va

[1000]	valid_set's rmse: 131.684
[2000]	valid_set's rmse: 130.67
[3000]	valid_set's rmse: 130.626
[1000]	valid_set's rmse: 135.592
[1000]	valid_set's rmse: 133.481
[2000]	valid_set's rmse: 132.323
[3000]	valid_set's rmse: 131.618
[4000]	valid_set's rmse: 131.443
[5000]	valid_set's rmse: 131.265
[6000]	valid_set's rmse: 131.277
[7000]	valid_set's rmse: 131.443
[1000]	valid_set's rmse: 128.503
[2000]	valid_set's rmse: 127.654


	Ran out of time, early stopping on iteration 2986. Best iteration is:
	[2813]	valid_set's rmse: 127.244


[1000]	valid_set's rmse: 134.135
[2000]	valid_set's rmse: 132.272
[3000]	valid_set's rmse: 131.286
[4000]	valid_set's rmse: 130.752
[5000]	valid_set's rmse: 130.363
[6000]	valid_set's rmse: 130.509
[1000]	valid_set's rmse: 136.168
[2000]	valid_set's rmse: 135.138
[3000]	valid_set's rmse: 135.029
[1000]	valid_set's rmse: 134.061
[2000]	valid_set's rmse: 133.034
[3000]	valid_set's rmse: 132.182
[4000]	valid_set's rmse: 131.997
[5000]	valid_set's rmse: 131.643
[6000]	valid_set's rmse: 131.504
[7000]	valid_set's rmse: 131.574
[1000]	valid_set's rmse: 132.912
[2000]	valid_set's rmse: 131.703
[3000]	valid_set's rmse: 131.117
[4000]	valid_set's rmse: 130.82
[5000]	valid_set's rmse: 130.673
[6000]	valid_set's rmse: 130.708


	-131.4948	 = Validation score   (-root_mean_squared_error)
	255.91s	 = Training   runtime
	0.61s	 = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 41.7s of the 191.5s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
	Ran out of time, early stopping on iteration 360. Best iteration is:
	[360]	valid_set's rmse: 131.436
	Ran out of time, early stopping on iteration 698. Best iteration is:
	[695]	valid_set's rmse: 133.551
	Ran out of time, early stopping on iteration 675. Best iteration is:
	[675]	valid_set's rmse: 131.674
	Ran out of time, early stopping on iteration 874. Best iteration is:
	[771]	valid_set's rmse: 126.67
	Ran out of time, early stopping on iteration 777. Best iteration is:
	[766]	valid_set's rmse: 131.739
	Ran out of time, early stopping on iteration 865. Best iteration is:
	[758]	valid_set's rmse: 133.592
	Ran out of time, early stopping on iteration 913. Best iteration is:
	[898]

[1000]	valid_set's rmse: 130.62


	-131.4324	 = Validation score   (-root_mean_squared_error)
	39.87s	 = Training   runtime
	0.09s	 = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 1.51s of the 151.3s of remaining time.
	-116.542	 = Validation score   (-root_mean_squared_error)
	1.87s	 = Training   runtime
	0.24s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 149.0s of remaining time.
	Ensemble Weights: {'KNeighborsDist_BAG_L1': 1.0}
	-84.1464	 = Validation score   (-root_mean_squared_error)
	0.01s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting 106 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 148.99s of the 148.97s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy


[1000]	valid_set's rmse: 60.597
[2000]	valid_set's rmse: 59.5748
[1000]	valid_set's rmse: 61.4527
[2000]	valid_set's rmse: 60.1446


	Ran out of time, early stopping on iteration 2307. Best iteration is:
	[2052]	valid_set's rmse: 60.0915


[1000]	valid_set's rmse: 64.1794
[2000]	valid_set's rmse: 63.0111


	Ran out of time, early stopping on iteration 2395. Best iteration is:
	[2304]	valid_set's rmse: 62.9171


[1000]	valid_set's rmse: 64.2891
[2000]	valid_set's rmse: 62.9486


	Ran out of time, early stopping on iteration 2918. Best iteration is:
	[2453]	valid_set's rmse: 62.7896


[1000]	valid_set's rmse: 59.0302
[2000]	valid_set's rmse: 57.5905
[1000]	valid_set's rmse: 62.8332
[2000]	valid_set's rmse: 61.2736
[1000]	valid_set's rmse: 63.068
[2000]	valid_set's rmse: 62.3601
[3000]	valid_set's rmse: 62.3717
[1000]	valid_set's rmse: 58.4155


	-60.5704	 = Validation score   (-root_mean_squared_error)
	125.64s	 = Training   runtime
	0.28s	 = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 22.47s of the 22.46s of remaining time.
	Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
	Ran out of time, early stopping on iteration 411. Best iteration is:
	[154]	valid_set's rmse: 53.9719
	Ran out of time, early stopping on iteration 415. Best iteration is:
	[339]	valid_set's rmse: 52.7536
	Ran out of time, early stopping on iteration 372. Best iteration is:
	[249]	valid_set's rmse: 54.7015
	Ran out of time, early stopping on iteration 383. Best iteration is:
	[199]	valid_set's rmse: 55.3363
	Ran out of time, early stopping on iteration 417. Best iteration is:
	[306]	valid_set's rmse: 54.765
	Ran out of time, early stopping on iteration 444. Best iteration is:
	[255]	valid_set's rmse: 56.0249
	-54.5864	 = Validation score   (-root_mean_squared_error)
	20.63s	 = Traini

## - Showing the Leatherboard

In [9]:
try:
    display(predictor.leaderboard(silent=True))
except:
    predictor = TabularPredictor.load('autogluon')
    display(predictor.leaderboard(silent=True))

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L3,-54.569714,root_mean_squared_error,1.292876,443.94709,0.000248,0.011255,3,True,9
1,LightGBM_BAG_L2,-54.586355,root_mean_squared_error,1.010519,318.299719,0.039332,20.634286,2,True,8
2,LightGBMXT_BAG_L2,-60.570375,root_mean_squared_error,1.253296,423.301549,0.282109,125.636116,2,True,7
3,KNeighborsDist_BAG_L1,-84.146423,root_mean_squared_error,0.014645,0.009696,0.014645,0.009696,1,True,2
4,WeightedEnsemble_L2,-84.146423,root_mean_squared_error,0.014847,0.021625,0.000202,0.011929,2,True,6
5,KNeighborsUnif_BAG_L1,-101.588176,root_mean_squared_error,0.014143,0.006946,0.014143,0.006946,1,True,1
6,RandomForestMSE_BAG_L1,-116.542032,root_mean_squared_error,0.241228,1.870387,0.241228,1.870387,1,True,5
7,LightGBM_BAG_L1,-131.432362,root_mean_squared_error,0.088231,39.867387,0.088231,39.867387,1,True,4
8,LightGBMXT_BAG_L1,-131.494795,root_mean_squared_error,0.61294,255.911016,0.61294,255.911016,1,True,3


## - Predictions on Test Data

In [10]:
predictions = predictor.predict(test_df)
predictions.head()

0    34.273094
1    43.230057
2    46.590622
3    50.979973
4    51.777607
Name: count, dtype: float32

In [12]:
# Identifying negative predictions
predictions.describe()

count    6493.000000
mean       98.326591
std        88.796036
min       -10.711480
25%        16.453503
50%        62.550030
75%       169.153320
max       370.938568
Name: count, dtype: float64

In [15]:
# Counting negative predictions
negative_prediction_count = (predictions < 0).sum()

print(f"There are {negative_prediction_count} negative predictions to set to zero.")

There are 9 negative predictions to set to zero.


In [16]:
# Setting negative predictions to zero
predictions[predictions < 0] = 0

predictions.describe()

count    6493.000000
mean       98.332489
std        88.789307
min         0.000000
25%        16.453503
50%        62.550030
75%       169.153320
max       370.938568
Name: count, dtype: float64

## - Setting up for Submission

In [25]:
submission_df = pd.read_csv('../data/sampleSubmission.csv')
display(submission_df.head())
submission_df.shape

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,0
1,2011-01-20 01:00:00,0
2,2011-01-20 02:00:00,0
3,2011-01-20 03:00:00,0
4,2011-01-20 04:00:00,0


(6493, 2)

In [27]:
submission_df['count'] = predictions
submission_df.to_csv('submission.csv', index=False)
display(submission_df.head())
submission_df.shape

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,34.273094
1,2011-01-20 01:00:00,43.230057
2,2011-01-20 02:00:00,46.590622
3,2011-01-20 03:00:00,50.979973
4,2011-01-20 04:00:00,51.777607


(6493, 2)

## - Submitting Initial Predictions

In [32]:
# Getting the best model name
best_model = predictor.model_best
best_model

'WeightedEnsemble_L3'

In [34]:
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m f"irst submission with {best_model}"

100%|█████████████████████████████████████████| 188k/188k [00:00<00:00, 436kB/s]
Successfully submitted to Bike Sharing Demand

In [35]:
!kaggle competitions submissions -c bike-sharing-demand

fileName        date                 description                                status    publicScore  privateScore  
--------------  -------------------  -----------------------------------------  --------  -----------  ------------  
submission.csv  2024-07-07 21:14:45  first submission with WeightedEnsemble_L3  complete  1.84764      1.84764       


# Training with Feature Engineering

# END OF NOTEBOOK