#                      Bike Share Challenge:

## Part I:

**Goal of challenge**:
   You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.
   
   As we have to simply provide a prediction on the total, we need to predict the number of casual rides + the number of registered rides.
   
   **The dataset**:<br>
You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month.

**Data fields:**

- datetime - hourly date + timestamp<br>
- season -  1 = spring, 2 = summer, 3 = fall, 4 = winter <br>
- holiday - whether the day is considered a holiday<br>
- workingday - whether the day is neither a weekend nor holiday<br>
- weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy<br>
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist<br>
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds<br>
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog <br>
- temp - temperature in Celsius<br>
- atemp - "feels like" temperature in Celsius<br>
- humidity - relative humidity<br>
- windspeed - wind speed<br>
- casual - number of non-registered user rentals initiated<br>
- registered - number of registered user rentals initiated<br>
- count - number of total rentals

**Initial thoughts:**<br>
- Bike share demand is usually higher on weekends (free floating); however we are only provided with the first 19 days of the month so we can create a variable which says how many weekends in the month provided;<br>
- Bike share demand is very correlated with good weather (our good weather indicators are: season, weather (1-2-3-4), temp, atemp, humidity, windspeed
- Casual users: they usually use more on the weekend; they tend to be higher during tourist seasons
- Registered users: they tend to use more always
- Total rentals over seasons: usually cyclical over the seasons
- Holidays: higher number of riders

**Load relevant libraries:**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rc("font", size=18)
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

In [None]:
train = pd.read_csv('../input/bike-sharing-demand/train.csv')
test = pd.read_csv('../input/bike-sharing-demand/test.csv')
train.head()

**Unique values per variable:**

In [None]:
unique_values = {}
for i in range(1, len(train.columns)-3):
    unique_values[train.columns[i]] = train[train.columns[i]].unique()
unique_values

In [None]:
train.describe()

In [None]:
train.isnull().sum()

## Part II: EDA and Visualisation

**Preparing Out Data For Vis:**

In [None]:
#Datetime:

datasets = [train, test]

for dataset in datasets:
    dataset['datetime'] = pd.to_datetime(dataset.datetime)
    dataset['hour'] = dataset['datetime'].apply(lambda x: x.hour)
    dataset['day'] = dataset['datetime'].apply(lambda x: x.day)
    dataset['weekday'] = dataset['datetime'].apply(lambda x: x.weekday())
    dataset['month'] = dataset['datetime'].apply(lambda x: x.month)
    dataset['year'] = dataset['datetime'].apply(lambda x: x.year)

In [None]:
train.head(2)

In [None]:
test.head(2)

In [None]:
#Names for categorical data:
train_c = train.copy()
train_c['weather'] = train_c['weather'].map({1: 'Good', 2: 'Medium', 3: 'Bad', 4: 'Very Bad'})
train_c['weekday'] = train_c['weekday'].map({0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thur', 4: 'Fri',
                                            5: 'Sat', 6: 'Sun'})
train_c['month'] = train_c['month'].map({1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 
                                        6: 'Jun', 7: 'July', 8: 'Aug', 9: 'Sept', 10: 'Oct',
                                         11: 'Nov', 12: 'Dec'})
train_c['season'] = train_c['season'].map({1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'})
train_c['workingday'] = train_c['workingday'].map({0: 'No', 1: 'Yes'})
train_c['holiday'] = train_c['holiday'].map({0: 'No', 1: 'Yes'})

**Season EDA:**|

In [None]:
from numpy import mean
fig, ax = plt.subplots(nrows=2, ncols=2, figsize = (12,8))
sns.barplot(x = 'season', y = 'count', data = train_c, ci=None, color='salmon',
            hue = 'year', estimator = mean, ax =ax[0,0])
ax[0,0].set_title('Mean Count by Season hue: Year')
sns.barplot(x = 'season', y = 'count', data = train_c, ci=None, 
            color = 'salmon', hue = 'weather', estimator = mean, ax = ax[0,1])
ax[0,1].set_title('Mean Count by Season hue: Weather')
sns.barplot(x = 'month', y = 'count', data = train_c, ci=None, 
            color = 'indigo', hue = 'year', estimator = mean, ax = ax[1,0])
ax[1,0].set_title('Mean Count by Month hue: Year')
sns.barplot(x = 'month', y = 'count', data = train_c, ci=None, 
            color = 'indigo', hue = 'weather', estimator = mean, ax = ax[1,1])
ax[1,1].set_title('Mean Count by Season hue: Weather')
plt.tight_layout()

Preliminary observation:
* We can see a big shift up from 2011 to 2012
* The very bad weather is associated with January, and yet this is associated with Spring
(The classification of the winter months as Spring is interesting).
* We can see that the months / seasons with worse weather indicators have a lower count of rides

**Humidity, Temperature:**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=4, figsize = (16,5))
sns.distplot(train_c['windspeed'], ax=ax[0])
ax[0].set_title('Distplot windspeed')
sns.distplot(train_c['temp'], ax=ax[1])
ax[1].set_title('Distplot temperature')
sns.distplot(train_c['atemp'], ax=ax[2])
ax[2].set_title('Distplot atemperature')
sns.distplot(train_c['humidity'], ax=ax[3])
ax[3].set_title('Distplot humidity')
plt.tight_layout()

Comments:
* For atemp('feels like temperature') we can see some spikes around the 30 celcius mark;
* For temp we can see spikes around the 16 degrees marks
* Other than that the two displots for temperature would show a *relatively* normal distribution;
* For windspeed we would see a normal distribution except for the spike at 0 which seems to indicate to be an outlier; let's look into these distributions by looking at their outliers a bit closer.

In [None]:
fig, ax = plt.subplots(nrows=4, ncols=1, figsize = (12,12))
sns.boxplot(x='season',y='windspeed', hue= 'weather', data=train_c, palette='winter', ax = ax[0])
ax[0].set_title('Boxplot Wincdspeed by Season: Hue Weather')
sns.boxplot(x='season',y='temp', hue= 'weather', data=train_c, palette='winter', ax = ax[1])
ax[1].set_title('Boxplot Temperature by Season: Hue Weather')
sns.boxplot(x='season',y='atemp', hue= 'weather', data=train_c, palette='winter', ax = ax[2])
ax[2].set_title('Boxplot ATemperature by Season: Hue Weather')
sns.boxplot(x='season',y='humidity', hue= 'weather', data=train_c, palette='winter', ax = ax[3])
ax[3].set_title('Boxplot Humidity by Season: Hue Weather')
plt.tight_layout()

*Comments:*
- What we can see here is that out of whisker bounds instances tend to be the lower end for humidity are fall under the occasions of bad weather; particularly prevalent are the Summer and Fall seasons;
- Winter, for temp atemp and humidity seems to have the least amount of outliers;
- For temperature we see highest amount of outliers with Fall season for good weather; same with atemp;

**Day of week and times:**

In [None]:
fig, ax = plt.subplots(1, figsize = (12,8))
grouped_hours = pd.DataFrame(train_c.groupby(['hour'], sort=True)['casual', 'registered', 'count'].mean())
grouped_hours.plot(ax=ax)
ax.set_xticks(grouped_hours.index.to_list())
ax.set_xticklabels(grouped_hours.index)
plt.xticks(rotation=45)
plt.title('Avg Count by Hour')

*Preliminary observations:*
- We can see that registered users follow commuter patterns, whilst casual users do not - they have higher peaks during the afternoon (potentially from weekend use);

**Let's look at by day of week:**

In [None]:
fig, ax = plt.subplots(1, figsize = (12,8))
sns.barplot(x = 'weekday', y = 'count', data = train_c, ci=None, 
            color = 'indigo', estimator = mean, ax = ax)
ax.set_title('Avg Count by Weekday')

* Similar usage overall indicating that higher usage during weekends will compensate for commuter usage during weekdays

**Can we see a commuter trend?**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize = (15,8))

workingday = train_c.loc[train_c.workingday == 'Yes']
not_workingday = train_c.loc[train_c.workingday == 'No']
grouped_workingday = pd.DataFrame(workingday.groupby(['hour'], sort=True)['count'].mean())
grouped_notworkingday = pd.DataFrame(not_workingday.groupby(['hour'], sort=True)['count'].mean())

grouped_workingday.plot(ax=ax[0])
ax[0].set_xticks(grouped_workingday.index.to_list())
ax[0].set_xticklabels(grouped_workingday.index)
ax[0].tick_params(labelrotation=45)
ax[0].set_title('Avg Count by Hour - Working Day')

grouped_notworkingday.plot(ax=ax[1])
ax[1].set_xticks(grouped_notworkingday.index.to_list())
ax[1].set_xticklabels(grouped_notworkingday.index)
ax[1].tick_params(labelrotation=45)
ax[1].set_title('Avg Count by Hour - Not Working Day')

*Preliminary observations:*
- We can see that the different patterns are very clear here: commuter for working days and leisure for non working days meaning that most usage is during the aft (esp. since weekend)

## Part III: Any Outliers?

In [None]:
sns.set(style="ticks")

In [None]:
sns.pairplot(data=train_c,
                  y_vars=['count'],
                  x_vars=['temp', 'atemp', 'humidity', 'windspeed'])

In [None]:
sns.pairplot(data=train_c,
                  y_vars=['registered'],
                  x_vars=['temp', 'atemp', 'humidity', 'windspeed'])

In [None]:
sns.pairplot(data=train_c,
                  y_vars=['casual'],
                  x_vars=['temp', 'atemp', 'humidity', 'windspeed'])

In general we can see that with the casual users temperature seems to be a bigger driver than registered ones.

**Creating a train without outliers train set:**

In [None]:
Q1 = train.quantile(0.25)
Q3 = train.quantile(0.75)
IQR = Q3 - Q1

In [None]:
train = train.drop(['datetime'], axis = 1)

In [None]:
train_without_outliers =train[~((train < (Q1 - 1.5 * IQR)) |(train > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
print("train original shape", train.shape[0])
print("train_without_outliers observations", train_without_outliers.shape[0])

Now let's review some of the outliers we saw with our boxplots in the previous analysis of temperature, windspeed and humidity:

In [None]:
fig, ax = plt.subplots(nrows=4, ncols=2, figsize = (12,12))

sns.boxplot(x='season',y='windspeed', data=train, palette='winter', ax = ax[0,0])
ax[0,0].set_title('Boxplot Wincdspeed by Season WITH OUTLIER')
sns.boxplot(x='season',y='windspeed', data=train_without_outliers, palette='winter', ax = ax[0,1])
ax[0,1].set_title('Boxplot Wincdspeed by Season WITHOUT OUTLIER')

sns.boxplot(x='season',y='temp', data=train, palette='winter', ax = ax[1,0])
ax[1,0].set_title('Boxplot Temperature by Season WITH OUTLIERS')
sns.boxplot(x='season',y='temp', data=train_without_outliers, palette='winter', ax = ax[1,1])
ax[1,1].set_title('Boxplot Temperature by Season WITHOUT OUTLIERS')


sns.boxplot(x='season',y='atemp', data=train, palette='winter', ax = ax[2,0])
ax[2,0].set_title('Boxplot ATemperature WITH OUTLIERS')
sns.boxplot(x='season',y='atemp', data=train_without_outliers, palette='winter', ax = ax[2,1])
ax[2,1].set_title('Boxplot ATemperature by Season WITHOUT OUTLIERES')

sns.boxplot(x='season',y='humidity', data=train, palette='winter', ax = ax[3,0])
ax[3,0].set_title('Boxplot Humidity by Season WITH OUTLIERS')
sns.boxplot(x='season',y='humidity', data=train_without_outliers, palette='winter', ax = ax[3,1])
ax[3,1].set_title('Boxplot Humidity by Season WITHOUT OUTLIERS')

plt.tight_layout()

**Comments:**
* Here we can see that the simple method for removing outliers has worked well wiuth windspeed;
* Its effect on temperature and atemperature however have not been very successful;
* This comes down to the fact that *whilst on an aggregate level of, say, windspeed we are able to remove outliers, this may not result on a more granular level (of looking at it on a season by season basis)*
* Note: arguably, if one were to apply a removing outliers method, it would make sense to do this on a season by season basis if we are looking at variables such as weather;
* We will therefore go ahead with the regression analysis without the non-outlier set first, but then consider it later to see if it makes any great change.

**How to deal with outliers?**

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

Summary of scalers:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

**MinMaxScaler:**
- Rescales the data set such that all feature values are in the range [0, 1]
- MinMaxScaler is very sensitive to the presence of outliers.

**MaxAbScaler:**
- Differs from the previous scaler such that the absolute values are mapped in the range [0, 1]. On positive only data, this scaler behaves similarly to MinMaxScaler and therefore also suffers from the presence of large outliers.

**RobustScaler:**
- The centering and scaling statistics of this scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers.

**PowerTransformer:**
- PowerTransformer applies a power transformation to each feature to make the data more Gaussian-like
- Currently, PowerTransformer implements the Yeo-Johnson and Box-Cox transforms.
- The power transform finds the optimal scaling factor to stabilize variance and mimimize skewness through maximum likelihood estimation.
- By default, PowerTransformer also applies zero-mean, unit variance normalization to the transformed output. Note that Box-Cox can only be applied to strictly positive data. Income and number of households happen to be strictly positive, but if negative values are present the Yeo-Johnson transformed is to be preferred.

**QuantileTransformer:**
- has an additional output_distribution parameter allowing to match a Gaussian distribution instead of a uniform distribution. Note that this non-parametetric transformer introduces saturation artifacts for extreme values.

**QuantileTransformer (uniform output):**
- QuantileTransformer applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform distribution. In this case, all the data will be mapped in the range [0, 1], even the outliers which cannot be distinguished anymore from the inliers.
- As RobustScaler, QuantileTransformer is robust to outliers in the sense that adding or removing outliers in the training set will yield approximately the same transformation on held out data. But contrary to RobustScaler, QuantileTransformer will also automatically collapse any outlier by setting them to the a priori defined range boundaries (0 and 1).

**Normalizer:**
- The Normalizer rescales the vector for each sample to have unit norm, independently of the distribution of the samples.

For this project, given the presence of outliers, we will consider the use of RobustScaler().

### Conclusions from overall EDA:

- Casual users tend to be non working day users
- Non working day users do not use it for commuter times, rather usage is high in the early afternoon not commuter hours
- Weather is an important factor for usage but plays a stronger role on casual users
- Usage by weekday stays roughly the same
- It changes though by season wiht Spring - which has the most intense weather - reporting to be the season with the least rides

## Part IV: Correlations

In [None]:
train.corr()
mask = np.array(train.corr())
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(30,15)
sns.heatmap(train.corr(), mask = mask, vmax = 0.8, square=True, annot=True, center = 0, 
            cmap="RdBu_r", linewidths=.5)

* For casual and registered: weather seems to be the most contributing factor
* With casual we also see strong associations with working day (negative)

### Identifying the Most Important Factor:

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

In [None]:
X= train.drop(['count', 'casual', 'registered'], axis = 1)
y = train['count']

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.3, random_state=5)

rf = RandomForestRegressor(n_estimators=100, random_state=2)

In [None]:
rf.fit(X_train, Y_train)

**Graphical representation of most important factors:**

In [None]:
%matplotlib inline
import matplotlib as mp
plt.subplots(figsize=(15,10))
core_variables = pd.Series(rf.feature_importances_, index=X.columns)
core_variables = core_variables.nlargest(8)

# Colorize the graph based on likeability:
likeability_scores = np.array(core_variables)
 
data_normalizer = mp.colors.Normalize()
color_map = mp.colors.LinearSegmentedColormap(
    "my_map",
    {
        "red": [(0, 1.0, 1.0),
                (1.0, .5, .5)],
        "green": [(0, 0.5, 0.5),
                  (1.0, 0, 0)],
        "blue": [(0, 0.50, 0.5),
                 (1.0, 0, 0)]
    }
)

plt.title('Most Important Features')

#make the plot
core_variables.plot(kind='barh', color=color_map(data_normalizer(likeability_scores)))

**Now let's do it with the without outliers dataset:**

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
continuous_features = ['temp','atemp', 'humidity', 'windspeed']
data = [train_without_outliers]
for dataset in data:
    for col in continuous_features:
        transf = dataset[col].values.reshape(-1,1)
        scaler = preprocessing.StandardScaler().fit(transf)
        dataset[col] = scaler.transform(transf)
train_without_outliers.reset_index()

In [None]:
X= train_without_outliers.drop(['count', 'casual', 'registered'], axis = 1)
y = train_without_outliers['count']

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.3, random_state=5)

rf_without_outliers = RandomForestRegressor(n_estimators=100, random_state=2)

In [None]:
rf_without_outliers.fit(X_train, Y_train)

**Feature Importance Without Outliers:**

In [None]:
plt.subplots(figsize=(15,10))
core_variables_without_outliers = pd.Series(rf_without_outliers.feature_importances_, index=X.columns)
core_variables_without_outliers = core_variables_without_outliers.nlargest(8)

# Colorize the graph based on likeability:
likeability_scores = np.array(core_variables)
 
data_normalizer = mp.colors.Normalize()
color_map = mp.colors.LinearSegmentedColormap(
    "my_map",
    {
        "red": [(0, 1.0, 1.0),
                (1.0, .5, .5)],
        "green": [(0, 0.5, 0.5),
                  (1.0, 0, 0)],
        "blue": [(0, 0.50, 0.5),
                 (1.0, 0, 0)]
    }
)

#make the plot
core_variables_without_outliers.plot(kind='barh', color=color_map(data_normalizer(likeability_scores)))

**Now let's compare the two plots:**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize = (20,10))
core_variables.plot(kind='barh', color=color_map(data_normalizer(likeability_scores)), ax=ax[0])
ax[0].set_title('With outliers significance plot')
core_variables_without_outliers.plot(kind='barh', color=color_map(data_normalizer(likeability_scores)), ax=ax[1])
ax[1].set_title('Without outliers significance plot')

**Conclusion:**
* With this random forest regressor feature importance we can see that with or without the outliers trai dataset the most import feature by far is *time*.
* Notable change would be the difference between weekday and working day in importance - with the latter being more important for the original dataset.
* For this reason we will continue with the normal, complete dataset; however, in our training below we will select only the most important variables.

### Random Forest Regressor: Applying Standard Scalers

**Using Robust Scaler:**

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
X = train.drop(['count', 'casual', 'registered'], axis =1)
y = train['count']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=2)

In [None]:
transformer = RobustScaler().fit(X_train)
rescaled_X_train = transformer.transform(X_train)

transformer = RobustScaler().fit(X_test)
rescaled_X_test = transformer.transform(X_test)

y_train= y_train.values.reshape(-1,1)
y_test= y_test.values.reshape(-1,1)

transformer = RobustScaler().fit(y_train)
rescaled_y_train = transformer.transform(y_train)

transformer = RobustScaler().fit(y_test)
rescaled_y_test = transformer.transform(y_test)

In [None]:
rf = RandomForestRegressor(n_estimators=100)
rf.fit(rescaled_X_train, rescaled_y_train)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn import metrics
rf_prediction = rf.predict(rescaled_X_test)
print('MSE:', metrics.mean_squared_error(rescaled_y_test, rf_prediction))

In [None]:
plt.scatter(rescaled_y_test,rf_prediction)

**Using MinMax Scaler:**

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
X = train.drop(['count', 'casual', 'registered'], axis =1)
y = train['count']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=2)

In [None]:
y_train= y_train.values.reshape(-1,1)
y_test= y_test.values.reshape(-1,1)

sc_X = MinMaxScaler()
sc_y = MinMaxScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
y_train = sc_y.fit_transform(y_train)
y_test = sc_y.fit_transform(y_test)

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

In [None]:
rf_prediction = rf.predict(X_test)
print('MSE:', metrics.mean_squared_error(y_test, rf_prediction))

In [None]:
plt.scatter(y_test,rf_prediction)

**Using Standard Scaler**:

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
train.head(2)

In [None]:
continuous_features= ['temp','atemp', 'humidity', 'windspeed', 'count']
train_copy = train.copy()
for col in continuous_features:
    transf = train_copy[col].values.reshape(-1,1)
    scaler = preprocessing.StandardScaler().fit(transf)
    train_copy[col] = scaler.transform(transf)

In [None]:
X = train_copy.drop(['count', 'casual', 'registered'], axis =1)
y = train_copy['count']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=2)

In [None]:
transformer = StandardScaler().fit(X_train)
standard_X_train = transformer.transform(X_train)

transformer = StandardScaler().fit(X_test)
standard_X_test = transformer.transform(X_test)

y_train= y_train.values.reshape(-1,1)
y_test= y_test.values.reshape(-1,1)

transformer = StandardScaler().fit(y_train)
standard_y_train = transformer.transform(y_train)

transformer = StandardScaler().fit(y_test)
standard_y_test = transformer.transform(y_test)

In [None]:
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

In [None]:
rf_prediction = rf.predict(standard_X_test)
print('MSE:', metrics.mean_squared_error(standard_y_test, rf_prediction))

In [None]:
plt.scatter(standard_y_test,rf_prediction)

## Submission 1:

For this, therefore, we will use the MinMax() as that performed the best in terms of MSE and the shape of the scatter plot. However, let's try this time using the without outliers dataset:

In [None]:
X = train_without_outliers[['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp','humidity', 'year', 
                            'month', 'day', 'hour', 'weekday','windspeed']]
y = train_without_outliers['count']

Let's also decrease the test size:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1)

In [None]:
y_train= y_train.values.reshape(-1,1)
y_test= y_test.values.reshape(-1,1)

sc_X = MinMaxScaler()
sc_y = MinMaxScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
y_train = sc_X.fit_transform(y_train)
y_test = sc_y.fit_transform(y_test)

In [None]:
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

In [None]:
rf_prediction = rf.predict(X_test)

In [None]:
print('MSE:', metrics.mean_squared_error(y_test, rf_prediction))

In [None]:
test[['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp','humidity', 'year', 'month', 'day', 'hour',
     'weekday','windspeed']] = sc_X.fit_transform(test[['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp','humidity', 
                                                          'year', 'month', 'day', 'hour', 'weekday','windspeed']])

In [None]:
test_pred= rf.predict(test[['season', 'holiday', 'workingday', 
                            'weather', 'temp', 'atemp','humidity', 'year', 'month', 'day', 
                            'hour', 'weekday','windspeed']])

In [None]:
test_pred=test_pred.reshape(-1,1)
test_pred.shape

In [None]:
test_pred

In [None]:
test_pred = sc_y.inverse_transform(test_pred)

In [None]:
test_pred

In [None]:
test_pred = pd.DataFrame(test_pred, columns=['count'])

In [None]:
submission1 = pd.concat([test['datetime'], test_pred],axis=1)

In [None]:
submission1.head()

In [None]:
submission1.dtypes

In [None]:
submission1['count'] = submission1['count'].astype('int')

In [None]:
submission1.to_csv('submission1.csv', index=False)

**Score:**
Score: (private leaderboard for now): 0.49759

## Submission 1:

In [None]:
X = train_without_outliers[['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp','humidity', 'year', 
                            'month', 'day', 'hour', 'weekday','windspeed']]
y = train_without_outliers['count']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1)

In [None]:
rescaled_X_train = RobustScaler().fit_transform(X_train)

rescaled_X_test = RobustScaler().fit_transform(X_test)

y_train= y_train.values.reshape(-1,1)
y_test= y_test.values.reshape(-1,1)

rescaled_y_train = RobustScaler().fit_transform(y_train)

rescaled_y_test = RobustScaler().fit_transform(y_test)

In [None]:
rf = RandomForestRegressor(n_estimators=100)
rf.fit(rescaled_X_train, rescaled_y_train)

In [None]:
rf_prediction = rf.predict(rescaled_X_test)

In [None]:
test[['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp','humidity', 'year', 'month', 'day', 'hour',
     'weekday','windspeed']] = sc_X.fit_transform(test[['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp','humidity', 
                                                          'year', 'month', 'day', 'hour', 'weekday','windspeed']])

test_pred= rf.predict(test[['season', 'holiday', 'workingday', 
                            'weather', 'temp', 'atemp','humidity', 'year', 'month', 'day', 
                            'hour', 'weekday','windspeed']])

In [None]:
test_pred=test_pred.reshape(-1,1)
test_pred.shape

In [None]:
test_pred = transformer.inverse_transform(test_pred)

In [None]:
test_pred

## Some Comments:

* It seems that in doing well with this bike prediction demand model comes down to how we treat with the outlier variables.
* This could be the nature of dealing with demand data in which outliers can be present and extremely influential in our models
* We looked at different ways of dealing with outliers and found that the best was the MinMax() standard scaler preprocessing method
* The MinMax method proves to be the most effective
* When for example we use the RobustScaler() we get values that are very off;
* It would be intersting to find out more into detail about the dynamics of why RobustScaler() is less effective than MinMax() by such a high degree.
* The code/idea to create a without outliers training I saw from other submissions; I will look to update this to see if there are more optimal ways of doing this in future commits.
<br>

**In any case, future work on this will require more attention to be dealt on the cases of outliers and how to effectively deal with them.**