# BIKESHARE EDA & PREPROCESSING

You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

Data Fields
- datetime - hourly date + timestamp  
- season 
    - 1 = spring, 
    - 2 = summer, 
    - 3 = fall, 
    - 4 = winter 
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- temp - temperature in Celsius
- atemp - "feels like" temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals <- PREDICTION

## Notebook Imports

In [None]:
import sys
sys.path.append('..')

In [None]:
import pandas as pd
from src import DataLoader

## Extracting the Train and Test Data

In [None]:
loader = DataLoader()

train_df, test_df = loader.get_train_test_data()

In [None]:
trend_df = train_df[['datetime', 'casual', 'registered', 'count', 'season']]

In [None]:
# Drop the casual and registered columns
train_df.drop(['casual', 'registered'], axis=1, inplace=True)

train_df.head()

In [None]:
# Saving checkpoint for the no feature engineering autogluon model predictions
loader.save_feature_engineered_data(train_df, test_df, "no_data_engineering")

## Inspecting Null Values

No null values found.

In [None]:
test_df.isnull().sum()

## Inspecting Distribution and Correlation

We can see that the big drivers for the count of bike rentals are the temperature and the humidity. The temperature and the "feels like" temperature are highly correlated, so we can drop one of them. And since the atemp is less correlated with count, it will be the one to be dropped. The windspeed and the humidity are also correlated, but not as much as the temperature and the "feels like" temperature.

In [None]:
_ = train_df.drop(columns=["count"]).hist(figsize=(20, 10), bins=20, alpha=0.5)

In [None]:
# Create heatmap of correlations
correlation_df = train_df.drop(columns=["datetime"]).corr()
correlation_df.style.background_gradient(cmap='coolwarm')

In [None]:
# Identify correlation higher than 0.9
correlation_threshold = 0.9
correlation_matrix = correlation_df.abs() > correlation_threshold

# Find pairs of features with high correlation
correlation_matrix = correlation_matrix.stack()
correlation_matrix = correlation_matrix[correlation_matrix]
correlation_matrix = correlation_matrix.reset_index()
correlation_matrix.columns = ["Feature 1", "Feature 2", "Correlation"]

# Where Feature 1 is not equal to Feature 2
correlation_matrix = correlation_matrix[correlation_matrix["Feature 1"] != correlation_matrix["Feature 2"]]
display(correlation_matrix)

# Identify which feature correlates more with count
correlation_with_count = correlation_df["count"].abs().sort_values(ascending=False)
correlation_with_count.to_frame().style.background_gradient(cmap='hot')

In [None]:
# Dropping atemp since it is highly correlated with temp
train_df.drop(columns=["atemp"], inplace=True)
test_df.drop(columns=["atemp"], inplace=True)

## Inspecting Trend of Data

We can observe some seasonality in the data, as well as some upward trend and increase in variability.

In [None]:
# Set count to zero where season is not 1 using assign
def set_count_to_zero(count, season, season_value):
    if season != season_value:
        return 0
    return count

train_df.assign(count=lambda x: x.apply(lambda row: set_count_to_zero(row["count"], row["season"], 1), axis=1))["count"].plot(figsize=(20, 10), label="Spring", color="green")
train_df.assign(count=lambda x: x.apply(lambda row: set_count_to_zero(row["count"], row["season"], 2), axis=1))["count"].plot(label="Summer", color="orange")
train_df.assign(count=lambda x: x.apply(lambda row: set_count_to_zero(row["count"], row["season"], 3), axis=1))["count"].plot(label="Fall", color="goldenrod")
train_df.assign(count=lambda x: x.apply(lambda row: set_count_to_zero(row["count"], row["season"], 4), axis=1))["count"].plot(label="Winter", color="cornflowerblue")


# Smoothing the trend
trend_df.set_index("datetime")["count"].rolling(window=100).mean().plot(figsize=(20, 10), label="Count Smoothed", color="blueviolet")

# Least square line of count .plot
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.arange(len(trend_df)).reshape(-1, 1)
y = trend_df["count"]

model = LinearRegression()

model.fit(X, y)

trend_df.loc[:, "trend"] = model.predict(X)

trend_df.set_index("datetime")["trend"].plot(figsize=(20, 10), color="red", label="Trend", linestyle="--")

# Add legend to the plot
import matplotlib.pyplot as plt

_ = plt.legend()


## Inspecting Category Types

Weather and Season are categorical variables. These need to be converted to dummy variables, or set as type category. We are setting them to category type.

In [None]:
train_df.head()

In [None]:
train_df.dtypes

In [None]:
# Testing the function to use on the bikeshare notebook
loader.set_as_category(columns=["season", "weather"])

In [None]:
train_df.dtypes

## Separating the datetime column into year, month, day, hour and weekday

In [None]:
train_df

In [None]:
# Create day of of the week feature
train_df["datetime"] = pd.to_datetime(train_df["datetime"])
train_df["day_of_week"] = train_df["datetime"].dt.dayofweek.astype("category")

test_df["datetime"] = pd.to_datetime(test_df["datetime"])
test_df["day_of_week"] = test_df["datetime"].dt.dayofweek.astype("category")

test_df["datetime"] = pd.to_datetime(test_df["datetime"])
test_df["day_of_week"] = test_df["datetime"].dt.dayofweek.astype("category")


# Sparating hour and minute from datetime
train_df["hour"] = train_df["datetime"].dt.hour
train_df["date"] = train_df["datetime"].dt.date

test_df["hour"] = test_df["datetime"].dt.hour
test_df["date"] = test_df["datetime"].dt.date

test_df["hour"] = test_df["datetime"].dt.hour
test_df["date"] = test_df["datetime"].dt.date

# Dropping datetime column
train_df.drop(columns=["datetime"], inplace=True)
test_df.drop(columns=["datetime"], inplace=True)

train_df.head()

## Saving checkpoint for hyperparameter tuning

In [None]:
loader.save_feature_engineered_data(train_df, test_df, checkpoint_name="hyperparameter_tuning")

## Engineering New Features

### Segmenting Into Day Periods

In [None]:
train_df.hour.describe()

In [None]:
def create_dummy_bins(train_df, test_df, column, bins, labels):
    """
    Create bins for the column and create dummy variables for the bins
    :param train_df: DataFrame to create the bins
    :param column: Column to create the bins
    :param bins: Bins to create
    :param labels: Labels for the bins
    :return: DataFrame with the bins

    Sets a global variable trend_df to the DataFrame with categories for the trend graphs.
    Use `show_line_trend()` to show the trend graph
    """
    train_df[f"{column}_bins"] = pd.cut(train_df[column], bins=bins, labels=labels, include_lowest=True)
    test_df[f"{column}_bins"] = pd.cut(test_df[column], bins=bins, labels=labels, include_lowest=True)

    global trend_df
    trend_df = train_df[["date", f"{column}_bins", "count"]]
    trend_df["date"] = pd.to_datetime(trend_df["date"]).dt.date
    train_df = pd.get_dummies(train_df, columns=[f"{column}_bins"], drop_first=False)
    test_df = pd.get_dummies(test_df, columns=[f"{column}_bins"], drop_first=False)

    return train_df, test_df

In [None]:
train_df, test_df = create_dummy_bins(train_df, test_df, "hour", bins=[0, 6, 12, 18, 24], labels=["night", "morning", "afternoon", "evening"])

In [None]:
# heat map of the correlation
def show_correlation_heatmap():
    """
    Show the correlation heatmap of the train_df DataFrame, dropping the date column but not inplace.
    :params: None
    :return: None
    """
    correlation_df = train_df.drop(columns=["date"]).corr()
    correlation_df = correlation_df[["count"]].abs().sort_values("count", ascending=False).style.background_gradient(cmap='coolwarm')
    
    display(correlation_df)

show_correlation_heatmap()

In [None]:
# seaborn line plot of day_period
import seaborn as sns

def show_line_trend(hue):
    """
    Show the trend of the count column in the trend_df DataFrame.

    :param hue: Column to use as hue in the seaborn lineplot
    :return: None
    """

    plt.figure(figsize=(20, 5))
    sns.lineplot(data=trend_df, x="date", y="count", hue=hue)

    # Smoothing the count
    trend_df["count_smoothed"] = trend_df["count"].rolling(window=100).mean()

    plt.figure(figsize=(20, 5))
    sns.lineplot(data=trend_df, x="date", y="count_smoothed", hue=hue)

    plt.figure(figsize=(20, 5))
    sns.pointplot(data=trend_df.groupby(['date', hue]).agg({"count": sum}).reset_index().query("count > 0"), x="date", y="count", hue=hue)

    plt.show()

show_line_trend("hour_bins")

### Categorizing Temperature

In [None]:
train_df.temp.describe()

In [None]:
train_df, test_df = create_dummy_bins(train_df, test_df, "temp", bins=[0, 21, 26, 50], labels=["cold", "warm", "hot"])

In [None]:
show_correlation_heatmap()

In [None]:
trend_df.temp_bins.value_counts()

In [None]:
show_line_trend("temp_bins")

### Categorizing Wind Speed by Buckets

In [None]:
train_df.windspeed.describe()

In [None]:
train_df, test_df = create_dummy_bins(train_df, test_df, "windspeed", bins=[0, 20, 30, 60], labels=["low", "medium", "high"])

In [None]:
show_correlation_heatmap()

In [None]:
show_line_trend("windspeed_bins")

### Categorizing Humidity by Buckets
- Low Humidity: 0-20
- Medium Humidity: 20-40
- High Humidity: 40-60

In [None]:
train_df.humidity.describe()

In [None]:
train_df, test_df = create_dummy_bins(train_df, test_df, "humidity", bins=[0, 62, 100], labels=["low", "high"])

In [None]:
show_correlation_heatmap()

In [None]:
show_line_trend("humidity_bins")

## Saving extra feature engineered data

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
loader.help()

In [None]:
loader.save_feature_engineered_data(train_df, test_df, checkpoint_name="extra_feature_engineering")

In [None]:
train_df.head()

In [None]:
_ = train_df.drop(columns=["count"]).hist(figsize=(20, 10), bins=20, alpha=0.5)

# END OF NOTEBOOK