# Purpose

In this notebook I created an XGB Regressor model that predicts the number of bikes for a given hour of a given day. After some exploratory data analysis, I engineered a few featuers for dayofweek, month, and hour, which helped with predictions. The XGB Regressor is fairly easy to set up, and I hope it helps others get started. Let me know if you have any suggestsions/questions!

# Initial Setup

In [None]:
### Functions used in notebook

def correlation_table(df, width, height):
    
    import seaborn as sns

    # Create Correlation df from source df
    corr = df.corr()
    # Plot figsize
    fig, ax = plt.subplots(figsize=(width, height))
    # Drop self-correlations
    dropSelf = np.zeros_like(corr)
    dropSelf[np.triu_indices_from(dropSelf)] = True 

    # Generate Heat Map, allow annotations and place floats in map
    sns.heatmap(corr, cmap="RdBu", annot=True, fmt=".2f", mask=dropSelf, 
        xticklabels=corr.columns, 
            yticklabels=corr.columns, ax=ax, linewidths=.5, cbar_kws={"shrink": .7},
            vmin = -1, vmax=1, center=0)
    plt.title('Correlation HeatMap',fontsize=14)
    plt.show()  
    
    
def unistats(df):
  import pandas as pd
  output_df = pd.DataFrame(columns=['Numeric', 'Count', 'Unique', 'Missing', 'Mean', 'Mode',  
                                  'Min', 'Max', 'Stdev', 'Q1', 'Median', 'Q3', 'Skew', 'Kurt'])

  for col in df:
      numeric = pd.api.types.is_numeric_dtype(df[col])
      if numeric:
        output_df.loc[col] = [True, df[col].count(), df[col].nunique(), df[col].isnull().sum(), df[col].mean(), df[col].mode().values[0],  
                              df[col].min(), df[col].max(), df[col].std(), df[col].quantile(.25), 
                              df[col].quantile(.5), df[col].quantile(.75), df[col].skew(), df[col].kurt()]
      else:
        output_df.loc[col] = [False, df[col].count(),df[col].nunique(), df[col].isnull().sum(), '-', df[col].mode().values[0], 
                          '-', '-', '-','-','-','-','-','-']

  return output_df.sort_values(by=['Numeric', 'Skew', 'Unique'], ascending = False)


def get_outlier_minmax(col):
  import pandas as pd 
  if pd.api.types.is_numeric_dtype(col):
    if col.skew() > 1 or col.skew() < -1:
      q1 = col.quantile(.25)
      q3 = col.quantile(.75)
      min = q1 - (1.5 * (q3 - q1))
      max = q3 + (1.5 * (q3 - q1))
      theory = 'Tukey 1.5IQR'
    else:
      min = col.mean() - (col.std() * 3)
      max = col.mean() + (col.std() * 3)
      theory = '3 σ from μ'
    min_count = (col < min).sum()
    max_count = (col > max).sum()
  else:
    min = col.min()
    max = col.max()
    min_count = (col == col.min()).sum()
    max_count = (col == col.max()).sum()
    theory = "Categorical"
  
  return min, min_count, max, max_count, theory



def detect_outliers(df, method='auto'):
  import pandas as pd

  summary_table = pd.DataFrame(columns=['total values', 'outlier min', 'count below', 'outlier max', 'count above', 'method'])

  # Loop through each column in the dataframe that is numeric, not binary, and not empty
  for col in df:
    if pd.api.types.is_numeric_dtype(df[col]) and (len(df[col].value_counts()) > 0) and not all(df[col].value_counts().index.isin([0, 1])):
      # Get the min, max and theory
      min, min_count, max, max_count, theory = get_outlier_minmax(df[col])
      # Place them in a summary df as well as a count of the outliers above and below the range; also report the theory used
      summary_table.loc[col] = (df[col].count(), min, min_count, max, max_count, theory)
  return summary_table

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from category_encoders import TargetEncoder


test = pd.read_csv('../input/bike-sharing-demand/test.csv', parse_dates=['datetime'])
train = pd.read_csv('../input/bike-sharing-demand/train.csv', parse_dates=['datetime'])
train.drop(columns=['registered','casual','atemp'], inplace=True)
test.drop(columns=['atemp'],inplace=True)
train['workingday'] = train['workingday'].astype(object)
test['workingday'] = test['workingday'].astype(object)

train['weather'] = train['weather'].astype(object)
test['weather'] = test['weather'].astype(object)

# Exploratory Data Analysis


Above I dropped some of the columns that I knew weren't all that helpful in prediciton.

Then, I made a correlation table tells me that my remaining variables are not too collinear.

In [None]:
correlation_table(train, 10, 10)

Next I used the pandas describe() function, as well as my own univariate stats function that gives me an overview of the data.

Holiday has a high skewness, but it is a binary metric. All the other features are fairly normally distributed.

In [None]:
train.describe()

In [None]:
unistats(train) 

In [None]:
unistats(test)

Simple bar plot showing bike count by season

 1 = Spring
 
 2 = Summer
 
 3 = Fall
 
 4 = Winter
 

In [None]:
bars = ['Spring', 'Summer', 'Fall', 'Winter']
x_ticks = np.arange(len(bars))
plt.bar(train['season'], train['count'])
plt.xticks(1+x_ticks, bars)


plt.show()


# Feature engineering

dayofweek, month, and hour



In [None]:
test['dayofweek'] = test['datetime'].dt.day_name() # Monday = 0, Sunday = 6
train['dayofweek'] = train['datetime'].dt.day_name()


test['month'] = test['datetime'].dt.month_name() # Monday = 0, Sunday = 6
train['month'] = train['datetime'].dt.month_name()

test['hour'] = test['datetime'].dt.hour.astype('object')
train['hour'] = train['datetime'].dt.hour.astype('object')


dayofweek and month turned out to be less helpful than I hoped.


In [None]:
test.drop(columns=['dayofweek','month'],inplace=True)
train.drop(columns=['dayofweek','month'],inplace=True)

# Data Cleaning

I have a model that identifies outliers based on the distribution of the column. 

After identifying outliers, I removed them and got a much worse score, so I decided to keep the outliers in the data.

In [None]:
detect_outliers(train) 

In [None]:
detect_outliers(test)

I tried one-hot encoding (2 cells below), but it didn't work out too well, so I decided to do target encoding using TargetEncoder

In [None]:
encoder = TargetEncoder()
train['season'] = encoder.fit_transform(train['season'], train['count'])

In [None]:
# train = pd.get_dummies(train, prefix = ['season'], columns=['season'], drop_first = True)
# train = pd.get_dummies(train, prefix = ['hour'], columns=['hour'], drop_first = True)

# test = pd.get_dummies(test, prefix = ['season'], columns=['season'], drop_first = True)
# test = pd.get_dummies(test, prefix = ['hour'], columns=['hour'], drop_first = True)

# Model Tuning and Performance

In [None]:
X_train = train.drop(columns=['count'])
y_train = np.log1p(train['count'])

X_test = test


In [None]:
xgb_model = xgb.XGBRegressor()

In [None]:
X_train['workingday'] = pd.to_numeric(X_train['workingday'])
X_train['weather'] = pd.to_numeric(X_train['weather'])
X_train['hour'] = pd.to_numeric(X_train['hour'])

test['workingday'] = pd.to_numeric(test['workingday'])
test['weather'] = pd.to_numeric(test['weather'])
test['hour'] = pd.to_numeric(test['hour'])

In [None]:
xgb_model.fit(X_train.drop(columns=['datetime']), y_train)

Scoring the model

In [None]:
score = xgb_model.score(X_train.drop(columns=['datetime']),y_train)
print(score)

In [None]:
test['Prediction'] =np.expm1(xgb_model.predict(test.drop(columns=['datetime']))).clip(0)
filename = 'submission.csv'

In [None]:
pd.DataFrame({'datetime': test['datetime'], 'Count': test['Prediction']}).to_csv(filename, index=False)


# Score

With this submission I achieved a score of 0.43352

# Alternative Approaches

In the future, I'd try thinking up more features to engineer as well as do more thorough encoding of categorical variables. 

Additionally, I coudl probably do better tuning of hyperparameters in my model and achieve better results. 

Overall, I had fun with this competition and I hope you do too!