### Explore train data

You will work with another Kaggle competition called "Store Item Demand Forecasting Challenge". In this competition, you are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items in 10 different stores.

To begin, let's explore the train data for this competition. For the faster performance, you will work with a subset of the train data containing only a single month history.

Your initial goal is to read the input data and take the first look at it.

In [1]:
# Import pandas
import pandas as pd

# Read train data
train = pd.read_csv('data/train.csv')

# Look at the shape of the data
print('Train shape:', train.shape)

# Look at the head() of the data
print(train.head())

Train shape: (913000, 4)
         date  store  item  sales
0  2013-01-01      1     1     13
1  2013-01-02      1     1     11
2  2013-01-03      1     1     14
3  2013-01-04      1     1     13
4  2013-01-05      1     1     10


### Explore test data

Having looked at the train data, let's explore the test data in the "Store Item Demand Forecasting Challenge". Remember, that the test dataset generally contains one column less than the train one.

This column, together with the output format, is presented in the sample submission file. Before making any progress in the competition, you should get familiar with the expected output.

That is why, let's look at the columns of the test dataset and compare it to the train columns. Additionally, let's explore the format of the sample submission. The train DataFrame is available in your workspace.

In [2]:
import pandas as pd

# Read the test data
test = pd.read_csv('data/test.csv')
# Print train and test columns
print('Train columns:', train.columns.tolist())
print('Test columns:', test.columns.tolist())

# Read the sample submission file
sample_submission = pd.read_csv('data/sample_submission.csv')

# Look at the head() of the sample submission
print(sample_submission.head())

Train columns: ['date', 'store', 'item', 'sales']
Test columns: ['id', 'date', 'store', 'item']
   id  sales
0   0     52
1   1     52
2   2     52
3   3     52
4   4     52


### Train a simple model

As you determined, you are dealing with a regression problem. So, now you're ready to build a model for a subsequent submission. But now, instead of building the simplest Linear Regression model as in the slides, let's build an out-of-box Random Forest model.

You will use the RandomForestRegressor class from the scikit-learn library.

Your objective is to train a Random Forest model with default parameters on the "store" and "item" features.

In [3]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Read the train data
train = pd.read_csv('data/train.csv')

# Create a Random Forest object
rf = RandomForestRegressor()

# Train a model
rf.fit(X=train[['store', 'item']], y=train['sales'])



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

### Prepare a submission

You've already built a model on the training data from the Kaggle Store Item Demand Forecasting Challenge. Now, it's time to make predictions on the test data and create a submission file in the specified format.

Your goal is to read the test data, make predictions, and save these in the format specified in the "sample_submission.csv" file. The rf object you created in the previous exercise is available in your workspace.

Note that starting from now and for the rest of the course, pandas library will be always imported for you and could be accessed as pd.

In [4]:
# Read test and sample submission data
test = pd.read_csv('data/test.csv')
sample_submission = pd.read_csv('data/sample_submission.csv')

# Show the head() of the sample_submission
print(sample_submission.head())

# Get predictions for the test set
test['sales'] = rf.predict(test[['store', 'item']])

# Write test predictions using the sample_submission format
test[['id', 'sales']].to_csv('kaggle_submission.csv', index=False)

   id  sales
0   0     52
1   1     52
2   2     52
3   3     52
4   4     52


### Train XGBoost models

Every Machine Learning method could potentially overfit. You will see it on this example with XGBoost. Again, you are working with the Store Item Demand Forecasting Challenge. The train DataFrame is available in your workspace.

Firstly, let's train multiple XGBoost models with different sets of hyperparameters using XGBoost's learning API. The single hyperparameter you will change is:

max_depth - maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.

In [5]:
import xgboost as xgb

# Create DMatrix on train data
dtrain = xgb.DMatrix(data=train[['store', 'item']],
                     label=train['sales'])

# Define xgboost parameters
params = {'objective': 'reg:linear',
          'max_depth': 2,
          'silent': 1}

# Train xgboost model
xg_depth_2 = xgb.train(params=params, dtrain=dtrain)

# Define xgboost parameters
params = {'objective': 'reg:linear',
          'max_depth': 8,
          'silent': 1}

# Train xgboost model
xg_depth_8 = xgb.train(params=params, dtrain=dtrain)

# Define xgboost parameters
params = {'objective': 'reg:linear',
          'max_depth': 15,
          'silent': 1}

# Train xgboost model
xg_depth_15 = xgb.train(params=params, dtrain=dtrain)

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


### Explore overfitting XGBoost

Having trained 3 XGBoost models with different maximum depths, you will now evaluate their quality. For this purpose, you will measure the quality of each model on both the train data and the test data. As you know by now, the train data is the data models have been trained on. The test data is the next month sales data that models have never seen before.

The goal of this exercise is to determine whether any of the models trained is overfitting. To measure the quality of the models you will use Mean Squared Error (MSE). It's available in sklearn.metrics as mean_squared_error() function that takes two arguments: true values and predicted values.

train and test DataFrames together with 3 models trained (xg_depth_2, xg_depth_8, xg_depth_15) are available in your workspace.

In [6]:
from sklearn.metrics import mean_squared_error

dtrain = xgb.DMatrix(data=train[['store', 'item']])
dtest = xgb.DMatrix(data=test[['store', 'item']])

# For each of 3 trained models
for model in [xg_depth_2, xg_depth_8, xg_depth_15]:
    # Make predictions
    train_pred = model.predict(dtrain)     
    test_pred = model.predict(dtest)          
    
    # Calculate metrics
    mse_train = mean_squared_error(train['sales'], train_pred)                  
    mse_test = mean_squared_error(test['sales'], test_pred)
    print('MSE Train: {:.3f}. MSE Test: {:.3f}'.format(mse_train, mse_test))

MSE Train: 607.487. MSE Test: 355.254
MSE Train: 294.466. MSE Test: 42.633
MSE Train: 255.334. MSE Test: 3.645


### Define a competition metric

Competition metric is used by Kaggle to evaluate your submissions. Moreover, you also need to measure the performance of different models on a local validation set.

For now, your goal is to manually develop a couple of competition metrics in case if they are not available in sklearn.metrics.

In particular, you will define:

Mean Squared Error (MSE) for the regression problem:

MSE=1N∑i=1N(yi−y^i)2

Logarithmic Loss (LogLoss) for the binary classification problem:

LogLoss=−1N∑i=1N(yilnpi+(1−yi)ln(1−pi))

In [7]:
import numpy as np

# Import MSE from sklearn
from sklearn.metrics import mean_squared_error

# Define your own MSE function
def own_mse(y_true, y_pred):
  	# Raise differences to the power of 2
    squares = np.power(y_true - y_pred, 2)
    # Find mean over all observations
    err = np.mean(squares)
    return err

y_regression_true = np.array([0.69646919,0.28613933,0.22685145,0.55131477,0.71946897,0.42310646,
0.9807642,0.68482974,0.4809319,0.39211752,0.34317802,0.72904971,
0.43857224,0.0596779,0.39804426,0.73799541,0.18249173,0.17545176,
0.53155137,0.53182759,0.63440096,0.84943179,0.72445532,0.61102351,
0.72244338,0.32295891,0.36178866,0.22826323,0.29371405,0.63097612,
0.09210494,0.43370117,0.43086276,0.4936851,0.42583029,0.31226122,
0.42635131,0.89338916,0.94416002,0.50183668,0.62395295,0.1156184,
0.31728548,0.41482621,0.86630916,0.25045537,0.48303426,0.98555979,
0.51948512,0.61289453,0.12062867,0.8263408,0.60306013,0.54506801,
0.34276383,0.30412079,0.41702221,0.68130077,0.87545684,0.51042234,
0.66931378,0.58593655,0.6249035,0.67468905,0.84234244,0.08319499,
0.76368284,0.24366637,0.19422296,0.57245696,0.09571252,0.88532683,
0.62724897,0.72341636,0.01612921,0.59443188,0.55678519,0.15895964,
0.15307052,0.69552953,0.31876643,0.6919703,0.55438325,0.38895057,
0.92513249,0.84167,0.35739757,0.04359146,0.30476807,0.39818568,
0.70495883,0.99535848,0.35591487,0.76254781,0.59317692,0.6917018,
0.15112745,0.39887629,0.2408559,0.34345601])

y_regression_pred = np.array([0.51312815,0.66662455,0.10590849,0.13089495,0.32198061,0.66156434,
0.84650623,0.55325734,0.85445249,0.38483781,0.3167879,0.35426468,
0.17108183,0.82911263,0.33867085,0.55237008,0.57855147,0.52153306,
0.00268806,0.98834542,0.90534158,0.20763586,0.29248941,0.52001015,
0.90191137,0.98363088,0.25754206,0.56435904,0.80696868,0.39437005,
0.73107304,0.16106901,0.60069857,0.86586446,0.98352161,0.07936579,
0.42834727,0.20454286,0.45063649,0.54776357,0.09332671,0.29686078,
0.92758424,0.56900373,0.457412,0.75352599,0.74186215,0.04857903,
0.7086974,0.83924335,0.16593788,0.78099794,0.28653662,0.30646975,
0.66526147,0.11139217,0.66487245,0.88785679,0.69631127,0.44032788,
0.43821438,0.7650961,0.565642,0.08490416,0.58267109,0.8148437,
0.33706638,0.92757658,0.750717,0.57406383,0.75164399,0.07914896,
0.85938908,0.82150411,0.90987166,0.1286312,0.08178009,0.13841557,
0.39937871,0.42430686,0.56221838,0.12224355,0.2013995,0.81164435,
0.46798757,0.80793821,0.00742638,0.55159273,0.93193215,0.58217546,
0.20609573,0.71775756,0.37898585,0.66838395,0.02931972,0.63590036,
0.03219793,0.74478066,0.472913,0.12175436])

print('Sklearn MSE: {:.5f}. '.format(mean_squared_error(y_regression_true, y_regression_pred)))
print('Your MSE: {:.5f}. '.format(own_mse(y_regression_true, y_regression_pred)))



Sklearn MSE: 0.15418. 
Your MSE: 0.15418. 


In [8]:
import numpy as np

# Import log_loss from sklearn
from sklearn.metrics import log_loss

# Define your own LogLoss function
def own_logloss(y_true, prob_pred):
  	# Find loss for each observation
    terms = y_true * np.log(prob_pred) + (1 - y_true) * np.log(1 - prob_pred)
    # Find mean over all observations
    err = np.mean(terms) 
    return -err

y_classification_true = np.array([1,1,0,1,0,1,1,1,0,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,0,0,1,
0,0,0,0,1,0,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,1,0,1,0,1,1,1,0,1,0,0,0,0,1,1,1,
0,0,0,1,1,1,0,0,0,0,1,0,1,1,0,1,0,0,1,0,0,0,1,0,0,0])

y_classification_pred = np.array([0.2082483,0.4433677,0.71560128,0.41051979,0.19100696,0.96749431
,0.65075037,0.86545985,0.02524236,0.26690581,0.5020711,0.06744864
,0.99303326,0.2364624,0.37429218,0.21401191,0.10544587,0.23247979
,0.30061014,0.63444227,0.28123478,0.36227676,0.00594284,0.36571913
,0.53388598,0.16201584,0.59743311,0.29315247,0.63205049,0.02619661
,0.88759346,0.01611863,0.12695803,0.77716246,0.04589523,0.71099869
,0.97104614,0.87168293,0.71016165,0.95850974,0.42981334,0.87287891
,0.35595767,0.92976365,0.14877766,0.94002901,0.8327162,0.84605484
,0.12392301,0.5964869,0.01639248,0.72118437,0.00773751,0.08482228
,0.22549841,0.87512453,0.36357632,0.53995994,0.56810321,0.22546336
,0.57214677,0.6609518,0.29824539,0.41862686,0.45308892,0.93235066
,0.58749375,0.94825237,0.55603475,0.50056142,0.00353221,0.48088904
,0.927455,0.19836569,0.05209113,0.40677889,0.37239648,0.85715306
,0.02661112,0.92014923,0.680903,0.90422599,0.60752907,0.81195331
,0.33554387,0.34956623,0.38987423,0.75479708,0.36929117,0.24221981
,0.93766836,0.90801108,0.34879732,0.63463807,0.27384221,0.20611513
,0.33633953,0.32709989,0.8822761,0.82230381])

print('Sklearn LogLoss: {:.5f}'.format(log_loss(y_classification_true, y_classification_pred)))
print('Your LogLoss: {:.5f}'.format(own_logloss(y_classification_true, y_classification_pred)))

Sklearn LogLoss: 1.10801
Your LogLoss: 1.10801


### EDA statistics


As mentioned in the slides, you'll work with New York City taxi fare prediction data. You'll start with finding some basic statistics about the data. Then you'll move forward to plot some dependencies and generate hypotheses on them.

The train and test DataFrames are already available in your workspace.

In [10]:
# Shapes of train and test data
print('Train shape:', train.shape)
print('Test shape:', test.shape)

# Train head()
print(train.head())

# Describe the target variable
print(train.fare_amount.describe())

# Train distribution of passengers within rides
print(train.passenger_count.value_counts())

Train shape: (913000, 4)
Test shape: (45000, 5)
         date  store  item  sales
0  2013-01-01      1     1     13
1  2013-01-02      1     1     11
2  2013-01-03      1     1     14
3  2013-01-04      1     1     13
4  2013-01-05      1     1     10


AttributeError: 'DataFrame' object has no attribute 'fare_amount'