There are a lot of missing values in the dataset. How are they spread across different timestamps, are there some patterns? We'll look at percent of missing values during different time periods for each feature.

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

train = pd.read_csv('../input/train.csv')
print(train.shape)
test = pd.read_csv('../input/test.csv')
print(test.shape)
macro = pd.read_csv('../input/macro.csv')
print(macro.shape)

Combine train and test set in one dataframe

In [None]:
full = pd.concat([train.drop('price_doc', axis=1), test])
print(full.shape)

Extract year + month from timestamp

In [None]:
full['yearmonth'] = full.timestamp.map(lambda x: x[:7])
train['yearmonth'] = train.timestamp.map(lambda x: x[:7])
test['yearmonth'] = test.timestamp.map(lambda x: x[:7])
macro['yearmonth'] = macro.timestamp.map(lambda x: x[:7])

There are different number of features in train and test with missing values

In [None]:
print('features with missing values in train', (train.isnull().sum()>0).sum())
print('features with missing values in test', (test.isnull().sum()>0).sum())

Columns with missing values either in train or in test

In [None]:
cols_missing = full.columns[full.isnull().sum()>0].tolist()
print(len(cols_missing))

Aggregating by month and calculating fraction of missing values for each feature

In [None]:
miss_train = train.groupby('yearmonth')[cols_missing].agg(lambda x: x.isnull().mean()).T
miss_test = test.groupby('yearmonth')[cols_missing].agg(lambda x: x.isnull().mean()).T
miss_train.head()

Plot heatmaps with month on x-axis and features on y-axis for train and test.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, gridspec_kw = {'width_ratios':[3, 1]},
                               sharey=True, figsize=(12,14))
sns.heatmap(miss_train, ax=ax1, cbar=False, vmax=1, vmin=0)
sns.heatmap(miss_test, ax=ax2, vmax=1, vmin=0)

Patterns of missing values are very different in train set and test set. Some features fully absent during first half of train set. Some don't have missing values in the beginning, but have them later. Last part of train set looks quite similar to test set.

## Doing the same for macro features

In [None]:
cols_missing_macro = macro.columns[macro.isnull().sum()>0].tolist()
print('features with missing values in test', len(cols_missing_macro))
miss_macro = macro.groupby('yearmonth')[cols_missing_macro].agg(lambda x: x.isnull().mean()).T

plt.figure(figsize=(12,18))
sns.heatmap(miss_macro)

Not good news - a lot of missing values in the end, which corresponds to test set. But may be this features will be unimportant.