# Exploration of Sberbank Housing Data, Part I

This notebook partially documents my thought process as I do some data cleaning and basic initial feature engineering.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train = pd.read_csv('../input/train.csv')
macro = pd.read_csv('../input/macro.csv')
test = pd.read_csv('../input/test.csv')

dfa = pd.concat([train, test])
dfa = dfa.merge(macro, on='timestamp', suffixes=['','_macro'])

I've combined the training and test sets, because any data cleaning we do will have to be done the same way in both sets, and it will have to deal with any problems that occur in either set.  I'll make use of the following function (based in part on code from [Mark Waddopus](https://www.kaggle.com/mwaddoups/sberbank-russian-housing-market/i-regression-workflow-various-models)) to examine variables: 

In [None]:
def describe(varname="price_doc", df=dfa, minval=-1e20, maxval=1e20, 
             addtolog=1, nlo=8, nhi=8, dohist=True, showmiss=True):
  var = df[varname]

  print("DESCRIPTION OF ", varname, "\n")
  if (showmiss):
     print("Fraction missing = ", var.isnull().mean(), "\n")
  var = var[(var<=maxval) & (var>=minval)]
  if (nlo > 0):
    print("Lowest values:\n", var.sort_values().head(nlo).values, "\n")
  if (nhi > 0):
    print("Highest values:\n", var.sort_values().tail(nhi).values, "\n")

  if (dohist):
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,3))

    print("Histograms of raw values and logarithm")
    var.plot(ax=axes[0], kind='hist', bins=100)
    np.log(var+addtolog).plot(ax=axes[1], kind='hist', bins=100, color='green', secondary_y=True)
    plt.show()

In [None]:
print(dfa.shape)
dfa.head()

In [None]:
print("The names of all 392 columns:\n\n", dfa.columns.values)

That's a lot.  Let's start with the target variable.

In [None]:
describe()

So price_doc looks pretty well-behaved.  There are spikes at round numbers (e.g., 13.8 is the natural log of 1e6), but that's not likely to be a problem.  I will use a logarithm (and I also intend to normalize by some macro variable like CPI, but that's a later issue).

In [None]:
describe("full_sq")

There are ones and zeros, which are nonsense in principle but may have special meanings that we don't know about.  There's also one extreme outlier on the high end.  Let's see what the variable looks like without those strange values.

In [None]:
describe("full_sq", minval=1.5, maxval=1000, showmiss=False)

I'm going to draw a line between 2 and 5.  Can you live in 2 square meters?  Maybe, but the main cluster of values seems to begin with 5, so I'm going to assume the 2 is a nonsense value.  Hence...

In [None]:
describe("full_sq", minval=3, maxval=1000, nhi=0, showmiss=False)

That logarithm histogram looks pretty nice.  Wrap it up.  I'll take it.  But what's that weird spike?

In [None]:
describe("full_sq", minval=25, maxval=60, nhi=0, nlo=0, showmiss=False)
print("Mode is ", dfa.full_sq.mode().values[0] )

Apparently the real answer to the ultimate question of life, the universe, and everything is 38.  And it's in units of square meters.  Other than that, I have no idea what it means.  But there's no particular reason for special processing of it.  On to the next variable.

In [None]:
describe("life_sq")

A lot of zeros (maybe commercial property with no living space?).  Also a lot of ones, as you can see from the log histogram.  (The function is actually log(x+1), and the natural logarithm of 2 is 0.69.)  Those make no sense, and the meaning is unclear, so I'll give them a dummy category.  There's also one extreme outlier on the right.  So let's look at the good part of the distribution.

In [None]:
describe("life_sq", minval=1.5, maxval=1000, nhi=0, showmiss=False)

Again there's a 2, and I'm going to draw a line above it, so...

In [None]:
describe("life_sq", minval=2.5, maxval=1000, nhi=0, nlo=0, showmiss=False)

Again, the log chart looks pretty reasonable but with some strange spikes.  Looking more closely,

In [None]:
describe("life_sq", minval=15, maxval=35, nhi=0, nlo=0, showmiss=False)

19 and 30.  Who knows what they mean?  Ignore them.  On to the floor variable.

In [None]:
describe("floor")

Genrally looks reasonable.  Not sure what the zeros mean.  (There aren't enough of them for zero to mean "not an apartment.")  They'll get a dummy.  Also one is kind of a special case, so maybe should have its own dummy.  And 77 is not unreasonable, but it's far from the rest of the pack, so I'll give it a dummy too.  It's not obvious from the historgrams, but I think I will use a logarithm for floor:  intuitively, the difference between floor 2 and floor 3 is a lot more significant than the difference between floor 32 and floor 33.

Next is max_floor.

In [None]:
describe("max_floor")

Looks pretty reasonable, except for the zeros.  Again one is a special case and should get a dummy.  And logs are clearly better than raw values.  There's also a big gap between 57 and 99, so maybe the last 4 points should get their own dummy.  (Also, 99 is a suspicious number, given that there are 3 of them.  Does it just mean "a lot"?  Is it a missing value code?)

Next num_room.

In [None]:
describe("num_room")

Looks good, except for the zeros, which will get a dummy.  Not clear whether raw values or logs are better, but my intuition says 1 room vs. 2 rooms is a bigger difference than 8 rooms vs. 9 rooms, so I'll go with logs.

Next is kitch_sq

In [None]:
describe("kitch_sq")

That's an ugly one.  Let's first take a closer look at the low end.

In [None]:
describe("kitch_sq", maxval=12, nhi=0, nlo=0, showmiss=False)

Lots of zeros and ones but nothing else strange.  One square meter is not implausibly small for a kitchenette, but it does look like a special case, so it will get a dummy.  Let's look at the high end.

In [None]:
describe("kitch_sq", minval=30, nhi=20, nlo=0, showmiss=False)

Nothing completely insane here but the top part of the distribution is weird.  I'll put in a dummy for values greater than 400.  Let's look at the middle part of the distribution.

In [None]:
describe("kitch_sq", minval=4, maxval=150, nhi=0, nlo=0, showmiss=False)

Still looks werid.  Let's look at the part where it gets weird.

In [None]:
describe("kitch_sq", minval=15, maxval=70, nhi=0, nlo=0, showmiss=False)

So the weirdness is not just a few unusual cases.  Even with the log transformation, the overall distribution is quite skewed.  I'm tempted to take the cube root of the log.  Here's what that looks like for the full distribution and for just the non-extreme range of values (2 to 400).

In [None]:
var = dfa.kitch_sq
var2 = var[((var>1)&(var<400))]
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,3))
np.cbrt(np.log(var+1)).plot(ax=axes[0], kind='hist', bins=100, color='green', secondary_y=True)
np.cbrt(np.log(var2+1)).plot(ax=axes[1], kind='hist', bins=100, color='green', secondary_y=True)
plt.show()

To be continued...