# Task
For the given dataset of Seattle prices `house.scv` construct two models predicting prices in as cotigorical values based on `price_bin` and continueus ones 'price'.

Data is presented as following table

|    | column              | kind        | data type   |   unique items |   nans |
|:--:|:-------------------:|:-----------:|:-----------:|:--------------:|:------:|
|  0 | id                  | index       | int64       |          21436 |      0 |
|  1 | date                | date        | object      |            372 |      0 |
|  2 | price               | numerical   | float64     |           3625 |      0 |
|  3 | price_bin           | categorical | int64       |              2 |      0 |
|  4 | bedrooms            | categorical | int64       |             13 |      0 |
|  5 | bathrooms           | categorical | float64     |             30 |      0 |
|  6 | sqft_living         | numerical   | int64       |           1038 |      0 |
|  7 | sqft_lot            | numerical   | int64       |           9782 |      0 |
|  8 | floors              | categorical | float64     |              6 |      0 |
|  9 | waterfront          | categorical | int64       |              2 |      0 |
| 10 | view                | categorical | int64       |              5 |      0 |
| 11 | condition           | categorical | int64       |              5 |      0 |
| 12 | grade               | categorical | int64       |             12 |      0 |
| 13 | sqft_above          | numerical   | int64       |            946 |      0 |
| 14 | sqft_basement       | numerical   | int64       |            306 |      0 |
| 15 | yr_built            | year        | int64       |            116 |      0 |
| 16 | yr_renovated        | year        | int64       |             70 |      0 |
| 17 | zipcode             | categorical | int64       |             70 |      0 |
| 18 | lat                 | geospacial  | float64     |           5034 |      0 |
| 19 | long                | geospacial  | float64     |            752 |      0 |
| 20 | sqft_living15       | numerical   | int64       |            777 |      0 |
| 21 | sqft_lot15          | numerical   | int64       |           8689 |      0 |


In [None]:
import importlib
import utils 

import pandas as pd
import numpy as np

import matplotlib.pylab as plt
from pandas.plotting import scatter_matrix
import seaborn as sns; sns.set()

%matplotlib inline

In [None]:
importlib.reload(utils)

df = utils.PrepareData().df

df.head()

## Data preparation

### Date feature

The data is aggregated for a period of just one year between May 2014 and May 2015. Methinks, it is not enough time lap for predicting price neither by a season nor making timeseries for future price predictions.

![Sales per month](figs/sales_per_month.png)

Next plot shows an absence of correlations between timeseries (sales per month) for each zipcode.

![zipcode corr per month](figs/zipcode_corr_per_month.png)

Detailed notebook file for above pictures evaluation is located in [time_series.ipynb](time_series.ipynb)

As we see from above, the time series correlation is quite weak, therefore in future study the date featute will be ignored.

## yr_renovated feature

In [None]:
ax = plt.axes()
df['yr_renovated'].hist(ax=ax)
ax.set_title('Renovation year distribution')

In [None]:
yr_non_zero = df[df["yr_renovated"]==0].shape[0]
perc_yr_non_zero = df[df["yr_renovated"]==0].shape[0] * 100./df.shape[0]
yr_non_zero, perc_yr_non_zero

In [None]:
df[['yr_built', 'yr_renovated']].corr()

In [None]:
ndf = df.copy()
ndf['renovated_bin']= ndf.apply(lambda row: 1 if row['yr_renovated'] else 0, axis=1) 
ndf[['renovated_bin', 'price_bin']].corr()

Number of entrees with renovation year equal 0 is 20699 or 95.77% of all data. 
Correlation, with construction year is also weak. The price may be related with some of newly renovated houses. However, at this point, it seems that this data may be omiited.

## yr_built featute

In [None]:
ax = plt.axes()
df['yr_built'].hist(ax=ax)
ax.set_title('Construction year distribution')

This data should be tranformed to the house age


## Bathrooms feature

This column data is very weird. Its values' type are neither integer not reasonable, for instance, the appartment with 33 bedrooms has 1.75 bathrooms, and one with 6 bedrooms has 8 bathrooms. Likely it is outcome of an error during data farmation. Therefore in future study it will be omitted.   

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(df[['bedrooms','bathrooms']], figsize=(8,8))

As we see from above correlation between destribution of bathrooms and bedrooms, we see the data is very weird (there are houses with more bathrooms then bedrooms, additionally its not integer value are not logical). Thus this feature will be omitted in further investigation.

## 33 bedrooms

The entree with 33 bedrooms is not representative, so this enteree will be removed from the dataset.

## Criminal activities

The public database with records since 2008 was used. The number of all criminal acts before May 2005 in the nearby area was added to the data frame.

In [None]:
from sklearn.pipeline import Pipeline
import base
from base import DropColumns, YearTransformer
importlib.reload(base)
transform_pipeline = Pipeline([
        ('clean', DropColumns(columns=['id', 'date', 'yr_renovated'])),
        ('yr_built_transformer', YearTransformer(column='yr_built')),
     ])


In [None]:
ndf = transform_pipeline.transform(df)
ndf.head()

# Visualization


In [None]:
import folium
from IPython.display import display
import branca.colormap as cm

In [None]:
norm = plt.Normalize()
v_max, v_min = ndf['criminal_activities'].max(), ndf['criminal_activities'].min()

colormap = cm.LinearColormap(colors=['yellow', 'green', 'red'],
                           vmin=v_min,
                           vmax=v_max)
house_map = folium.Map(location=[ndf['lat'].mean(), ndf['long'].mean()],
                       min_lat=ndf['lat'].min(), 
                       max_lat=ndf['lat'].max(), 
                       min_lon=ndf['long'].min(), 
                       max_lon=ndf['long'].max(), 
                    #    zoom_control=False,
                       scrollWheelZoom=False,
                    #    dragging=False,
                    )

for i in range(len(df)):
    folium.CircleMarker([ndf.iloc[i]['lat'], ndf.iloc[i]['long']],
                        radius=1,
                        color=colormap(ndf.iloc[i]['criminal_activities']),
                        fill=True,
                       ).add_to(house_map)

display(house_map)

In [None]:
ndf['price_per_sqft'] = ndf['sqft_living'] / ndf['price']
norm = plt.Normalize()
v_max, v_min = ndf['price_per_sqft'].max(), ndf['price_per_sqft'].min()
colormap = cm.LinearColormap(colors=['yellow', 'green', 'red'],
                             vmin=v_min,
                             vmax=v_max)
house_map = folium.Map(location=[ndf['lat'].mean(), ndf['long'].mean()],
                       min_lat=ndf['lat'].min(), 
                       max_lat=ndf['lat'].max(), 
                       min_lon=ndf['long'].min(), 
                       max_lon=ndf['long'].max(), 
                    #    zoom_control=False,
                       scrollWheelZoom=False,
                    #    dragging=False,
                    )

for i in range(len(df)):
    folium.CircleMarker([ndf.iloc[i]['lat'], ndf.iloc[i]['long']],
                        radius=1,
                        color=colormap(ndf.iloc[i]['price_per_sqft']),
                        fill=True,
                       ).add_to(house_map)

display(house_map)

# Binary Model

Detailed notebook file with an evaluation of appropriate loogistic regression and clusterization models  is located in [classification_bin_unskewed.ipynb](classification_bin_unskewed.ipynb) and neural network in [dl.ipynb](dl.ipynb)

# Continueus Model

Detailed notebook file with an evaluation of an appropriate Linear Regression model is located in [ml.ipynb](ml.ipynb)