## House sales predict

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

### We will make a work plan

* #### Analysis the target
* #### Filling missing values 
* #### Feature Engineering
* #### Converting categorical to numerical
* #### Modeling and predicting



#### Import required libraries 

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno


%matplotlib inline
plt.style.use('seaborn-darkgrid')
palette = plt.get_cmap('Set2')

#### Read train and test datasets

In [None]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

#### Display first 10 rows from train and test datasets

In [None]:
train.head(10)

In [None]:
test.head(10)

#### Display information on values in columns

In [None]:
train.info()

#### display a graph of missing values

##### first group

In [None]:
msno.bar(train.iloc[:, :40])

In [None]:
msno.bar(train.iloc[:, 40:])

#### As we can see there are a lot of missing values in some columns, not missing values do not exceed the mark of 30. We'll fix it after a while!

#### Display the description of the values in the columns

In [None]:
train.iloc[:, :40].describe()

In [None]:
train.iloc[:, 40:-1].describe()

In [None]:
pd.DataFrame(train['SalePrice'].describe())

In [None]:
plt.figure(figsize=(12, 7))

sns.distplot(train['SalePrice']).set(ylabel=None, xlabel=None)
plt.title('House price distribution histogram', fontsize=18)
plt.show()

#### As we can see, we have a positive skew, we must fix it.

In [None]:
train['SalePrice'] = np.log1p(train['SalePrice'])

In [None]:
plt.figure(figsize=(12, 7))

sns.distplot(train['SalePrice'])
plt.title('House price distribution histogram after fix', fontsize=18)
plt.show()

#### its ok now

#### Let's build a Pearson correlation matrix

In [None]:
corr_train = train.corr()

colormap = plt.cm.RdBu

plt.figure(figsize=(14,12))
plt.title('Pearson correlation matrix between features', y=1, size=15)
sns.heatmap(corr_train, vmax=.8, square=True, cmap=colormap)
plt.show()

In [None]:
train.head()

#### We see the relationship (correlation) between the features, but let's see the correlation between the "Price of houses" and features

In [None]:
highest_corr_features = corr_train.index[
    abs(corr_train['SalePrice']) > 0.5
    ]

plt.figure(figsize=(14,12))
plt.title('Pearson correlation matrix between features and "SalePrice"', y=1, size=15)
sns.heatmap(train[highest_corr_features].corr(), linewidths=0.1, vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)
plt.show()

### Let's celebrate

* #### Saleprice is highly correlated with OverallQual
* #### GarageArea logically has a great relationship with GarageCars
* #### Have the smallest connection YearBuilt and TotRmsAbvGrd
* #### Also highly correlated 1stFirSF and TotalBsmtSF
* #### TotRmsAbvGrd is highly correlated with GrLivArea

In [None]:
SalePrice = pd.DataFrame(corr_train['SalePrice'].sort_values(ascending=False))
SalePrice

#### Let's take only strongly related features

In [None]:
features = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']

sns.pairplot(train[features])
plt.show()

### Good job! Now we know important features

### Let's find and fill in the missing data

#### Let's combine training and test datasets for convenience

In [None]:
y_train = train['SalePrice']
test_id = test['Id']
data = pd.concat([train, test], axis=0, sort=False)
data = data.drop(['Id', 'SalePrice'], axis=1)

In [None]:
Total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum() / data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([Total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

#### We can safely remove these features as they are not important and do not have a high correlation.

In [None]:
data.drop((missing_data[missing_data['Total'] > 5]).index, axis=1, inplace=True)
print(data.isnull().sum().max())

In [None]:
# numeric data
numeric_missed = ['BsmtFinSF1',
                  'BsmtFinSF2',
                  'BsmtUnfSF',
                  'TotalBsmtSF',
                  'BsmtFullBath',
                  'BsmtHalfBath',
                  'GarageArea',
                  'GarageCars']

for feature in numeric_missed:
    data[feature].fillna(0, inplace=True)

In [None]:
# categorical data
categorical_missed = ['Exterior1st',
                  'Exterior2nd',
                  'SaleType',
                  'MSZoning',
                   'Electrical',
                     'KitchenQual']

for feature in categorical_missed:
    data[feature].fillna(data[feature].mode()[0], inplace=True)

In [None]:
data['Functional'].fillna('Typ', inplace=True)

In [None]:
data.isnull().sum().max() 

#### Almost done! That is enough

### Feature Engineering

#### Fix The Skewness in the other features

In [None]:
from scipy.stats import skew
from sklearn.decomposition import PCA

In [None]:
numeric = data.dtypes[data.dtypes != 'object'].index
skewed = data[numeric].apply(lambda col: skew(col)).sort_values(ascending=False)
skewed = skewed[abs(skewed) > 0.5]

for feature in skewed.index:
    data[feature] = np.log1p(data[feature])

In [None]:
data['TotalSF'] = data['TotalBsmtSF'] + data['1stFlrSF'] + data['2ndFlrSF']

### Converting the categorical to numerical.

#### The simplest is to use the function pd.get_dummies()

In [None]:
data = pd.get_dummies(data)
data

In [None]:
x_train = data[:len(y_train)]
x_test = data[len(y_train):]

In [None]:
x_valid = x_train[:1168]
y_valid = y_train[:1168]

### We cleaned the data very well, good job!

### Modeling and predicting

In [None]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
import tensorflow as tf

In [None]:
!pip install xgboost

In [None]:
x_train.shape, y_train.shape, x_valid.shape, y_valid.shape, x_test.shape

In [None]:
model = tf.keras.Sequential()

model.add(tf.keras.layers.Flatten(input_shape=(221,)))
model.add(tf.keras.layers.Dense(512, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

loss = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

model.compile(optimizer=optimizer,
              loss=loss,
              metrics=['mse'])

history = model.fit(x_train, y_train, validation_data=(x_valid, y_valid), epochs=150, batch_size=128)

In [None]:
y_predict = np.floor(np.expm1(model.predict(x_test)))

sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = y_predict
sub.to_csv('submission.csv',index=False)