# Start here if...

You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.

# Competition Description

![title](img/housesbanner.png)
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

# Practice Skills

- Creative feature engineering 
- Advanced regression techniques like random forest and gradient boosting

# Acknowledgements

The [Ames Housing dataset](http://jse.amstat.org/v19n3/decock.pdf) was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

==========================================================================================================

### Imports

In [24]:
#Importing packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('./input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./input\data_description.txt
./input\sample_submission.csv
./input\test.csv
./input\train.csv


### Processing the dataset

In [31]:
#Load dataset
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
#train.info() #train.isnull().sum()

print('Train:{} Test:{}'.format(train.shape, test.shape))

Train:(1460, 81) Test:(1459, 80)


In [32]:
#Feature to predict
target = list(set(train.columns)-set(test.columns))
target= target[0] #the property's sale price in dollars. predict variable

### X/Y datasets

In [33]:
X_train = train.loc[:, train.columns != target]
Y_train = train[target]
X_test = test

#Drop id
X_test_id = X_test["Id"]

X_train.drop(columns='Id', inplace=True)
X_test.drop(columns='Id', inplace=True)
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)

### Fill NaN values

In [34]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [35]:
def fillNa_df(df):
    obj_col = df.columns[df.dtypes == 'object'].values
    num_col = df.columns[df.dtypes != 'object'].values
    df[obj_col] = df[obj_col].fillna('None')
    df[num_col] = df[num_col].fillna(0)
    return df

X_train = fillNa_df(X_train)
X_test = fillNa_df(X_test)

### Encoding ordinal/categorical features

In [37]:
from sklearn.preprocessing import OneHotEncoder
#def oneHotEncoding(df_train, df_test):
#    obj_col = df_train.columns[df_train.dtypes == 'object'].values
    

In [3]:
# drop missing values
missing = test.isnull().sum()
missing = missing[missing>0]
train.drop(missing.index, axis=1, inplace=True)
train.drop(['Electrical'], axis=1, inplace=True)

test.dropna(axis=1, inplace=True)
test.drop(['Electrical'], axis=1, inplace=True)