### Intro
I hope you liked this code, I also prepared more interesting laptops for this competition and I will be glad to share them with you:

1. [COMPREHENSIVE DATA EXPLORATION WITH PYTHON](https://www.kaggle.com/andrej0marinchenko/comprehensive-data-exploration-with-python-upd)
2. [Data ScienceTutorial for Beginners ](https://www.kaggle.com/andrej0marinchenko/data-sciencetutorial-for-beginners-house-prices)
3. [House Price Calculation methods for beginnners](https://www.kaggle.com/andrej0marinchenko/house-price-calculation-methods-for-beginnners)
4. [Start: Introduction for beginners ](https://www.kaggle.com/andrej0marinchenko/start-introduction-for-beginners-house-prices)
5. [EDA + Data Analytics For beginners](https://www.kaggle.com/andrej0marinchenko/eda-data-analytics-for-beginners-house-prices)
6. [1 step for beginners linear model](https://www.kaggle.com/andrej0marinchenko/1-step-for-beginners-linear-model-house-prices)
7. [Universal notebook 4 data analysis](https://www.kaggle.com/andrej0marinchenko/universal-notebook-4-data-analysis)


### Imports
We are using a typical data science stack: numpy, pandas, sklearn, matplotlib.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder

# File system manangement
import os

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

### Read in Data
First, we can list all the available data files. There are a total of 9 files: 1 main file for training (with target) 1 main file for testing (without the target), 1 example submission file, and 6 other files containing additional information about each loan.

In [None]:
# List files available
print(os.listdir("../input/house-prices-advanced-regression-techniques/"))

In [None]:
# Training data
app_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
print('Training data shape: ', app_train.shape)
app_train.head()

In [None]:
# Testing data features
app_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
print('Testing data shape: ', app_test.shape)
app_test.head()

In [None]:
# Create a label encoder object
le = dict()
le_count = 0

def encode_transform(app, col, le):
    # Transform both training and testing data
    app[col] = le[col].transform(app[col])
    return app, col, le

    


# Iterate through the columns
for col in app_train:
#     print(col)
    if app_train[col].dtype == 'object':        
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            le[col] = LabelEncoder()
            # Train on the training data
            le[col].fit(app_train[col])
            encode_transform(app_train, col, le)
#             encode_transform(app_test, col, le)
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

# one-hot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

In [None]:
train_labels = app_train['SalePrice']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

# Add the target back in
app_train['SalePrice'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

In [None]:
# fill missing values based on probability of occurrence
for column in app_train.columns:
    null_vals = app_train.isnull().values
    a, b = np.unique(app_train.values[~null_vals], return_counts = 1)
    app_train.loc[app_train[column].isna(), column] = np.random.choice(a, app_train[column].isnull().sum(), p = b / b.sum())
    
# fill missing values based on probability of occurrence
for column in app_test.columns:
    null_vals = app_test.isnull().values
    a, b = np.unique(app_test.values[~null_vals], return_counts = 1)
    app_test.loc[app_test[column].isna(), column] = np.random.choice(a, app_test[column].isnull().sum(), p = b / b.sum())

### Modelling
I will perform a simple linear regression on the dataset to predict house prices. In order to train out the regression model, we need to first split up the data into an X list that contains the features to train on, and a y list with the target variable, in this case, the Price column.

In [None]:
from sklearn.model_selection import train_test_split

X = app_train.drop(['SalePrice'], axis = 1)
y = app_train['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Split the data into training and testing set using scikit-learn train_test_split function. We are using 80% of the data for training and 20% for testing, train_test_split() returns four objects:

- X_train: the subset of our features used for training
- X_test: the subset which will be our ‘hold-out’ set – what we’ll use to test the model
- y_train: the target variable SalePrice which corresponds to X_train
- y_test: the target variable SalePrice which corresponds to X_test Now we will import the linear regression class, create an object of that class, which is the linear regression model.

In [None]:
from sklearn import linear_model

lr = linear_model.LinearRegression()

Then using the fit method to "fit" the model to the dataset. What this does is nothing but make the regressor "study" the data and "learn" from it.

In [None]:
model = lr.fit(X_train, y_train)

R-squared is the measure of how close the data are to the fitted regression line, in other words it measures the strength of the relationship between the model and the SalePrice on a convenient 0 – 100% scale.

In [None]:
# make predictions based on model
predictions = model.predict(X_test)

In [None]:
submission = pd.DataFrame()
submission['Id'] = app_test['Id'].astype(int)
temp = app_test.select_dtypes(include = [np.number]).drop(['Id'], axis = 1).interpolate()
predictions = model.predict(app_test)

In [None]:
submission['SalePrice'] = predictions

In [None]:
submission.to_csv('submission.csv', index = False)

each time you run this code, the resulting code on the leader board will change, this is due to the fact that when filling in the missing values, we used a random component.
- LATEST SCORE is 0.47626 - 0.35608 


- BEST SCORE is 0.35608 


I hope you liked this code, I also prepared more interesting laptops for this competition and I will be glad to share them with you:

1. [COMPREHENSIVE DATA EXPLORATION WITH PYTHON](https://www.kaggle.com/andrej0marinchenko/comprehensive-data-exploration-with-python-upd)
2. [Data ScienceTutorial for Beginners ](https://www.kaggle.com/andrej0marinchenko/data-sciencetutorial-for-beginners-house-prices)
3. [House Price Calculation methods for beginnners](https://www.kaggle.com/andrej0marinchenko/house-price-calculation-methods-for-beginnners)
4. [Start: Introduction for beginners ](https://www.kaggle.com/andrej0marinchenko/start-introduction-for-beginners-house-prices)
5. [EDA + Data Analytics For beginners](https://www.kaggle.com/andrej0marinchenko/eda-data-analytics-for-beginners-house-prices)
6. [1 step for beginners linear model](https://www.kaggle.com/andrej0marinchenko/1-step-for-beginners-linear-model-house-prices)
7. [Universal notebook 4 data analysis](https://www.kaggle.com/andrej0marinchenko/universal-notebook-4-data-analysis)