# An Important Note
I worked on this project during my studies for Dataquest online Data Science Bootcamp. This was for "Linear Regression For Machine Learning" part of the bootcamp.

# Predicting House Sale Prices
In this project, I'll be working on housing data for the city of Ames, Iowa, United States from 2006 to 2010 and I'll be predicting house sal prices by using of Linear Regression. The reason why the data was collected can be found on the link https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627 .

# Importing the Libraries and Packages Which I Need
As the first step I want to import all the Python Libraries and the packages which I need in this project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# Reading In And Exploring The Data

In [2]:
houses = pd.read_csv("AmesHousing.tsv", delimiter="\t")

In [3]:
houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Frontage       2440 non-null float64
Lot Area           2930 non-null int64
Street             2930 non-null object
Alley              198 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         29

As it can be seen there are columns with missing values. So now, I need to do some Feature Engineering to be able to predict better sale prices.

# Feature Engineering
Now I'll handle missing values in the following way:
1. For all columns:
I'll drop any column with 5% or more missing values for now.
2. Text columns:
I'll drop any column with 1 or more missing values for now.
3. Numerical columns:
I'll fill in the columns with missing values, with the mean value of that column.

# First Step : 
For all columns: Drop any column with 5% or more missing values for now.

In [4]:
# To see the total number of missing values in each column
total_of_missing = houses.isnull().sum()

# To identify the columns which will be dropped
columns_to_drop = total_of_missing[(total_of_missing > len(houses)*0.05)].sort_values()

# Dropping the columns with more than 5 % of missing values
houses = houses.drop(columns_to_drop.index, axis=1)
houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 71 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Area           2930 non-null int64
Street             2930 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         2930 non-null object
Roof Matl          2930 non-null object
Exterior 1st       29

The first step is accomplished. Now, I have 71 columns in total instead of having 82 columns.

# Second Step :
Text columns: Drop any column with 1 or more missing values for now.

In [5]:
# To see the total number of missing values in text columns
text_total_of_missing = houses.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=False)

# To identify the text columns which will be dropped
columns_to_drop_text = text_total_of_missing[(text_total_of_missing >= 1)].sort_values()

# Dropping the text columns with 1 or more missing values
houses = houses.drop(columns_to_drop_text.index, axis=1)
houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 64 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Area           2930 non-null int64
Street             2930 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         2930 non-null object
Roof Matl          2930 non-null object
Exterior 1st       29

The second step is accomplished. Now, I have 64 columns in total instead of having 82 columns.

# Third Step:
Numerical columns:
I'll fill in the columns with missing values, with the mean value of that column.

In [6]:
# To see the total number of missing values in numerical columns
num_total_of_missing = houses.select_dtypes(include=['int', 'float']).isnull().sum()
num_total_of_missing

# To identify the numerical columns which will be fixed
columns_to_be_fixed = num_total_of_missing[(num_total_of_missing < len(houses)*0.05) & (num_total_of_missing > 0)].sort_values()
columns_to_be_fixed

# To compute the mean of each column to be fixed in to a dictionary
fixing_dict = houses[columns_to_be_fixed.index].mean().to_dict()
fixing_dict

# To fill in the missing values and confirm the filling in operation
houses = houses.fillna(fixing_dict)
houses.isnull().sum().value_counts()

0    64
dtype: int64

In [7]:
houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 64 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Area           2930 non-null int64
Street             2930 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         2930 non-null object
Roof Matl          2930 non-null object
Exterior 1st       29

Now, I have accomplished the Feature Engineering tasks. As a result of Feature Engineering, I have total of 64 columns instead of 82 columns as in the original data set and all the columns have 2930 values, that means there is no missing value in any column.

# Creating New Columns
What new features can I create, that better capture the information in some of the features? Now, I'll create two new columns which I think they will be more useful for prediction. These columns will be "Years Before Sale" and "Years Since Remod".

In [8]:
years_sold = houses['Yr Sold'] - houses['Year Built']
years_since_remod = houses['Yr Sold'] - houses['Year Remod/Add']

In [9]:
houses["Years Before Sale"] = years_sold
houses["Years Since Remod"] = years_since_remod

Two new columns are created. Let me check if there is any negative value in these two columns. Since they should not have any negative values, I'll remove the row/rows with a negative value in these columns.

In [10]:
print(houses[houses["Years Before Sale"] < 0])

      Order        PID  MS SubClass MS Zoning  Lot Area Street Lot Shape  \
2180   2181  908154195           20        RL     39290   Pave       IR1   

     Land Contour Utilities Lot Config        ...         Screen Porch  \
2180          Bnk    AllPub     Inside        ...                    0   

     Pool Area Misc Val Mo Sold Yr Sold Sale Type  Sale Condition  SalePrice  \
2180         0    17000      10    2007       New         Partial     183850   

      Years Before Sale  Years Since Remod  
2180                 -1                 -2  

[1 rows x 66 columns]


In [11]:
print(houses[houses["Years Since Remod"] < 0])

      Order        PID  MS SubClass MS Zoning  Lot Area Street Lot Shape  \
1702   1703  528120010           60        RL     16659   Pave       IR1   
2180   2181  908154195           20        RL     39290   Pave       IR1   
2181   2182  908154205           60        RL     40094   Pave       IR1   

     Land Contour Utilities Lot Config        ...         Screen Porch  \
1702          Lvl    AllPub     Corner        ...                    0   
2180          Bnk    AllPub     Inside        ...                    0   
2181          Bnk    AllPub     Inside        ...                    0   

     Pool Area Misc Val Mo Sold Yr Sold Sale Type  Sale Condition  SalePrice  \
1702         0        0       6    2007       New         Partial     260116   
2180         0    17000      10    2007       New         Partial     183850   
2181         0        0      10    2007       New         Partial     184750   

      Years Before Sale  Years Since Remod  
1702                  0         

There is only 1 row in "Years Before Sale" column and there are three columns in "Years Since Remod" column which have a negative value. But, the row in "Years Before Sale" column has a negative value in "Years Since Remod" too. So the indexes of the rows to be dropped are:
1. 1702
2. 2180
3. 2181

Now, let me drop these rows.

In [12]:
houses = houses.drop([1702, 2180, 2181], axis=0)
houses.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Data columns (total 66 columns):
Order                2927 non-null int64
PID                  2927 non-null int64
MS SubClass          2927 non-null int64
MS Zoning            2927 non-null object
Lot Area             2927 non-null int64
Street               2927 non-null object
Lot Shape            2927 non-null object
Land Contour         2927 non-null object
Utilities            2927 non-null object
Lot Config           2927 non-null object
Land Slope           2927 non-null object
Neighborhood         2927 non-null object
Condition 1          2927 non-null object
Condition 2          2927 non-null object
Bldg Type            2927 non-null object
House Style          2927 non-null object
Overall Qual         2927 non-null int64
Overall Cond         2927 non-null int64
Year Built           2927 non-null int64
Year Remod/Add       2927 non-null int64
Roof Style           2927 non-null object
Roof Matl          

# Dropping The Columns Which Are Not Useful
Now, I'll drop the columns which are useless for machine learning.

Columns to be dropped:
1. The three columns which I used for creating more useful two columns because I do  not need them any more
2. that aren't useful for ML
3. leak data about the final sale

## First Group of Columns To Be Dropped
The first group contains the columns which are used for creating two new columns. These three columns are "Year Built", "Year Remod/Add" and "Yr Sold"

In [13]:
# To drop the First Group of The Columns
not_needed_columns = ["Year Built", "Year Remod/Add", "Yr Sold"]
houses = houses.drop(not_needed_columns, axis=1)

## Second Group of Columns To Be Dropped
The second group contains the columns which are not useful for machine learning. These columns are "PID" and "Order".

In [14]:
# To drop the Second Group of The Columns
not_useful_columns = ["PID", "Order"]
houses = houses.drop(not_useful_columns, axis=1)

## Third Group of Columns To Be Dropped
The third group contains the columns which will leak for the prediction of the sale price. These columns are "Mo Sold", "Sale Condition" and "Sale Type".

In [15]:
# To drop the Third Group of The Columns
leaking_columns = ["Mo Sold", "Sale Condition", "Sale Type"]
houses = houses.drop(leaking_columns, axis=1)

In [16]:
houses.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Data columns (total 58 columns):
MS SubClass          2927 non-null int64
MS Zoning            2927 non-null object
Lot Area             2927 non-null int64
Street               2927 non-null object
Lot Shape            2927 non-null object
Land Contour         2927 non-null object
Utilities            2927 non-null object
Lot Config           2927 non-null object
Land Slope           2927 non-null object
Neighborhood         2927 non-null object
Condition 1          2927 non-null object
Condition 2          2927 non-null object
Bldg Type            2927 non-null object
House Style          2927 non-null object
Overall Qual         2927 non-null int64
Overall Cond         2927 non-null int64
Roof Style           2927 non-null object
Roof Matl            2927 non-null object
Exterior 1st         2927 non-null object
Exterior 2nd         2927 non-null object
Mas Vnr Area         2927 non-null float64
Exter Qual    

As it can be seen, my Feature Engineering tasks are accomplished and now I have 58 columns which are ready to be used for machine learning.

# Writing A Function Which Does The Feature Engineering For Me
After accomplishing all these tasks for the Feature Engineering, I think that I can write a function which can do all these steps for me.

In [17]:
def transform_features(df):
    # To see the total number of missing values in each column
    total_of_missing = df.isnull().sum()
    # To identify the columns which will be dropped
    columns_to_drop = total_of_missing[(total_of_missing > len(df)*0.05)].sort_values()
    # Dropping the columns with more than 5 % of missing values
    df = df.drop(columns_to_drop.index, axis=1)
    
    # To see the total number of missing values in text columns
    text_total_of_missing = df.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=False)
    # To identify the text columns which will be dropped
    columns_to_drop_text = text_total_of_missing[(text_total_of_missing >= 1)].sort_values()
    # Dropping the text columns with 1 or more missing values
    df = df.drop(columns_to_drop_text.index, axis=1)
    
    # To see the total number of missing values in numerical columns
    num_total_of_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    # To identify the numerical columns which will be fixed
    columns_to_be_fixed = num_total_of_missing[(num_total_of_missing < len(df)*0.05) & (num_total_of_missing > 0)].sort_values()
    # To compute the mean of each column to be fixed in to a dictionary
    fixing_dict = df[columns_to_be_fixed.index].mean().to_dict()
    # To fill in the missing values 
    df = df.fillna(fixing_dict)
    
    
    years_sold = df['Yr Sold'] - df['Year Built']
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod
    df = df.drop([1702, 2180, 2181], axis=0)

    df = df.drop(["Yr Sold", "PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Year Built", "Year Remod/Add"], axis=1)
    return df

In [18]:
h_df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
h_df = transform_features(h_df)
h_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Data columns (total 58 columns):
MS SubClass          2927 non-null int64
MS Zoning            2927 non-null object
Lot Area             2927 non-null int64
Street               2927 non-null object
Lot Shape            2927 non-null object
Land Contour         2927 non-null object
Utilities            2927 non-null object
Lot Config           2927 non-null object
Land Slope           2927 non-null object
Neighborhood         2927 non-null object
Condition 1          2927 non-null object
Condition 2          2927 non-null object
Bldg Type            2927 non-null object
House Style          2927 non-null object
Overall Qual         2927 non-null int64
Overall Cond         2927 non-null int64
Roof Style           2927 non-null object
Roof Matl            2927 non-null object
Exterior 1st         2927 non-null object
Exterior 2nd         2927 non-null object
Mas Vnr Area         2927 non-null float64
Exter Qual    

I tested my function "transform_features()" by reading in the same data set with a different name "*h_df* " and I achieved the same same results with doing Feature Engineering step by step on "houses" dataframe. So, my function works perfectly!

# Feature Selection
To be able to select best features to get the best prediction, I'll try to find the answers of the following questions.
1. Which numerical columns should I select?
2. Which categorical columns should I keep? To be able to answer this question, I should also answer the following questions.
        Which columns are currently numerical but need to be encoded as categorical instead (because the numbers don't have any semantic meaning)?
        If a categorical column has hundreds of unique values (or categories), should I keep it? When I dummy code this column, hundreds of columns will need to be added back to the data frame.

## Finding The Answer Of The First Question
Now, I'll try to get the answer of the first question by checking the correlations between the numerical columns and the "SalePrice" column.

In [19]:
# Filtering the numerical columns only
numerical_df = houses.select_dtypes(include=['int', 'float'])

In [20]:
# Finding the absolute values of the correlations between the numerical columns and the "Sale Price" column
abs_corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values()
abs_corr_coeffs

BsmtFin SF 2         0.006000
Misc Val             0.019273
3Ssn Porch           0.032268
Bsmt Half Bath       0.035874
Low Qual Fin SF      0.037629
Pool Area            0.068438
MS SubClass          0.085128
Overall Cond         0.101540
Screen Porch         0.112280
Kitchen AbvGr        0.119760
Enclosed Porch       0.128685
Bedroom AbvGr        0.143916
Bsmt Unf SF          0.182248
Lot Area             0.267520
2nd Flr SF           0.269601
Bsmt Full Bath       0.276329
Half Bath            0.284871
Open Porch SF        0.316262
Wood Deck SF         0.328183
BsmtFin SF 1         0.438928
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.510611
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641675
Total Bsmt SF        0.643601
Garage Cars          0.648411
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: Sale

Let me only keep columns with a correlation coefficient of larger than 0.4 (This is arbitrary, I'll do the worth experimenting later!)

In [21]:
abs_corr_coeffs[abs_corr_coeffs > 0.4]

BsmtFin SF 1         0.438928
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.510611
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641675
Total Bsmt SF        0.643601
Garage Cars          0.648411
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: SalePrice, dtype: float64

Let me drop columns with less than 0.4 correlation with "SalePrice" column from the "houses" dataframe.

In [22]:
houses = houses.drop(abs_corr_coeffs[abs_corr_coeffs < 0.4].index, axis=1)

In [23]:
houses.select_dtypes(include=['int', 'float'])

Unnamed: 0,Overall Qual,Mas Vnr Area,BsmtFin SF 1,Total Bsmt SF,1st Flr SF,Gr Liv Area,Full Bath,TotRms AbvGrd,Fireplaces,Garage Cars,Garage Area,SalePrice,Years Before Sale,Years Since Remod
0,6,112.0,639.0,1080.0,1656,1656,1,7,2,2.0,528.0,215000,50,50
1,5,0.0,468.0,882.0,896,896,1,5,0,1.0,730.0,105000,49,49
2,6,108.0,923.0,1329.0,1329,1329,1,6,0,1.0,312.0,172000,52,52
3,7,0.0,1065.0,2110.0,2110,2110,2,8,2,2.0,522.0,244000,42,42
4,5,0.0,791.0,928.0,928,1629,2,6,1,2.0,482.0,189900,13,12
5,6,20.0,602.0,926.0,926,1604,2,7,1,2.0,470.0,195500,12,12
6,8,0.0,616.0,1338.0,1338,1338,2,6,0,2.0,582.0,213500,9,9
7,8,0.0,263.0,1280.0,1280,1280,2,5,0,2.0,506.0,191500,18,18
8,8,0.0,1180.0,1595.0,1616,1616,2,5,1,2.0,608.0,236500,15,14
9,7,0.0,0.0,994.0,1028,1804,2,7,1,2.0,442.0,189000,11,11


## Finding The Answer Of The Second Question
Now, I'll make a list of column names from documentation that are *meant* to be categorical. After that, I'll work on this list to get the answer of the questions.

In [24]:
# Making a list of the columns which are "categorical"
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

Note that, I made this list from the documentation of the original data set. But I have dropped some of the columns up to now. So, I should find which of the columns in the "nominal_features" list are still in the "houses" dataframe.

In [25]:
houses_cat_cols = []
for col in nominal_features:
    if col in houses.columns:
        houses_cat_cols.append(col)
houses_cat_cols

['MS Zoning',
 'Street',
 'Land Contour',
 'Lot Config',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Foundation',
 'Heating',
 'Central Air']

Since, the number of unique values in categorical columns are important for me, I'll find the number of unique values in each categorical column the dataframe.

In [26]:
uniqueness_counts = houses[houses_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()
uniqueness_counts

Street           2
Central Air      2
Land Contour     4
Lot Config       5
Bldg Type        5
Roof Style       6
Foundation       6
Heating          6
MS Zoning        7
Condition 2      8
House Style      8
Roof Matl        8
Condition 1      9
Exterior 1st    16
Exterior 2nd    17
Neighborhood    28
dtype: int64

It may be a problem for me to have a big number of unique values in the categorical columns, I should have a cutoff point. Now, I'll choose the cutoff point as "10", I'll drop the columns which have more than 10 unique values and I'll do worth experimenting now.

In [27]:
drop_nonuniq_cols = uniqueness_counts[uniqueness_counts > 10].index
houses = houses.drop(drop_nonuniq_cols, axis=1)

Up to now, I completed the selecting of numerical and categorical features. But, I still have text columns to deal with. Now, I'll convert these text columns into categorical columns by using of "df.astype()" and "pd.get_dummies()" functions. After that, I'll drop those text columns from the dataframe.

In [28]:
# Selecting and converting the "text" columns into "categorical" columns.
text_cols = houses.select_dtypes(include=['object'])
for col in text_cols:
    houses[col] = houses[col].astype('category')

In [29]:
# Creating "dummy" columns and adding them back to the dataframe. After that dropping the "text" columns.
houses = pd.concat([
    houses, 
    pd.get_dummies(houses.select_dtypes(include=['category']))
], axis=1).drop(text_cols,axis=1)

In [30]:
houses.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Columns: 130 entries, Overall Qual to Paved Drive_Y
dtypes: float64(5), int64(9), uint8(116)
memory usage: 674.6 KB


# Writing A Function Which Does The Feature Selecting For Me
After accomplishing all these tasks for the Feature Selecting, I think that I can write a function which can do all these steps for me.

In [31]:
def select_features(df, coeff_threshold=0.4, uniq_threshold=10):
    numerical_df = df.select_dtypes(include=['int', 'float'])
    abs_corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values()
    df = df.drop(abs_corr_coeffs[abs_corr_coeffs < coeff_threshold].index, axis=1)
    
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    
    transform_cat_cols = []
    for col in nominal_features:
        if col in df.columns:
            transform_cat_cols.append(col)

    uniqueness_counts = df[transform_cat_cols].apply(lambda col: len(col.value_counts())).sort_values()
    drop_nonuniq_cols = uniqueness_counts[uniqueness_counts > 10].index
    df = df.drop(drop_nonuniq_cols, axis=1)
    
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(text_cols,axis=1)
    
    return df

In [32]:
h_df = select_features(h_df)
h_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2927 entries, 0 to 2929
Columns: 130 entries, Overall Qual to Paved Drive_Y
dtypes: float64(5), int64(9), uint8(116)
memory usage: 674.6 KB


I tested my function "select_features()" by using of "*h_df* " dataframe and I achieved the same same results with doing Feature Selecting step by step on "houses" dataframe. So, my function works perfectly!

# Creating The Linear Regression Model To Predict  Sale Prices
After writing two functions for "Future Engineering" and "Future Selecting", now I'll write another function to create the Linear Regression model and predict the Sale Prices and return the average Root Mean Squared Error (RMSE).

In [33]:
def train_and_test(df, k=0):
    numeric_df = df.select_dtypes(include=['integer', 'float'])
    features = numeric_df.columns.drop("SalePrice")
    lr = linear_model.LinearRegression()
    
    if k == 0:
        train = df[:1460]
        test = df[1460:]

        lr.fit(train[features], train["SalePrice"])
        predictions = lr.predict(test[features])
        mse = mean_squared_error(test["SalePrice"], predictions)
        rmse = np.sqrt(mse)

        return rmse
    
    if k == 1:
        # Randomize *all* rows (frac=1) from `df` and return
        shuffled_df = df.sample(frac=1, )
        train = df[:1460]
        test = df[1460:]
        
        lr.fit(train[features], train["SalePrice"])
        predictions_one = lr.predict(test[features])        
        
        mse_one = mean_squared_error(test["SalePrice"], predictions_one)
        rmse_one = np.sqrt(mse_one)
        
        lr.fit(test[features], test["SalePrice"])
        predictions_two = lr.predict(train[features])        
       
        mse_two = mean_squared_error(train["SalePrice"], predictions_two)
        rmse_two = np.sqrt(mse_two)
        
        avg_rmse = np.mean([rmse_one, rmse_two])
        print(rmse_one)
        print(rmse_two)
        return avg_rmse
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        for train_index, test_index, in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            lr.fit(train[features], train["SalePrice"])
            predictions = lr.predict(test[features])
            mse = mean_squared_error(test["SalePrice"], predictions)
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
        print(rmse_values)
        avg_rmse = np.mean(rmse_values)
        return avg_rmse

# Predicting The Sale Prices And Calculating The Average Root Mean Squared Error
Now, it is time to predict the "SalePrice". I'll do it in three different ways.
1. By using "train_and_test()" on "houses" dataframe which I prepared it for prediction step by step
2. By using "train_and_test()" fuction on "h_df" dataframe which was prepared for the prediction by the functions.
3. By reading in the same data set with a different name to be able to use three functions together.

## First Way
I'll use "train_and_test()" funtion on "houses" dataframe.

In [34]:
train_and_test(houses, k=4)

[25137.50790272022, 28731.10463146569, 37024.624712659286, 26319.184885512903]


29303.105533089525

## Second Way
I'll use "train_and_test()" funtion on "h_df" dataframe.

In [35]:
train_and_test(h_df, k=4)

[28503.430122903304, 24673.466170039697, 26045.5696346681, 36324.24552276996]


28886.677862595265

## Third Way
I'll use three functions together.

In [36]:
data = pd.read_csv("AmesHousing.tsv", delimiter="\t")
transformed_features = transform_features(data)
selected_features = select_features(transformed_features)
rmse = train_and_test(selected_features, k=4)
rmse

[27921.29246392514, 23741.37824369794, 26765.53099622498, 36668.17673186331]


28774.09460892784