## Machine Learning with Housing Regression from Kaggle 2: Advanced Techniques

### In this part, we are attempting to identify columns and data values that have NaN values (null values), then dropping them so that we don't have NaN values.

#### 1: Import pandas and our training/testing data files.

#### 2: Create a list of the column names and create an empty list to store the names of columns that have NaN values.

#### 3: Uses a for loop to go through the list of column names, detect if there are any NA values, and if so, add them to the empty list.

#### 4: Prints the list of columns that have empty values.

#### 5: Creates a new dataframe identical in size to data but with True and False replacing actual values, True being for an NaN value.

#### 6: Creates a one dimensional array with axis labels (column names) that contain the sum of the 'True' Values, then prints this.

#### 7: Prints only the columns that are known to have NaN values and their sums by combining both method above.

#### 8: Drops the columns that have NaN values and prints out the new column list for both the training and testing data.


In [1]:
import pandas as pd #imports pandas for data manipulation
train_file_path = 'train.csv' #declares a file path for the training data that will be used
test_file_path = 'test.csv'
data = pd.read_csv(train_file_path) # creates a dataframe object using the filepath
testdata = pd.read_csv(train_file_path)

columns = data.columns #creates a list of columns (as strings)
columns_with_empty = [] #empty list of columns that have empty values
for i in columns: #for each "i" in the list of columns
    if data[i].isnull().any(): #if that column (data[i]) has any NaN values, add it to the columns_with_empty list
        columns_with_empty.append(i)
print('These are the columns that have empty values:\n\n'+ str(columns_with_empty) + '\n') #print all columns with empty values

dataNaN = data.isnull() #creates a new dataframe identical in size to 
#data but with true and false replacing the actual values in each data frame (true being a missing value)

columnSum = dataNaN.sum() #creates a series (One-dimensional ndarray with axis labels) of 
#every column and its sum (number of true, or previously missing values)

print('These are the numbers of \'NaN\' values in for every column in the data:\n\n' + str(columnSum) + '\n')
#help(columnSum) used to determine what kind of object it is and how it works, what can be doen with it etc.

data[columns_with_empty].isnull().sum() #uses both methods above to return a series of only the columns with an empty
#value and the amount of empty values in that columns

data_without_nan = data.dropna(axis = 'columns') #you can also omit rows data.dropna() to remove individual values
print('These are the new data columns with the columns that had \'NaN\' values dropped\n\n' + str(data_without_nan.columns)+'\n') 
# prints the new data columns without the columns that had null values

testdata_without_nan = testdata.drop(columns_with_empty, axis = 'columns') # drops the columns from test data too, then prints it below
print(testdata_without_nan.columns)



These are the columns that have empty values:

['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

These are the numbers of 'NaN' values in for every column in the data:

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
Ext

### Imputation: Imputation 'infers' a value to fill in for the missing value so that the column can still be used. 

#### 9: Import imputer for later use

#### 10: Create a copy of the data and remove everything everything except for number only categories to help imputation, as well as the SalePrice data (we are going to be predicting it).

#### 11: Impute the data, storing it into a new DataFrame

### Now we are doing an extenstion on imputation:

#### 12: Once again, find the columns with missing values and store it in a list. However, then create a new column of whether or not the value was NaN or not (and therefore imputed or not).

#### 13: Impute the data.

In [2]:
from sklearn.preprocessing import Imputer

newdata = data.copy() #creates a copy of the data called newdata
newdata = newdata.select_dtypes(exclude=['object']) #selects only integer (maybe float?) values for simplification
newdata = newdata.drop(['SalePrice'], axis = 'columns') #drops the saleprice column as that's what we're PREDICTING

my_imputer = Imputer() #creates an imputer
#help(my_imputer) tells us parameters and other useful info about the imputers
data_with_imputed_values = my_imputer.fit_transform(newdata) #has the imputer impute for the NaN values

#Imputation with an extension:

cols_with_missing = [] #defines an empty list for the columns with missing values
for i in newdata.columns:
    if newdata[i].isnull().any(): #if any value in the given column is null add it to the list
        cols_with_missing.append(i)
for col in cols_with_missing: #going through each column that had a missing value, 
    #create a new column to show what values were inputed
    newdata[col + '_was_missing'] = newdata[col].isnull()
    
print(newdata.columns) #print all the columns (it will show the new _was_missing columns and not the dropped ones)
print(newdata.LotFrontage) #print the lotFrontage data to show the NaN values
print(newdata.LotFrontage_was_missing) #print the wasmissing for lotfrontage to show how they match up

newdata = my_imputer.fit_transform(newdata) #inputs the data 


Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'LotFrontage_was_missing',
       'MasVnrArea_was_missing', 'GarageYrBlt_was_missing'],
      dtype='object')
0        65.0
1        80.0
2        68.0
3        60.0
4        84.0
5        85.0
6        75.0
7         NaN
8        51.0
9        50.0
10       70.0
11       85.0
12        NaN
13       91.0
14        NaN
15       51.0
16        NaN
17       72.0
18       66.0
19       70.0
20      101.0
21       57.0
22       75.0
23    

### Bringing it all together for some final model analysis:

#### 14: Import necessary packages.

#### 15: Import the dataset as a DataFrame, and identify the y value.

#### 16: Drop the values that are 'objects' (not numbers) so that the imputation and model can run correctly, as well as the SalePrice data. Store this as newtraindata which will be used for later models, and create a copy called x for the current model (dropping columns).

#### 17: Identify every empty column, store it in a list, and then drop it from x, removing any columns that have a NaN value.

#### 18: Split the data into testing and training sets, then create a model with optimal values and calculate the mean absolute error(MAE).

#### 19: Create a new object imputeddata as a copy of newtraindata, which has the sales and non number categories dropped. 

#### 20: Apply the imputation, split it, train the model, and print the MAE.

#### 21: Finally, create another copy of newtraindata, imputeddata2, and add a new boolean combo for every column that had a missing value that describes whether or not it was missing.

#### 22: Creates a new model and prints the MAE using this dataset.

### Overall, we can see that removing the columns actually had a slightly effect on the data than the imputation or the enhanced imputation, but the enhanced imputation was verrrry slightly improved on the regular one. 

In [18]:
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split #imports

trainpath = 'train.csv'
traindata = pd.read_csv(trainpath)
y = traindata.SalePrice #creates the traindata DataFrame from csv and the y values

newtraindata = traindata.select_dtypes(exclude=['object']) #removes any non number values from the traindata and stores it into a new value, newtraindata
newtraindata = newtraindata.drop(['SalePrice'], axis = 'columns') #removes the SalePrice column(y values) from the x values
x = newtraindata #x is a copy of newtraindata to be used for the first model


columns_with_empty = [] #defines an empty list of columns that have empty values
print(x.columns) #print the columns that are still in the data 

for i in x.columns:
    if x[i].isnull().any():
        columns_with_empty.append(i) #goes through every column and if it has an NaN value, appends it to columns_with_empty
x = x.drop(columns_with_empty, axis = 'columns') #from x specifically for the first model, removes any columns that have empty values 
print(columns_with_empty) #print the columns that were just removed       
print(x.columns) #print the columns that are left

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state = 0) #splits the data into testing and training sets

RFmodel1 = RandomForestRegressor(n_estimators = 2000)
RFmodel1.fit(train_x, train_y)
predicted = RFmodel1.predict(val_x)
#creates and trains the model, we've done this before
print(mean_absolute_error(val_y, predicted)) #print the MAE of actual y values and the model predicted ones

imputeddata = newtraindata.copy() #creates another copy of newtraindata for the second model
myimputer = Imputer() #creates the imputer
imputeddata = myimputer.fit_transform(imputeddata) #imputes the data, filling in the values that are NaN

train_x, val_x, train_y, val_y = train_test_split(imputeddata, y, random_state=0) #splits the data *after* imputing

RFmodel2 = RandomForestRegressor(n_estimators = 2000)
RFmodel2.fit(train_x, train_y)
predicted = RFmodel2.predict(val_x)

print(mean_absolute_error(val_y, predicted))#print the MAE of actual y values and the mode predicted ones

imputeddata2 = newtraindata.copy() #creates a final copy of newtraindata for the third model

for i in columns_with_empty:
    imputeddata2[i + '_was_empty'] = imputeddata2[i].isnull() #adds columns for the columns that had missing values and which ones were missing with booleans

imputeddata2 = myimputer.fit_transform(imputeddata2) #impute the data

train_x, val_x, train_y, val_y = train_test_split(imputeddata, y, random_state = 0)

RFmodel3 = RandomForestRegressor(n_estimators = 2000)
RFmodel3.fit(train_x, train_y)
predicted = RFmodel3.predict(val_x)

print(mean_absolute_error(val_y, predicted)) #print the MAE of actual y values and the model predicted ones

print("""As you can see, removing the categories was actually much more effective for this dataset, 
likely due to the low amount of columns that had missing data. You can also see that the Imputation 
with extension slightly improved upon regular Imputation.""")



    
        

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold'],
      dtype='object')
['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
Index(['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars',
    