In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('files/train.csv')
df = df.sample(frac = 1)

In [3]:
count_nan = len(df) - df.count()

In [4]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


# Check for NaN columns..


In [5]:
for cols in df.columns:
    if df[cols].isnull().any():
        print(cols)

LotFrontage
Alley
MasVnrType
MasVnrArea
BsmtQual
BsmtCond
BsmtExposure
BsmtFinType1
BsmtFinType2
Electrical
FireplaceQu
GarageType
GarageYrBlt
GarageFinish
GarageQual
GarageCond
PoolQC
Fence
MiscFeature


# Let's go through how I might want to treat each column, before I get into dealing with NaN's...

## MSSubClass

The values it takes are kind of weird considering its categorical data. If I decide to use a decision tree-based algorithm then it's fine. If not I'll need to at least normailize this one.

## MSZoning 
Should be one-hot encoded.

## LotFrontage, LotArea
Numerical data. We're dealing with a bunch of different units here, and if I wasn't working with a decision tree I'd really need to be careful to normalize and standardize the data.

## Street, Alley, LandContour, LotConfig, Neighborhood, Condition1, Condition2, BldgType, HouseStyle
Categorical. To be one-hot encoded.

## OverallCond and OverallQual
Numerical.

## YearBuilt, YearRemodAdd
Numerical data. Need to think about how I should treat this. Could maybe base them off of how many days ago they were built / remodeled.

## RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType
All categorical.

## MasVnrArea
Numerical.

## LotShape, Utilities, LandSlope, ExterQual, ExterCond
Categorical, but admits an ordered set, with Excellent > Good > Average/Typical etc.. so can convert to numerical to save on columns

## Foundation
Categorical.

## BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1
Categorical, but again admitting an ordered set, so should be converted into numerical data to save on columns.

## BsmtFinSF1
Numerical.

## BsmtFinType2
Categorical, but admitting an ordered set -- convert to numeric.

## BsmtFinSF2, BsmtUnfSF, TotalBsmtSF
Numerical.

## Heating
Categorical.

## HeatingQC
Categorical, but ordered set so convert to numeric.

## CentralAir, Electrical
Categorical.

## 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath,  BsmtHalfBath, FullBath, HalfBath, Bedroom, Kitchen
Numerical

## KitchenQual
Categoric, but admits ordered set, so convert to numerical.

## TotRmsAbvGrd
Numerical.

## Functional
Categorical, but admits ordered set. Convert to numerical.

## Fireplaces
Numerical.

## FireplaceQu
Categorical -- convertable to numerical.

## GarageType
Categorical.

## GarageFinish
Categorical -- convertable to numerical.

## GarageYrBlt
Possibly convert to 'how many days ago'.

## GarageCars, GarageArea
Numerical.

## GarageQual, GarageCond
Categorical -- convertable to numerical.

## PavedDrive
I think this admits an ordered set in terms of 'paved-ness'. So possibly convertable to numerical, although could be safe and just keep it categorical -- only like 3 unique values.

## WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea
Numerical.

## PoolQC, Fence
Categorical -- convertable to numerical.

## MiscFeature
Categorical.

## MiscVal
Numerical.

## MoSold , YrSold
MoSold might not be worth the effort to include. I won't want to one-hot encode 12 columns for each month, and I doubt the month something was sold can be worth the 12 columns I'd be costing the training data. YrSold I can express in terms of 'days since the present'. I could maybe express MoSold as numerical data of values between 1 and 12. But I don't think that's justifiable because it's weird to say that month 2 > month 1 or something.

## SaleType, SaleCondition
Categorical.

# Okay, with that, here are my next steps:

There isn't a whole lot of rows to this data, so if I add too many columns we start worrying about big-p, little-n issues. I need to try and impute my NaN's instead of deleting those rows so that the little-n doesn't get even smaller. I'll also really want to favor converting categorical data that admits an ordered set into numerical data so I can save up on columns.
1. Create a function to deal with missing NaN values sample-by-sample by selecting a subset of the data that shares as many characteristics as possible with a sample and picking the most likely value to impute given this. 
2. Create a function that converts categorical-but-numerical-convertable columns into numeric columns.
3. Deal with time-series columns.

It might be worth trying to find similar columns by one-hot encoding and normalizing data, and seeing which samples have the highest cosine similarity to the one in question.. I'd need to normalize numerical data, or else different magnitudes would have disproportionate influence on the the cosine similarity..

# One-hot encoding categorical data

Also going to drop the MoSold column.

In [6]:
df = df.drop(columns = ['MoSold'])

In [7]:
to_one_hot = [
    'MiscFeature',
    'SaleType',
    'SaleCondition',
    'CentralAir', 
    'Electrical',
    'Heating',
    'Foundation',
    'RoofStyle', 
    'RoofMatl', 
    'Exterior1st', 
    'Exterior2nd', 
    'MasVnrType',
    'Street', 
    'Alley', 
    'LandContour', 
    'LotConfig', 
    'Neighborhood', 
    'Condition1', 
    'Condition2', 
    'BldgType', 
    'HouseStyle',
    'MSZoning',
    'MSSubClass',
    'GarageType',
    'PavedDrive'
    ]

In [8]:
time_series_columns = ['YrSold','YearBuilt','YearRemodAdd','GarageYrBlt']

And now for the data I want to make numeric.

In [9]:
numerical = [
    'MiscVal',
    'WoodDeckSF', 
    'OpenPorchSF', 
    'EnclosedPorch', 
    '3SsnPorch', 
    'ScreenPorch', 
    'PoolArea',
    'GarageCars', 
    'GarageArea',
    'Fireplaces',
    'TotRmsAbvGrd',
    '1stFlrSF', 
    '2ndFlrSF', 
    'LowQualFinSF', 
    'GrLivArea', 
    'BsmtFullBath',  
    'BsmtHalfBath', 
    'FullBath', 
    'HalfBath', 
    'BedroomAbvGr', 
    'KitchenAbvGr',
    'BsmtFinSF2', 
    'BsmtUnfSF', 
    'TotalBsmtSF',
    'BsmtFinSF1',
    'MasVnrArea',
    'LotFrontage', 
    'LotArea',
    'OverallCond',
    'OverallQual',
      
]

In [10]:
columns_so_far = to_one_hot + time_series_columns + numerical

In [11]:
full_columns = list(df.columns)

In [12]:
to_numerical = list(set(columns_so_far).symmetric_difference(full_columns))

In [13]:
to_numerical.remove('SalePrice')
to_numerical.remove("Id")

In [14]:
to_numerical

['LandSlope',
 'GarageFinish',
 'ExterQual',
 'GarageCond',
 'ExterCond',
 'Utilities',
 'FireplaceQu',
 'KitchenQual',
 'BsmtQual',
 'HeatingQC',
 'PoolQC',
 'BsmtFinType2',
 'BsmtFinType1',
 'Functional',
 'GarageQual',
 'Fence',
 'LotShape',
 'BsmtExposure',
 'BsmtCond']

## One-hot encode the categorical columns

In [15]:
df = pd.get_dummies(df,columns = to_one_hot)

## Convert the to_numerical data to numerical

In [16]:
import json 
with open('files/to_numerical_json.json', 'r') as f:   
    ordered_categories = json.load(f)

In [17]:
ordered_categories

{'Fence': ['GdPrv', 'MnPrv', 'GdWo', 'MnWw', 'NA'],
 'LotShape': ['Reg', 'IR1', 'IR2', 'IR3'],
 'Utilities': ['AllPub', 'NoSewr', 'NoSeWa', 'ELO'],
 'LandSlope': ['Gtl', 'Mod', 'Sev'],
 'BsmtQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'],
 'BsmtCond': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'],
 'BsmtExposure': ['Gd', 'Av', 'Mn', 'No', 'NA'],
 'BsmtFinType1': ['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'NA'],
 'BsmtFinType2': ['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'NA'],
 'HeatingQC': ['Ex', 'Gd', 'TA', 'Fa', 'Po'],
 'KitchenQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po'],
 'Functional': ['Typ', 'Min1', 'Min2', 'Mod', 'Maj1', 'Maj2', 'Sev', 'Sal'],
 'FireplaceQu': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'],
 'GarageFinish': ['Fin', 'RFn', 'Unf', 'NA'],
 'ExterCond': ['Ex', 'Gd', 'TA', 'Fa', 'Po'],
 'ExterQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po'],
 'GarageQual': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'],
 'GarageCond': ['Ex', 'Gd', 'TA', 'Fa', 'Po', 'NA'],
 'PoolQC': ['Ex', 'Gd', 'TA', 'Fa', 'NA']}

In [18]:
from funcs.conv_to_numerical import conv_to_numerical

In [19]:
df = conv_to_numerical(df, ordered_categories)

## Convert dates into 'days ago'
Going to assume all houses listed for a given year were listed on January 1st of that year. Because I'm omitting month, the time during the year won't matter.

In [20]:
from funcs.get_days_ago import get_days_ago

In [21]:
for cols in time_series_columns:
    df[cols] = [get_days_ago(year) for year in df[cols]]

# Normalize the numerical columns for the cosine similarity steps
Normalizing is not generally a good idea for a decision tree-based model, so I'll probably make it separate to df. I'll also add the time-series columns to the numerical data now that I've converted it to numerical data.

In [22]:
all_numerical = to_numerical + numerical + time_series_columns

In [23]:
y = df['SalePrice']
df = df.drop(columns = ['SalePrice'])

In [24]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
#cos_sim_df = cos_sim_df.drop(columns = numerical)
df[all_numerical] = min_max_scaler.fit_transform(df[all_numerical])

## Now to implement the cosine similarity function to find a subset of data most similar to a sample with one or more NaN values. Just need to find the cosine similarity between the sample with NaN of interest with all other columns, dealing with NaN's in the cosine similarity (either make the similarity zero or NaN or something) and get a subset of the data which has a cosine similarity score above a certain threshold. From that subset, take the mean/median of the values of the subset to fill in the NaN for that sample.

In [25]:
from funcs.get_most_similar_rows import get_most_similar_rows

results = get_most_similar_rows(df)
    

[nan nan nan]
[nan nan]
[nan nan nan nan nan nan]


ValueError: One of the values is not NaN

## My NaN-detecting code is not working properly. Need to make sure this works 

In [None]:
with  open("files/most_sim_rows.json", "w") as f:
    json.dump(results, f)

In [None]:
'33.0' in list(results.keys())

True

In [None]:
df.isnull().sum(axis = 1).sum()


4100

In [None]:
'617.0' in list(results.keys())

True

In [None]:
# In order to get the 'best' subset, take the top nth (maybe 70th?) percentile of cos_sim scores as the subset. Then, just form a subset of df 
# with those indices that pass.
from scipy.stats import normaltest

cutoff = 0.75
alpha = 1e-3


for indices in df['Id']:
    #print(indices in list(results.keys()))
    indices = float(indices)
    #print(results[indices])
    #try:
    #print(indices in list(results.keys()))
    to_compare = [int(float(x)) for x in list(results.keys())]
    if indices in to_compare:
        #print(results_indices)
        indices_as_str = str(indices)
        most_similar = [x[0] for x in results[indices_as_str] if x[1] >= cutoff]
        columns_to_fill = [x[2] for x in results[indices_as_str] if x[1] >= cutoff]
        subset = df.iloc[most_similar]
        #print(columns_to_fill)
        columns_to_fill = list(set(tuple(row) for row in columns_to_fill))
        #print('columns to fill:', columns_to_fill)
        for col_list in columns_to_fill: # Don't want to use 'mean' on categorical data
            for col in col_list:
                print(indices, col)
                col_name = df.columns[col]
                print('before at index',indices,'and column', col, col_name,'sample is ',df.loc[indices, col_name])
                if not pd.isnull(df.loc[indices, col_name]):
                    raise ValueError(df.loc[indices, col_name], 'should be null')
                subset_test = subset[col_name]
                if col in all_numerical:
                    if len(subset_test) >= 8: # normaltest requires at least 8 samples
                        results = normaltest(subset_test)
                        if results[1] > alpha: #null hypothesis cannot be rejected that distribution is normal. Take mean of gaussian
                            df.loc[indices, col_name] = np.nanmean(subset_test)
                        elif results[1] < alpha:
                            df.loc[indices, col_name] = np.nanmedian(subset_test)
                    else:
                        print('less than 8 samples')
                        mean = np.nanmean(subset_test)
                        median = np.nanmedian(subset_test)
                        print(mean, median)
                        if np.abs((mean-median)/mean)*100 < 10: # Mean and median are close together -- can pick either, will pick mean
                            df.loc[indices, col_name] = mean
                        else: # Pick median -- not close together possibly due to outliers
                            df.loc[indices, col_name] = media
                else:
                    df.loc[indices, col_name] = np.nanmedian(subset_test)
                print('after at index',indices,'and column', col,'sample is ',df.loc[indices, col_name])
    #except KeyError:
    #    print('KeyError')
    #    #print('No NaN for index', indices, 'passing..')
    #    pass

1128.0 50
before at index 1128.0 and column 50 PoolQC sample is  nan
after at index 1128.0 and column 50 sample is  nan
535.0 50
before at index 535.0 and column 50 PoolQC sample is  nan
after at index 535.0 and column 50 sample is  0.6666666666666667
535.0 51
before at index 535.0 and column 51 Fence sample is  nan
after at index 535.0 and column 51 sample is  0.6666666666666667
242.0 50
before at index 242.0 and column 50 PoolQC sample is  nan
after at index 242.0 and column 50 sample is  nan
1093.0 50
before at index 1093.0 and column 50 PoolQC sample is  nan
after at index 1093.0 and column 50 sample is  0.6666666666666667
1093.0 51
before at index 1093.0 and column 51 Fence sample is  0.6666666666666667


  r, k = function_base._ureduce(a, func=_nanmedian, axis=axis, out=out,


ValueError: (0.6666666666666667, 'should be null')

In [None]:
df.isnull().any().sum()

14

## The method above runs the risk of getting values like 0.5 for categorical columns. Can fix this by just enforcing the median always. 