# Cleaning and One-Hot Encoding Categorical Features

_This notebook is dedicated to cleaning, choosing, and one-hot encoding categorical data._

In [28]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LassoCV

_Reading in two copies of the train and test data set to add a column later._  
_Note: This is probably bad practice to work directly on the "test" DataFrame, but I thought I might get more confused later._

In [2]:
test = pd.read_csv('../datasets/test_clean.csv')
train = pd.read_csv('../datasets/train_clean.csv')
test1 = pd.read_csv('../datasets/test_clean.csv')
train1 = pd.read_csv('../datasets/train_clean.csv')

In [3]:
test.shape

(879, 80)

In [4]:
train.shape

(2051, 81)

In [5]:
# Getting the non-numerical columns in the features
non_num_col_train = [col for col in train.columns if col not in train._get_numeric_data().columns]
non_num_col_test = [col for col in test.columns if col not in train._get_numeric_data().columns]

_Newly updated 'train' and 'test' DataFrames with only non-numerical columns._

In [6]:
train = pd.DataFrame(train[non_num_col_train])
test = pd.DataFrame(test[non_num_col_test])

In [7]:
test.shape

(879, 42)

_I realized that there were a few columns that could essentially be converted to numerical columns (ie. Exterior Quality is scaled from Poor to Excellent, and can be converted to a numerical scale). Therefore, I created a few functions to help me in that process._

In [8]:
def one_to_five_scale(x):
    if x == 'Ex':
        return 5
    elif x == 'Gd':
        return 4
    elif x == 'TA':
        return 3
    elif x =='Fa':
        return 2
    else:
        return 1

def fin_type(x):
    if x == 'GLQ':
        return 6
    elif x == 'ALQ':
        return 5
    elif x == 'BLQ':
        return 4
    elif x == 'Rec':
        return 3
    elif x == 'LwQ':
        return 2
    elif x == 'Unf':
        return 1
    else:
        return 0
    
def functionally(x):
    if x == 'Typ':
        return 7
    elif x == 'Min1':
        return 6
    elif x == 'Min2':
        return 5
    elif x == 'Mod':
        return 4
    elif x == 'Maj1':
        return 3
    elif x == 'Maj2':
        return 2
    elif x == 'Sev':
        return 1
    else:
        return 0

def garage_finish(x):
    if x == 'Fin':
        return 3
    elif x == 'RFn':
        return 2
    elif x == 'Unf':
        return 1
    else:
        return 0

_Applied the functions to the columns, saved each of them into separate Series, then created a DataFrame to house all of the "Non-Catgerical" features. Did this for both the train and the test data. Finally, saved them into separate csv files._

In [9]:
exterqual = train['Exter Qual'].apply(one_to_five_scale)
extercond = train['Exter Cond'].apply(one_to_five_scale)
bsmtqual  = train['Bsmt Qual'].apply(one_to_five_scale)
bsmtcond  = train['Bsmt Cond'].apply(one_to_five_scale)
bsmtfintype1 = train['BsmtFin Type 1'].apply(fin_type)
bsmtfintype2 = train['BsmtFin Type 2'].apply(fin_type)
heatingqc = train['Heating QC'].apply(one_to_five_scale)
kitchenqual = train['Kitchen Qual'].apply(one_to_five_scale)
functional = train['Functional'].apply(functionally)
fireplacequ = train['Fireplace Qu'].apply(one_to_five_scale)
garagefinish = train['Garage Finish'].apply(garage_finish)
garagequal = train['Garage Qual'].apply(one_to_five_scale)
garagecond = train['Garage Cond'].apply(one_to_five_scale)
poolqc = train['Pool QC'].apply(one_to_five_scale)

In [10]:
Not_categorical_train = pd.DataFrame()

In [11]:
Not_categorical_train['Exter Qual'] = exterqual
Not_categorical_train['Exter Cond'] = extercond
Not_categorical_train['Bsmt Qual'] = bsmtqual
Not_categorical_train['Bsmt Cond'] = bsmtcond
Not_categorical_train['BsmtFin Type 1'] = bsmtfintype1
Not_categorical_train['BsmtFin Type 2'] = bsmtfintype2
Not_categorical_train['Heating QC'] = heatingqc
Not_categorical_train['Kitchen Qual'] = kitchenqual
Not_categorical_train['Functional'] = functional
Not_categorical_train['Fireplace Qu'] = fireplacequ
Not_categorical_train['Garage Finish'] = garagefinish
Not_categorical_train['Garage Qual'] = garagequal
Not_categorical_train['Garage Cond'] = garagecond
Not_categorical_train['Pool QC'] = poolqc

In [12]:
Not_categorical_train.to_csv('../datasets/noncat_train.csv', index = False)

In [13]:
Not_categorical_train.shape

(2051, 14)

In [14]:
exterqual = test['Exter Qual'].apply(one_to_five_scale)
extercond = test['Exter Cond'].apply(one_to_five_scale)
bsmtqual  = test['Bsmt Qual'].apply(one_to_five_scale)
bsmtcond  = test['Bsmt Cond'].apply(one_to_five_scale)
bsmtfintype1 = test['BsmtFin Type 1'].apply(fin_type)
bsmtfintype2 = test['BsmtFin Type 2'].apply(fin_type)
heatingqc = test['Heating QC'].apply(one_to_five_scale)
kitchenqual = test['Kitchen Qual'].apply(one_to_five_scale)
functional = test['Functional'].apply(functionally)
fireplacequ = test['Fireplace Qu'].apply(one_to_five_scale)
garagefinish = test['Garage Finish'].apply(garage_finish)
garagequal = test['Garage Qual'].apply(one_to_five_scale)
garagecond = test['Garage Cond'].apply(one_to_five_scale)
poolqc = test['Pool QC'].apply(one_to_five_scale)

In [15]:
Not_categorical_test = pd.DataFrame()

In [16]:
Not_categorical_test['Exter Qual'] = exterqual
Not_categorical_test['Exter Cond'] = extercond
Not_categorical_test['Bsmt Qual'] = bsmtqual
Not_categorical_test['Bsmt Cond'] = bsmtcond
Not_categorical_test['BsmtFin Type 1'] = bsmtfintype1
Not_categorical_test['BsmtFin Type 2'] = bsmtfintype2
Not_categorical_test['Heating QC'] = heatingqc
Not_categorical_test['Kitchen Qual'] = kitchenqual
Not_categorical_test['Functional'] = functional
Not_categorical_test['Fireplace Qu'] = fireplacequ
Not_categorical_test['Garage Finish'] = garagefinish
Not_categorical_test['Garage Qual'] = garagequal
Not_categorical_test['Garage Cond'] = garagecond
Not_categorical_test['Pool QC'] = poolqc

In [17]:
Not_categorical_test.shape

(879, 14)

In [18]:
Not_categorical_test.to_csv('../datasets/noncat_test.csv', index = False)

_Here I dropped a few columns that I thought would have no role in predicting SalePrice, as well as dropping the columns that I saved earlier into the "Non-Categorical" features._

In [19]:
train.drop(['Street','Alley','Lot Shape','Land Contour','Land Slope', 'Roof Style', 'Roof Matl', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual','Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1','BsmtFin Type 2', 'Heating QC','Kitchen Qual','Functional', 'Fireplace Qu', 'Garage Finish','Garage Qual','Garage Cond','Pool QC'],axis=1,inplace=True)
test.drop(['Street','Alley','Lot Shape','Land Contour','Land Slope', 'Roof Style', 'Roof Matl', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual','Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1','BsmtFin Type 2', 'Heating QC','Kitchen Qual','Functional', 'Fireplace Qu', 'Garage Finish','Garage Qual','Garage Cond','Pool QC'],axis=1,inplace=True)

_I added the "MS SubClass" column, since even though the values in the column are numerical, they actually represent categorical data. Did this for both train and test data._

In [20]:
train['MS SubClass'] = train1['MS SubClass']
test['MS SubClass'] = test1['MS SubClass']

In [21]:
train['MS SubClass'] = train['MS SubClass'].astype(str)
test['MS SubClass'] = test['MS SubClass'].astype(str)

*Used .get_dummies() on the entire DataFrame to complete the categorical work.*

In [22]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)

In [23]:
train.describe()

Unnamed: 0,MS Zoning_A (agr),MS Zoning_C (all),MS Zoning_FV,MS Zoning_I (all),MS Zoning_RH,MS Zoning_RL,MS Zoning_RM,Utilities_AllPub,Utilities_NoSeWa,Utilities_NoSewr,...,MS SubClass_30,MS SubClass_40,MS SubClass_45,MS SubClass_50,MS SubClass_60,MS SubClass_70,MS SubClass_75,MS SubClass_80,MS SubClass_85,MS SubClass_90
count,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,...,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0
mean,0.000975,0.009264,0.049244,0.000488,0.006826,0.779132,0.154071,0.999025,0.000488,0.000488,...,0.049244,0.00195,0.005363,0.096538,0.192101,0.043881,0.007801,0.041931,0.013652,0.036568
std,0.03122,0.095825,0.21643,0.022081,0.082357,0.414933,0.361105,0.03122,0.022081,0.022081,...,0.21643,0.04413,0.073055,0.2954,0.394048,0.20488,0.088,0.20048,0.116069,0.187743
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


_I realized that some of the columns in the train data is not in the test data and vice versa. I decided to only use the features that exist in both._

In [24]:
[x for x in train.columns if x not in test.columns]

['MS Zoning_A (agr)',
 'Utilities_NoSeWa',
 'Neighborhood_GrnHill',
 'Neighborhood_Landmrk',
 'Condition 2_Artery',
 'Condition 2_RRAe',
 'Condition 2_RRAn',
 'Condition 2_RRNn',
 'Exterior 1st_CBlock',
 'Exterior 1st_ImStucc',
 'Exterior 1st_Stone',
 'Exterior 2nd_Stone',
 'Heating_OthW',
 'Heating_Wall',
 'Electrical_Mix',
 'Misc Feature_Elev',
 'Misc Feature_TenC',
 'MS SubClass_150']

In [25]:
[x for x in test.columns if x not in train.columns]

['Exterior 1st_PreCast',
 'Exterior 2nd_Other',
 'Exterior 2nd_PreCast',
 'Mas Vnr Type_CBlock',
 'Heating_Floor',
 'Sale Type_VWD']

In [26]:
train.drop([x for x in train.columns if x not in test.columns], axis=1, inplace = True)

In [27]:
test.drop([x for x in test.columns if x not in train.columns], axis=1,inplace=True)

_Saved the categorical columns into .csv files._

In [None]:
train.to_csv('./datasets/cat_train.csv', index=False)
test.to_csv('./datasets/cat_test.csv', index=False)

_At this point in the notebook, I tried to use lasso iteratively to reduce the number of categorical columns that I wanted in my final model._

In [29]:
X = train
y = train1['SalePrice']

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

In [31]:
lasso = LassoCV(n_alphas = 200)
lasso.fit(X_train, y_train)
lasso.score(X_test,y_test)



0.677841409647286

In [64]:
lasso.alpha_

35.952714782700525

In [46]:
X_train.shape

(1538, 147)

_I obtained all the coefficients that were not equal to 0._

In [50]:
not_zero = [index for index, val in enumerate(lasso.coef_) if val != 0]

*I made a new X_train that only took the features that did not have coefficients equal to 0.*

In [60]:
X_train_it_1 = X_train.iloc[0:, not_zero]
X_test_it_1 = X_test.iloc[0:, not_zero]

In [68]:
X_train_it_1.columns

Index(['MS Zoning_C (all)', 'MS Zoning_RH', 'MS Zoning_RM',
       'Lot Config_CulDSac', 'Lot Config_FR2', 'Lot Config_Inside',
       'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale',
       'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr',
       'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert',
       'Neighborhood_Greens', 'Neighborhood_IDOTRR', 'Neighborhood_MeadowV',
       'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NWAmes',
       'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown',
       'Neighborhood_SWISU', 'Neighborhood_Sawyer', 'Neighborhood_SawyerW',
       'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Neighborhood_Timber',
       'Neighborhood_Veenker', 'Condition 1_Feedr', 'Condition 1_Norm',
       'Condition 1_PosA', 'Condition 1_PosN', 'Condition 1_RRAe',
       'Condition 1_RRAn', 'Condition 2_Norm', 'Condition 2_PosA',
       'Condition 2_PosN', 'Bldg Type_1Fam',

_I refit a new Lasso model to the X_train data._

In [61]:
lasso_it_1 = LassoCV(n_alphas = 200)
lasso_it_1.fit(X_train_it_1, y_train)
lasso_it_1.score(X_test_it_1, y_test)



0.6774970913208453

_This time, it seemed that Lasso did not zero any more coefficients. Instead, I got all the coefficients that were high in value (greater than 40000 and less than -40000)._

In [74]:
high_coef = [index for index,val in enumerate(lasso.coef_) if (val > 40000) or (val < -40000)]

In [88]:
X_train.iloc[0:, high_coef].columns

Index(['Neighborhood_BrkSide', 'Neighborhood_Edwards', 'Neighborhood_IDOTRR',
       'Neighborhood_MeadowV', 'Neighborhood_NAmes', 'Neighborhood_NoRidge',
       'Neighborhood_NridgHt', 'Neighborhood_OldTown', 'Neighborhood_Sawyer',
       'Neighborhood_StoneBr', 'Neighborhood_Veenker', 'Condition 2_PosA',
       'Condition 2_PosN', 'House Style_2.5Fin', 'Exterior 1st_CemntBd',
       'MS SubClass_75'],
      dtype='object')

_It seems that continuing to iterate the Lasso didn't improve the score, nor does it reduce the number of zero-ed out features. I stopped here and took down all the features of which to use later. It seemed like the features that Lasso didn't remove were mainly neighborhood features, which seems interesting._