# Processing and Transforming Features

We'll work with a dataset on sold houses in Ames, Iowa. Each row in the dataset describes the properties of a single house as well as the amount it was sold for. Here are some of the columns: 

- `Lot Area`: Lot size in square feet.
- `Overall Qual`: Rates the overall material and finish of the house.
- `Overall Cond`: Rates the overall condition of the house.
- `Year Built`: Original construction date.
- `Low Qual Fin SF`: Low quality finished square feet (all floors).
- `Full Bath`: Full bathrooms above grade.
- `Fireplaces`: Number of fireplaces.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('AmesHousing.txt', delimiter="\t")
data.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [3]:
train = data[0:1460]
test = data[1460:]

### Selecting columns with no missing values

In [4]:
null_counts = train.isnull().sum()
null_counts

Order                0
PID                  0
MS SubClass          0
MS Zoning            0
Lot Frontage       249
Lot Area             0
Street               0
Alley             1351
Lot Shape            0
Land Contour         0
Utilities            0
Lot Config           0
Land Slope           0
Neighborhood         0
Condition 1          0
Condition 2          0
Bldg Type            0
House Style          0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Roof Style           0
Roof Matl            0
Exterior 1st         0
Exterior 2nd         0
Mas Vnr Type        11
Mas Vnr Area        11
Exter Qual           0
Exter Cond           0
                  ... 
Bedroom AbvGr        0
Kitchen AbvGr        0
Kitchen Qual         0
TotRms AbvGrd        0
Functional           0
Fireplaces           0
Fireplace Qu       717
Garage Type         74
Garage Yr Blt       75
Garage Finish       75
Garage Cars          0
Garage Area          0
Garage Qual

In [5]:
df_no_mv = train[null_counts[null_counts == 0].index]
df_no_mv.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,...,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,31770,Pave,IR1,Lvl,AllPub,Corner,...,0,0,0,0,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,11622,Pave,Reg,Lvl,AllPub,Inside,...,0,0,120,0,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,14267,Pave,IR1,Lvl,AllPub,Corner,...,0,0,0,0,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,11160,Pave,Reg,Lvl,AllPub,Corner,...,0,0,0,0,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,13830,Pave,IR1,Lvl,AllPub,Inside,...,0,0,0,0,0,3,2010,WD,Normal,189900


### Categorical Features

In [6]:
train['Utilities'].value_counts()

AllPub    1457
NoSewr       2
NoSeWa       1
Name: Utilities, dtype: int64

In [7]:
train['Street'].value_counts()

Pave    1455
Grvl       5
Name: Street, dtype: int64

In [8]:
train['House Style'].value_counts()

1Story    743
2Story    440
1.5Fin    160
SLvl       60
SFoyer     35
2.5Unf     11
1.5Unf      8
2.5Fin      3
Name: House Style, dtype: int64

### Handling categorical features

In [9]:
import warnings
warnings.filterwarnings('ignore')

train['Utilities'] = train['Utilities'].astype('category')
train['Utilities'].value_counts()

AllPub    1457
NoSewr       2
NoSeWa       1
Name: Utilities, dtype: int64

In [10]:
train['Utilities'].cat.codes[:10]

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int8

In [11]:
cat_cols = df_no_mv.select_dtypes(include=['object']).columns
cat_cols

Index(['MS Zoning', 'Street', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl',
       'Exterior 1st', 'Exterior 2nd', 'Exter Qual', 'Exter Cond',
       'Foundation', 'Heating', 'Heating QC', 'Central Air', 'Electrical',
       'Kitchen Qual', 'Functional', 'Paved Drive', 'Sale Type',
       'Sale Condition'],
      dtype='object')

In [12]:
for col in cat_cols:
    train[col] = train[col].astype('category')

In [13]:
train['Utilities'].cat.codes.value_counts()

0    1457
2       2
1       1
dtype: int64

In [14]:
train['House Style'].cat.codes.value_counts()

2    743
5    440
0    160
7     60
6     35
4     11
1      8
3      3
dtype: int64

### Dummy Coding

In [15]:
train.shape

(1460, 82)

In [16]:
for col in cat_cols:
    train = pd.concat([train, pd.get_dummies(train[col])], axis=1)
    del train[col]

In [17]:
train.shape

(1460, 236)

In [18]:
train.head()

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Alley,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,ConLI,ConLw,New,Oth,WD,Abnorml,Alloca,Family,Normal,Partial
0,1,526301100,20,141.0,31770,,6,5,1960,1960,...,0,0,0,0,1,0,0,0,1,0
1,2,526350040,20,80.0,11622,,5,6,1961,1961,...,0,0,0,0,1,0,0,0,1,0
2,3,526351010,20,81.0,14267,,6,6,1958,1958,...,0,0,0,0,1,0,0,0,1,0
3,4,526353030,20,93.0,11160,,7,5,1968,1968,...,0,0,0,0,1,0,0,0,1,0
4,5,527105010,60,74.0,13830,,5,5,1997,1998,...,0,0,0,0,1,0,0,0,1,0


### Handling umerical features that aren't categorical

In [21]:
train[['Year Remod/Add', 'Year Built']].head(10)

Unnamed: 0,Year Remod/Add,Year Built
0,1960,1960
1,1961,1961
2,1958,1958
3,1968,1968
4,1998,1997
5,1998,1998
6,2001,2001
7,1992,1992
8,1996,1995
9,1999,1999


In [23]:
train['years_until_remod'] = train['Year Remod/Add'] - train['Year Built']
train['years_until_remod'].head(10)

0    0
1    0
2    0
3    0
4    1
5    0
6    0
7    0
8    1
9    0
Name: years_until_remod, dtype: int64

### Handling missing values

In [26]:
data = pd.read_csv('AmesHousing.txt', delimiter="\t")
train = data[0:1460]
test = data[1460:]

In [27]:
null_counts = train.isnull().sum()
null_counts

Order                0
PID                  0
MS SubClass          0
MS Zoning            0
Lot Frontage       249
Lot Area             0
Street               0
Alley             1351
Lot Shape            0
Land Contour         0
Utilities            0
Lot Config           0
Land Slope           0
Neighborhood         0
Condition 1          0
Condition 2          0
Bldg Type            0
House Style          0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Roof Style           0
Roof Matl            0
Exterior 1st         0
Exterior 2nd         0
Mas Vnr Type        11
Mas Vnr Area        11
Exter Qual           0
Exter Cond           0
                  ... 
Bedroom AbvGr        0
Kitchen AbvGr        0
Kitchen Qual         0
TotRms AbvGrd        0
Functional           0
Fireplaces           0
Fireplace Qu       717
Garage Type         74
Garage Yr Blt       75
Garage Finish       75
Garage Cars          0
Garage Area          0
Garage Qual

In [35]:
df_missing_values = train[null_counts[(null_counts > 0) & (null_counts < 584)].index]
df_missing_values.head()

Unnamed: 0,Lot Frontage,Mas Vnr Type,Mas Vnr Area,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath,Garage Type,Garage Yr Blt,Garage Finish,Garage Qual,Garage Cond
0,141.0,Stone,112.0,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,1.0,0.0,Attchd,1960.0,Fin,TA,TA
1,80.0,,0.0,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,0.0,0.0,Attchd,1961.0,Unf,TA,TA
2,81.0,BrkFace,108.0,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,0.0,0.0,Attchd,1958.0,Unf,TA,TA
3,93.0,,0.0,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,1.0,0.0,Attchd,1968.0,Fin,TA,TA
4,74.0,,0.0,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,0.0,0.0,Attchd,1997.0,Fin,TA,TA


In [36]:
df_missing_values.shape

(1460, 19)

In [39]:
# Imputation

float_cols = df_missing_values.select_dtypes(include=['float'])
float_cols = float_cols.fillna(float_cols.mean())
float_cols.head()

Unnamed: 0,Lot Frontage,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath,Garage Yr Blt
0,141.0,112.0,639.0,0.0,441.0,1080.0,1.0,0.0,1960.0
1,80.0,0.0,468.0,144.0,270.0,882.0,0.0,0.0,1961.0
2,81.0,108.0,923.0,0.0,406.0,1329.0,0.0,0.0,1958.0
3,93.0,0.0,1065.0,0.0,1045.0,2110.0,1.0,0.0,1968.0
4,74.0,0.0,791.0,0.0,137.0,928.0,0.0,0.0,1997.0


In [41]:
float_cols.isnull().sum()

Lot Frontage      0
Mas Vnr Area      0
BsmtFin SF 1      0
BsmtFin SF 2      0
Bsmt Unf SF       0
Total Bsmt SF     0
Bsmt Full Bath    0
Bsmt Half Bath    0
Garage Yr Blt     0
dtype: int64