<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Dataset-Class" data-toc-modified-id="Dataset-Class-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Dataset Class</a></span><ul class="toc-item"><li><span><a href="#Basic-data-manipulation-functions" data-toc-modified-id="Basic-data-manipulation-functions-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Basic data manipulation functions</a></span><ul class="toc-item"><li><span><a href="#Load-data-from-dataframe" data-toc-modified-id="Load-data-from-dataframe-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Load data from dataframe</a></span></li><li><span><a href="#Access-feature-(column)-names" data-toc-modified-id="Access-feature-(column)-names-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Access feature (column) names</a></span></li><li><span><a href="#Replace-NA" data-toc-modified-id="Replace-NA-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Replace NA</a></span></li></ul></li></ul></li></ul></div>

# Dataset Class

This class collects the helper methods to be used along the different lessons, specifically for data preparation and basic feature engineering.

To start using it, simply add

    from src.dataset import Dataset

In [8]:
# imports
import numpy as np
import pandas as pd
import statsmodels.api as sm
import warnings

from src.dataset import Dataset

warnings.simplefilter(action='ignore')
warnings.filterwarnings(action='once')

In [9]:
houses = Dataset('./data/houseprices_prepared.csv.gz')
houses.set_target('SalePrice')
houses.describe()


Available types: [dtype('int64') dtype('O') dtype('float64')]
79 Features
43 categorical features
36 numerical features
16 categorical features with NAs
0 numerical features with NAs
63 Complete features
--
Target: SalePrice


## Basic data manipulation functions

### Load data from dataframe

To load data from an existing dataframe into this class, use:

In [12]:
my_existing_dataframe = pd.read_csv('./data/houseprices_prepared.csv.gz')
del(houses)

houses = Dataset.from_dataframe(my_existing_dataframe)
houses.set_target('SalePrice')
houses.describe()


Available types: [dtype('int64') dtype('O') dtype('float64')]
79 Features
43 categorical features
36 numerical features
16 categorical features with NAs
0 numerical features with NAs
63 Complete features
--
Target: SalePrice


### Access feature (column) names

Print a convenient table with the list of features that are categorical and contains NA. Other options are:

  - all (default)
  - features
  - target
  - complete
  - numerical
  - numerical_na
  - categorical

To display features of any type in table format, which is more convenient when there're many of them, use:

In [13]:
houses.table('categorical_na')

-----------------------------------------------------------------------------
Alley        MasVnrType   BsmtQual     BsmtCond     BsmtExposure BsmtFinType1 
BsmtFinType2 Electrical   FireplaceQu  GarageType   GarageFinish GarageQual   
GarageCond   PoolQC       Fence        MiscFeature  
-----------------------------------------------------------------------------


### Replace NA

Replace the NA's by new values in all 'categorical_na' features. There's a special case called 'Electrical' where NA is replaced by 'Unknown'. As you can see, you can pass a single column name or a list of column names.

To obtain a list of names from the dataset for each type of feature, we use `dataset.names(kind)`.

In [4]:
houses.replace_na(column='Electrical', value='Unknown')
houses.replace_na(column=houses.names('categorical_na'), value='None')
houses.table('categorical_na')

Describe now the dataset to check that there're no NA among the categorical variables!

In [5]:
houses.describe()


Available types: [dtype('int64') dtype('O') dtype('float64')]
79 Features
43 categorical features
36 numerical features
0 categorical features with NAs
0 numerical features with NAs
79 Complete features
--
Target: SalePrice
