# Data Preprocessing



See also:
https://colab.research.google.com/github/d2l-ai/d2l-pytorch-colab/blob/master/chapter_preliminaries/pandas.ipynb


## Reading dataset

In [9]:
# Create CSV file to play with

import os
os.makedirs(os.path.join('.', 'data', ), exist_ok=True)
data_file = os.path.join('.', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')


In [10]:
# Read CSV file

import pandas as pd

data = pd.read_csv(data_file)
data


Unnamed: 0,NumRooms,RoofType,Price
0,,,127500
1,2.0,,106000
2,4.0,Slate,178100
3,,,140000


## 2.2.2. Data preparation

**Steps in training data processing:**

1. Separate input columns and target values (labels). We can select columns either by name or via integer-location based indexing (iloc).

In [11]:
inputs = data.iloc[:, 0:2]
inputs

Unnamed: 0,NumRooms,RoofType
0,,
1,2.0,
2,4.0,Slate
3,,


In [12]:
targets = data.iloc[:, 2]
targets


0    127500
1    106000
2    178100
3    140000
Name: Price, dtype: int64

2. Handle bad data using imputation.
- Categorical data - creates separate column for each category.
- Numerical values - replace NaNs with mean values



In [13]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
inputs


Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,,False,True
1,2.0,False,True
2,4.0,True,False
3,,False,True


In [14]:
inputs = inputs.fillna(inputs.mean())
inputs # returns pd.DataFrame

Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,3.0,False,True
1,2.0,False,True
2,4.0,True,False
3,3.0,False,True


## 2.2.3 Conversion to Tensor format

In [18]:
inputs.values # returns np.ndarray

array([[3.0, False, True],
       [2.0, False, True],
       [4.0, True, False],
       [3.0, False, True]], dtype=object)

In [19]:
import torch

X, y = torch.tensor(inputs.values), torch.tensor(targets.values)
X, y


TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint64, uint32, uint16, uint8, and bool.