# 2.2. Data Preprocessing

Pandas is the most widely used preprocessing packages, and luckily it can work seemlessly with tensors. In this notebook we will briefly walk through steps for preprocessing raw data with pandas and converting them into the tensor format.

## Table of Contents

[2.2.1 Reading the Dataset](#d)

[2.2.2 Handling Missing Data](#missing)

[2.2.3 Conversion to the Tensor Format](#conv)

[2.2.4 Summary](#summary)

## 2.2.1 Reading the Dataset <a name="d"></a>

In [4]:
# Create synthetic data
import os

os.makedirs(os.path.join('..','data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # Column names
    f.write('NA,Pave,127500\n')  # Each row represents a data example
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

In [5]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000


## 2.2.2. Handling Missing Data <a name="missing"></a>

Typical methods to handle missing data includes **imputation and deletion**, where **imputation** replaces missing values with substituted ones, while **deletion** ignores missing values. Here we will consider imputation.

In [6]:
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN


The column **Alley** is treated as a **category**; hence there's no imputation happening on that column. Instead, what we can do is to conduct one-hot encoding for the categorical column.

In [9]:
# Create one-hot columns considering NA in categorical columns as a valid category
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1


## 2.2.3. Conversion to the Tensor Format <a name="conv"></a>

Since all the entries in inputs and outputs are numerical, they can be converted to the tensor format.

In [10]:
import torch

X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

## 2.2.4. Summary  <a name="summary"></a>

Like many other extension packages in the vast ecosystem of Python, pandas can work together with tensors.

Imputation and deletion can be used to handle missing data.