# Data Preprocessing
:label:`sec_pandas`

## Section Summary
This sectiion explains how to preprocess data using the Python library "pandas". The process starts with reading the data from a CSV file and separating it into input and target values. The text explains how to handle missing values through imputation and deletion. Categorical values can be handled by converting them into dummy variables. Numerical values can be imputed with the mean of the corresponding column. Once the data has been cleaned, it can be loaded into PyTorch tensors for further processing. The text emphasizes that data processing can be complicated and can involve handling different data types, such as text, images, and audio. Outliers, faulty measurements, and recording errors are also common problems that need to be addressed before feeding the data into any model.




## Reading the Dataset



In [1]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [2]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


## Data Preparation


In [3]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN               0             1
1       2.0               0             1
2       4.0               1             0
3       NaN               0             1


In [4]:
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0               0             1
1       2.0               0             1
2       4.0               1             0
3       3.0               0             1


## Conversion to the Tensor Format


In [5]:
import torch

X, y = torch.tensor(inputs.values), torch.tensor(targets.values)
X, y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))