# Data Preprocessing

To apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocess it to suit our needs.

## Reading the dataset

### Creating an artificial dataset

In [26]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')

with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
,,127500
2,,106000
4,Slate,178100
,,140000''')

- **os.makedirs()**:is used to create a directory recursively. That means while making leaf directory if any intermediate-level directory is missing, os.makedirs() method will create them all

    Parameters: 

1. path: A path-like object representing a file system path. A path-like object is either a string or bytes object representing a path.
2. mode (optional) : A Integer value representing mode of the newly created directory..If this parameter is omitted then the default value Oo777 is used.
3. exist_ok (optional) : A default value False is used for this parameter. If the target directory already exists an OSError is raised if its value is False otherwise not. For value True leaves directory unaltered. 

### Loading the csv with pandas

In [35]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


## Data Preparation

- In supervised learning, separate the columns corresponding to input versus target values. We can select columns by name or via integer-location based indexig (iloc)
- Nan values are missing values these values might be handled via imputation or deletion. 
    - imputation: replace missing values with estimates of their values.
    - deletion: discards either those rows or those columns containing missing values
- For categorical input fields we can treat NaN as a category.

In [36]:
inputs = data.iloc[:, 0:2]
target = data.iloc[:,2]
target, inputs

(0    127500
 1    106000
 2    178100
 3    140000
 Name: Price, dtype: int64,
    NumRooms RoofType
 0       NaN      NaN
 1       2.0      NaN
 2       4.0    Slate
 3       NaN      NaN)

In [37]:
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN           False          True
1       2.0           False          True
2       4.0            True         False
3       NaN           False          True


In [38]:
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0           False          True
1       2.0           False          True
2       4.0            True         False
3       3.0           False          True


## Conversion to Tensor Format

In [39]:
import torch

X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(target.to_numpy(dtype=float))

X,y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))

## Discussion

Data processing can get hairy. For example, rather than arriving in a single CSV file, our dataset might be spread across multiple files extracted from a relational database. For instance, in an e-commerce application, customer addresses might live in one table and purchase data in another. Moreover, practitioners face myriad data types beyond categorical and numeric, for example, text strings, images, audio data, and point clouds. Oftentimes, advanced tools and efficient algorithms are required in order to prevent data processing from becoming the biggest bottleneck in the machine learning pipeline. These problems will arise when we get to computer vision and natural language processing. Finally, we must pay attention to data quality. Real-world datasets are often plagued by outliers, faulty measurements from sensors, and recording errors, which must be addressed before feeding the data into any model.

## Excercises

### Excercise 1

In [41]:
#1
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
abalone = fetch_ucirepo(id=1) 
  
# data (as pandas dataframes) 
X = abalone.data.features 
y = abalone.data.targets 
  
# metadata 
print(abalone.metadata) 
  
# variable information 
print(abalone.variables) 

{'uci_id': 1, 'name': 'Abalone', 'repository_url': 'https://archive.ics.uci.edu/dataset/1/abalone', 'data_url': 'https://archive.ics.uci.edu/static/public/1/data.csv', 'abstract': 'Predict the age of abalone from physical measurements', 'area': 'Biology', 'tasks': ['Classification', 'Regression'], 'characteristics': ['Tabular'], 'num_instances': 4177, 'num_features': 8, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': [], 'target_col': ['Rings'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1994, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C55C7W', 'creators': ['Warwick Nash', 'Tracy Sellers', 'Simon Talbot', 'Andrew Cawthorn', 'Wes Ford'], 'intro_paper': None, 'additional_info': {'summary': 'Predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- 

In [46]:
X, y

(     Sex  Length  Diameter  Height  Whole_weight  Shucked_weight  \
 0      M   0.455     0.365   0.095        0.5140          0.2245   
 1      M   0.350     0.265   0.090        0.2255          0.0995   
 2      F   0.530     0.420   0.135        0.6770          0.2565   
 3      M   0.440     0.365   0.125        0.5160          0.2155   
 4      I   0.330     0.255   0.080        0.2050          0.0895   
 ...   ..     ...       ...     ...           ...             ...   
 4172   F   0.565     0.450   0.165        0.8870          0.3700   
 4173   M   0.590     0.440   0.135        0.9660          0.4390   
 4174   M   0.600     0.475   0.205        1.1760          0.5255   
 4175   F   0.625     0.485   0.150        1.0945          0.5310   
 4176   M   0.710     0.555   0.195        1.9485          0.9455   
 
       Viscera_weight  Shell_weight  
 0             0.1010        0.1500  
 1             0.0485        0.0700  
 2             0.1415        0.2100  
 3             0.1

In [57]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole_weight    4177 non-null   float64
 5   Shucked_weight  4177 non-null   float64
 6   Viscera_weight  4177 non-null   float64
 7   Shell_weight    4177 non-null   float64
dtypes: float64(7), object(1)
memory usage: 261.2+ KB


In [53]:
X.isna().sum()/X.shape[0]

Sex               0.0
Length            0.0
Diameter          0.0
Height            0.0
Whole_weight      0.0
Shucked_weight    0.0
Viscera_weight    0.0
Shell_weight      0.0
dtype: float64

In [58]:
y.isna().sum()/y.shape[0]

Rings    0.0
dtype: float64

- There are not missing values
- There are only one categorial input the rest are numerical so 1/8 of the variables are categorical and 7/8 are numericals

In [65]:
n_categoricals = 0

for column in X.columns:
    if X[column].dtype == "object":
        n_categoricals += 1

ratio_categoricals = n_categoricals/len(list(X.columns))
ratio_numericals = 1 - ratio_categoricals

ratio_categoricals, ratio_numericals

(0.125, 0.875)

### Excercise 2

In [69]:
dimensions = X[["Length","Diameter","Height"]]
dimensions.head(5)

Unnamed: 0,Length,Diameter,Height
0,0.455,0.365,0.095
1,0.35,0.265,0.09
2,0.53,0.42,0.135
3,0.44,0.365,0.125
4,0.33,0.255,0.08


In [75]:
weights = X[[column for column in X.columns if column.endswith("weight")]]
weights.head(5)

Unnamed: 0,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight
0,0.514,0.2245,0.101,0.15
1,0.2255,0.0995,0.0485,0.07
2,0.677,0.2565,0.1415,0.21
3,0.516,0.2155,0.114,0.155
4,0.205,0.0895,0.0395,0.055


In [76]:
sex = X["Sex"]
sex.head(5)

0    M
1    M
2    F
3    M
4    I
Name: Sex, dtype: object

### Excercise 3

- The dataset that i can load this way will depend of the amount of RAM that i have available
- The limitations are the amount of RAM that depends on the size of the file, the amount of time to process the file that depends of the format
- I am not sure about this question, I guess the difference will depend on the network connection to the server where probably the file format can have a major impact. Another big difference is going to be associated with the capacity of the laptop with respect to the capacity of the server. The server can probably handle much larger volumes of data than the laptop, but network limitations may slow down the workflow if data needs to be uploaded or downloaded.

**STRATEGIES**

**Load Less Data**

Supposed i want to load some columns of the Dataset but not all columns. We have 2 options:

In [103]:
# Option 1: Load all and then filter what we need

df = pd.read_csv("..\data\house_tiny.csv")

columns_target = ["NumRooms","Price"]

df[columns_target]

Unnamed: 0,NumRooms,Price
0,,127500
1,2.0,106000
2,4.0,178100
3,,140000


In [105]:
# Option 2: Only load the columns we request

df = pd.read_csv("..\data\house_tiny.csv", usecols=columns_target)

df

Unnamed: 0,NumRooms,Price
0,,127500
1,2.0,106000
2,4.0,178100
3,,140000


**Use efficient datatypes**

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

In [77]:
X.memory_usage(deep=True)

Index                128
Sex               242266
Length             33416
Diameter           33416
Height             33416
Whole_weight       33416
Shucked_weight     33416
Viscera_weight     33416
Shell_weight       33416
dtype: int64

In [114]:
X_2 = X.copy()

X_2["Sex"] = X_2["Sex"].astype("category")

X_2.memory_usage(deep=True)

Index               128
Sex                4459
Length            33416
Diameter          33416
Height            33416
Whole_weight      33416
Shucked_weight    33416
Viscera_weight    33416
Shell_weight      33416
dtype: int64

We can go a bit further and downcast the numeric columns to their smallest types using pandas.to_numeric().

In [128]:
numerical_columns = X_2.iloc[:,1:].columns

X_2[numerical_columns] = X_2[numerical_columns].apply(pd.to_numeric, downcast="float")

X_2.dtypes

Sex               category
Length             float32
Diameter           float32
Height             float32
Whole_weight       float32
Shucked_weight     float32
Viscera_weight     float32
Shell_weight       float32
dtype: object

In [129]:
X_2.memory_usage(deep=True)

Index               128
Sex                4459
Length            16708
Diameter          16708
Height            16708
Whole_weight      16708
Shucked_weight    16708
Viscera_weight    16708
Shell_weight      16708
dtype: int64

We’ve reduced the in-memory footprint of this dataset

Source:
- [Scaling to large datasets](https://pandas.pydata.org/docs/user_guide/scale.html)
- [List of libraries implementing a DataFrame API](https://pandas.pydata.org/community/ecosystem.html) 
- [Enhancing performance](https://pandas.pydata.org/docs/user_guide/enhancingperf.html)

### Excercise 4

- If I have many categories, I would try to analyze how these categorical variables relate to each other and try to reduce variables that are high correlated, leaving only the most important of these correlated variables.
- One way to work with high cardinality data is probably to try to process the data to group them in a more meaningful way or not consider this category at all.

### Excercise 5

- [PIL](https://pillow.readthedocs.io/en/stable/handbook/tutorial.html)
- [numpy.load](https://numpy.org/doc/stable/reference/generated/numpy.load.html)
- [Dask](https://docs.dask.org/en/stable/)
- [Ibis](https://ibis-project.org/tutorials/getting_started)
- [Koalas](https://koalas.readthedocs.io/en/latest/)
- [Modin](https://github.com/modin-project/modin)
- [Odo](https://odo.pydata.org/en/latest/)
- [Pandarell](https://github.com/nalepae/pandarallel)
- [Ray](https://docs.ray.io/en/latest/data/api/doc/ray.data.from_pandas.html)
- [Vaex](https://vaex.io/)
- [Hail](https://hail.is/)
- [NTV-pandas](https://github.com/loco-philippe/ntv-pandas)
- [bcpandas](https://github.com/yehoshuadimarsky/bcpandas)