# Data Preprocessing - Standardization
This tutorial explains how to preprocess data using the Pandas library. Preprocessing is the process of doing a pre-analysis of data, in order to transform them into a standard and normalised format. Preprocessing involves the following aspects:
* missing values
* data formatting
* data normalisation
* data standardisation
* data binning
In this tutorial we deal only with standardization. Standardization is often confused with normalization, however
they refer to different things. Normalization involves adjusting values measured on different scales to a common scale, while standardization transforms data to have a mean of zero and a standard deviation of 1. 
Standardization is also done through a z-score transformation, where the new value is calculated as the difference between the current value and the average value, divided by the standard deviation. 

Z-score is a statistical measure that specifies how far is a single data point from the rest of the dataset. As highlighted by Mahbubul Alam in [his article](https://towardsdatascience.com/z-score-for-anomaly-detection-d98b0006f510), z-score can be used to detect outliers in a dataset.

Z-score can be calculated manually as described in [my previous post](https://towardsdatascience.com/data-preprocessing-with-python-pandas-part-3-normalisation-5b5392d27673). However, in this tutorial I will show you how to calculate z-score using some functions from the `scipy.stats` library.

In this tutorial we consider two types of standardizations:
* z-score
* z-map

## Data Import
As example dataset, in this tutorial we consider the dataset provided by the Italian Protezione Civile, related to the number of COVID-19 cases registered since the beginning of the COVID-19 pandemic. The dataset is updated daily and can be downloaded from [this link](https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv).

First of all, we need to import the Python `pandas` library and read the dataset through the `read_csv()` function. Then we can drop all the columns with `NaN` values. This is done through `dropna()` function. 

In [6]:
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv')
df.dropna(axis=1,inplace=True)
df.tail(10)

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi
27395,2023-09-20T17:00:00,ITA,21,P.A. Bolzano,46.499335,11.356624,16,1,17,0,17,-1,32,295466,1626,297109,5612164
27396,2023-09-20T17:00:00,ITA,22,P.A. Trento,46.068935,11.121231,13,0,13,320,333,1,33,245489,1657,247479,3064220
27397,2023-09-20T17:00:00,ITA,1,Piemonte,45.073274,7.680687,195,6,201,32587,32788,187,459,1699970,13866,1746624,22033260
27398,2023-09-20T17:00:00,ITA,16,Puglia,41.125596,16.867367,60,3,63,4139,4202,161,281,1635165,9846,1649213,14268081
27399,2023-09-20T17:00:00,ITA,20,Sardegna,39.215312,9.110616,143,1,144,6755,6899,23,126,511680,2970,521549,5515313
27400,2023-09-20T17:00:00,ITA,19,Sicilia,38.115697,13.362357,189,9,198,4723,4921,26,28,1814264,12875,1832060,16938303
27401,2023-09-20T17:00:00,ITA,9,Toscana,43.769231,11.255889,204,8,212,2601,2813,-31,356,1601718,12048,1616579,17010366
27402,2023-09-20T17:00:00,ITA,10,Umbria,43.106758,12.388247,64,2,66,1339,1405,103,146,443389,2506,447300,5105951
27403,2023-09-20T17:00:00,ITA,2,Valle d'Aosta,45.737503,7.320149,8,0,8,47,55,0,6,50498,574,51127,597040
27404,2023-09-20T17:00:00,ITA,5,Veneto,45.434905,12.338452,236,10,246,18595,18841,227,774,2713135,16969,2748945,38410471


## z-score
The new value is calculated as the difference between the current value and the average value, divided by the standard deviation. For example, we can calculate the z-score of the column `deceduti`. We can use the `zscore()` function of the `scipy.stats` library.

In [7]:
from scipy.stats import zscore
df['zscore-deceduti'] = zscore(df['deceduti'])

## z-map
The new value is calculated as the difference between the current value and the average value of a comparison array, divided by the standard deviation of a comparison array. For example, we can calculate the z-map of the column `deceduti`, using the column `terapia_intensiva` as comparison array. We can use the `zmap()` function of the `scipy.stats` library.

In [8]:
from scipy.stats import zmap
zmap(df['deceduti'], df['terapia_intensiva'])

0         -0.436605
1         -0.436605
2         -0.436605
3         -0.436605
4         -0.436605
            ...    
27400    139.179620
27401    130.211650
27402     26.738405
27403      5.787839
27404    183.574868
Name: deceduti, Length: 27405, dtype: float64

## Detect outliers
Standardization can be used to detect and delete outliers. For example, a threshold can be defined to specify which values can be considered as outliers. In this example, we set `threshold = 2`. We can add a new column to the dataframe, called `outliers` which is set to `True` if the value is less than `-2` or greater than `2`. We use the `numpy` function `where()` to perform comparisons.

In [9]:
threshold = 2

df['outliers'] = np.where((df['zscore-deceduti'] - threshold > 0), True, np.where(df['zscore-deceduti'] + threshold < 0, True, False)) 

Now, we can remove outliers, using the `drop()` function.

In [10]:
df.drop(df[df['outliers'] == True].index,inplace=True)

In [11]:
df.shape

(26383, 19)