# Data Preprocessing


## Mean Computation

The mean() is a built-in Python statistics module function used to calculate the average of numbers and lists. The mean() returns the mean of the data set passed as parameters. To use the mean() method in the Python program, import the Python statistics module, and then we can use the mean function to return the mean of the given list. See the following example.

In [1]:
import statistics

data = [11, 21, 11, 19, 46, 21, 19, 29, 21, 18, 3, 11, 11]

x = statistics.mean(data)
print(x)

y = statistics.median(data)
print(y)

z = statistics.mode(data)
print(z)

a = statistics.stdev(data)
print(a)

b = statistics.variance(data)
print(b)

18.53846153846154
19
11
10.611435534486562
112.6025641025641


Using numpy.mean() function
NumPy.mean() function returns the average of the array elements. The average is taken over the flattened array by default, otherwise over the specified axis.

Numpy library is a commonly used library to work on large multi-dimensional arrays. It also has an extensive collection of mathematical functions to be used on arrays to perform various tasks. One important thing to note here is that the mean() function will give us the average for the list given.


In [2]:
from numpy import mean
number_list = [19, 21, 46, 11, 18]
avg = mean(number_list)
print("The average of List is ", round(avg, 2))

The average of List is  23.0


In [5]:
print("The average of data is: ", round(mean(data), 2))

The average of data is:  18.54


## Imputation 

### Examples

In [4]:
import numpy as np
import pandas as pd
df = pd.read_csv('dataset/NaNDataset.csv')
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,'Good'
1,4,,6.0,'Good'
2,7,,9.0,'Excellent'
3,10,11.0,12.0,
4,13,14.0,15.0,'Excellent'
5,16,17.0,,'Fair'
6,19,12.0,12.0,'Excellent'
7,20,11.0,23.0,'Fair'


In [6]:
print(df['B'].mean())
df['B'] = df['B'].fillna(df['B'].mean())
df

11.166666666666666


Unnamed: 0,A,B,C,D
0,1,2.0,3.0,'Good'
1,4,11.166667,6.0,'Good'
2,7,11.166667,9.0,'Excellent'
3,10,11.0,12.0,
4,13,14.0,15.0,'Excellent'
5,16,17.0,,'Fair'
6,19,12.0,12.0,'Excellent'
7,20,11.0,23.0,'Fair'


In [10]:
print(df['D'].value_counts())
print(df['D'].value_counts().index)
print(df['D'].value_counts().index[0])
df['D'] = df['D'].fillna(df['D'].value_counts().index[0])
df

'Excellent'    4
'Good'         2
'Fair'         2
Name: D, dtype: int64
Index([''Excellent'', ''Good'', ''Fair''], dtype='object')
'Excellent'


Unnamed: 0,A,B,C,D
0,1,2.0,3.0,'Good'
1,4,11.166667,6.0,'Good'
2,7,11.166667,9.0,'Excellent'
3,10,11.0,12.0,'Excellent'
4,13,14.0,15.0,'Excellent'
5,16,17.0,,'Fair'
6,19,12.0,12.0,'Excellent'
7,20,11.0,23.0,'Fair'


## Data Preprocessing - Binning


Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.

Data binning is a type of data preprocessing, a mechanism which includes also dealing with missing values, formatting, normalization and standardization.

Binning can be applied to convert numeric values to categorical or to sample (quantise) numeric values. 

convert numeric to categorical includes binning by distance and binning by frequency
reduce numeric values includes quantisation (or sampling).
Binning is a technique for data smoothing. Data smoothing is employed to remove noise from data. Three techniques for data smoothing:

binning
regression
outlier analysis


In [13]:
import pandas as pd
df = pd.read_csv('dataset/cupcake.csv')
df.head()

Unnamed: 0,Mese,Cupcake
0,2004-01,5
1,2004-02,5
2,2004-03,4
3,2004-04,6
4,2004-05,5


## Binning by distance

In this case we define the edges of each bin. In Python pandas binning by distance is achieved by means of thecut() function.
We group values related to the column Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls. We calculate the interval range as the difference between the maximum and minimum value and then we split this interval into three parts, one for each group. We exploit the functions min() and max() of dataframe to calculate the minimum value and the maximum value of the column Cupcake.

In [19]:
min_value = df['Cupcake'].min()
max_value = df['Cupcake'].max()
print(min_value)
print(max_value)

4
100


Now we can calculate the range of each interval, i.e. the minimum and maximum value of each interval. Since we have 3 groups, we need 4 edges of intervals (bins):

small — (edge1, edge2)

medium — (edge2, edge3)

big — (edge3, edge4)

We can use the linspace() function of the numpy package to calculate the 4 bins, equally distributed.


In [20]:
import numpy as np
bins = np.linspace(min_value,max_value,4)
bins

array([  4.,  36.,  68., 100.])

In [21]:
labels = ['small', 'medium', 'big']

We can use the cut() function to convert the numeric values of the column Cupcake into the categorical values. We need to specify the bins and the labels. In addition, we set the parameter include_lowest to True in order to include also the minimum value.

In [22]:
df['bins'] = pd.cut(df['Cupcake'], bins=bins, labels=labels, include_lowest=True)

We can plot the distribution of values, by using the hist() function of the matplotlib package.

In [24]:
df

Unnamed: 0,Mese,Cupcake,bins
0,2004-01,5,small
1,2004-02,5,small
2,2004-03,4,small
3,2004-04,6,small
4,2004-05,5,small
...,...,...,...
199,2020-08,47,medium
200,2020-09,44,medium
201,2020-10,49,medium
202,2020-11,44,medium


In [66]:
df.groupby(['bins']).count()

Unnamed: 0_level_0,Mese,Cupcake
bins,Unnamed: 1_level_1,Unnamed: 2_level_1
small,68,68
medium,74,74
big,62,62


## Binning by frequency

Binning by frequency calculates the size of each bin so that each bin contains the (almost) same number of observations, but the bin range will vary. We can use the Python pandas qcut() function. We can set the precision parameter to define the number of decimal points.

In [27]:
df['bin_qcut'] = pd.qcut(df['Cupcake'], q=3, precision=1, labels=labels)

In [28]:
df

Unnamed: 0,Mese,Cupcake,bins,bin_qcut
0,2004-01,5,small,small
1,2004-02,5,small,small
2,2004-03,4,small,small
3,2004-04,6,small,small
4,2004-05,5,small,small
...,...,...,...,...
199,2020-08,47,medium,medium
200,2020-09,44,medium,medium
201,2020-10,49,medium,medium
202,2020-11,44,medium,medium


In [29]:
df.groupby(['bin_qcut']).count()

Unnamed: 0_level_0,Mese,Cupcake,bins
bin_qcut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
small,68,68,68
medium,68,68,68
big,68,68,68


## Data Preprocessing - Normalization

As already said in my previous tutorial on data normalization, Data Normalization involves adjusting values measured on different scales to a common scale.

Normalization applies only to columns containing numeric values. Five methods of normalization exist:

single feature scaling
min max
z-score
log scaling
clipping
In this tutorial, we use the scikit-learn library to perform normalization, while in my previous tutorial I dealt with data normalization using the pandas library. The scikit-learn library can be used also to deal with missing values, as explained in my previous post.

All the scikit-learn operations described in this tutorial follow the following steps:

select a preprocessing methodology
fit it through the fit() function
apply it to data through the transform() function.
The scikit-learn library works only with arrays, thus when performing every operation, a dataframe column must be converted to an array. This can be achieved through the numpy.array() function, which receives the dataframe column as input. In addition, the fit() function receives as input an array of arrays, each representing a sample of the dataset. Thus the reshape() function could be used to convert a standard array to an array of arrays.

## Data Import
As example dataset, in this tutorial we consider the dataset provided by the Italian Protezione Civile, related to the number of COVID-19 cases registered since the beginning of the COVID-19 pandemic. The dataset is updated daily and can be downloaded from this link.

First of all, we need to import the Python pandas library and read the dataset through the read_csv() function. Then we can drop all the columns with NaN values. This is done through dropna() function.

In [31]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv')
df.dropna(axis=1,inplace=True)
df.head(10)

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0,5
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0,0
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0,1
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,0,10
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,18,148
5,2020-02-24T18:00:00,ITA,6,Friuli Venezia Giulia,45.649435,13.768136,0,0,0,0,0,0,0,0,0,0,58
6,2020-02-24T18:00:00,ITA,12,Lazio,41.89277,12.483667,1,1,2,0,2,0,2,1,0,3,124
7,2020-02-24T18:00:00,ITA,7,Liguria,44.411493,8.932699,0,0,0,0,0,0,0,0,0,0,1
8,2020-02-24T18:00:00,ITA,3,Lombardia,45.466794,9.190347,76,19,95,71,166,0,166,0,6,172,1463
9,2020-02-24T18:00:00,ITA,11,Marche,43.61676,13.518875,0,0,0,0,0,0,0,0,0,0,16



## Single Feature Scaling¶
Single Feature Scaling converts every value of a column into a number between 0 and 1. The new value is calculated as the current value divided by the max value of the column. This can be done through the MaxAbsScaler class. We apply the scaler to the tamponi column, which mut be converted to array and reshaped.

In [32]:
import numpy as np
from sklearn.preprocessing import MaxAbsScaler

X = np.array(df['tamponi']).reshape(-1,1)
scaler = MaxAbsScaler()

Now we can fit the scaler and then apply the transformation. We convert it to the original shape by applying the inverse reshape() function and we store the result into a new column of the datafram df.

In [34]:
scaler.fit(X)
X_scaled = scaler.transform(X)
df['single feature scaling'] = X_scaled.reshape(1,-1)[0]
df

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi,single feature scaling
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0,5,1.360436e-07
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0,0,0.000000e+00
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0,1,2.720872e-08
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.250850,0,0,0,0,0,0,0,0,0,0,10,2.720872e-07
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,18,148,4.026891e-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16774,2022-05-02T17:00:00,ITA,19,Sicilia,38.115697,13.362357,804,46,850,115493,116343,-1,1204,990096,10605,1117044,12618270,3.433270e-01
16775,2022-05-02T17:00:00,ITA,9,Toscana,43.769231,11.255889,612,23,635,46898,47533,-6337,730,1043305,9867,1100705,13791333,3.752445e-01
16776,2022-05-02T17:00:00,ITA,10,Umbria,43.106758,12.388247,219,5,224,11563,11787,-47,384,255778,1834,269399,4211267,1.145832e-01
16777,2022-05-02T17:00:00,ITA,2,Valle d'Aosta,45.737503,7.320149,25,0,25,1355,1380,-203,18,33216,533,35129,509263,1.385639e-02


The scikit-learn library also provides a function to restore the original values, given the transormation. This function also works for the transformations described later in this article.

In [37]:
print(df['single feature scaling'].min())
print(df['single feature scaling'].max())

0.0
1.0


In [38]:
scaler.inverse_transform(X_scaled)

array([[5.0000000e+00],
       [0.0000000e+00],
       [1.0000000e+00],
       ...,
       [4.2112670e+06],
       [5.0926300e+05],
       [2.9113193e+07]])

## Min Max¶
Similarly to Single Feature Scaling, Min Max converts every value of a column into a number between 0 and 1. The new value is calculated as the difference between the current value and the min value, divided by the range of the column values. In scikit-learn we use the MinMaxScaler class. For example, we can apply the min max method to the column totale_casi.

In [39]:
from sklearn.preprocessing import MinMaxScaler
X = np.array(df['totale_casi']).reshape(-1,1)
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
df['min max'] = X_scaled.reshape(1,-1)[0]
df.head()

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi,single feature scaling,min max
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0,5,1.360436e-07,0.0
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0,1,2.720872e-08,0.0
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,0,10,2.720872e-07,0.0
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,18,148,4.026891e-06,6e-06


In [40]:
print(df['min max'].min())
print(df['min max'].max())

0.0
1.0



## z-score
Z-Score converts every value of a column into a number around 0. Typical values obtained by a z-score transformation range from -3 and 3. The new value is calculated as the difference between the current value and the average value, divided by the standard deviation. In scikit-learn we can use the StandardScaler function. For example, we can calculate the z-score of the column deceduti.

In [42]:
from sklearn.preprocessing import StandardScaler

X = np.array(df['deceduti']).reshape(-1,1)
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
df['z score'] = X_scaled
df.head()

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,totale_casi,tamponi,single feature scaling,min max,z score
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,0,5,1.360436e-07,0.0,-0.659144
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,-0.659144
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,0,1,2.720872e-08,0.0,-0.659144
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,0,10,2.720872e-07,0.0,-0.659144
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,18,148,4.026891e-06,6e-06,-0.659144


In [43]:
df['z score'].mean()

2.7102172411136784e-17

For more details, you can give a look at this link
<a href="https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html">this link</a>