# Data Cleaning and Preparation

In [1]:
import pandas as pd
import numpy as np

## Data Transformation

### Generali

#### Modifica 

#### Rimozione

### Specifici 

## Handling 

In questa sezione parleremo di problemi di gestione dei dati generici, ma che pero trovano applicazione nella maggior parte dei progetti di data analisi. Parleremo di: 
- **Gestione dei Dati Mancanti**
- **Gestione delle Stringhe**
- **Gestione di Dati Categorici**

### Gestione dei dati mancanti 

La gestione dei dati mancanti prende il nome di **imputazione**. Molti metodi diversi possono essere impiegati per risolvere questo problema. In questo caso vedremo come:
- Eliminare le entrate di un asse dove ci sono dei valori *nulli*,
- Sostituire i valori nulli con un valore specifico.

In [2]:
float_data = pd.Series([1.2, -3.5, np.nan, 0])
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

In [4]:
string_data = pd.Series(['aaardvak', np.nan, None, 'avocado'])
string_data

0    aaardvak
1         NaN
2        None
3     avocado
dtype: object

In [9]:
data = pd.DataFrame(
    [[1., 6.5, 3.], [1., np.nan, np.nan],
    [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]]
)
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [3]:
# identificazione con maschera delle entrate nulle
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

In [5]:
# identificazione di diverse sentinelle per il valore nullo
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

In [6]:
# eliminare le entrate con valori nulli 
string_data.dropna()

0    aaardvak
3     avocado
dtype: object

In [10]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [11]:
data.dropna(how = 'all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [12]:
data.dropna(how = 'all', axis = 'columns')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [8]:
# filtraggio con maschera dei valori nulli 
string_data[string_data.notna()]

0    aaardvak
3     avocado
dtype: object

Se invece si volessero escludere solamente le entrate con un numero maggiore di una certa solglia di valori mancanti, basta specificarlo come argomento della funzione.  

In [13]:
np.random.seed(42)
df = pd.DataFrame(np.random.standard_normal((7,3)))
df

Unnamed: 0,0,1,2
0,0.496714,-0.138264,0.647689
1,1.52303,-0.234153,-0.234137
2,1.579213,0.767435,-0.469474
3,0.54256,-0.463418,-0.46573
4,0.241962,-1.91328,-1.724918
5,-0.562288,-1.012831,0.314247
6,-0.908024,-1.412304,1.465649


In [17]:
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,0.496714,,
1,1.52303,,
2,1.579213,,-0.469474
3,0.54256,,-0.46573
4,0.241962,-1.91328,-1.724918
5,-0.562288,-1.012831,0.314247
6,-0.908024,-1.412304,1.465649


In [18]:
df.dropna(thresh = 2)

Unnamed: 0,0,1,2
2,1.579213,,-0.469474
3,0.54256,,-0.46573
4,0.241962,-1.91328,-1.724918
5,-0.562288,-1.012831,0.314247
6,-0.908024,-1.412304,1.465649


Se invece si volessero sostituire i valori mancanti con un valore specifico e comunque possibile utilizzare l'API di Python. 

In [None]:
# sostituzione di tutti i valori
df.fillna(0)

Unnamed: 0,0,1,2
0,0.496714,0.0,0.0
1,1.52303,0.0,0.0
2,1.579213,0.0,-0.469474
3,0.54256,0.0,-0.46573
4,0.241962,-1.91328,-1.724918
5,-0.562288,-1.012831,0.314247
6,-0.908024,-1.412304,1.465649


In [20]:
# sostituzione dei valori mancanti per colonna
df.fillna({1:0.5, 2:0})

Unnamed: 0,0,1,2
0,0.496714,0.5,0.0
1,1.52303,0.5,0.0
2,1.579213,0.5,-0.469474
3,0.54256,0.5,-0.46573
4,0.241962,-1.91328,-1.724918
5,-0.562288,-1.012831,0.314247
6,-0.908024,-1.412304,1.465649


In [21]:
np.random.seed(42)
df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,0.496714,,
1,1.52303,,
2,1.579213,,-0.469474
3,0.54256,,-0.46573
4,0.241962,-1.91328,-1.724918
5,-0.562288,-1.012831,0.314247
6,-0.908024,-1.412304,1.465649


In [None]:
# backward fill
df.bfill()

Unnamed: 0,0,1,2
0,0.496714,-1.91328,-0.469474
1,1.52303,-1.91328,-0.469474
2,1.579213,-1.91328,-0.469474
3,0.54256,-1.91328,-0.46573
4,0.241962,-1.91328,-1.724918
5,-0.562288,-1.012831,0.314247
6,-0.908024,-1.412304,1.465649


In [25]:
# forward fill
df.ffill()

Unnamed: 0,0,1,2
0,0.496714,,
1,1.52303,,
2,1.579213,,-0.469474
3,0.54256,,-0.46573
4,0.241962,-1.91328,-1.724918
5,-0.562288,-1.012831,0.314247
6,-0.908024,-1.412304,1.465649


In generale il metodo `fillna()` permette di gestire il problema dei valori mancanti in molti modi diversi. Si consglia di consultare la [documentazione](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) per ulteriori dettagli.

### Gestione delle Stringhe 

In Python esistono delle funzioni native che permettono di gestire le stringhe in maniera veloce. Una lista completa puo essere trovata al seguente [link](https://www.w3schools.com/python/python_ref_string.asp).

Un metodo molto utilizzato per gestire le stringhe e quello delle espressioni regolari (*regex*). Il module `re` di Python permette di lavorare con le espressioni regolari in maniera semplice. Queste possono essere utilizate per: 
1. pattern matching
2. sostituzione
3. splitting

In [None]:
import re 

# splitting
text = 'foo bar\t baz \tqux'
re.split(r'\s+', text) # \s is the whitespace regex

['foo', 'bar', 'baz', 'qux']

In [27]:
# compiling the regex
regex = re.compile(r'\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [None]:
# pattern matching: a list of all string matching the regex
regex.findall(text)

[' ', '\t ', ' \t']

In [None]:
# strict matching: returns a matching only if the match is at the beginning of the string
regex.match(text)

In [30]:
# first matching: returns only the first string matching 
regex.search(text)

<re.Match object; span=(3, 4), match=' '>

In [None]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9._]+\.[A-Z]{2,4}" # mail regex
regex = re.compile(pattern, flags = re.IGNORECASE) # case-sensitive regex

regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [33]:
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [34]:
print(regex.match(text))

None


In [35]:
# substitution 
print(regex.sub("REDACTED", text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



In [42]:
# segmentation 
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9._]+)\.([A-Z]{2,4})" # mail regex segmented
regex = re.compile(pattern, flags = re.IGNORECASE)

In [43]:
m = regex.match("wesm@bright.net")
m.groups()

('wesm', 'bright', 'net')

In [44]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [45]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



Parliamo adesso delle funzioni native di *pandas* per la gestione delle stringhe. 

In [46]:
data = {
    'Dave': 'dave@google.com',
    'Steve': 'steve@gmail.com',
    'Rob': 'rob@gmail.com',
    'Wes': np.nan
}

In [49]:
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [50]:
data.isna()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

In [51]:
# matching 
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [52]:
data_as_string_text = data.astype('string')
data_as_string_text.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes       <NA>
dtype: boolean

In [54]:
# combinare pandas con re 
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9._]+)\.([A-Z]{2,4})" 
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [None]:
# using the str attribute to do vectorized search
matches = data.str.findall(pattern, flags = re.IGNORECASE).str[0]
matches

Dave     (dave, google, com)
Steve    (steve, gmail, com)
Rob        (rob, gmail, com)
Wes                      NaN
dtype: object

In [56]:
matches.str.get(1)

Dave     google
Steve     gmail
Rob       gmail
Wes         NaN
dtype: object

In [None]:
# requiring a DataFrame
data.str.extract(pattern, flags = re.IGNORECASE)

Unnamed: 0,0,1,2
Dave,dave,google,com
Steve,steve,gmail,com
Rob,rob,gmail,com
Wes,,,


Per ulteriori informazioni riguardo all'utilizzo dell'operatore `str` delle `Series` si faccia riferimento alla [documentazione](https://pandas.pydata.org/docs/user_guide/text.html)

### Gestione di Dati Categorici 

In questa sezione parleremo del tipo `Categorical` di *pandas*. Nella libreria sono presente delle funzioni di utility per la gestione dei dati categorici che aiutano, ad esempio, per lo *one-hot encoding*. 

In [59]:
values = pd.Series(['apple', 'orange', 'apple', 'apple']*2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [60]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [61]:
pd.value_counts(values)

  pd.value_counts(values)


apple     6
orange    2
Name: count, dtype: int64

In [62]:
# data warehousing 
values = pd.Series([0, 1, 0, 0]*2)
dim = pd.Series(['apple', 'orange'])
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [None]:
# encoding 
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

In [66]:
# Categorical type for handling encoding
fruits = dim.take(values)
fruits

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

In [82]:
N = len(fruits)
rng = np.random.default_rng(seed = 42)
df = pd.DataFrame({
    'fruit':fruits,
    'basket_ID':np.arange(N),
    'count':rng.integers(3,15, size = N),
    'weight': rng.uniform(0,4, size = N)},
    columns = ['basket_ID', 'fruit', 'count', 'weight']
)
df

Unnamed: 0,basket_ID,fruit,count,weight
0,0,apple,4,0.376709
1,1,orange,12,3.902489
0,2,apple,10,3.044559
0,3,apple,8,3.144257
0,4,apple,8,0.512455
1,5,orange,13,1.801544
0,6,apple,4,1.483192
0,7,apple,11,3.70706


In [None]:
# category object
fruit_cat = df['fruit'].astype('category')
fruit_cat

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [None]:
# original array is stored as an attribute
c = fruit_cat.array
type(c)

pandas.core.arrays.categorical.Categorical

In [None]:
# categories are stored as an Index attribitute
c.categories

Index(['apple', 'orange'], dtype='object')

In [76]:
# encoded array wrt to the category index is encoded as another attribute
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [None]:
# mapping value-category 
dict(enumerate(c.categories))

{0: 'apple', 1: 'orange'}

In [None]:
# casting a column to category type: obtaining all utilities
df['fruit'] = df['fruit'].astype('category')
df['fruit']

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [85]:
# creating a Categorical
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

In [86]:
# from_code constructor 
categories = ['foo', 'baz', 'bar']
codes = [0,1,2,0,0,1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

['foo', 'baz', 'bar', 'foo', 'foo', 'baz']
Categories (3, object): ['foo', 'baz', 'bar']

In [None]:
# ordered constructor 
my_cats_3 = pd.Categorical.from_codes(codes, categories, ordered = True)
my_cats_3

['foo', 'baz', 'bar', 'foo', 'foo', 'baz']
Categories (3, object): ['foo' < 'baz' < 'bar']