# PREPROCESAMIENTO

- Ingesta (ficheros, API, webscraping, DB)
- EDA - Análisis exploratorio de los datos - pre-visualización
- **Preparación de los datos**
-- Integración
-- Limpieza
-- Normalización
-- Transformación
- Visualización
- Reducción de datos
-- Reducción de dimensiones (PCA / SVD)
-- Reducción de muestras
-- Discretización
- Modelado
- Packaging
- DevOps - CI/CD

## Integración

La integración de los datos se centra en la recolección de todos los datos necesarios para el análisis (que a menudo proceden de fuentes distintas) en un único conjunto. La integración de datos debe afrontar problemas como la eliminación de atributos redundantes, la detección de tuplas duplicadas y la identificación de inconsistencias. Tanto los atributos redundantes como las tuplas duplicadas hacen aumentar el espacio de almacenamiento y el tiempo de cómputo necesarios para tratarlos y, además, pueden ser fuente de inconsistencias. 


El conjunto de datos ha sido creado con la colaboración de diferentes personas. Aunque todas ellas anotaban la misma información, lo cierto es que utilizaron una nomenclatura distinta para describir la dirección del viento. Veamos cómo podemos unificar la nomenclatura usada por todos ellos.

In [1]:
 # Importamos la librería pandas.
import pandas as pd

# Cargamos los datos del fichero "weather_dataset_edited.csv" en un dataframe.
data = pd.read_csv("https://raw.githubusercontent.com/marcusRB/IDbootcamps_DataScience_student_PT_10201/master/dataset/weather_dataset_edited.csv")

# Mostramos una descripción básica de los datos cargados.
print(type(data))
print(len(data))
data.head(n=5)

<class 'pandas.core.frame.DataFrame'>
43824


Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,jan,1,0,,-21,-11.0,1021.0,Nw,1.79,0,0
1,2,2010,jan,1,1,,-21,-12.0,1020.0,nw,4.92,0,0
2,3,2010,jan,1,2,,-21,-11.0,1019.0,nw,6.71,0,0
3,4,2010,jan,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,jan,1,4,,-20,-12.0,1018.0,nW,12.97,0,0


In [2]:
# Visualizamos las diferentes abreviaturas utilizadas.
set(data["cbwd"])

{'NE', 'NW', 'Nw', 'SE', 'Se', 'nW', nan, 'ne', 'nw', 'sE', 'se'}

In [3]:
 # Unificamos la nomenclatura para usar únicamente mayúsculas.
 data.loc[data.cbwd == 'ne', 'cbwd'] = "NE"
 data.loc[(data.cbwd == "Nw") | (data.cbwd == 'nW') | (data.cbwd == 'nw'), "cbwd"] = "NW"
 data.loc[(data.cbwd == 'Se') | (data.cbwd == 'sE') | (data.cbwd == 'se'), "cbwd"] = "SE"

In [4]:
# Comprobamos que la sustitución se haya realizado correctamente.
set(data["cbwd"])

{'NE', 'NW', 'SE', nan}

In [5]:
# Comprobamos la temperatura si están en grados centigrados o grados Fahrenheit
import numpy as np
grouped = data.groupby("year")
grouped.aggregate({"TEMP" : np.mean})

Unnamed: 0_level_0,TEMP
year,Unnamed: 1_level_1
2010,11.63242
2011,54.617534
2012,11.967441
2013,12.399201
2014,13.679566


In [8]:
def fah_to_celsius(x):
  return (x-32)*5/9

data.loc[data.year == 2011, "TEMP"] = data[data.year == 2011]['TEMP'].apply(fah_to_celsius).mean()

In [9]:
grouped = data.groupby("year")
grouped.aggregate({"TEMP" : np.mean})

Unnamed: 0_level_0,TEMP
year,Unnamed: 1_level_1
2010,11.63242
2011,12.565297
2012,11.967441
2013,12.399201
2014,13.679566


## Limpieza - tratamiento Valores nulos

Una vez se dispone de un conjunto de datos integrados, es necesario aplicar un proceso de limpieza. Este proceso se encarga de tratar los valores perdidos y datos erróneos (o datos con ruido), que pueden aparecer a causa de errores en la entrada de datos, la transmisión, o los propios sistemas de procesamiento de datos. 


Missing Values

Many real-world datasets may contain missing values for various reasons. They are often encoded as NaNs, blanks or any other placeholders. Training a model with a dataset that has a lot of missing values can drastically impact the machine learning model’s quality. Some algorithms such as scikit-learn estimators assume that all values are numerical and have and hold meaningful value.

One way to handle this problem is to get rid of the observations that have missing data. However, you will risk losing data points with valuable information. A better strategy would be to impute the missing values. In other words, we need to infer those missing values from the existing part of the data.

we will focus on 6 popular ways for data imputation for cross-sectional datasets:

    Do Nothing
    Imputation Using (Mean/Median) Values
    Imputation Using (Most Frequent) or (Zero/Constant) Values
    Imputation Using k-NN / K-means / Regression or Classification
    Imputation Using Multivariate Imputation by Chained Equation (MICE)
    Imputation Using Deep Learning (Datawig)

We can check all of them and evaluate model combining each step described before.
a. Do Nothing

That’s an easy one. You just let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms will panic and throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.
b. Imputation Using (Mean/Median) Values:

This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.

Pros: Easy and fast. Works well with small numerical datasets.

Cons: Doesn’t factor the correlations between features. It only works on the column level. Will give poor results on encoded categorical features (do NOT use it on categorical features). Not very accurate. Doesn’t account for the uncertainty in the imputations
c. Imputation Using (Most Frequent) or (Zero/Constant) Values

Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.

Pros: Works well with categorical features.

Cons: It also doesn’t factor the correlations between features. It can introduce bias in the data.
d. Imputation Using k-NN:

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood. Let’s see some example code using Impyute library which provides a simple and easy way to use KNN for imputation:

Pros: Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).

Cons: Computationally expensive. KNN works by storing the whole training dataset in memory. K-NN is quite sensitive to outliers in the data (unlike SVM)
e. Imputation Using Multivariate Imputation by Chained Equation (MICE)

This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns. For more information on the algorithm mechanics, you can refer to the Research Paper
f. Imputation Using Deep Learning (Datawig):

This method works very well with categorical and non-numerical features. It is a library that learns Machine Learning models using Deep Neural Networks to impute missing values in a dataframe. It also supports both CPU and GPU for training.

Pros: Quite accurate compared to other methods. It has some functions that can handle categorical data (Feature Encoder). It supports CPUs and GPUs.

Cons: Single Column imputation. Can be quite slow with large datasets. You have to specify the columns that contain information about the target column that will be imputed.
Other Imputation Methods:
Stochastic regression imputation:

It is quite similar to regression imputation which tries to predict the missing values by regressing it from other related variables in the same dataset plus some random residual value.
Extrapolation and Interpolation:

It tries to estimate values from other observations within the range of a discrete set of known data points.
Hot-Deck imputation:

Works by randomly choosing the missing value from a set of related and similar variables.


## Normalización

Una de las alternativas para normalizar los datos consiste en centrar los valores para que la media del atributo se encuentre cercana a cero y escalarlos para que la varianza sea 1. Veamos cómo realizar este proceso sobre el atributo que contiene la presión atmosférica.


In [10]:
data["PRES"].describe()

count    43824.000000
mean      1016.447654
std         10.268698
min        991.000000
25%       1008.000000
50%       1016.000000
75%       1025.000000
max       1046.000000
Name: PRES, dtype: float64

In [12]:
from sklearn.preprocessing import StandardScaler
data_transf = data
# Utilizamos StandardScaler para normalizar los datos de la presión del aire
data_transf.loc[:, ['PRES']] = StandardScaler().fit_transform(data_transf.loc[:, ["PRES"]])

In [14]:
data_transf.PRES.describe()

count    4.382400e+04
mean     2.664485e-15
std      1.000011e+00
min     -2.478206e+00
25%     -8.226701e-01
50%     -4.359456e-02
75%      8.328654e-01
max      2.877939e+00
Name: PRES, dtype: float64

## Transformación de los datos

Adicionalmente, se pueden realizar otro tipo de transformaciones en los datos, de manera que se generen nuevos atributos a partir de los existentes. Así, por ejemplo, puede ser beneficioso generar un nuevo atributo que agregue información contenida en otros atributos o bien transformar un atributo nominal a varios atributos binarios (lo que permitirá aplicar modelos que solo sepan trabajar con atributos numéricos).

 Los atributos month y cbwd contienen cadenas de caracteres como valores y representan variables categóricas, por lo que algunos tipos de algoritmos de minería de datos no podrán trabajar con ellas. Por ello, las transformaremos en un conjunto de atributos binarios (un atributo para cada categoría posible).

In [15]:
print(list(data))

['No', 'year', 'month', 'day', 'hour', 'pm2.5', 'DEWP', 'TEMP', 'PRES', 'cbwd', 'Iws', 'Is', 'Ir']


In [16]:
 # Creamos nuevos atributos binarios para las categorías utilizadas en las columnas "month" y "cbwd".
 data_transf = pd.get_dummies(data, columns=['month','cbwd'], dummy_na=True)
 data_transf

Unnamed: 0,No,year,day,hour,pm2.5,DEWP,TEMP,PRES,Iws,Is,Ir,month_apr,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sept,month_nan,cbwd_NE,cbwd_NW,cbwd_SE,cbwd_nan
0,1,2010,1,0,,-21,-11.0,0.443328,1.79,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
1,2,2010,1,1,,-21,-12.0,0.345943,4.92,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
2,3,2010,1,2,,-21,-11.0,0.248559,6.71,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
3,4,2010,1,3,,-21,-14.0,0.248559,9.84,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
4,5,2010,1,4,,-20,-12.0,0.151174,12.97,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43819,43820,2014,31,19,8.0,-23,-2.0,1.709325,231.97,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
43820,43821,2014,31,20,10.0,-22,-3.0,1.709325,237.78,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
43821,43822,2014,31,21,10.0,-22,-3.0,1.709325,242.70,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0
43822,43823,2014,31,22,8.0,-22,-4.0,1.709325,246.72,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [17]:
print(list(data_transf))

['No', 'year', 'day', 'hour', 'pm2.5', 'DEWP', 'TEMP', 'PRES', 'Iws', 'Is', 'Ir', 'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sept', 'month_nan', 'cbwd_NE', 'cbwd_NW', 'cbwd_SE', 'cbwd_nan']


## Limpieza de datos

 En primer lugar, identificamos los atributos que tienen algún valor NaN:

In [18]:
def any_is_null(x):
  return any(pd.isnull(x))

In [19]:
print(data_transf.apply(any_is_null))

No            False
year          False
day           False
hour          False
pm2.5          True
DEWP          False
TEMP          False
PRES          False
Iws           False
Is            False
Ir            False
month_apr     False
month_aug     False
month_dec     False
month_feb     False
month_jan     False
month_jul     False
month_jun     False
month_mar     False
month_may     False
month_nov     False
month_oct     False
month_sept    False
month_nan     False
cbwd_NE       False
cbwd_NW       False
cbwd_SE       False
cbwd_nan      False
dtype: bool


In [24]:
data_transf.isnull().sum()

No               0
year             0
day              0
hour             0
pm2.5         2120
DEWP             0
TEMP             0
PRES             0
Iws              0
Is               0
Ir               0
month_apr        0
month_aug        0
month_dec        0
month_feb        0
month_jan        0
month_jul        0
month_jun        0
month_mar        0
month_may        0
month_nov        0
month_oct        0
month_sept       0
month_nan        0
cbwd_NE          0
cbwd_NW          0
cbwd_SE          0
cbwd_nan         0
dtype: int64

In [23]:
data_transf.isnull().sum()/data_transf.shape[0]*100

No            0.000000
year          0.000000
day           0.000000
hour          0.000000
pm2.5         4.837532
DEWP          0.000000
TEMP          0.000000
PRES          0.000000
Iws           0.000000
Is            0.000000
Ir            0.000000
month_apr     0.000000
month_aug     0.000000
month_dec     0.000000
month_feb     0.000000
month_jan     0.000000
month_jul     0.000000
month_jun     0.000000
month_mar     0.000000
month_may     0.000000
month_nov     0.000000
month_oct     0.000000
month_sept    0.000000
month_nan     0.000000
cbwd_NE       0.000000
cbwd_NW       0.000000
cbwd_SE       0.000000
cbwd_nan      0.000000
dtype: float64

In [25]:
data_transf['pm2.5'].describe()

count    41704.000000
mean        98.653439
std         92.084440
min          0.000000
25%         29.000000
50%         72.000000
75%        137.000000
max        994.000000
Name: pm2.5, dtype: float64

In [26]:
# Sustituiremos los valores perdidos por la media de la columna 
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
data_transf['pm2.5'] = imp.fit_transform(data_transf[['pm2.5']]).ravel()

In [27]:
data_transf['pm2.5'].describe()

count    43824.000000
mean        98.653439
std         89.829472
min          0.000000
25%         31.000000
50%         78.000000
75%        132.000000
max        994.000000
Name: pm2.5, dtype: float64

In [28]:
data_transf.isnull().sum()

No            0
year          0
day           0
hour          0
pm2.5         0
DEWP          0
TEMP          0
PRES          0
Iws           0
Is            0
Ir            0
month_apr     0
month_aug     0
month_dec     0
month_feb     0
month_jan     0
month_jul     0
month_jun     0
month_mar     0
month_may     0
month_nov     0
month_oct     0
month_sept    0
month_nan     0
cbwd_NE       0
cbwd_NW       0
cbwd_SE       0
cbwd_nan      0
dtype: int64

**Ejercicio 1**

Cargue los datos del fichero [*bank_edited.csv*](https://raw.githubusercontent.com/marcusRB/IDbootcamps_DataScience_student_PT_10201/master/dataset/bank_edited.csv) en un dataframe. Este conjunto de datos recoge información respecto a una campaña de marketing de un banco portugués. El conjunto original se puede encontrar en el [repositorio de datos de Machine Learning de la UC Irvine] (http://archive.ics.uci.edu/ml/datasets/Bank+Marketing), pero el conjunto que utilizaremos tiene alguna modificación .Observación: revise la documentación de la función  read_csv para ver qué parámetro disponemos para ajustar el proceso de cargar de datos.Los valores del estado civil (atributo marital) contienen errores tipográficos y incluyen el uso de diferentes nomenclaturas. En este ejercicio unificaremos la nomenclatura de los valores de esta variables.

- a) ¿Cuantos valores diferentes tiene el atributo marital en el conjunto de datos? Mostrad estos valores.

In [29]:
# respuesta
import pandas as pd
import numpy as np

data = pd.read_csv("https://raw.githubusercontent.com/marcusRB/IDbootcamps_DataScience_student_PT_10201/master/dataset/bank_edited.csv",
                   sep = ";", dtype={"balance":np.float})

In [30]:
data.sample(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
2615,45,blue-collar,married,secondary,no,776.0,yes,no,cellular,6.0,feb,232.0,4,253,1,failure,no
1444,57,technician,divorced,primary,no,13.0,yes,no,cellular,8.0,may,323.0,2,368,2,failure,no
1133,26,services,single,secondary,no,127.0,yes,yes,cellular,23.0,jul,85.0,3,-1,0,unknown,no
2443,29,technician,single,tertiary,no,218.0,yes,no,cellular,12.0,apr,169.0,1,-1,0,unknown,no
4459,31,admin.,single,secondary,no,223.0,yes,no,cellular,17.0,apr,508.0,1,315,11,success,no
3756,42,services,married,secondary,no,96.0,yes,no,cellular,5.0,may,238.0,3,340,2,failure,no
474,41,management,married,tertiary,no,666.0,yes,no,unknown,21.0,may,253.0,3,-1,0,unknown,no
110,21,student,sing,secondary,no,2488.0,no,no,cellular,30.0,jun,258.0,6,169,3,success,yes
2104,38,self-employed,divorced,tertiary,no,1513.0,no,no,cellular,7.0,may,330.0,1,342,1,failure,no
1907,59,management,divorced,tertiary,no,7813.0,yes,no,cellular,21.0,nov,75.0,1,-1,0,unknown,no


In [32]:
data.marital.unique()

array(['married', 'single', 'marrid', 'divorced', 'maried', 'sing',
       'Married', 'MARRIED', 'DIVORCED', 'Single', 'SINGLE'], dtype=object)

- b) Unificad los atributos marital en los valores: "single", "married" o "divorced".

In [33]:
# respuesta
data.loc[(data.marital == "Married") | (data.marital == "maried") | 
         (data.marital == "MARRIED")  | (data.marital == 'marrid'), "marital"] = "married"

In [34]:
data.loc[(data.marital == "Single") | (data.marital == "SINGLE") | (data.marital == "sing"), "marital"] = "single"
data.loc[(data.marital == "DIVORCED"), "marital"] = "divorced"

In [35]:
data.marital.unique()

array(['married', 'single', 'divorced'], dtype=object)

 c) ¿Qué columnas contienen valores perdidos? 

In [36]:
# respuesta
data.isna().any()

age          False
job          False
marital      False
education    False
default      False
balance       True
housing      False
loan         False
contact      False
day           True
month        False
duration      True
campaign     False
pdays        False
previous     False
poutcome     False
y            False
dtype: bool

In [37]:
data.isnull().any()

age          False
job          False
marital      False
education    False
default      False
balance       True
housing      False
loan         False
contact      False
day           True
month        False
duration      True
campaign     False
pdays        False
previous     False
poutcome     False
y            False
dtype: bool

d) Calculad el primer y el tercer cuartil del atributo "balance".

In [39]:
# respuesta
print(data.balance.quantile(.25))
print(data.balance.quantile(.75))

68.0
1476.0


**Ejercicio 2**

 El atributo poutcome contiene información sobre si el cliente del banco contractó un deposito. Calcula la correlacióm entre el atributo poutcome y el resto de atributos (usa la función 'corr'). ¿Qué variable presenta mayor correlación con poutcome?

In [1]:
 # Respuesta
 

**Ejercicio 3**

 El módulo sklearn incluye varios datasets de ejemplo, dentro del módulo sklearn.datasets. Estos datasets se almacenan en formato Bunch, propio de sklearn. Un Bunch es un objeto tipo diccionario, los atributos interesantes son: data, con los datos en crudo, target, con generalmente las etiquetas de clasificación o etiquetas objetivo, target_names, el significado de las etiquetas, feature_names, el significado de las características o atributos, DESCR , la descripción completa del conjunto de datos. Importa el dataset iris de sklearn. Almacena los datos este dataset como un objeto pandas, con los correpondientres nombres de variables. Añade la variable target en el dataframe con el nombre de atributo Species y los valors con el tipo de especie de cada muestra.

In [None]:
# Respuesta