# Ejercicio Data Quality - Perfilado
## Evaluar la calidad de datos de las ventas de productos

Se quiere hacer una evaluación de calidad de datos sobre las ventas (sales) y pagos (payments). Para ello se requiere hacer un análisis de los siguientes puntos:
- Calidad de los datos
- Selección de clave principal
- Identificación de cardinalidad
- Obtener media, varianza y desviacion Estandar, covarianza, correlacion
- Mejorar la calidad.

**Referencia**: “Estadística Descriptiva con Python y Pandas”: https://coderhook.github.io/Descriptive%20Statistics

- Columnas sales:, orderNumber, orderLineNumber, orderDate, shippedDate, requiredDate, customerNumber, employeeNumber, productCode, status, comments, quantityOrdered, priceEach, sales_amount, origin

- Columnas payments:, customerNumber, checkNumber, paymentDate, amount

## Carga

In [15]:
import pandas as pd
import numpy as np
from tabulate import tabulate

In [16]:
sales_df = pd.read_csv('https://github.com/ricardoahumada/DataScienceBasics/raw/refs/heads/main/data/company_sales/sales.csv')

In [17]:
payments_df = pd.read_csv('https://github.com/ricardoahumada/DataScienceBasics/raw/refs/heads/main/data/company_sales/payments.csv')

## Calidad

### Sales

In [None]:
# Columnas
sales_df.columns = ['orderNumber', 'orderLineNumber', 'orderDate', 'shippedDate', 'requiredDate', 'customerNumber',
                    'employeeNumber', 'productCode', 'status', 'comments', 'quantityOrdered', 'priceEach', 'sales_amount', 'origin']
sales_df.info()

In [None]:
sales_df.head()

In [None]:
sales_df.tail()

In [None]:
sales_df.sample(20)

In [None]:
sales_df['orderDate'].unique()

In [None]:
sales_df['orderDate'].value_counts()

In [None]:
# columnas corruptas
sales_df_clean = sales_df.drop(columns=['orderDate', 'shippedDate', 'requiredDate','comments'])
sales_df_clean.info()

In [None]:
# nulos

sales_df_clean.isna().sum()

In [None]:
# atipicos

z_scores = (sales_df_clean-sales_df_clean.mean(numeric_only=True)) / \
    sales_df_clean.std(numeric_only=True)
z_scores_abs = z_scores.apply(np.abs)
print(tabulate(z_scores_abs, headers='keys'))

In [None]:
umbral = 3

out_mask = ~z_scores[z_scores_abs > umbral].isna()
print('\nOutliers per column:\n')
print(out_mask.sum())

In [None]:
sales_df_clean.describe()

In [None]:
outliers = sales_df_clean['quantityOrdered'][out_mask['quantityOrdered']]
print('Outliers:\n', outliers)

In [None]:
sales_df_clean.drop(outliers, inplace=True)
sales_df_clean.shape

In [None]:
# duplicados
sales_df_clean.duplicated().sum()

In [None]:
sales_df_clean[sales_df_clean.duplicated()]

In [40]:
sales_df_clean['complete_order_number'] = sales_df_clean['orderNumber'].astype('str')+'-'+sales_df_clean['orderLineNumber'].astype('str')

In [None]:
sales_df_clean.head()

In [None]:
dup_ordnums = sales_df_clean[sales_df_clean.duplicated()]['complete_order_number']
dup_ordnums.values

In [None]:
sales_df_clean[sales_df_clean['complete_order_number'].isin(dup_ordnums.values)]

In [None]:
sales_df_clean.drop_duplicates(inplace=True)
sales_df_clean[sales_df_clean.duplicated()]

In [None]:
# incoherencias
sales_df_clean.info()

In [None]:
sales_df_clean[['productCode', 'status', 'origin']] = sales_df_clean[['productCode', 'status', 'origin']].astype('category')
sales_df_clean.info()

In [None]:
# cardinalidad
print(sales_df_clean['status'].unique())
print(sales_df_clean['status'].value_counts())

In [56]:
def calc_cardinalidad(adf):
    result = {}
    for col in adf.columns:
        print('\n- Valores únicos para "{0}"'.format(col), '\n')
        # print(adf[col].unique())
        card = len(adf[col].unique())
        print('Num valores únicos: ', len(adf[col].unique()))
        result[col] = card

    return result

In [None]:
sales_card = calc_cardinalidad(sales_df_clean)
print(sales_card)

In [None]:
# frecuencias
for col in sales_df_clean.columns:
    print('\n- Frecuencias para "{0}"'.format(col), '\n')
    print(sales_df_clean[col].value_counts())

In [None]:
sales_df_clean.describe()

In [None]:
sales_df_clean.describe(include='category')

In [None]:
# corr
sales_corr = sales_df_clean.corr('pearson')
sales_corr

In [None]:
sales_corr[(np.abs(sales_corr) >= 0.7) & (np.abs(sales_corr) != 1)]

In [None]:
# sesgo
sales_skw = sales_df_clean.skew(numeric_only=True)
sales_skw

In [None]:
sales_skw[np.abs(sales_skw) > 2]

In [None]:
# kurtosis
sales_kurt = sales_df_clean.kurt(numeric_only=True)
sales_kurt

In [None]:
sales_kurt[np.abs(sales_kurt) > 3]

In [None]:
sales_df_clean.columns

In [None]:
sales_df_clean.groupby('customerNumber').agg(num=('complete_order_number','count'), tot_amount=('sales_amount','sum')).sort_values('num', ascending=False)