# Ejercicio Data Quality - Perfilado
## Evaluar la calidad de datos de las ventas de productos

Se quiere hacer una evaluación de calidad de datos sobre las ventas (sales) y pagos (payments). Para ello se requiere hacer un análisis de los siguientes puntos:
- Calidad de los datos
- Selección de clave principal
- Identificación de cardinalidad
- Obtener media, varianza y desviacion Estandar, covarianza, correlacion
- Mejorar la calidad.

**Referencia**: “Estadística Descriptiva con Python y Pandas”: https://coderhook.github.io/Descriptive%20Statistics

- Columnas sales:, orderNumber, orderLineNumber, orderDate, shippedDate, requiredDate, customerNumber, employeeNumber, productCode, status, comments, quantityOrdered, priceEach, sales_amount, origin

- Columnas payments:, customerNumber, checkNumber, paymentDate, amount

## Carga

In [83]:
import pandas as pd
import numpy as np
from tabulate import tabulate

In [84]:
sales_df = pd.read_csv(
    'https://github.com/ricardoahumada/DataScienceBasics/raw/refs/heads/main/data/company_sales/sales.csv')

In [85]:
payments_df = pd.read_csv(
    'https://github.com/ricardoahumada/DataScienceBasics/raw/refs/heads/main/data/company_sales/payments.csv')

## Calidad

### Sales

In [None]:
# columnas
sales_df.columns = ['orderNumber', 'orderLineNumber', 'orderDate', 'shippedDate', 'requiredDate', 'customerNumber',
                    'employeeNumber', 'productCode', 'status', 'comments', 'quantityOrdered', 'priceEach', 'sales_amount', 'origin']
sales_df.info()

In [None]:
sales_df.head(5)

In [None]:
sales_df.tail(5)

In [None]:
sales_df.sample(20)

In [None]:
sales_df.shape

In [91]:
sales_df_clean = sales_df.drop(columns=['comments', 'orderDate',
                                        'shippedDate', 'requiredDate'])

In [None]:
sales_df_clean.info()

In [None]:
# nulos
sales_df_clean.isna().sum()
# sales_df_clean.dropna(inplace=True)
# sales_df_clean.isna().sum()

In [None]:
# extremos
z_scores = (sales_df_clean-sales_df_clean.mean(numeric_only=True)) / \
    sales_df_clean.std(numeric_only=True)
z_scores_abs = z_scores.apply(np.abs)
print(tabulate(z_scores_abs, headers='keys'))

In [None]:
umbral = 3

out_mask = ~z_scores[z_scores_abs > umbral].isna()
print('\nOutliers per column:\n')
print(out_mask.sum())

In [None]:
outliers = sales_df_clean['quantityOrdered'][out_mask['quantityOrdered']]
print('Outliers:\n', outliers)

In [None]:
sales_df_clean['quantityOrdered'].describe()

In [None]:
sales_df_clean.drop(outliers, inplace=True)
sales_df_clean.shape

In [None]:
# duplicados
sales_df_clean[sales_df_clean.duplicated()]

In [100]:
sales_df_clean['complete_order_number'] = sales_df_clean['orderNumber'].astype(
    'str')+'-'+sales_df_clean['orderLineNumber'].astype('str')

In [None]:
sales_df_clean.head()

In [None]:
sales_df_clean.info()

In [None]:
dup_ordnums = sales_df_clean[sales_df_clean.duplicated(
)]['complete_order_number']


dup_ordnums.values

In [None]:
sales_df_clean[sales_df_clean['complete_order_number'].isin(
    dup_ordnums.values)]

In [None]:
sales_df_clean.drop_duplicates(inplace=True)
sales_df_clean[sales_df_clean.duplicated()]

In [None]:
# incoherencias
sales_df_clean.info()

In [None]:
sales_df_clean['status'].unique()

In [None]:
sales_df_clean['productCode'].unique()

In [None]:
# cardinalidad
def calc_cardinalidad(adf):
    result = {}
    for col in adf.columns:
        print('\n- Valores únicos para "{0}"'.format(col), '\n')
        # print(adf[col].unique())
        card = len(adf[col].unique())
        print('Num valores únicos: ', len(adf[col].unique()))
        result[col] = card

    return result


sales_card = calc_cardinalidad(sales_df_clean)
print(sales_card)

In [None]:
sales_df_clean.columns

In [111]:
sales_df_clean[['productCode', 'status', 'origin']] = sales_df_clean[[
    'productCode', 'status', 'origin']].astype('category')

In [None]:
sales_df_clean.info()

In [None]:
sales_df_clean.describe()

In [None]:
sales_df_clean.describe(include='category')

In [None]:
# frecuencias
for col in sales_df_clean.columns:

    print('\n- Frecuencias para "{0}"'.format(col), '\n')

    print(sales_df_clean[col].value_counts())

In [None]:
sales_df_clean.columns

In [None]:
# correlación
sales_corr = sales_df_clean.corr('pearson', numeric_only=True)
sales_corr

In [None]:
sales_corr[np.abs(sales_corr) >= 0.7]

In [None]:
# sesgo

sales_skw = sales_df_clean.skew(numeric_only=True)
sales_skw

In [None]:
sales_skw[np.abs(sales_skw) > 2]

In [None]:
# kurtosis
sales_kurt = sales_df_clean.kurt(numeric_only=True)
sales_kurt

In [None]:
sales_kurt[sales_kurt > np.abs(3)]

### payments

In [None]:
payments_df.columns = ['customerNumber',
                       'checkNumber', 'paymentDate', 'amount']


payments_df.info()

In [None]:
payments_df.isna().sum()

In [None]:
# extremos
amount_col = payments_df['amount']


q1 = np.percentile(amount_col, 25)
q3 = np.percentile(amount_col, 75)
iqr = q3 - q1
print('iqr:\n', iqr)

umbra_sup = q3+1.5*iqr
umbra_inf = q1-1.5*iqr

print('umbrales inf:\n', umbra_inf)
print('\numbrales sup:\n', umbra_sup)

In [None]:
am_outliers = amount_col[((amount_col < umbra_inf) | (amount_col > umbra_sup))]
am_outliers

In [None]:
amount_col.describe()

In [None]:
payments_df.drop(am_outliers.index, inplace=True)
payments_df.shape

In [None]:
#duplicados
payments_df.duplicated().sum()

In [None]:
payments_df[payments_df.duplicated()]

In [None]:
payments_df['customer-check'] = payments_df['customerNumber'].astype(
    str)+'-'+payments_df['checkNumber'].astype(str)
payments_df.info()

In [None]:
payments_df[payments_df.duplicated()]

In [None]:
cust_check_ids = payments_df[payments_df.duplicated()]['customer-check'].values
cust_check_ids

In [None]:
payments_df[payments_df['customer-check'].isin(cust_check_ids)]

In [None]:
def doNothing(x):
    return list(x)[0]


added_payments_df = payments_df.groupby('customer-check').agg(
    {'amount': 'sum', 'customerNumber': doNothing, 'checkNumber': doNothing, 'paymentDate': doNothing}).reset_index()
added_payments_df[added_payments_df['customer-check'].isin(cust_check_ids)]

In [None]:
# incoherencias
added_payments_df.info()

In [137]:
added_payments_df['paymentDate'] = pd.to_datetime(
    added_payments_df['paymentDate'])

added_payments_df['checkNumber'] = added_payments_df['checkNumber'].astype(
    'category')
added_payments_df['customer-check'] = added_payments_df['customer-check'].astype(
    'category')

In [None]:
added_payments_df.info()

In [None]:
payments_card = calc_cardinalidad(added_payments_df)
print(payments_card)

In [None]:
# frecuencias
for col in added_payments_df.columns:
    # print('\n- Frecuencias para "{0}"'.format(col), '\n')
    print(added_payments_df[col].value_counts())

In [None]:
# correlación
payments_corr = added_payments_df.corr('pearson', numeric_only=True)
payments_corr

In [None]:
payments_corr[np.abs(payments_corr) >= 0.7]

In [None]:
# sesgo

payments_skw = added_payments_df.skew(numeric_only=True)
payments_skw

In [None]:
payments_skw[np.abs(payments_skw) > 2]

In [None]:
# kurtosis
payments_kurt = added_payments_df.kurt(numeric_only=True)
payments_kurt

In [None]:
payments_kurt[payments_kurt > np.abs(3)]

## Mezclado de datos

In [None]:
sales_df_clean.info()

In [None]:
added_payments_df.info()

In [None]:
merged_df = pd.merge(sales_df_clean, added_payments_df,
                     on='customerNumber', how='left')
merged_df.info()

In [None]:
merged_df.head()

In [None]:
merged_df.tail()

#### Insights by Sales and payments

In [None]:
customer_sales_pays = merged_df.groupby('customerNumber').agg(num=('complete_order_number', 'count'), tot_sale=(
    'sales_amount', 'sum'), tot_ammount=('amount', 'sum')).reset_index()

customer_sales_pays

In [None]:
print('# top ten por número de compras')
customer_sales_pays.sort_values('num', ascending=False)[
    ['customerNumber', 'num']].head(10)

In [None]:
print('# top ten por monto de compras')
customer_sales_pays.sort_values('tot_sale', ascending=False)[
    ['customerNumber', 'tot_sale']].head(10)

In [None]:
print('# top ten por monto de pagos')
customer_sales_pays.sort_values('tot_ammount', ascending=False)[
    ['customerNumber', 'tot_ammount']].head(10)

#### Insights by origin

In [None]:
by_origin = merged_df.groupby('origin').agg(num=('complete_order_number', 'count'), tot_sale=(
    'sales_amount', 'sum'), tot_amount=('amount', 'sum')).reset_index()

by_origin

#### Insights by date

In [None]:
paymentDate = merged_df['paymentDate']

by_date = merged_df.groupby([paymentDate.dt.year, paymentDate.dt.month]).agg(num=(
    'orderNumber', 'count'), tot_sale=('sales_amount', 'sum'), tot_ammount=('amount', 'sum'))

by_date.index.names = ['year', 'month']

by_date

In [None]:
print('# top años por número de compras')
by_date.sort_values('num', ascending=False).groupby('year').agg({'num': sum}).sort_values('num', ascending=False)

In [None]:
print('# top meses por número de compras')
by_date.groupby('month').agg({'num': sum}).sort_values(
    'num', ascending=False).head(3)

In [None]:
merged_df_corr = merged_df.corr('pearson', numeric_only=True)
merged_df_corr

In [None]:
merged_df_corr[(merged_df_corr > 0.7) & (merged_df_corr != 1)]

## Coclusiones

**sales:**
- nulos: eliminadas 3 columnas. Luego no nulos
- anomalías: 17 outliers elimiandos de quantityOrdered
- duplicados: 5 duplicados elimiandos
- incoherencias: ajustados tipos
- cardinalidad: descompensación en origen: 21 - 1 (spain-japan) y en status (shipped +90%)
- estadística descriptiva: correlación entre 'sales_amount' y 'priceEach'. No Sesgo significativo.

**payments:**
- nulos: no nulos
- anomalías: 6 outliers elimiandos en amount
- duplicados: 6 duplicados mezclados
- incoherencias: ajustados tipos
- cardinalidad: no se observan descompensaciones
- estadística descriptiva: no correlación fuerte. no sesgo.

**datos finales:**
- 4 primeros clientes son los mismos en los top tens
- Origen mayoritario de spain, pero monto total de japan
- Año de más ventas 2004
- Meses de más ventas: 11, 12, 5

## Guardar

In [174]:
merged_df.to_csv(
    '../../data/company_sales/output/merged_lean_df.csv', index=False)

In [177]:
merged_df.to_pickle(
    '../../data/company_sales/output/merged_lean_df.pkl')

In [None]:
# ! pip install fastparquet

In [None]:
merged_df.to_parquet(
    '../../data/company_sales/output/merged_lean_df.parquet')