# Ejercicio Data Quality - Perfilado
## Evaluar la calidad de datos de las ventas de productos

Se quiere hacer una evaluación de calidad de datos sobre las ventas (sales) y pagos (payments). Para ello se requiere hacer un análisis de los siguientes puntos:
- Calidad de los datos
- Selección de clave principal
- Identificación de cardinalidad
- Obtener media, varianza y desviacion Estandar, covarianza, correlacion
- Mejorar la calidad.

**Referencia**: “Estadística Descriptiva con Python y Pandas”: https://coderhook.github.io/Descriptive%20Statistics

- Columnas sales:, orderNumber, orderLineNumber, orderDate, shippedDate, requiredDate, customerNumber, employeeNumber, productCode, status, comments, quantityOrdered, priceEach, sales_amount, origin

- Columnas payments:, customerNumber, checkNumber, paymentDate, amount

In [2]:
import pandas as pd
import numpy as np
from tabulate import tabulate

In [4]:
sales_df = pd.read_csv('https://github.com/ricardoahumada/DataScienceBasics/raw/refs/heads/main/data/company_sales/sales.csv')

In [5]:
payments_df = pd.read_csv('https://github.com/ricardoahumada/DataScienceBasics/raw/refs/heads/main/data/company_sales/payments.csv')

## Calidad

### Sales

In [8]:
# Columnas
sales_df.columns = ['orderNumber', 'orderLineNumber', 'orderDate', 'shippedDate', 'requiredDate', 'customerNumber',
                    'employeeNumber', 'productCode', 'status', 'comments', 'quantityOrdered', 'priceEach', 'sales_amount', 'origin']
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   orderNumber      3001 non-null   int64  
 1   orderLineNumber  3001 non-null   int64  
 2   orderDate        3001 non-null   object 
 3   shippedDate      2859 non-null   object 
 4   requiredDate     3001 non-null   object 
 5   customerNumber   3001 non-null   int64  
 6   employeeNumber   3001 non-null   int64  
 7   productCode      3001 non-null   object 
 8   status           3001 non-null   object 
 9   comments         759 non-null    object 
 10  quantityOrdered  3001 non-null   int64  
 11  priceEach        3001 non-null   float64
 12  sales_amount     3001 non-null   float64
 13  origin           3001 non-null   object 
dtypes: float64(2), int64(5), object(7)
memory usage: 328.4+ KB


In [10]:
sales_df.head()

Unnamed: 0,orderNumber,orderLineNumber,orderDate,shippedDate,requiredDate,customerNumber,employeeNumber,productCode,status,comments,quantityOrdered,priceEach,sales_amount,origin
0,10100,1,0000-00-00,0000-00-00,0000-00-00,363,1216,S24_3969,Shipped,,49,35.29,1729.21,spain
1,10100,2,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_2248,Shipped,,50,55.09,2754.5,spain
2,10100,3,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_1749,Shipped,,30,136.0,4080.0,spain
3,10100,4,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_4409,Shipped,,22,75.46,1660.12,spain
4,10101,1,0000-00-00,0000-00-00,0000-00-00,128,1504,S18_2795,Shipped,Check on availability.,26,167.06,4343.56,spain


In [12]:
sales_df.tail()

Unnamed: 0,orderNumber,orderLineNumber,orderDate,shippedDate,requiredDate,customerNumber,employeeNumber,productCode,status,comments,quantityOrdered,priceEach,sales_amount,origin
2996,10425,9,0000-00-00,,0000-00-00,119,1370,S24_2300,In Process,,49,127.79,6261.71,spain
2997,10425,10,0000-00-00,,0000-00-00,119,1370,S18_2432,In Process,,19,48.62,923.78,spain
2998,10425,11,0000-00-00,,0000-00-00,119,1370,S32_1268,In Process,,41,83.79,3435.39,spain
2999,10425,12,0000-00-00,,0000-00-00,119,1370,S10_4962,In Process,,38,131.49,4996.62,spain
3000,10425,13,0000-00-00,,0000-00-00,119,1370,S18_4600,In Process,,38,107.76,4094.88,spain


In [14]:
sales_df.sample(20)

Unnamed: 0,orderNumber,orderLineNumber,orderDate,shippedDate,requiredDate,customerNumber,employeeNumber,productCode,status,comments,quantityOrdered,priceEach,sales_amount,origin
2352,10350,10,0000-00-00,0000-00-00,0000-00-00,141,1370,S24_3816,Shipped,,25,77.15,1928.75,spain
302,10135,2,0000-00-00,0000-00-00,0000-00-00,124,1165,S24_3856,Shipped,,47,139.03,6534.41,spain
1939,10308,8,0000-00-00,0000-00-00,0000-00-00,319,1323,S24_4278,Shipped,Customer requested that FedEx Ground is used f...,44,71.73,3156.12,spain
1131,10219,3,0000-00-00,0000-00-00,0000-00-00,487,1165,S24_2840,Shipped,,21,31.12,653.52,spain
1976,10311,6,0000-00-00,0000-00-00,0000-00-00,141,1370,S24_1046,Shipped,Difficult to negotiate with customer. We need ...,26,70.55,1834.3,spain
2900,10414,5,0000-00-00,,0000-00-00,362,1216,S24_3151,On Hold,Customer credit limit exceeded. Will ship when...,60,72.58,4354.8,spain
424,10147,8,0000-00-00,0000-00-00,0000-00-00,379,1188,S12_3990,Shipped,,21,74.21,1558.41,spain
2792,10399,6,0000-00-00,0000-00-00,0000-00-00,496,1612,S10_4698,Shipped,,22,156.86,3450.92,spain
2748,10394,4,0000-00-00,0000-00-00,0000-00-00,141,1370,S32_3207,Shipped,,30,55.93,1677.9,spain
2769,10398,1,0000-00-00,0000-00-00,0000-00-00,353,1337,S72_1253,Shipped,,34,41.22,1401.48,spain


In [16]:
sales_df['orderDate'].unique()

array(['0000-00-00', '2038-09-00'], dtype=object)

In [18]:
sales_df['orderDate'].value_counts()

0000-00-00    2998
2038-09-00       3
Name: orderDate, dtype: int64

In [23]:
sales_df_clean = sales_df.drop(columns=['orderDate', 'shippedDate', 'requiredDate','comments'])
sales_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   orderNumber      3001 non-null   int64  
 1   orderLineNumber  3001 non-null   int64  
 2   customerNumber   3001 non-null   int64  
 3   employeeNumber   3001 non-null   int64  
 4   productCode      3001 non-null   object 
 5   status           3001 non-null   object 
 6   quantityOrdered  3001 non-null   int64  
 7   priceEach        3001 non-null   float64
 8   sales_amount     3001 non-null   float64
 9   origin           3001 non-null   object 
dtypes: float64(2), int64(5), object(3)
memory usage: 234.6+ KB


In [26]:
sales_df_clean.isna().sum()

orderNumber        0
orderLineNumber    0
customerNumber     0
employeeNumber     0
productCode        0
status             0
quantityOrdered    0
priceEach          0
sales_amount       0
origin             0
dtype: int64

In [28]:
sales_df_clean.duplicated().sum()

5

In [29]:
sales_df_clean[sales_df_clean.duplicated()]

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,productCode,status,quantityOrdered,priceEach,sales_amount,origin
28,10104,2,141,1370,S50_1514,Shipped,32,53.31,1705.92,spain
2861,10410,2,357,1612,S18_3136,Shipped,34,84.82,2883.88,spain
2895,10413,6,175,1323,S32_3207,Shipped,24,56.55,1357.2,spain
2945,10419,1,382,1401,S18_1589,Shipped,37,100.8,3729.6,spain
2990,10425,3,119,1370,S18_2238,In Process,28,147.36,4126.08,spain
