# Ejercicio Data Quality - Perfilado
## Evaluar la calidad de datos de las ventas de productos

Se quiere hacer una evaluación de calidad de datos sobre las ventas (sales) y pagos (payments). Para ello se requiere hacer un análisis de los siguientes puntos:
- Calidad de los datos
- Selección de clave principal
- Identificación de cardinalidad
- Obtener media, varianza y desviacion Estandar, covarianza, correlacion
- Mejorar la calidad.

**Referencia**: “Estadística Descriptiva con Python y Pandas”: https://coderhook.github.io/Descriptive%20Statistics

- Columnas sales:, orderNumber, orderLineNumber, orderDate, shippedDate, requiredDate, customerNumber, employeeNumber, productCode, status, comments, quantityOrdered, priceEach, sales_amount, origin

- Columnas payments:, customerNumber, checkNumber, paymentDate, amount

## Carga

In [None]:
import pandas as pd
import numpy as np
from tabulate import tabulate

In [None]:
sales_df = pd.read_csv(
    'https://github.com/ricardoahumada/DataScienceBasics/raw/refs/heads/main/data/company_sales/sales.csv')

In [None]:
payments_df = pd.read_csv(
    'https://github.com/ricardoahumada/DataScienceBasics/raw/refs/heads/main/data/company_sales/payments.csv')

## Calidad

### Sales

In [None]:
# columnas
sales_df.columns = ['orderNumber', 'orderLineNumber', 'orderDate', 'shippedDate', 'requiredDate', 'customerNumber',
                    'employeeNumber', 'productCode', 'status', 'comments', 'quantityOrdered', 'priceEach', 'sales_amount', 'origin']
sales_df.info()

In [None]:
sales_df.head(5)

In [None]:
sales_df.tail(5)

In [None]:
sales_df.sample(20)

In [None]:
sales_df.shape

In [None]:
sales_df_clean = sales_df.drop(columns=['comments', 'orderDate',
                                        'shippedDate', 'requiredDate'])

In [None]:
sales_df_clean.info()

In [None]:
# nulos
sales_df_clean.isna().sum()
# sales_df_clean.dropna(inplace=True)
# sales_df_clean.isna().sum()

In [None]:
# extremos
z_scores = (sales_df_clean-sales_df_clean.mean(numeric_only=True)) / \
    sales_df_clean.std(numeric_only=True)
z_scores_abs = z_scores.apply(np.abs)
print(tabulate(z_scores_abs, headers='keys'))

In [None]:
umbral = 3

out_mask = ~z_scores[z_scores_abs > umbral].isna()
print('\nOutliers per column:\n')
print(out_mask.sum())

In [None]:
outliers = sales_df_clean['quantityOrdered'][out_mask['quantityOrdered']]
print('Outliers:\n', outliers)

In [None]:
sales_df_clean['quantityOrdered'].describe()

In [None]:
sales_df_clean.drop(outliers, inplace=True)
sales_df_clean.shape

In [None]:
# duplicados
sales_df_clean[sales_df_clean.duplicated()]

In [None]:
sales_df_clean['complete_order_number'] = sales_df_clean['orderNumber'].astype(
    'str')+'-'+sales_df_clean['orderLineNumber'].astype('str')

In [None]:
sales_df_clean.head()

In [None]:
sales_df_clean.info()

In [None]:
dup_ordnums = sales_df_clean[sales_df_clean.duplicated(
)]['complete_order_number']


dup_ordnums.values

In [None]:
sales_df_clean[sales_df_clean['complete_order_number'].isin(
    dup_ordnums.values)]

In [None]:
sales_df_clean.drop_duplicates(inplace=True)
sales_df_clean[sales_df_clean.duplicated()]

In [None]:
# incoherencias
sales_df_clean.info()

In [None]:
sales_df_clean['status'].unique()

In [None]:
sales_df_clean['productCode'].unique()

In [None]:
# cardinalidad
def calc_cardinalidad(adf):
    result = {}
    for col in adf.columns:
        print('\n- Valores únicos para "{0}"'.format(col), '\n')
        # print(adf[col].unique())
        card = len(adf[col].unique())
        print('Num valores únicos: ', len(adf[col].unique()))
        result[col] = card

    return result


sales_card = calc_cardinalidad(sales_df_clean)
print(sales_card)

In [None]:
sales_df_clean.columns

In [None]:
sales_df_clean[['productCode', 'status', 'origin']] = sales_df_clean[[
    'productCode', 'status', 'origin']].astype('category')

In [None]:
sales_df_clean.info()

In [None]:
sales_df_clean.describe()

In [None]:
sales_df_clean.describe(include='category')

In [None]:
# frecuencias
for col in sales_df_clean.columns:

    print('\n- Frecuencias para "{0}"'.format(col), '\n')

    print(sales_df_clean[col].value_counts())

In [None]:
sales_df_clean.columns

In [None]:
# correlación
sales_corr = sales_df_clean.corr('pearson', numeric_only=True)
sales_corr

In [None]:
sales_corr[np.abs(sales_corr) >= 0.7]

In [None]:
# sesgo

sales_skw = sales_df_clean.skew(numeric_only=True)
sales_skw

In [None]:
sales_skw[np.abs(sales_skw) > 2]

In [None]:
# kurtosis
sales_kurt = sales_df_clean.kurt(numeric_only=True)
sales_kurt

In [None]:
sales_kurt[sales_kurt > np.abs(3)]

### payments

In [36]:
payments_df.columns = ['customerNumber',
                       'checkNumber', 'paymentDate', 'amount']


payments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278 entries, 0 to 277
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   customerNumber  278 non-null    int64  
 1   checkNumber     278 non-null    object 
 2   paymentDate     278 non-null    object 
 3   amount          278 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 8.8+ KB


In [37]:
payments_df.isna().sum()

customerNumber    0
checkNumber       0
paymentDate       0
amount            0
dtype: int64

In [38]:
# extremos
amount_col = payments_df['amount']


q1 = np.percentile(amount_col, 25)
q3 = np.percentile(amount_col, 75)
iqr = q3 - q1
print('iqr:\n', iqr)

umbra_sup = q3+1.5*iqr
umbra_inf = q1-1.5*iqr

print('umbrales inf:\n', umbra_inf)
print('\numbrales sup:\n', umbra_sup)

iqr:
 29892.835000000003
umbrales inf:
 -29695.117500000004

umbrales sup:
 89876.2225


In [39]:
am_outliers = amount_col[((amount_col < umbra_inf) | (amount_col > umbra_sup))]
am_outliers

17    101244.59
23    111654.40
41    116208.40
43    120166.58
61    105743.00
Name: amount, dtype: float64

In [40]:
amount_col.describe()

count       278.000000
mean      31827.944281
std       21096.143249
min         615.450000
25%       15144.135000
50%       31369.150000
75%       45036.970000
max      120166.580000
Name: amount, dtype: float64

In [None]:
payments_df.drop(am_outliers.index, inplace=True)
payments_df.shape

In [41]:
#duplicados
payments_df.duplicated().sum()

5

In [42]:
payments_df[payments_df.duplicated()]

Unnamed: 0,customerNumber,checkNumber,paymentDate,amount
32,129,ID449593,2003-12-11,13923.93
86,175,CITI3434344,2005-05-19,14500.78
144,260,IO164641,2004-08-30,13527.58
215,381,GB117430,2005-02-03,7379.9
269,487,AH612904,2003-09-28,14997.09


In [43]:
payments_df['customer-check'] = payments_df['customerNumber'].astype(
    str)+'-'+payments_df['checkNumber'].astype(str)
payments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278 entries, 0 to 277
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   customerNumber  278 non-null    int64  
 1   checkNumber     278 non-null    object 
 2   paymentDate     278 non-null    object 
 3   amount          278 non-null    float64
 4   customer-check  278 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 11.0+ KB


In [44]:
payments_df[payments_df.duplicated()]

Unnamed: 0,customerNumber,checkNumber,paymentDate,amount,customer-check
32,129,ID449593,2003-12-11,13923.93,129-ID449593
86,175,CITI3434344,2005-05-19,14500.78,175-CITI3434344
144,260,IO164641,2004-08-30,13527.58,260-IO164641
215,381,GB117430,2005-02-03,7379.9,381-GB117430
269,487,AH612904,2003-09-28,14997.09,487-AH612904


In [45]:
cust_check_ids = payments_df[payments_df.duplicated()]['customer-check'].values
cust_check_ids

array(['129-ID449593', '175-CITI3434344', '260-IO164641', '381-GB117430',
       '487-AH612904'], dtype=object)

In [46]:
payments_df[payments_df['customer-check'].isin(cust_check_ids)]

Unnamed: 0,customerNumber,checkNumber,paymentDate,amount,customer-check
31,129,ID449593,2003-12-11,13923.93,129-ID449593
32,129,ID449593,2003-12-11,13923.93,129-ID449593
85,175,CITI3434344,2005-05-19,14500.78,175-CITI3434344
86,175,CITI3434344,2005-05-19,14500.78,175-CITI3434344
143,260,IO164641,2004-08-30,13527.58,260-IO164641
144,260,IO164641,2004-08-30,13527.58,260-IO164641
214,381,GB117430,2005-02-03,7379.9,381-GB117430
215,381,GB117430,2005-02-03,7379.9,381-GB117430
268,487,AH612904,2003-09-28,14997.09,487-AH612904
269,487,AH612904,2003-09-28,14997.09,487-AH612904


In [47]:
def doNothing(x):
    return list(x)[0]


added_payments_df = payments_df.groupby('customer-check').agg(
    {'amount': 'sum', 'customerNumber': doNothing, 'checkNumber': doNothing, 'paymentDate': doNothing}).reset_index()
added_payments_df[added_payments_df['customer-check'].isin(cust_check_ids)]

Unnamed: 0,customer-check,amount,customerNumber,checkNumber,paymentDate
31,129-ID449593,27847.86,129,ID449593,2003-12-11
84,175-CITI3434344,29001.56,175,CITI3434344,2005-05-19
141,260-IO164641,27055.16,260,IO164641,2004-08-30
211,381-GB117430,14759.8,381,GB117430,2005-02-03
264,487-AH612904,29994.18,487,AH612904,2003-09-28


In [48]:
# incoherencias
added_payments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   customer-check  273 non-null    object 
 1   amount          273 non-null    float64
 2   customerNumber  273 non-null    int64  
 3   checkNumber     273 non-null    object 
 4   paymentDate     273 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 10.8+ KB


In [49]:
added_payments_df['paymentDate'] = pd.to_datetime(
    added_payments_df['paymentDate'])

added_payments_df['checkNumber'] = added_payments_df['checkNumber'].astype(
    'category')
added_payments_df['customer-check'] = added_payments_df['customer-check'].astype(
    'category')

In [50]:
added_payments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   customer-check  273 non-null    category      
 1   amount          273 non-null    float64       
 2   customerNumber  273 non-null    int64         
 3   checkNumber     273 non-null    category      
 4   paymentDate     273 non-null    datetime64[ns]
dtypes: category(2), datetime64[ns](1), float64(1), int64(1)
memory usage: 28.1 KB


In [51]:
payments_card = calc_cardinalidad(added_payments_df)
print(payments_card)


- Valores únicos para "customer-check" 

Num valores únicos:  273

- Valores únicos para "amount" 

Num valores únicos:  273

- Valores únicos para "customerNumber" 

Num valores únicos:  98

- Valores únicos para "checkNumber" 

Num valores únicos:  273

- Valores únicos para "paymentDate" 

Num valores únicos:  232
{'customer-check': 273, 'amount': 273, 'customerNumber': 98, 'checkNumber': 273, 'paymentDate': 232}


In [52]:
# frecuencias
for col in added_payments_df.columns:
    # print('\n- Frecuencias para "{0}"'.format(col), '\n')
    print(added_payments_df[col].value_counts())

103-HQ336336    1
333-HL209210    1
339-AP286625    1
334-LF737277    1
334-HH517378    1
               ..
187-KL124726    1
189-BO711618    1
189-NM916675    1
198-FI192930    1
496-MN89921     1
Name: customer-check, Length: 273, dtype: int64
6066.78     1
23936.53    1
23333.06    1
28394.54    1
29716.86    1
           ..
48425.69    1
17359.53    1
32538.74    1
9658.74     1
52166.00    1
Name: amount, Length: 273, dtype: int64
141    13
124     9
398     4
381     4
323     4
       ..
357     2
450     1
415     1
211     1
239     1
Name: customerNumber, Length: 98, dtype: int64
AB661578    1
JPMR4544    1
KH910279    1
KG644125    1
KF480160    1
           ..
FI192930    1
FN155234    1
FN640986    1
FP170292    1
PT550181    1
Name: checkNumber, Length: 273, dtype: int64
2004-06-21    3
2003-12-09    3
2003-11-24    3
2003-11-18    3
2004-12-06    2
             ..
2003-01-30    1
2005-03-10    1
2004-10-21    1
2004-11-03    1
2003-07-16    1
Name: paymentDate, Length: 2

In [54]:
# correlación
payments_corr = added_payments_df.corr('pearson')
payments_corr

Unnamed: 0,amount,customerNumber
amount,1.0,-0.195275
customerNumber,-0.195275,1.0


In [55]:
payments_corr[np.abs(payments_corr) >= 0.7]

Unnamed: 0,amount,customerNumber
amount,1.0,
customerNumber,,1.0


In [56]:
# sesgo

payments_skw = added_payments_df.skew(numeric_only=True)
payments_skw

amount            1.116880
customerNumber    0.340642
dtype: float64

In [57]:
payments_skw[np.abs(payments_skw) > 2]

Series([], dtype: float64)

In [58]:
# kurtosis
payments_kurt = added_payments_df.kurt(numeric_only=True)
payments_kurt

amount            2.485139
customerNumber   -1.195111
dtype: float64

In [59]:
payments_kurt[payments_kurt > np.abs(3)]

Series([], dtype: float64)

## Mezclado de datos

In [60]:
sales_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2988 entries, 0 to 3000
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   orderNumber            2988 non-null   int64   
 1   orderLineNumber        2988 non-null   int64   
 2   customerNumber         2988 non-null   int64   
 3   employeeNumber         2988 non-null   int64   
 4   productCode            2988 non-null   category
 5   status                 2988 non-null   category
 6   quantityOrdered        2988 non-null   int64   
 7   priceEach              2988 non-null   float64 
 8   sales_amount           2988 non-null   float64 
 9   origin                 2988 non-null   category
 10  complete_order_number  2988 non-null   object  
dtypes: category(3), float64(2), int64(5), object(1)
memory usage: 224.1+ KB


In [61]:
added_payments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   customer-check  273 non-null    category      
 1   amount          273 non-null    float64       
 2   customerNumber  273 non-null    int64         
 3   checkNumber     273 non-null    category      
 4   paymentDate     273 non-null    datetime64[ns]
dtypes: category(2), datetime64[ns](1), float64(1), int64(1)
memory usage: 28.1 KB


In [62]:
merged_df = pd.merge(sales_df_clean, added_payments_df,
                     on='customerNumber', how='left')
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11991 entries, 0 to 11990
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   orderNumber            11991 non-null  int64         
 1   orderLineNumber        11991 non-null  int64         
 2   customerNumber         11991 non-null  int64         
 3   employeeNumber         11991 non-null  int64         
 4   productCode            11991 non-null  category      
 5   status                 11991 non-null  category      
 6   quantityOrdered        11991 non-null  int64         
 7   priceEach              11991 non-null  float64       
 8   sales_amount           11991 non-null  float64       
 9   origin                 11991 non-null  category      
 10  complete_order_number  11991 non-null  object        
 11  customer-check         11991 non-null  category      
 12  amount                 11991 non-null  float64       
 13  c

In [63]:
merged_df.head()

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,productCode,status,quantityOrdered,priceEach,sales_amount,origin,complete_order_number,customer-check,amount,checkNumber,paymentDate
0,10100,1,363,1216,S24_3969,Shipped,49,35.29,1729.21,spain,10100-1,363-HL575273,50799.69,HL575273,2004-11-17
1,10100,1,363,1216,S24_3969,Shipped,49,35.29,1729.21,spain,10100-1,363-IS232033,10223.83,IS232033,2003-01-16
2,10100,1,363,1216,S24_3969,Shipped,49,35.29,1729.21,spain,10100-1,363-PN238558,55425.77,PN238558,2003-12-05
3,10100,2,363,1216,S18_2248,Shipped,50,55.09,2754.5,spain,10100-2,363-HL575273,50799.69,HL575273,2004-11-17
4,10100,2,363,1216,S18_2248,Shipped,50,55.09,2754.5,spain,10100-2,363-IS232033,10223.83,IS232033,2003-01-16


In [64]:
merged_df.tail()

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,productCode,status,quantityOrdered,priceEach,sales_amount,origin,complete_order_number,customer-check,amount,checkNumber,paymentDate
11986,10425,12,119,1370,S10_4962,In Process,38,131.49,4996.62,spain,10425-12,119-LN373447,47924.19,LN373447,2004-08-08
11987,10425,12,119,1370,S10_4962,In Process,38,131.49,4996.62,spain,10425-12,119-NG94694,49523.67,NG94694,2005-02-22
11988,10425,13,119,1370,S18_4600,In Process,38,107.76,4094.88,spain,10425-13,119-DB933704,19501.82,DB933704,2004-11-14
11989,10425,13,119,1370,S18_4600,In Process,38,107.76,4094.88,spain,10425-13,119-LN373447,47924.19,LN373447,2004-08-08
11990,10425,13,119,1370,S18_4600,In Process,38,107.76,4094.88,spain,10425-13,119-NG94694,49523.67,NG94694,2005-02-22


#### Insights by Sales and payments

In [70]:
customer_sales_pays = merged_df.groupby('customerNumber').agg(num=('complete_order_number', 'count'), tot_sale=('sales_amount', 'sum'), tot_amount=('amount', 'sum')).reset_index()

customer_sales_pays

Unnamed: 0,customerNumber,num,tot_sale,tot_amount
0,103,21,66943.08,156200.52
1,112,87,240542.94,2325248.42
2,114,220,722340.28,9932178.85
3,119,159,475719.36,6198333.04
4,121,128,416899.16,3335193.28
...,...,...,...,...
93,486,66,223295.61,1709984.98
94,487,30,85140.74,638511.90
95,489,24,59172.30,355033.80
96,495,36,131083.48,1179751.32


In [72]:
print('# top ten por número de compras')
customer_sales_pays.sort_values('num', ascending=False)[
    ['customerNumber', 'num']].head(10)

# top ten por número de compras


Unnamed: 0,customerNumber,num
9,141,3367
5,124,1620
2,114,220
14,151,192
58,323,184
47,276,184
13,148,172
67,353,164
3,119,159
26,187,153


In [73]:
print('# top ten por monto de compras')
customer_sales_pays.sort_values('tot_sale', ascending=False)[
    ['customerNumber', 'tot_sale']].head(10)

# top ten por monto de compras


Unnamed: 0,customerNumber,tot_sale
9,141,10668964.02
5,124,5326446.06
2,114,722340.28
14,151,711655.8
13,148,624924.0
58,323,618488.32
47,276,548136.88
11,145,516340.48
67,353,507932.76
3,119,475719.36


In [76]:
print('# top ten por monto de pagos')
customer_sales_pays.sort_values('tot_amount', ascending=False)[
    ['customerNumber', 'tot_amount']].head(10)

# top ten por monto de pagos


Unnamed: 0,customerNumber,tot_amount
9,141,185376400.0
5,124,105153900.0
2,114,9932179.0
14,151,8539870.0
26,187,7568915.0
58,323,7112616.0
13,148,6718794.0
47,276,6303574.0
3,119,6198333.0
48,278,5738836.0


#### Insights by origin

In [77]:
by_origin = merged_df.groupby('origin').agg(num=('complete_order_number', 'count'), tot_sale=(
    'sales_amount', 'sum'), tot_amount=('amount', 'sum')).reset_index()

by_origin

Unnamed: 0,origin,num,tot_sale,tot_amount
0,japan,428,1467969.0,14170510.0
1,spain,11563,37023663.46,509480500.0


#### Insights by date

In [78]:
paymentDate = merged_df['paymentDate']

by_date = merged_df.groupby([paymentDate.dt.year, paymentDate.dt.month]).agg(num=(
    'orderNumber', 'count'), tot_sale=('sales_amount', 'sum'), tot_ammount=('amount', 'sum'))

by_date.index.names = ['year', 'month']

by_date

Unnamed: 0_level_0,Unnamed: 1_level_0,num,tot_sale,tot_ammount
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2003,1,81,264884.69,717057.94
2003,2,327,1053999.45,13962943.76
2003,3,189,564074.4,7656653.14
2003,4,314,1066448.75,5603591.46
2003,5,239,765701.4,6398322.22
2003,6,135,442817.16,6024837.25
2003,7,468,1470009.14,13198441.27
2003,8,333,1065266.12,24853485.99
2003,9,134,396909.32,4782035.76
2003,10,515,1656871.86,19166694.82


In [79]:
print('# top años por número de compras')
by_date.sort_values('num', ascending=False).groupby('year').agg({'num': sum}).sort_values('num', ascending=False)

# top años por número de compras


Unnamed: 0_level_0,num
year,Unnamed: 1_level_1
2004,5722
2003,4139
2005,2130


In [80]:
print('# top meses por número de compras')
by_date.groupby('month').agg({'num': sum}).sort_values(
    'num', ascending=False).head(3)

# top meses por número de compras


Unnamed: 0_level_0,num
month,Unnamed: 1_level_1
12,1814
11,1719
3,1471


In [82]:
merged_df_corr = merged_df.corr('pearson')
merged_df_corr

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,quantityOrdered,priceEach,sales_amount,amount
orderNumber,1.0,-0.044374,-0.053594,0.09072,0.060507,-0.00368,0.0348,0.074286
orderLineNumber,-0.044374,1.0,-0.04628,-0.025341,-0.032029,0.004692,-0.023035,0.067127
customerNumber,-0.053594,-0.04628,1.0,0.05205,-0.007862,-0.008469,-0.007947,-0.31482
employeeNumber,0.09072,-0.025341,0.05205,1.0,-0.012939,-0.026228,-0.029114,-0.025593
quantityOrdered,0.060507,-0.032029,-0.007862,-0.012939,1.0,0.024957,0.567646,0.015247
priceEach,-0.00368,0.004692,-0.008469,-0.026228,0.024957,1.0,0.80799,-0.003561
sales_amount,0.0348,-0.023035,-0.007947,-0.029114,0.567646,0.80799,1.0,0.004488
amount,0.074286,0.067127,-0.31482,-0.025593,0.015247,-0.003561,0.004488,1.0


In [83]:
merged_df_corr[(merged_df_corr > 0.7) & (merged_df_corr != 1)]

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,quantityOrdered,priceEach,sales_amount,amount
orderNumber,,,,,,,,
orderLineNumber,,,,,,,,
customerNumber,,,,,,,,
employeeNumber,,,,,,,,
quantityOrdered,,,,,,,,
priceEach,,,,,,,0.80799,
sales_amount,,,,,,0.80799,,
amount,,,,,,,,


## Conclusiones

**sales:**
- nulos: eliminadas 3 columnas. Luego no nulos
- anomalías: 17 outliers elimiandos de quantityOrdered
- duplicados: 5 duplicados elimiandos
- incoherencias: ajustados tipos
- cardinalidad: descompensación en origen: 21 - 1 (spain-japan) y en status (shipped +90%)
- estadística descriptiva: correlación entre 'sales_amount' y 'priceEach'. No Sesgo significativo.

**payments:**
- nulos: no nulos
- anomalías: 6 outliers elimiandos en amount
- duplicados: 6 duplicados mezclados
- incoherencias: ajustados tipos
- cardinalidad: no se observan descompensaciones
- estadística descriptiva: no correlación fuerte. no sesgo.

**datos finales:**
- 4 primeros clientes son los mismos en los top tens
- Origen mayoritario de spain, pero monto total de japan
- Año de más ventas 2004
- Meses de más ventas: 11, 12, 5

## Guardar

In [85]:
merged_df.to_csv(
    '../data/company_sales/output/merged_lean_df.csv', index=False)

In [87]:
merged_df.to_pickle(
    '../data/company_sales/output/merged_lean_df.pkl')

In [None]:
# ! pip install fastparquet

In [86]:
merged_df.to_parquet(
    '../data/company_sales/output/merged_lean_df.parquet')