# Análise exploratória dos datasets (Quantidade = 09).

## Informações sobre cada dataset:
1. customers: Contém informações sobre os clientes, como ID, cidade e estado.
2. geolocation: Inclui informações de geolocalização, como latitude e longitude. 
3. order_items: Detalha itens individuais em cada pedido, incluindo preço e valor do frete.
4. order_payments: Fornece informações sobre os pagamentos dos pedidos.
5. order_reviews: Inclui avaliações dos clientes. 
6. orders: Contém informações sobre os pedidos, como status e datas.
7. products: Detalha informações sobre os produtos, incluindo categoria e dimensões. 
8. sellers: Fornece informações sobre os vendedores, como localização. 
9. product_category_name_translation: Traduz nomes de categorias de produtos para o inglês.

## Resumo valores/colunas a serem tratadas:

1. dataset customers: OK

2. **dataset geolocation: Tratar valores duplicados.**

3. **dataset order_items: Coluna shipping_limit_date, object para date.**

4. dataset order_payments: OK

5. **dataset order_reviews: Colunas review_comment_title e review_comment_message, valores NaN e nulos.** <br> 
**Colunas review_creation_date e review_answer_timestamp. object para date.** 

6. **dataset orders: Colunas order_purchase_timestamp, order_delivered_carrier_date, order_delivered_customer_date, order_estimated_delivery_date, object para date.** <br>
**Colunas order_approved_at, order_delivered_carrier_date e order_delivered_customer_date, valores nulos.**  

7. **dataset products: Colunas product_category_name, product_name_length, product_description_length e product_photos_qty, valores nulos.**

8. dataset sellers: OK

9. dataset product_category_name_translation: OK

## Instalando as bibliotecas:

In [None]:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn

## Importando as bibliotecas:

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Criar dataframes dos arquivos .csv e realizar as primeiras análises:

## Arquivo 01, "olist_customers_dataset.csv":

### Criar dataframe e visualizar as primeiras linhas.

In [12]:
df_customers = pd.read_csv('./dataset/olist_customers_dataset.csv', sep=",")
df_customers.head()

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP


### Verificar os tipos de dados e colunas do dataframe:

In [6]:
df_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB


### Verificar valores nulos do dataframe:

In [7]:
df_customers.isnull().sum()

customer_id                 0
customer_unique_id          0
customer_zip_code_prefix    0
customer_city               0
customer_state              0
dtype: int64

### Verificar valores duplicados:

In [20]:
duplicate_value_customers = df_customers.duplicated()
print(f"Total de valores duplicados: {duplicate_value_customers.sum()}")

Total de valores duplicados: 0


### Verificar a estatística das colunas textuais do dataframe:

In [14]:
df_customers.describe(include=['object'])

Unnamed: 0,customer_id,customer_unique_id,customer_city,customer_state
count,99441,99441,99441,99441
unique,99441,96096,4119,27
top,06b8999e2fba1a1fbc88172c00ba8bc7,8d50f5eadf50201ccdcedfb9e2ac8455,sao paulo,SP
freq,1,17,15540,41746


## Arquivo 02, "olist_geolocation_dataset.csv":

### Criar dataframe e visualizar as primeiras linhas.

In [17]:
df_geolocation = pd.read_csv('./dataset/olist_geolocation_dataset.csv', sep=",")
df_geolocation.head()

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP


### Verificar os tipos de dados e colunas do dataframe:

In [18]:
df_geolocation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   geolocation_zip_code_prefix  1000163 non-null  int64  
 1   geolocation_lat              1000163 non-null  float64
 2   geolocation_lng              1000163 non-null  float64
 3   geolocation_city             1000163 non-null  object 
 4   geolocation_state            1000163 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 38.2+ MB


### Verificar valores nulos do dataframe:

In [19]:
df_geolocation.isnull().sum()

geolocation_zip_code_prefix    0
geolocation_lat                0
geolocation_lng                0
geolocation_city               0
geolocation_state              0
dtype: int64

### Verificar valores duplicados:

In [21]:
duplicate_value_geolocation = df_geolocation.duplicated()
print(f"Total de valores duplicados: {duplicate_value_geolocation.sum()}")

Total de valores duplicados: 261831


*** TRATAR DADOS DUPLICADOS.

### Verificar a estatística das colunas textuais do dataframe:

In [29]:
df_geolocation.describe(include=['object'])

Unnamed: 0,geolocation_city,geolocation_state
count,1000163,1000163
unique,8011,27
top,sao paulo,SP
freq,135800,404268


## Arquivo 03, "olist_order_items_dataset.csv":

### Criar dataframe e visualizar as primeiras linhas.

In [23]:
df_order_items = pd.read_csv('./dataset/olist_order_items_dataset.csv', sep=",")
df_order_items.head()

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14


### Verificar os tipos de dados e colunas do dataframe:

In [24]:
df_order_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   order_id             112650 non-null  object 
 1   order_item_id        112650 non-null  int64  
 2   product_id           112650 non-null  object 
 3   seller_id            112650 non-null  object 
 4   shipping_limit_date  112650 non-null  object 
 5   price                112650 non-null  float64
 6   freight_value        112650 non-null  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 6.0+ MB


*** TRATAR COLUNA shipping_limit_date, OBJECT para DATE. 

### Verificar valores nulos do dataframe:

In [25]:
df_order_items.isnull().sum()

order_id               0
order_item_id          0
product_id             0
seller_id              0
shipping_limit_date    0
price                  0
freight_value          0
dtype: int64


### Verificar valores duplicados:

In [26]:
duplicate_value_order_items = df_order_items.duplicated()
print(f"Total de valores duplicados: {duplicate_value_order_items.sum()}")

Total de valores duplicados: 0


### Verificar a estatística das colunas textuais e numéricas do dataframe:

In [27]:
df_order_items.describe()

Unnamed: 0,order_item_id,price,freight_value
count,112650.0,112650.0,112650.0
mean,1.197834,120.653739,19.99032
std,0.705124,183.633928,15.806405
min,1.0,0.85,0.0
25%,1.0,39.9,13.08
50%,1.0,74.99,16.26
75%,1.0,134.9,21.15
max,21.0,6735.0,409.68


In [28]:
df_order_items.describe(include=['object'])

Unnamed: 0,order_id,product_id,seller_id,shipping_limit_date
count,112650,112650,112650,112650
unique,98666,32951,3095,93318
top,8272b63d03f5f79c56e9e4120aec44ef,aca2eb7d00ea1a7b8ebd4e68314663af,6560211a19b47992c3666cc44a7e94c0,2017-07-21 18:25:23
freq,21,527,2033,21


## Arquivo 04, "olist_order_payments.csv":

### Criar dataframe e visualizar as primeiras linhas.

In [30]:
df_order_payments = pd.read_csv('./dataset/olist_order_payments_dataset.csv', sep=",")
df_order_payments.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45


### Verificar os tipos de dados e colunas do dataframe:

In [31]:
df_order_payments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB


### Verificar valores nulos do dataframe:

In [32]:
df_order_payments.isnull().sum()

order_id                0
payment_sequential      0
payment_type            0
payment_installments    0
payment_value           0
dtype: int64

### Verificar valores duplicados:

In [None]:
duplicate_value_order_payments = df_order_payments.duplicated()
print(f"Total de valores duplicados: {duplicate_value_order_payments.sum()}")

Total de valores duplicados: 0


### Verificar a estatística das colunas textuais e numéricas do dataframe:

In [34]:
df_order_payments.describe()

Unnamed: 0,payment_sequential,payment_installments,payment_value
count,103886.0,103886.0,103886.0
mean,1.092679,2.853349,154.10038
std,0.706584,2.687051,217.494064
min,1.0,0.0,0.0
25%,1.0,1.0,56.79
50%,1.0,1.0,100.0
75%,1.0,4.0,171.8375
max,29.0,24.0,13664.08


In [35]:
df_order_payments.describe(include=['object'])

Unnamed: 0,order_id,payment_type
count,103886,103886
unique,99440,5
top,fa65dad1b0e818e3ccc5cb0e39231352,credit_card
freq,29,76795


## Arquivo 05, "olist_order_reviews.csv":

### Criar dataframe e visualizar as primeiras linhas.

In [36]:
df_order_reviews = pd.read_csv('./dataset/olist_order_reviews_dataset.csv', sep=",")
df_order_reviews.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


*** TRATAR COLUNAS review_comment_title e review_comment_message. VALORES NaN.

### Verificar os tipos de dados e colunas do dataframe:

In [38]:
df_order_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99224 entries, 0 to 99223
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   review_id                99224 non-null  object
 1   order_id                 99224 non-null  object
 2   review_score             99224 non-null  int64 
 3   review_comment_title     11568 non-null  object
 4   review_comment_message   40977 non-null  object
 5   review_creation_date     99224 non-null  object
 6   review_answer_timestamp  99224 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB


*** TRATAR COLUNAS review_creation_date e review_answer_timestamp, OBJECT para DATE.

### Verificar valores nulos do dataframe:

In [39]:
df_order_reviews.isnull().sum()

review_id                      0
order_id                       0
review_score                   0
review_comment_title       87656
review_comment_message     58247
review_creation_date           0
review_answer_timestamp        0
dtype: int64

*** TRATAR COLUNAS review_comment_title e review_comment_message, VALORES NULOS.

### Verificar valores duplicados:

In [40]:
duplicate_value_order_reviews = df_order_reviews.duplicated()
print(f"Total de valores duplicados: {duplicate_value_order_reviews.sum()}")

Total de valores duplicados: 0


### Verificar a estatística das colunas textuais e numéricas do dataframe:

In [41]:
df_order_reviews.describe(include=['object'])

Unnamed: 0,review_id,order_id,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
count,99224,99224,11568,40977,99224,99224
unique,98410,98673,4527,36159,636,98248
top,7b606b0d57b078384f0b58eac1d41d78,c88b1d1b157a9999ce368f218a407141,Recomendo,Muito bom,2017-12-19 00:00:00,2017-06-15 23:21:05
freq,3,3,423,230,463,4


In [42]:
df_order_reviews.describe()

Unnamed: 0,review_score
count,99224.0
mean,4.086421
std,1.347579
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


## Arquivo 06, "olist_orders_dataset.csv":

### Criar dataframe e visualizar as primeiras linhas.

In [44]:
df_orders = pd.read_csv('./dataset/olist_orders_dataset.csv', sep=",")
df_orders.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00


### Verificar os tipos de dados e colunas do dataframe:

In [45]:
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB


*** TRATAR COLUNAS order_purchase_timestamp, order_delivered_carrier_date, order_delivered_customer_date, order_estimated_delivery_date, OBJECT PARA DATE.

### Verificar valores nulos do dataframe:

In [46]:
df_orders.isnull().sum()

order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                 160
order_delivered_carrier_date     1783
order_delivered_customer_date    2965
order_estimated_delivery_date       0
dtype: int64

*** TRATAR COLUNAS order_approved_at, order_delivered_carrier_date, order_delivered_customer_date, VALORES NULOS. 

### Verificar valores duplicados:

In [47]:
duplicate_value_orders = df_orders.duplicated()
print(f"Total de valores duplicados: {duplicate_value_orders.sum()}")

Total de valores duplicados: 0


### Verificar a estatística das colunas textuais do dataframe:

In [48]:
df_orders.describe(include=['object'])

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
count,99441,99441,99441,99441,99281,97658,96476,99441
unique,99441,99441,8,98875,90733,81018,95664,459
top,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2018-04-11 10:48:14,2018-02-27 04:31:10,2018-05-09 15:48:00,2018-05-08 23:38:46,2017-12-20 00:00:00
freq,1,1,96478,3,9,47,3,522


## Arquivo 07, "olist_products_dataset.csv":

### Criar dataframe e visualizar as primeiras linhas.

In [49]:
df_products = pd.read_csv('./dataset/olist_products_dataset.csv', sep=",")
df_products.head()

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0


### Verificar os tipos de dados e colunas do dataframe:

In [50]:
df_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32951 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32951 non-null  object 
 1   product_category_name       32341 non-null  object 
 2   product_name_lenght         32341 non-null  float64
 3   product_description_lenght  32341 non-null  float64
 4   product_photos_qty          32341 non-null  float64
 5   product_weight_g            32949 non-null  float64
 6   product_length_cm           32949 non-null  float64
 7   product_height_cm           32949 non-null  float64
 8   product_width_cm            32949 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.3+ MB


### Verificar valores nulos do dataframe:

In [51]:
df_products.isnull().sum()

product_id                      0
product_category_name         610
product_name_lenght           610
product_description_lenght    610
product_photos_qty            610
product_weight_g                2
product_length_cm               2
product_height_cm               2
product_width_cm                2
dtype: int64

COLUNAS product_category_name, product_name_lenght, product_description_lenght, product_photos_qty, VALORES NULOS.

### Verificar valores duplicados:

In [52]:
duplicate_value_products = df_products.duplicated()
print(f"Total de valores duplicados: {duplicate_value_products.sum()}")

Total de valores duplicados: 0


### Verificar a estatística das colunas textuais e numéricas do dataframe:

In [53]:
df_products.describe(include=['object'])

Unnamed: 0,product_id,product_category_name
count,32951,32341
unique,32951,73
top,1e9e8ef04dbcff4541ed26657ea517e5,cama_mesa_banho
freq,1,3029


In [55]:
df_products.describe().round(1)

Unnamed: 0,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
count,32341.0,32341.0,32341.0,32949.0,32949.0,32949.0,32949.0
mean,48.5,771.5,2.2,2276.5,30.8,16.9,23.2
std,10.2,635.1,1.7,4282.0,16.9,13.6,12.1
min,5.0,4.0,1.0,0.0,7.0,2.0,6.0
25%,42.0,339.0,1.0,300.0,18.0,8.0,15.0
50%,51.0,595.0,1.0,700.0,25.0,13.0,20.0
75%,57.0,972.0,3.0,1900.0,38.0,21.0,30.0
max,76.0,3992.0,20.0,40425.0,105.0,105.0,118.0



## Arquivo 08, "olist_sellers_dataset.csv":

### Criar dataframe e visualizar as primeiras linhas.

In [56]:
df_sellers = pd.read_csv('./dataset/olist_sellers_dataset.csv', sep=",")
df_sellers.head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


### Verificar os tipos de dados e colunas do dataframe:

In [57]:
df_sellers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3095 entries, 0 to 3094
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   seller_id               3095 non-null   object
 1   seller_zip_code_prefix  3095 non-null   int64 
 2   seller_city             3095 non-null   object
 3   seller_state            3095 non-null   object
dtypes: int64(1), object(3)
memory usage: 96.8+ KB


### Verificar valores nulos do dataframe:

In [58]:
df_sellers.isnull().sum()

seller_id                 0
seller_zip_code_prefix    0
seller_city               0
seller_state              0
dtype: int64

### Verificar valores duplicados:

In [59]:
duplicate_value_sellers = df_sellers.duplicated()
print(f"Total de valores duplicados: {duplicate_value_sellers.sum()}")

Total de valores duplicados: 0


### Verificar a estatística das colunas textuais do dataframe:

In [60]:
df_sellers.describe(include=['object'])

Unnamed: 0,seller_id,seller_city,seller_state
count,3095,3095,3095
unique,3095,611,23
top,3442f8959a84dea7ee197c632cb2df15,sao paulo,SP
freq,1,694,1849



## Arquivo 09, "olist_product_category_name_translation.csv":

### Criar dataframe e visualizar as primeiras linhas.

In [61]:
df_product_category_name_translation = pd.read_csv('./dataset/product_category_name_translation.csv', sep=",")
df_product_category_name_translation.head()

Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto
3,cama_mesa_banho,bed_bath_table
4,moveis_decoracao,furniture_decor


### Verificar os tipos de dados e colunas do dataframe:

In [62]:
df_product_category_name_translation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 2 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   product_category_name          71 non-null     object
 1   product_category_name_english  71 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB


### Verificar valores nulos do dataframe:

In [63]:
df_product_category_name_translation.isnull().sum()

product_category_name            0
product_category_name_english    0
dtype: int64

### Verificar valores duplicados:

In [64]:
duplicate_value_product_category_name_translation = df_product_category_name_translation.duplicated()
print(f"Total de valores duplicados: {duplicate_value_product_category_name_translation.sum()}")

Total de valores duplicados: 0


### Verificar a estatística das colunas textuais do dataframe:

In [65]:
df_product_category_name_translation.describe(include=['object'])

Unnamed: 0,product_category_name,product_category_name_english
count,71,71
unique,71,71
top,beleza_saude,health_beauty
freq,1,1
