# Initial Data Analysis for database modeling
First of all, lets quickly analyze the provided data so we can identify weak design points and normalize it better

In [1]:
import pandas as pd

In [2]:
df_seller = pd.read_csv('../datasets/olist_sellers_dataset.csv')
df_product = pd.read_csv('../datasets/olist_products_dataset.csv')
df_category_translation = pd.read_csv('../datasets/product_category_name_translation.csv')
df_customers = pd.read_csv('../datasets/olist_customers_dataset.csv')
df_geolocation = pd.read_csv('../datasets/olist_geolocation_dataset.csv')
df_order = pd.read_csv('../datasets/olist_orders_dataset.csv')
df_order_item = pd.read_csv('../datasets/olist_order_items_dataset.csv')
df_order_payment = pd.read_csv('../datasets/olist_order_payments_dataset.csv')
# The next dataframe has strings without delimiters which is breaking import.
# getting all cols as string to perform basic checks
df_order_review = pd.read_csv('../datasets/olist_order_reviews_dataset.csv', dtype=str, skipinitialspace = True, quotechar = '"')


# Basic columns redudancy check

In [3]:
df_seller.head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


**It seems odd to have the the zip code, city name and state all in here. I guess the geolocation dataset might give new insights, but this surely is going somewhere else**

In [4]:
df_product.head()

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0


**There are some things out of norms here.**
- product_name_lenghtand prodcut_description_lenght are dependant of columns other than the primary key which is not advisable, and might as well be generated during runtime
- product_category_name has high data redundancy and should be an FK, which will improve performance as well as data normalization

In [5]:
df_category_translation.head()

Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto
3,cama_mesa_banho,bed_bath_table
4,moveis_decoracao,furniture_decor


**This could be highly improved**
- Product category should be its own dataset, composed only of the categoty identificators
- Introduce a new dataset for languages. This opens the application to have more than a single translation
- category id be a FK here
- language id be a FK here
- the translation string for that giving language. This will allow for more than one language internationalization under normalized database norms


In [6]:
df_customers.head()

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP


**Again zip code, city and state. This is asking for data normalization**
 
***TODO*** - check difference between customer ID and customer unique ID. If they have father-child relation, this could be a candidate for normalization as well

In [7]:
df_geolocation.head()

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP


**And again the city name and state. This is not good at all. I'll create new tables for city and state, and make then a FK here. Actually, I'll make only the city a FK here, and state will be a FK to city**

Also, the zip prefix is supposed to repeat. If they do so, we might create a zip_prefix table as well and FK it. lets check for repeated occurences:


In [8]:
df_geolocation.geolocation_zip_code_prefix.value_counts(dropna=False)

24220    1146
24230    1102
38400     965
35500     907
11680     879
         ... 
73990       1
87307       1
72450       1
24877       1
38198       1
Name: geolocation_zip_code_prefix, Length: 19015, dtype: int64

**Yes, they do repeat. I'll create a table for it instead, put it's id in geolocation as FK, and thus put the city FK into the new zip_prefix table**

In [10]:
df_order.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00


**Column order status seems to be a candidate for normalization. Lets check all unique values**

In [11]:
df_order.order_status.unique()

array(['delivered', 'invoiced', 'shipped', 'processing', 'unavailable',
       'canceled', 'created', 'approved'], dtype=object)

**Yeap, for data integrity, performance, analysis and all, it is advisable to create a new table for order statuses**

Let me also just double check if the customer foreign key points to customer id or customer unique id. I'll check using the value of the first row

In [33]:
df_customers[df_customers.customer_unique_id == '9ef432eb6251297304e76186b10a928d']

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state


In [32]:
df_customers[df_customers.customer_id == '9ef432eb6251297304e76186b10a928d']

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
70296,9ef432eb6251297304e76186b10a928d,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP


**it uses the customer_id, as expected**


In [12]:
df_order_item.head()

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.0,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.9,18.14


**order_item datset seems ok**

In [13]:
df_order_payment.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45


**It seems we have here the same case as order status. The payment type is a candidate for a new table, lets check the unique values**


In [14]:
df_order_payment.payment_type.unique()

array(['credit_card', 'boleto', 'voucher', 'debit_card', 'not_defined'],
      dtype=object)

**Yeap, new table it is**

In [15]:
df_order_review.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59,,,,,,
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13,,,,,,
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24,,,,,,
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06,,,,,,
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53,,,,,,


**order_review dataset seems ok in terms of normalization, however this dataset was badly extracted, having double quoted and not quoted at all strings fields, thus csv importing is breaking. This should be treated during import to database. I guess this will requires some arduous regex, thus maybe initially I'll just delete the rows with more than 7 columns so I can start coding, and treat this specific case if time frame is long enough.**


--------------


# Proposed changes

Many of the datasets do not follow normalization standards. Using data normalization is important for reasons like reducing redudancy and improving data integrity, not to mention performance and data analisys possibilities.

Main change points:
- Data redudancy
- Dataset names in plural form should be singular
- Column names have reduntant table name
- interdependency between columns other than the primary key within the same dataset should be splitted or even removed in case they can be generated during runtime with low performance expenses.

- Model enhancements
    - The string internationalizaton can be expanded to support more languages, and as well attend other tables in the future
    
- Possible enhancements which should be carefully done in case business could benefit for it. Otherwise, they should be kept denormalized. From standard marketplaces knowledge, I do not think they are necessary, but I'll keep this here as a possibility:
    - product weight and size column name specifies the measure unit. If in future more units come to be supported, a new table for units (weight and size) should be created and FKs for them should be placed into products table.

Proposal for relational database model:

- Table: state
    - columns
        - id
        - name
        
        
- Table: city
    - columns
        - id
        - name
        - state_id  (FK to state table)
        
        
- Table: zip
    - columns
        - id
        - prefix
        - city_id  (FK to city table)
        
        
- Table: geolocation
    - columns
        - id
        - latitute
        - longitude
        - zip_id  (FK to zip table)
        
        
- Table: seller
    - columns
        - id
        - zip_id  (FK to zip table)
        
        
- Table: customer
    - columns
        - id
        - unique_id
        - zip_id  (FK to zip table)
        
        
- Table: category
    - columns
        - id
        - name
        
        
- Table: languague:
    - columns
        - id
        - code
        
        
- Table: internationalization
    - columns
        - id
        - reference_table
        - reference_id 
        - language_id  (FK to language table)
        - text
        
        
- Table: product
    - columns
        - id
        - category_id  (FK to category table)
        - weigth_g
        - lenght_cm
        - height_cm
        - width_cm
        - photos_qty
        
        
- Table: order_status
    - columns
        - id
        - description
        
        
- Table: order
    - columns
        - id
        - customer_id
        - status_id  (FK to order_status table)
        - purchase_timestamp
        - approved_at
        - delivered_carrier_date
        - delivered_customer_date
        - estimated_delivery_date
        
        
- Table: order_item
    - columns
        - id
        - order_id (FK to order table)
        - item_id
        - product_id  (FK to product table)
        - seller_id  (FK to seller table)
        - shipping_limit_date
        - price
        - freight_value
        
        
- Table: payment_type
    - columns
        - id
        - description
        
        
- Table: order_payment
    - columns
        - id
        - order_id
        - payment_type_id  (FK to payment_type table)
        - sequential
        - installments
        - value
        
        
- Table: order_review
    - columns
        - id
        - order_id
        - score
        - comment_title
        - comment_message
        - creation_date
        - answer_timestamp
