Project 1 - File Format Converter Handout
Overview


The objective of this project is to develop solutions based on the design provided. In this case, 
the source data was obtained in the form of CSV files from a MySQL DB.To improve the efficiency of our
data engineering pipelines, we need to convert these CSV files into JSON files, since JSON is better to 
use in downstream applications than CSV files. The scope of this project involves converting CSV files into JSON files.

In [123]:
import glob;

In [124]:
src_files =glob.glob('data/retail_db/*/*',recursive=True)

In [125]:
import pandas as pd

In [126]:
for file_name in src_files:
    df= pd.read_csv(file_name,header= None)
    print(f'Shape of the file {file_name} is {df.shape}')

Shape of the file data/retail_db/customers/part-00000 is (12435, 9)
Shape of the file data/retail_db/products/part-00000 is (1345, 6)
Shape of the file data/retail_db/departments/part-00000 is (6, 2)
Shape of the file data/retail_db/order_items/part-00000 is (172198, 6)
Shape of the file data/retail_db/orders/part-00000 is (68883, 4)
Shape of the file data/retail_db/categories/part-00000 is (58, 3)


In [127]:
# function to get column names
def get_column_names(schemas,schemaName,sortKey=None):
    column_list = list(map(lambda order: order['column_name'],list(schemas.get(schemaName))))
    # return column_list
    if(sortKey != None):
        sorted_list = sorted(list(schemas.get(schemaName)),key= lambda order: order.get(sortKey))
        return  list(map(lambda order: order['column_name'],list(sorted_list)))
    return column_list

In [128]:
import json

In [129]:
schemas = json.load(open('./data/retail_db/schemas.json'))

In [130]:
order_columns = get_column_names(schemas,'orders')
order_columns

['order_id', 'order_date', 'order_customer_id', 'order_status']

In [131]:
orders= pd.read_csv('./data/retail_db/orders/part-00000',names=order_columns)
src_files[0].split('/')

['data', 'retail_db', 'customers', 'part-00000']

In [132]:
import os


In [134]:

for file_name in src_files:
    collection_name = file_name.split('/')[-2]
    os.makedirs(f'data/retail_db_json/{collection_name}',exist_ok=True)
    print(f'Processing file : {file_name}')
    column_names = get_column_names(schemas,collection_name)
    # read csv
    collection = pd.read_csv(file_name,names=column_names)
    collection.to_json(f'data/retail_db_json/{collection_name}/part-00000',orient='records')

Processing file : data/retail_db/customers/part-00000
Processing file : data/retail_db/products/part-00000
Processing file : data/retail_db/departments/part-00000
Processing file : data/retail_db/order_items/part-00000
Processing file : data/retail_db/orders/part-00000
Processing file : data/retail_db/categories/part-00000
