![ecommerce_analytics-1224x532](ecommerce_analytics-1224x532.jpg)


As a Data Engineer at an electronics e-commerce company, Voltmart, you have been requested by a peer Machine Learning team to clean the data containing the information about orders made last year. They are planning to further use this cleaned data to build a demand forecasting model. To achieve this, they have shared their requirements regarding the desired output table format.

An analyst shared a parquet file called `"orders_data.parquet"` for you to clean and preprocess. 

You can see the dataset schema below along with the **cleaning requirements**:

## `orders_data.parquet`

| column | data type | description | cleaning requirements | 
|--------|-----------|-------------|-----------------------|
| `order_date` | `timestamp` | Date and time when the order was made | _Modify: Remove orders placed between 12am and 5am (inclusive); convert from timestamp to date_ |
| `time_of_day` | `string` | Period of the day when the order was made | _New column containing (lower bound inclusive, upper bound exclusive): "morning" for orders placed 5-12am, "afternoon" for orders placed 12-6pm, and "evening" for 6-12pm_ |
| `order_id` | `long` | Order ID | _N/A_ |
| `product` | `string` | Name of a product ordered | _Remove rows containing "TV" as the company has stopped selling this product; ensure all values are lowercase_ |
| `product_ean` | `double` | Product ID | _N/A_ |
| `category` | `string` | Broader category of a product | _Ensure all values are lowercase_ |
| `purchase_address` | `string` | Address line where the order was made ("House Street, City, State Zipcode") | _N/A_ |
| `purchase_state` | `string` | US State of the purchase address | _New column containing: the State that the purchase was ordered from_ |
| `quantity_ordered` | `long` | Number of product units ordered | _N/A_ |
| `price_each` | `double` | Price of a product unit | _N/A_ |
| `cost_price` | `double` | Cost of production per product unit | _N/A_ |
| `turnover` | `double` | Total amount paid for a product (quantity x price) | _N/A_ |
| `margin` | `double` | Profit made by selling a product (turnover - cost) | _N/A_ |

<br>

In [1]:
from pyspark.sql import (
    SparkSession,
    types,
    functions as F,
)

spark = (
    SparkSession
    .builder
    .appName('cleaning_orders_dataset_with_pyspark')
    .getOrCreate()
)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/07/23 16:11:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
orders_data = spark.read.parquet('orders_data.parquet')
orders_data.toPandas().head()

                                                                                

Unnamed: 0,order_date,order_id,product,product_id,category,purchase_address,quantity_ordered,price_each,cost_price,turnover,margin
0,2023-01-22 21:25:00,141234,iPhone,5638009000000.0,Vêtements,"944 Walnut St, Boston, MA 02215",1,700.0,231.0,700.0,469.0
1,2023-01-28 14:15:00,141235,Lightning Charging Cable,5563320000000.0,Alimentation,"185 Maple St, Portland, OR 97035",1,14.95,7.475,14.95,7.475
2,2023-01-17 13:33:00,141236,Wired Headphones,2113973000000.0,Vêtements,"538 Adams St, San Francisco, CA 94016",2,11.99,5.995,23.98,11.99
3,2023-01-05 20:33:00,141237,27in FHD Monitor,3069157000000.0,Sports,"738 10th St, Los Angeles, CA 90001",1,149.99,97.4935,149.99,52.4965
4,2023-01-25 11:59:00,141238,Wired Headphones,9692681000000.0,Électronique,"387 10th St, Austin, TX 73301",1,11.99,5.995,11.99,5.995


In [3]:
# Start here, using as many cells as you require
# order_date
# Modify: Remove orders placed between 12am and 5am (inclusive); convert from timestamp to date

clean_order_date = orders_data.withColumn('time_of_day', F.hour('order_date'))
clean_order_date = clean_order_date.filter(clean_order_date.time_of_day > 5)

In [4]:
clean_order_date.select('time_of_day').show()

+-----------+
|time_of_day|
+-----------+
|         21|
|         14|
|         13|
|         20|
|         11|
|         20|
|         12|
|         12|
|         10|
|         21|
|         11|
|         10|
|         18|
|         19|
|         21|
|         17|
|         11|
|          8|
|          9|
|         11|
+-----------+
only showing top 20 rows



In [4]:
# time_of_day
# New column containing (lower bound inclusive, upper bound exclusive): "morning" for orders placed 5-12am, "afternoon" for orders placed 12-6pm, and "evening" for 6-12pm

clean_order_date = clean_order_date.withColumn('time_of_day', F.when(clean_order_date.time_of_day < 12, "morning").when(clean_order_date.time_of_day < 18, "afternoon").otherwise("evening"))

In [6]:
clean_order_date.toPandas().head()

Unnamed: 0,order_date,order_id,product,product_id,category,purchase_address,quantity_ordered,price_each,cost_price,turnover,margin,time_of_day
0,2023-01-22 21:25:00,141234,iPhone,5638009000000.0,Vêtements,"944 Walnut St, Boston, MA 02215",1,700.0,231.0,700.0,469.0,evening
1,2023-01-28 14:15:00,141235,Lightning Charging Cable,5563320000000.0,Alimentation,"185 Maple St, Portland, OR 97035",1,14.95,7.475,14.95,7.475,afternoon
2,2023-01-17 13:33:00,141236,Wired Headphones,2113973000000.0,Vêtements,"538 Adams St, San Francisco, CA 94016",2,11.99,5.995,23.98,11.99,afternoon
3,2023-01-05 20:33:00,141237,27in FHD Monitor,3069157000000.0,Sports,"738 10th St, Los Angeles, CA 90001",1,149.99,97.4935,149.99,52.4965,evening
4,2023-01-25 11:59:00,141238,Wired Headphones,9692681000000.0,Électronique,"387 10th St, Austin, TX 73301",1,11.99,5.995,11.99,5.995,morning


In [5]:
clean_order_date = clean_order_date.withColumn("order_date", clean_order_date["order_date"].cast(types.DateType()))

In [8]:
clean_order_date.toPandas().head()

Unnamed: 0,order_date,order_id,product,product_id,category,purchase_address,quantity_ordered,price_each,cost_price,turnover,margin,time_of_day
0,2023-01-22,141234,iPhone,5638009000000.0,Vêtements,"944 Walnut St, Boston, MA 02215",1,700.0,231.0,700.0,469.0,evening
1,2023-01-28,141235,Lightning Charging Cable,5563320000000.0,Alimentation,"185 Maple St, Portland, OR 97035",1,14.95,7.475,14.95,7.475,afternoon
2,2023-01-17,141236,Wired Headphones,2113973000000.0,Vêtements,"538 Adams St, San Francisco, CA 94016",2,11.99,5.995,23.98,11.99,afternoon
3,2023-01-05,141237,27in FHD Monitor,3069157000000.0,Sports,"738 10th St, Los Angeles, CA 90001",1,149.99,97.4935,149.99,52.4965,evening
4,2023-01-25,141238,Wired Headphones,9692681000000.0,Électronique,"387 10th St, Austin, TX 73301",1,11.99,5.995,11.99,5.995,morning


In [6]:
# product
# Remove rows containing "TV" as the company has stopped selling this product; ensure all values are lowercase

clean_product = clean_order_date.filter(~clean_order_date.product.contains("TV"))
clean_product = clean_product.withColumn("product", F.lower("product"))

In [17]:
clean_product.select("product").show()

+--------------------+
|             product|
+--------------------+
|              iphone|
|lightning chargin...|
|    wired headphones|
|    27in fhd monitor|
|    wired headphones|
|aaa batteries (4-...|
|27in 4k gaming mo...|
|usb-c charging cable|
|bose soundsport h...|
|apple airpods hea...|
|apple airpods hea...|
|  macbook pro laptop|
|aaa batteries (4-...|
|    27in fhd monitor|
|    27in fhd monitor|
|     vareebadd phone|
|apple airpods hea...|
|usb-c charging cable|
|aaa batteries (4-...|
|usb-c charging cable|
+--------------------+
only showing top 20 rows



In [7]:
# category
# Ensure all values are lowercase
clean_category = clean_product.withColumn("category", F.lower("category"))

In [19]:
clean_category.select("category").show()

+------------+
|    category|
+------------+
|   vêtements|
|alimentation|
|   vêtements|
|      sports|
|électronique|
|alimentation|
|   vêtements|
|   vêtements|
|électronique|
|électronique|
|   vêtements|
|   vêtements|
|   vêtements|
|   vêtements|
|alimentation|
|alimentation|
|alimentation|
|      sports|
|électronique|
|alimentation|
+------------+
only showing top 20 rows



In [9]:
# purchase_state
# New column containing: the State that the purchase was ordered from

clean_purchase_state = clean_category.withColumn("purchase_state", F.split("purchase_address", " ").getItem(F.size(F.split("purchase_address", " "))-2))

In [11]:
clean_purchase_state.select("purchase_state").show()

+--------------+
|purchase_state|
+--------------+
|            MA|
|            OR|
|            CA|
|            CA|
|            TX|
|            CA|
|            CA|
|            CA|
|            CA|
|            CA|
|            CA|
|            CA|
|            CA|
|            CA|
|            OR|
|            CA|
|            MA|
|            CA|
|            WA|
|            CA|
+--------------+
only showing top 20 rows



In [12]:
n_states = clean_purchase_state.select("purchase_state").distinct().count()

In [None]:
clean_purchase_state.write.parquet('orders_data_clean.parquet')