
#### General Instructions <br /><br />

- All relevant data sets are pre-loaded in the Jupyter Notebook environment.
- The data set for the challenge will be provided when you start the assessment. Load them as required.
- All files are in CSV format. Some files have headers and some files do not have headers
- The files do not have headers
  - CSV files: catgories.csv, customers.csv, departments.csv, orders.csv, order_items.csv, products.csv, customers.csv

#### Question 1

**Fetch the top 10 categories with highest percentage of 'pending orders' in the year 2014**

- Datasets to be used to solve this task are
  - order_items.csv, orders.csv, products.csv, categories.csv
- An order is considered as pending order if the order status is either 'PENDING' or 'PENDING_PAYMENT' (in orders.csv)
- Columns to be fetched are: **category, total_orders, pending_orders, percentage_pending_orders**
- Round the **percentage_pending_orders** to one decimal place
- Sort the data in the DESCENDING order of **percentage_pending_orders**
- The output should have 10 rows, excluding the header
- Save the output as a single CSV file with header in **question1** directory

In [58]:
#Write your import
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType,StringType, DoubleType, DateType
from pyspark.sql.functions import col,year,when, countDistinct, sum, round, row_number
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("top 10 categories with highest percentage_pending_orders").getOrCreate()

# File paths 
categories_file = "categories.csv"
customers_file = "customers.csv"
departments_file = "departments.csv"
order_items_file = "order_items.csv"
orders_file = "orders.csv"
products_file = "products.csv"

#Write your code

#Defining schema
orderSchema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("order_date", DateType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("order_status", StringType(), True)
])

orderItemSchema = StructType([
    StructField("order_item_id", IntegerType(), True),
    StructField("order_id", IntegerType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("total_price", DoubleType(), True),
    StructField("unit_price", DoubleType(), True),
])

productSchema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("category_id", IntegerType(), True),
    StructField("product", StringType(), True),
    StructField("type", StringType(), True),
    StructField("unit_price", DoubleType(), True)
])

categoriesSchema = StructType([
    StructField("category_id", IntegerType(), True),
    StructField("department_id", IntegerType(), True),
    StructField("category", StringType(), True)
])

customerSchema =  StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("phone", StringType(), True),
    StructField("address", StringType(), True),
    StructField("city", StringType(), True),
    StructField("state", StringType(), True),
    StructField("zip", StringType(), True),
])

departmentSchema = StructType([
    StructField("department_id", IntegerType(), True),
    StructField("department", StringType(), True)
])

#Reading the csv files
orders = spark.read.csv(orders_file,schema = orderSchema)
order_items = spark.read.csv(order_items_file,schema = orderItemSchema)
products = spark.read.csv(products_file,schema = productSchema)
categories = spark.read.csv(categories_file,schema = categoriesSchema)

orders_2014 = orders.filter(year(col('order_date'))==2014)

df_joined = pending_orders_2014.join(order_items,'order_id').join(products,'product_id').join(categories,'category_id')
                                 
result_1 = df_joined.groupBy('category').agg(countDistinct('order_id').alias('total_orders'),
                                           sum(when(col('order_status').isin('PENDING_PAYMENT','PENDING'),1).otherwise(0)).alias('pending_orders')).withColumn('percentage_pending_orders',round((col('pending_orders') / col('total_orders'))*100,1)) \
.orderBy(col('percentage_pending_orders').desc()).limit(10)

# result_1.show()

#saving the output 
result_1.coalesce(1).write.mode('overwrite').option('header','true').csv('question1')

25/11/28 06:59:15 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
                                                                                


#### Question 2

**Fetch two products with lowest 'total_sale_value' in 'Basketball', 'Football' and 'Soccer' categories in 'Sports' department in the year 2014. Consider orders with status as ‘COMPLETE’ only.** <br/><br/>

- Datasets to be used to solve this task are
  - order_items.csv, orders.csv, products.csv, departments.csv, categories.csv
- Analyze and understand the data in all the datasets as per the ER diagram given above
- ***total_sale_value*** is computed as the ***sum of total_price of all items*** in ***order_items*** dataset. 
- Consider only orders with ***COMPLETE*** as status in orders dataset.
- Round the *total_sale_value* to nearest integer value. 
- Columns to be fetched are: **year, category, product, total_quantity, total_sale_value**
- Sort the data in the ASCENDING order of category and ASCENDING order of total_sale_value.
- The output should have 6 rows, excluding the header
- Save the output as a single CSV file with header in **question2** directory

In [63]:
#Write your code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType,StringType, DoubleType, DateType
from pyspark.sql.functions import col,year,when, countDistinct, sum, round, row_number
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("products with lowest total_sale_value").getOrCreate()

# Dataset to be used order_items.csv, orders.csv, products.csv, departments.csv, categories.csv
orderSchema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("order_date", DateType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("order_status", StringType(), True)
])

orderItemSchema = StructType([
    StructField("order_item_id", IntegerType(), True),
    StructField("order_id", IntegerType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("total_price", DoubleType(), True),
    StructField("unit_price", DoubleType(), True),
])

productSchema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("category_id", IntegerType(), True),
    StructField("product", StringType(), True),
    StructField("type", StringType(), True),
    StructField("unit_price", DoubleType(), True)
])

categoriesSchema = StructType([
    StructField("category_id", IntegerType(), True),
    StructField("department_id", IntegerType(), True),
    StructField("category", StringType(), True)
])

departmentSchema = StructType([
    StructField("department_id", IntegerType(), True),
    StructField("department", StringType(), True)
])

#Reading the csv files
orders = spark.read.csv(orders_file,schema = orderSchema)
order_items = spark.read.csv(order_items_file,schema = orderItemSchema)
products = spark.read.csv(products_file,schema = productSchema)
categories = spark.read.csv(categories_file,schema = categoriesSchema)
departments = spark.read.csv(departments_file,schema = departmentSchema)

# Filtering the order with completed status

completed_orders = orders.filter((year(col('order_date'))=='2014') & (col('order_status')=="COMPLETE"))
# completed_orders.show()

df_joined = completed_orders.join(order_items, 'order_id') \
    .join(products,'product_id') \
    .join(categories, 'category_id') \
    .join(departments, 'department_id')


df_filter = df_joined.filter((col('department') =='Sports') & 
                  (col('category').isin('Basketball', 'Football', 'Soccer')))
agg_df = df_filter.groupBy(year(col('order_date')).alias('year'),
                             col('category'),
                             col('product')
                          ).agg(sum('quantity').alias('total_quantity'),
                                     round(sum('total_price')).alias('total_sale_value'))
windSpec = Window.partitionBy('category').orderBy(col('total_sale_value').asc())

result_2 = agg_df.withColumn('rn', row_number().over(windSpec)).filter(col('rn') <=2).drop('rn')\
.orderBy(col('category').asc(), col('total_sale_value').asc())

result_2.coalesce(1).write.mode('overwrite').option('header','true').csv('question2')

                                                                                

#### Question 3

**Find top two worst performing states which recorded highest percentage of drop in 'total sales' in the year 2014 compared to year 2013 in 'Sports', 'Footwear', 'Fitness' and 'Golf' departments. Consider orders with status as ‘COMPLETE’ only.** <br /><br />

- Datasets to be used to solve this task are
  - order_items.csv, orders.csv, products.csv, departments.csv, categories.csv, customers.csv
- Analyze and understand the data in all the datasets as per the ER diagram given above 
- **total_sales** is computed as **sum of total_price** in **order_items** dataset for that product.
- Consider only COMPLETED orders (order_status in orders should be COMPLETE only)
- Fetch the data related to top 2 states which recorder maximum percentage drop in total sale value in 2014 compared to 2013 in the specified departments.
- Discard the data if there are no sales in either 2013 or 2014 in any state for any department
- Round the **total_sales** values to nearest integer value
- Round the **drop%** to two decimal places
- Columns to be fetched are: **department, state, 2014_sales, 2013_sales, drop%**
- The output should have 8 rows, excluding the header
- Save the output as a single CSV file with header in **question3** directory


In [None]:
#Write your code





#### Question 4

**Find out all the products that recorded more than 30% drop in total number of units sold (based on ‘quantity’ column in ‘order_items’ dataset) in year 2014 compared to year 2013 in 'Sports' and 'Fitness' departments. Consider orders with status as ‘COMPLETE’ only.** <br /><br />

- Datasets to be used to solve this task are
  - order_items.csv, orders.csv, products.csv, departments.csv, categories.csv
- Quantity of each product is computed as **sum of quantity** in order_items datasets for that product
- Columns to be fetched are: **department, product, 2013_qty, 2014_qty, drop%**
    - 2013_qty: total quantity of the product in the year 2013
    - 2014_qty: total quantity of the product in the year 2014
    - drop%: drop in 2014 over 2013
- Sort the data in the **ASCENDING order of department** and in the **DESCENDING order of drop%**.
- Round the drop% to nearest integer value.
- The output should have 4 rows, excluding the header
- Save the output as a single CSV file with header in **question4** directory

In [None]:
#Write your code