## Requirement

```
REQUIREMENT:
    hdfs://192.168.93.128:9000/input/e-commerce/data.csv
    Spark Session using SparkConf,
        use 4 executor cores, 
        max 4 excutor cores,
        use Spark Cluster
    Write the result to hdfs 
        mostPopularMoviesDf.write.mode('overwrite')\
                          .csv("hdfs:....../output/top-movies.csv")

My Understanding:
    Read from HDFS

    spark-config
    4 executor cores
    4 max cores

    Write back to HDFS
```

In [1]:
import findspark
findspark.init()

In [2]:


"""
Since Spark 2.x, Spark unified Spark APIs, DF, Datasets, & SQL.
SparkSession uses SparkContext internally.
"""

from pyspark.conf import SparkConf

config = SparkConf()
config.setMaster("spark://192.168.11.77:7077").setAppName("E-COMMERCE:CLUSTER")


<pyspark.conf.SparkConf at 0x1fb62b7e448>

In [3]:
"""
Configure before creating SparkSession
"""

conf = \
(
    config
    .set("spark.executor.memory", "2g")
    .set("spark.executor.cores", 4)
    .set("spark.cores.max", 4)
    .set("spark.driver.memory", "2g")
)


In [4]:
from pyspark.sql import SparkSession

ss = SparkSession.builder.config(conf=conf).getOrCreate()

In [5]:
ss

<br><br>

## Read e-commerce data

In [159]:

"""
Read CSV from HDFS
"""

import datetime as dt
from pyspark.sql.types import StructType, IntegerType, DoubleType, StringType, DateType
from pyspark.sql.functions import col, asc, desc, count, sum, avg, to_date, to_timestamp

schema_ecomm = (
    StructType()
    .add("InvoiceNo", StringType(), True)
    .add("StockCode", StringType(), True)
    .add("Description", StringType(), True)
    .add("Quantity", IntegerType(), True)
    .add("InvoiceDate", DateType(), True)
    .add("UnitPrice", DoubleType(), True)
    .add("CustomerId", StringType(), True)
    .add("Country", StringType(), True)
)

df_ecomm_full = (
    ss.read
    .format("csv")
    .option("header", True)
    .option("dateFormat", "MM/dd/yyyy HH:mm")
    .schema(schema_ecomm)
    .load("hdfs://192.168.93.128:9000/input/e-commerce/data.csv")
)


"""
DROP un-necessary data/columns
.drop('column_name', 'column_name')
"""
df_ecomm_full = df_ecomm_full[["Country", "CustomerId", "Quantity", "UnitPrice"]]

In [139]:
df_ecomm_full.show(2)

+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|InvoiceDate|UnitPrice|CustomerId|       Country|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6| 2010-12-01|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6| 2010-12-01|     3.39|     17850|United Kingdom|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
only showing top 2 rows



In [21]:
import pyspark
help(pyspark.sql.types)

Help on module pyspark.sql.types in pyspark.sql:

NAME
    pyspark.sql.types

DESCRIPTION
    # Licensed to the Apache Software Foundation (ASF) under one or more
    # contributor license agreements.  See the NOTICE file distributed with
    # this work for additional information regarding copyright ownership.
    # The ASF licenses this file to You under the Apache License, Version 2.0
    # (the "License"); you may not use this file except in compliance with
    # the License.  You may obtain a copy of the License at
    #
    #    http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    #

CLASSES
    builtins.object
        DataType
            ArrayType


<br><br>

### Troubleshooting..

In [140]:
# df_ecomm_full = (
#     ss.read
#     .format("csv")
#     .option("header", True)
#     .schema(schema_ecomm)
#     .load("hdfs://192.168.93.128:9000/input/e-commerce/data.csv")
# )
# df_ecomm_full.show(2)

# from pyspark.sql.functions import expr

# (
#     df_ecomm_full
#     .select(col("InvoiceDate"))
#     .withColumn(
#         "InvDate",
#         expr("to_timestamp('InvoiceDate', 'MM-dd-yyyy HH:mm')"))
#     .show(10)
# )

In [120]:
for s in df_ecomm_full.schema:
    print(s)

StructField(InvoiceNo,StringType,true)
StructField(StockCode,StringType,true)
StructField(Description,StringType,true)
StructField(Quantity,IntegerType,true)
StructField(InvoiceDate,TimestampType,true)
StructField(UnitPrice,DoubleType,true)
StructField(CustomerId,StringType,true)
StructField(Country,StringType,true)


In [121]:
df_ecomm_full.show(2)

+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|InvoiceDate|UnitPrice|CustomerId|       Country|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|       null|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|       null|     3.39|     17850|United Kingdom|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
only showing top 2 rows



<br><br>

## Orders by Country

In [142]:

count_country = df_ecomm_full.select("Country").distinct().count()
print(f"{count_country} Unique Countries")


38 Unique Countries


In [76]:
print("-- Total Orders by Countries --")

df_ecomm_country_ordertotal = \
(
    df_ecomm_full
    .groupby(col("Country"))
    .count().withColumnRenamed("count", "total_orders")
    .sort(desc("count"))
)
df_ecomm_country_ordertotal.show(2)


-- Total Orders by Countries --
+--------------+------------+
|       Country|total_orders|
+--------------+------------+
|United Kingdom|      495478|
|       Germany|        9495|
+--------------+------------+
only showing top 2 rows



<br><br>


##  Customers by highest orders

In [157]:
df_ecomm_full.show(2)

+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|InvoiceDate|UnitPrice|CustomerId|       Country|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6| 2010-12-01|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6| 2010-12-01|     3.39|     17850|United Kingdom|
+---------+---------+--------------------+--------+-----------+---------+----------+--------------+
only showing top 2 rows



In [224]:
df_ecomm_customer_ordertotal = \
(
    df_ecomm_full
    # .filter("CustomerId IS NOT NULL AND Quantity > 0 AND UnitPrice > 0")  --> SQL style, OR
    .filter((col("CustomerId").isNotNull()) & (col("Quantity")>0) & (col("UnitPrice")>0))
    .withColumn("total_order", df_ecomm_full.Quantity * df_ecomm_full.UnitPrice)
    .groupby("CustomerId")
    .agg(sum("total_order").alias("total_order"))
    .sort(desc("total_order"))
)

df_ecomm_customer_ordertotal.count(), df_ecomm_customer_ordertotal.show(2)

+----------+------------------+
|CustomerId|       total_order|
+----------+------------------+
|     14646| 280206.0199999998|
|     18102|259657.29999999993|
+----------+------------------+
only showing top 2 rows



(4338, None)

<br><br>

## Write to HDFS

In [217]:
(
    df_ecomm_country_ordertotal
    .coalesce(1)
    .write.mode('overwrite')
    .option("header", True)
    .csv("hdfs://192.168.93.128:9000/output/e-commerce/country_totalorders")
)

(
    df_ecomm_customer_ordertotal
    .coalesce(1)
    .write.mode('overwrite')
    .option("header", True)
    .csv("hdfs://192.168.93.128:9000/output/e-commerce/customer_totalorders")
)


<br><br>

### How may partitions did I coalesce?

In [188]:
df_ecomm_country_ordertotal.rdd.getNumPartitions(), df_ecomm_customer_ordertotal.rdd.getNumPartitions()

(38, 200)

In [None]:
# ss.stop()

In [220]:
df_ecomm = df_ecomm_full[["Country", "CustomerId", "Quantity", "UnitPrice"]]

In [221]:
df_ecomm.schema

StructType(List(StructField(Country,StringType,true),StructField(CustomerId,StringType,true),StructField(Quantity,IntegerType,true),StructField(UnitPrice,DoubleType,true)))

In [226]:
"""
EXPLAIN PLAN
"""

df_ecomm.explain(True)

== Parsed Logical Plan ==
'Project [unresolvedalias('Country, None), unresolvedalias('CustomerId, None), unresolvedalias('Quantity, None), unresolvedalias('UnitPrice, None)]
+- Relation[InvoiceNo#2230,StockCode#2231,Description#2232,Quantity#2233,InvoiceDate#2234,UnitPrice#2235,CustomerId#2236,Country#2237] csv

== Analyzed Logical Plan ==
Country: string, CustomerId: string, Quantity: int, UnitPrice: double
Project [Country#2237, CustomerId#2236, Quantity#2233, UnitPrice#2235]
+- Relation[InvoiceNo#2230,StockCode#2231,Description#2232,Quantity#2233,InvoiceDate#2234,UnitPrice#2235,CustomerId#2236,Country#2237] csv

== Optimized Logical Plan ==
Project [Country#2237, CustomerId#2236, Quantity#2233, UnitPrice#2235]
+- Relation[InvoiceNo#2230,StockCode#2231,Description#2232,Quantity#2233,InvoiceDate#2234,UnitPrice#2235,CustomerId#2236,Country#2237] csv

== Physical Plan ==
*(1) Project [Country#2237, CustomerId#2236, Quantity#2233, UnitPrice#2235]
+- *(1) FileScan csv [Quantity#2233,UnitP