**Query the customer_number from the orders table for the customer who has placed the largest number of orders.**

It is guaranteed that exactly one customer will have placed more orders than any other customer.

The **orders** table is defined as follows:

| Column | Type |
|-------------------|-----------|
| order_number (PK) | int |
| customer_number | int |
| order_date | date |
| required_date | date |
| shipped_date | date |
| status | char(15) |
| comment | char(200) |

Sample **Input**:

| order_number | customer_number | order_date | required_date | shipped_date | status | comment |
|--------------|-----------------|------------|---------------|--------------|--------|---------|
| 1 | 1 | 2017-04-09 | 2017-04-13 | 2017-04-12 | Closed | |
| 2 | 2 | 2017-04-15 | 2017-04-20 | 2017-04-18 | Closed | |
| 3 | 3 | 2017-04-16 | 2017-04-25 | 2017-04-20 | Closed | |
| 4 | 3 | 2017-04-18 | 2017-04-28 | 2017-04-25 | Closed | |

Sample **Output**:

| customer_number |
|-----------------|
| 3 |

Explanation:
- The customer with number '3' has two orders, which is greater than either customer '1' or '2' because each of them only has one order. So the result is customer_number '3'.

**Follow up**: What if more than one customer have the largest number of orders, can you find all the customer_number in this case?

In [0]:
from pyspark.sql.functions import col, count
from pyspark.sql.types import StructField, StructType, IntegerType, StringType, DateType

schema = StructType([
  StructField("order_number", IntegerType(), True),
  StructField("customer_number", IntegerType(), True),
  StructField("order_date", StringType(), True),
  StructField("required_date", StringType(), True),
  StructField("shipped_date", StringType(), True),
  StructField("status", StringType(), True),
  StructField("comment", StringType(), True)
])

data = [(1, 1, "2017-04-09", "2017-04-13", "2017-04-12", "Closed", None),
        (2, 2, "2017-04-15", "2017-04-20", "2017-04-18", "Closed", None),
        (3, 3, "2017-04-16", "2017-04-25", "2017-04-20", "Closed", None),
        (4, 3, "2017-04-18", "2017-04-28", "2017-04-25", "Closed", None)]

df = spark.createDataFrame(data, schema)

df = (df
      .withColumn("order_date", col("order_date").cast(DateType()))
      .withColumn("required_date", col("required_date").cast(DateType()))
      .withColumn("shipped_date", col("shipped_date").cast(DateType()))
      )

display(df)

order_number,customer_number,order_date,required_date,shipped_date,status,comment
1,1,2017-04-09,2017-04-13,2017-04-12,Closed,
2,2,2017-04-15,2017-04-20,2017-04-18,Closed,
3,3,2017-04-16,2017-04-25,2017-04-20,Closed,
4,3,2017-04-18,2017-04-28,2017-04-25,Closed,


For exactly one customer will have placed more orders than any other customer.

In [0]:
df_order_cnt = df.groupBy("customer_number").agg(count("*").alias("cnt"))

df_top_order = df_order_cnt.orderBy(col("cnt").desc()).select("customer_number").limit(1)

display(df_top_order)

customer_number
3


If more than one customer have the largest number of orders:

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

windowSpec = Window.orderBy(col("cnt").desc())
df_top_orders = (df_order_cnt
                 .withColumn("rank", F.dense_rank().over(windowSpec))
                 .select("customer_number")
                 .filter("rank = 1")
                 )

df_top_orders.display()

customer_number
3
