## Problem 02

Weightage: 25

Get number of phones associated with each member. **If there are no phones then the phone count should be zero**.

## Data Description

All of the address data is available under **/public/addresses**. Here is the schema.
```
root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- postal_code: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

## Output Requirements

* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem02/solution
```
* Use text file format with delimiter | (pipe) to save the output. Output should be saved in 2 files and compressed using gzip.
* The files should contain header with column names.
* Here are the column names. Data types should be same as input data.
```
 |-- id: long
 |-- first_name: string
 |-- last_name: string
 |-- email: string
 |-- phone_count: long
```
* Data should be sorted in ascending order by id.

## Validation

Here are the self validation steps:
* Run the following command to validate files are compressed. Extension should be gz.
```
hdfs dfs -ls /user/`whoami`/mock_test_02/problem02/solution
```
* Run the following code to create data frame.
```
import getpass
username = getpass.getuser()
path = f'/user/{username}/mock_test_02/problem02/solution'
data = spark. \
    read. \
    csv(path,
        sep='|',
        header=True,
        inferSchema=True
       )
```
* Get Schema by running `data.printSchema()`. Output should be as below. Ignore Nullability if it does not match exactly.
```
root
 |-- id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- phone_count: integer (nullable = true)
```
* Get count by running `data.count()`. It should return **1,000,000**.
* Run `data.orderBy('id').show()` to validate the data. Output should be like this.

| id|first_name|   last_name|               email|phone_count|
|---|----------|------------|--------------------|-----------|
|  1|    Corrie|Van den Oord|cvandenoord0@etsy...|          1|
|  2|  Nikolaus|     Brewitt|nbrewitt1@dailyma...|          4|
|  3|    Orelie|      Penney|openney2@vistapri...|          5|
|  4|     Ashby|    Maddocks|  amaddocks3@home.pl|          4|
|  5|      Kurt|        Rome|krome4@shutterfly...|          1|
|  6|    Idelle|      Dorsey|idorsey5@artistee...|          5|
|  7|      Levy|       Pacey|lpacey6@bloglovin...|          5|
|  8|   Hershel|       Kneal|hkneal7@engadget.com|          3|
|  9|     Kelly|  Gatheridge|kgatheridge8@mysp...|          1|
| 10|     Aksel|       Ewles| aewles9@samsung.com|          1|
| 11| Millicent|    Whitwell| mwhitwella@army.mil|          3|
| 12|      Levy|    Fennelow|lfennelowb@so-net...|          4|
| 13|     Bucky|       Harle|   bharlec@europa.eu|          1|
| 14|     Randy|   Kleinmann|rkleinmannd@frien...|          4|
| 15|   Eveleen|     Lanaway|elanawaye@blinkli...|          5|
| 16|  Eleonore|      Cordle|ecordlef@printfri...|          0|
| 17|     Monte|     Sidaway|msidawayg@unicef.org|          3|
| 18|     Heddi|      Sackes|hsackesh@business...|          0|
| 19|    Tabina|     Olivari|    tolivarii@goo.gl|          2|
| 20|Rutherford|   Josephson|rjosephsonj@sprin...|          2|

In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 02 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', 2)

In [3]:
addressPath="/public/addresses/Address-*.json"

In [4]:
addressDf=spark.read.json(addressPath)

In [20]:
from pyspark.sql.functions import size,col,length,when

In [25]:
users=addressDf. \
    select("id","first_name","last_name","email","phone_numbers"). \
    withColumn("phone_count", \
               when(col("phone_numbers").isNull(),0). \
               otherwise(size(col("phone_numbers")))). \
    orderBy("id")

In [30]:
usersDf=users.select('id',"first_name","last_name","phone_count")

In [31]:
usersDf.show()

+---+----------+------------+-----------+
| id|first_name|   last_name|phone_count|
+---+----------+------------+-----------+
|  1|    Corrie|Van den Oord|          1|
|  2|  Nikolaus|     Brewitt|          4|
|  3|    Orelie|      Penney|          5|
|  4|     Ashby|    Maddocks|          4|
|  5|      Kurt|        Rome|          1|
|  6|    Idelle|      Dorsey|          5|
|  7|      Levy|       Pacey|          5|
|  8|   Hershel|       Kneal|          3|
|  9|     Kelly|  Gatheridge|          1|
| 10|     Aksel|       Ewles|          1|
| 11| Millicent|    Whitwell|          3|
| 12|      Levy|    Fennelow|          4|
| 13|     Bucky|       Harle|          1|
| 14|     Randy|   Kleinmann|          4|
| 15|   Eveleen|     Lanaway|          5|
| 16|  Eleonore|      Cordle|          0|
| 17|     Monte|     Sidaway|          3|
| 18|     Heddi|      Sackes|          0|
| 19|    Tabina|     Olivari|          2|
| 20|Rutherford|   Josephson|          2|
+---+----------+------------+-----

In [32]:
usersDf.coalesce(2). \
    write.mode('overwrite'). \
    option('compression','gzip'). \
    option('header','true'). \
    option('sep','|'). \
    format('csv'). \
    save(f'/user/{username}/mock_test_02/problem02/solution')

# Validations

In [33]:
%%sh
hdfs dfs -ls /user/`whoami`/mock_test_02/problem02/solution

Found 3 items
-rw-r--r--   3 itv001477 supergroup          0 2021-12-10 01:05 /user/itv001477/mock_test_02/problem02/solution/_SUCCESS
-rw-r--r--   3 itv001477 supergroup    5443846 2021-12-10 01:05 /user/itv001477/mock_test_02/problem02/solution/part-00000-e06b7b39-9d0b-4d14-99d9-7002ec3b4d5a-c000.csv.gz
-rw-r--r--   3 itv001477 supergroup    5623763 2021-12-10 01:05 /user/itv001477/mock_test_02/problem02/solution/part-00001-e06b7b39-9d0b-4d14-99d9-7002ec3b4d5a-c000.csv.gz


In [34]:
path = f'/user/{username}/mock_test_02/problem02/solution'
data = spark. \
  read. \
  csv(path,
      sep='|',
      header=True,
      inferSchema=True
     )

In [35]:
data.printSchema()

root
 |-- id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_count: integer (nullable = true)



In [36]:
data.count()

1000000

In [37]:
data.orderBy('id').show()

+---+----------+------------+-----------+
| id|first_name|   last_name|phone_count|
+---+----------+------------+-----------+
|  1|    Corrie|Van den Oord|          1|
|  2|  Nikolaus|     Brewitt|          4|
|  3|    Orelie|      Penney|          5|
|  4|     Ashby|    Maddocks|          4|
|  5|      Kurt|        Rome|          1|
|  6|    Idelle|      Dorsey|          5|
|  7|      Levy|       Pacey|          5|
|  8|   Hershel|       Kneal|          3|
|  9|     Kelly|  Gatheridge|          1|
| 10|     Aksel|       Ewles|          1|
| 11| Millicent|    Whitwell|          3|
| 12|      Levy|    Fennelow|          4|
| 13|     Bucky|       Harle|          1|
| 14|     Randy|   Kleinmann|          4|
| 15|   Eveleen|     Lanaway|          5|
| 16|  Eleonore|      Cordle|          0|
| 17|     Monte|     Sidaway|          3|
| 18|     Heddi|      Sackes|          0|
| 19|    Tabina|     Olivari|          2|
| 20|Rutherford|   Josephson|          2|
+---+----------+------------+-----