## Problem 01

Weightage: 10

Get all those member details who does not have phone numbers.

## Data Description

All of the address data is available under **/public/addresses**. Here is the schema.
```
root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- postal_code: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

## Output Requirements

* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem01/solution
```
* Use parquet file format to save the output. Output should be saved in 2 files.
* Here are the column names. Data types should be same as input data.
```
 |-- id: long
 |-- first_name: string
 |-- last_name: string
 |-- email: string
```
* Data should be sorted in ascending order by id.

## Validation

Here are the self validation steps:
* Run the following code to create data frame.
```
import getpass
username = getpass.getuser()
path = f'/user/{username}/mock_test_02/problem01/solution'
data = spark. \
    read. \
    parquet(path)
```
* Get Schema by running `data.printSchema()`. Output should be as below. Ignore Nullability if it does not match exactly.
```
root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
```
* Get count by running `data.count()`. It should return **258160**.
* Run `data.orderBy('id').show()` to validate the data. Output should be like this.

| id|first_name|last_name|               email|
|---|----------|---------|--------------------|
| 16|  Eleonore|   Cordle|ecordlef@printfri...|
| 18|     Heddi|   Sackes|hsackesh@business...|
| 23|       Zak|    Rigts| zrigtsm@cornell.edu|
| 25|     Wiatt|     Wane|    wwaneo@tmall.com|
| 26|    Aubrie| Ashworth|aashworthp@networ...|
| 28|    Lindsy|  Kellart|lkellartr@istockp...|
| 30|    Harman|   Birley|hbirleyt@deliciou...|
| 33|     Randa|   Eberst|   reberstw@tamu.edu|
| 34|    Stinky| Penniall|spenniallx@domain...|
| 35|     Marya|   Rahlof|mrahlofy@oaic.gov.au|
| 42|     Peder|  Harring|pharring15@list-m...|
| 54|       Row|    Anker|ranker1h@squidoo.com|
| 57|    Morgun|      Loy|mloy1k@deviantart...|
| 60|  Geoffrey|Ashbridge|gashbridge1n@wufo...|
| 61|     Nance|  Gladdis|ngladdis1o@weathe...|
| 62|     Allyn|    Monni| amonni1p@devhub.com|
| 64|     Kleon|  Tolchar|ktolchar1r@angelf...|
| 66|  Georgena|    Ingre|gingre1t@marriott...|
| 69|   Belicia|    Trigg|   btrigg1w@army.mil|
| 79|  Courtnay|  Umpleby|cumpleby26@trelli...|

In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 01 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [3]:
spark.conf.set('spark.sql.shuffle.partitions', 2)

In [2]:
addressPath="/public/addresses/Address-*.json"

In [4]:
addressDf=spark.read.json(addressPath)

In [5]:
addressDf.show()

+--------------------+--------------------+----------+------+------+---------------+------------+--------------------+
|             address|               email|first_name|gender|    id|     ip_address|   last_name|       phone_numbers|
+--------------------+--------------------+----------+------+------+---------------+------------+--------------------+
|[Honolulu, 96840,...|lbreyt0@tripadvis...|  L;urette|Female|900001|  80.24.165.223|       Breyt|[213-896-1319, 21...|
|[Charlotte, 28278...|    nkilrow1@last.fm|     Nixie|Female|900002| 169.186.205.65|      Kilrow|[801-204-0578, 60...|
|[Los Angeles, 900...| zeliaz2@storify.com|     Zelig|  Male|900003|    85.93.47.94|       Eliaz|                null|
|[Pensacola, 32520...|bbrimblecombe3@li...|     Brook|Female|900004| 229.246.203.59|Brimblecombe|      [228-516-3927]|
|[San Antonio, 782...|fmylechreest4@ele...|      Fabe|  Male|900005| 147.180.88.217| Mylechreest|[404-484-7154, 28...|
|[Syracuse, 13217,...|rince5@purevolume...|     

In [14]:
addressDf.select('phone_numbers').show()

+--------------------+
|       phone_numbers|
+--------------------+
|[213-896-1319, 21...|
|[801-204-0578, 60...|
|                null|
|      [228-516-3927]|
|[404-484-7154, 28...|
|[714-428-9292, 72...|
|[305-120-1075, 21...|
|[480-768-1034, 64...|
|                null|
|[801-927-5543, 21...|
|      [215-211-9823]|
|      [619-732-6649]|
|[832-477-2553, 40...|
|[916-497-7931, 85...|
|[573-176-6702, 51...|
|[561-112-5164, 90...|
|[801-893-9003, 70...|
|      [907-343-8039]|
|[650-643-8600, 32...|
|      [908-210-8009]|
+--------------------+
only showing top 20 rows



In [7]:
from pyspark.sql.functions import col,size

In [20]:
users=addressDf.select("id","first_name","last_name","email"). \
      filter(col('phone_numbers').isNull()). \
      orderBy('id')

In [21]:
users

id,first_name,last_name,email
16,Eleonore,Cordle,ecordlef@printfri...
18,Heddi,Sackes,hsackesh@business...
23,Zak,Rigts,zrigtsm@cornell.edu
25,Wiatt,Wane,wwaneo@tmall.com
26,Aubrie,Ashworth,aashworthp@networ...
28,Lindsy,Kellart,lkellartr@istockp...
30,Harman,Birley,hbirleyt@deliciou...
33,Randa,Eberst,reberstw@tamu.edu
34,Stinky,Penniall,spenniallx@domain...
35,Marya,Rahlof,mrahlofy@oaic.gov.au


In [22]:
users.coalesce(2).write.parquet(f'/user/{username}/mock_test_02/problem01/solution',mode='overwrite')

# VALIDATION

In [23]:
path = f'/user/{username}/mock_test_02/problem01/solution'
data = spark. \
  read. \
  parquet(path)

In [24]:
data.printSchema()

root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)



In [25]:
data.count()

258160

In [26]:
data.orderBy('id').show()

+---+----------+---------+--------------------+
| id|first_name|last_name|               email|
+---+----------+---------+--------------------+
| 16|  Eleonore|   Cordle|ecordlef@printfri...|
| 18|     Heddi|   Sackes|hsackesh@business...|
| 23|       Zak|    Rigts| zrigtsm@cornell.edu|
| 25|     Wiatt|     Wane|    wwaneo@tmall.com|
| 26|    Aubrie| Ashworth|aashworthp@networ...|
| 28|    Lindsy|  Kellart|lkellartr@istockp...|
| 30|    Harman|   Birley|hbirleyt@deliciou...|
| 33|     Randa|   Eberst|   reberstw@tamu.edu|
| 34|    Stinky| Penniall|spenniallx@domain...|
| 35|     Marya|   Rahlof|mrahlofy@oaic.gov.au|
| 42|     Peder|  Harring|pharring15@list-m...|
| 54|       Row|    Anker|ranker1h@squidoo.com|
| 57|    Morgun|      Loy|mloy1k@deviantart...|
| 60|  Geoffrey|Ashbridge|gashbridge1n@wufo...|
| 61|     Nance|  Gladdis|ngladdis1o@weathe...|
| 62|     Allyn|    Monni| amonni1p@devhub.com|
| 64|     Kleon|  Tolchar|ktolchar1r@angelf...|
| 66|  Georgena|    Ingre|gingre1t@marri