## Problem 06

Weightage: 25

Get cities with top ten female member count from each state. There is a chance that more than 1 city might get the same rank if the counts are same. You need to get all the cities which contain top ten female member count from each state.

In [None]:
[5, 4, 4, 3, 3, 2]

## Data Description

All of the address data is available under **/public/addresses**. Here is the schema.
```
root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- postal_code: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

## Output Requirements
* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem06/solution
```
* Use CSV and save the output to exactly one file. Make sure to preserve the header.
* Here are the column names. Data types should be same as input data.
```
 |-- state: string
 |-- city:string
 |-- female_count: long
```
* Data should be sorted in ascending order by state and then in descending order by count.

## Validation

Here are the self validation steps:
* Run the following to check number of files.
```
hdfs dfs -ls /user/`whoami`/mock_test_02/problem06/solution
```
* Run the following to validate the data. Review the data to see if it is sorted in ascending order by state and then in descending order by count.
```
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*
```
* Run this command to get the count including header. Result should be 320.
```
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*|wc -l
```


In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 06 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', 2)

In [3]:
addressPath="/public/addresses/Address-*.json"
addressDf=spark.read.json(addressPath)

In [5]:
from pyspark.sql.functions import col,dense_rank,count
from pyspark.sql.window import Window

In [13]:
spec=Window.partitionBy("state"). \
            orderBy(col("female_count").desc())

In [8]:
addressDf.select('gender').distinct()

gender
Female
Male


In [9]:
filteredDf=addressDf.filter(col("gender")=="Female")

In [10]:
countDf=filteredDf.groupBy("address.state","address.city"). \
                agg(count("id").alias("female_count"))

In [19]:
resultDf=countDf.withColumn("rnk",dense_rank().over(spec)). \
                filter(col("rnk")<=10). \
            select("state","city","female_count"). \
            orderBy(col("state"),col("female_count").desc())

In [20]:
resultDf.show()

+-------+---------------+------------+
|  state|           city|female_count|
+-------+---------------+------------+
|Alabama|     Birmingham|        3980|
|Alabama|     Montgomery|        2132|
|Alabama|         Mobile|        2093|
|Alabama|     Huntsville|        1036|
|Alabama|     Tuscaloosa|         544|
|Alabama|       Anniston|         253|
|Alabama|        Gadsden|         248|
| Alaska|      Anchorage|        1389|
| Alaska|      Fairbanks|         563|
| Alaska|         Juneau|         267|
|Arizona|        Phoenix|        4361|
|Arizona|         Tucson|        2614|
|Arizona|           Mesa|         779|
|Arizona|     Scottsdale|         773|
|Arizona|       Glendale|         521|
|Arizona|       Prescott|         278|
|Arizona|         Peoria|         274|
|Arizona|       Chandler|         274|
|Arizona|        Gilbert|         273|
|Arizona|Apache Junction|         260|
+-------+---------------+------------+
only showing top 20 rows



In [21]:
resultDf.count()

319

In [22]:
resultDf.coalesce(1).write. \
    option("header","true"). \
    csv(f"/user/{username}/mock_test_02/problem06/solution")

# Validations

In [23]:
%%sh
hdfs dfs -ls /user/`whoami`/mock_test_02/problem06/solution

Found 2 items
-rw-r--r--   3 itv001477 supergroup          0 2021-12-11 02:27 /user/itv001477/mock_test_02/problem06/solution/_SUCCESS
-rw-r--r--   3 itv001477 supergroup       7643 2021-12-11 02:27 /user/itv001477/mock_test_02/problem06/solution/part-00000-a56be6c3-0ed0-407f-a072-195ec8b57b83-c000.csv


In [24]:
%%sh
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*

state,city,female_count
Alabama,Birmingham,3980
Alabama,Montgomery,2132
Alabama,Mobile,2093
Alabama,Huntsville,1036
Alabama,Tuscaloosa,544
Alabama,Anniston,253
Alabama,Gadsden,248
Alaska,Anchorage,1389
Alaska,Fairbanks,563
Alaska,Juneau,267
Arizona,Phoenix,4361
Arizona,Tucson,2614
Arizona,Mesa,779
Arizona,Scottsdale,773
Arizona,Glendale,521
Arizona,Prescott,278
Arizona,Chandler,274
Arizona,Peoria,274
Arizona,Gilbert,273
Arizona,Apache Junction,260
Arizona,Tempe,259
Arkansas,Little Rock,1320
Arkansas,Fort Smith,553
Arkansas,North Little Rock,546
Arkansas,Hot Springs National Park,268
California,Sacramento,5500
California,Los Angeles,5344
California,San Diego,4187
California,San Francisco,3729
California,San Jose,3221
California,Fresno,3176
California,Pasadena,1861
California,Oakland,1569
California,Long Beach,1364
California,Bakersfield,1332
Colorado,Denver,4066
Colorado,Colorado Springs,2923
Colorado,Littleton,812
Colorado,Aurora,802
Colorado,Pueblo,763
Colorado,Boulder,748
Colorado,Gr

In [25]:
%%sh
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*|wc -l

320
