## Repair table to refresh partitions

Let us understand how the folder structures in the table and metadata in metastore for the same table can be kept in synch.
* The data, processing engine and catalog (metastore) are decoupled in Spark.
* As data, processing and catalog are decoupled, we can deal with the data directly.
* If we load the data directly into the metastore tables, the metadata in catalog (or metastore) can be out of synch.
* We can use `MSCK REPAIR TABLE` to keep the folder structures and metadata in catalog in synch.

In [None]:
%%sh

hdfs dfs -ls /user/${USER}/itvgithub/landing

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Analyze GitHub Archive Data'). \
    master('yarn'). \
    getOrCreate()

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', 8)

* As part of the previous topic, we have already data related to one day. Let us review the data which is already added to the table when table is created using `saveAsTable`.

In [3]:
# There is only one partition at this time
spark.sql(f'SHOW PARTITIONS {username}_ghraw_db.ghactivity').show(truncate=False)

+-------------------------+
|partition                |
+-------------------------+
|year=2021/month=01/day=13|
+-------------------------+



In [4]:
# You can see the count for the day for which the partition belongs to.
spark.sql(f'''
    SELECT substring(created_at, 1, 10) AS created_dt, count(1)
    FROM {username}_ghraw_db.ghactivity
    GROUP BY created_dt
    ORDER BY created_dt
'''). \
    show()

+----------+--------+
|created_dt|count(1)|
+----------+--------+
|2021-01-13| 2829111|
+----------+--------+



* Now let us load the data for **2021-01-14**. Here we are loading the data directly using the path of the folder. There is no reference to table any where.

In [5]:
from pyspark.sql.functions import substring, col
process_dt = '2021-01-14'
spark.conf.set('spark.sql.shuffle.partitions', 8)
ghdata = spark. \
    read. \
    json(f'/user/{username}/itvgithub/landing/{process_dt}-*.json.gz')
ghdata = ghdata. \
    withColumn('year', substring('created_at', 1, 4)). \
    withColumn('month', substring('created_at', 6, 2)). \
    withColumn('day', substring('created_at', 9, 2))
ghdata. \
    write. \
    mode('append'). \
    partitionBy('year', 'month', 'day'). \
    parquet(f'/user/{username}/warehouse/{username}_ghraw_db.db/ghactivity')

* You can run `SHOW PARTITIONS` to list the current partitions based upon the metadata in the metastore.
* At this time you will only see one partition as we have directly copied data into the target folder with out referenct to tables.
* The metadata in the metastore and the table folder structures related to **ghactivity** are not in sync.

In [6]:
spark.sql(f'SHOW PARTITIONS {username}_ghraw_db.ghactivity').show(truncate=False)

+-------------------------+
|partition                |
+-------------------------+
|year=2021/month=01/day=13|
+-------------------------+



* You can confirm that the target location have folders related to both `year=2021/month=01/day=13` as well as `year=2021/month=01/day=14` by running below HDFS Command.

In [7]:
%%sh

hdfs dfs -ls -R /user/${USER}/warehouse/${USER}_ghraw_db.db/ghactivity

-rw-r--r--   3 itv001477 supergroup          0 2021-12-06 01:31 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/_SUCCESS
drwxr-xr-x   - itv001477 supergroup          0 2021-12-06 00:55 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021
drwxr-xr-x   - itv001477 supergroup          0 2021-12-06 01:31 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021/month=01
drwxr-xr-x   - itv001477 supergroup          0 2021-12-06 01:04 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021/month=01/day=13
-rw-r--r--   3 itv001477 supergroup  172701438 2021-12-06 00:56 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021/month=01/day=13/part-00000-6e194789-615a-4cdf-b0dd-68f0e575d685.c000.snappy.parquet
-rw-r--r--   3 itv001477 supergroup  176429914 2021-12-06 00:56 /user/itv001477/warehouse/itv001477_ghraw_db.db/ghactivity/year=2021/month=01/day=13/part-00001-6e194789-615a-4cdf-b0dd-68f0e575d685.c000.snappy.parquet
-rw-r--r--

* You can run `MSCK REPAIR` command to refresh the data in the metastore as per the folder structures.

In [8]:
spark.sql(f'''
    MSCK REPAIR TABLE {username}_ghraw_db.ghactivity
''')

* Now we can run `SHOW PARTITIONS` to list the partitions and we should be able to see partitions related to both the days.

In [9]:
spark.sql(f'SHOW PARTITIONS {username}_ghraw_db.ghactivity').show(truncate=False)

+-------------------------+
|partition                |
+-------------------------+
|year=2021/month=01/day=13|
|year=2021/month=01/day=14|
+-------------------------+



* You can also run this query to see counts for each day. You should see the counts for both the dates.

|created_dt|count(1)|
|----------|--------|
|2021-01-13| 2829111|
|2021-01-14| 2857818|


In [10]:
spark.sql(f'''
    SELECT substring(created_at, 1, 10) AS created_dt, count(1)
    FROM {username}_ghraw_db.ghactivity
    GROUP BY created_dt
    ORDER BY created_dt
'''). \
    show()

+----------+--------+
|created_dt|count(1)|
+----------+--------+
|2021-01-13| 2829111|
|2021-01-14| 2857818|
+----------+--------+



### Task - Run for additional day

As we already have data related to **2021-01-15**, perform below tasks to understand the relevance of managing partitions.
* Validate whether you have data for **2021-01-15** or not.
* Run the code to read data from JSON files related to **2021-01-15** and populate the target location in respective folders.
* Validate whether you are able to see the partitions or not.
* Repair table and validate by reviewing the partitions as well as by running count by created_at.

In [None]:
%%sh

hdfs dfs -ls /user/${USER}/itvgithub/landing/*2021-01-15*.json.gz

In [2]:
from pyspark.sql.functions import substring, col
process_dt = '2021-01-15'
spark.conf.set('spark.sql.shuffle.partitions', 8)
ghdata = spark. \
    read. \
    json(f'/user/{username}/itv-github/landing/{process_dt}-*.json.gz')
ghdata = ghdata. \
    withColumn('year', substring('created_at', 1, 4)). \
    withColumn('month', substring('created_at', 6, 2)). \
    withColumn('day', substring('created_at', 9, 2))
ghdata. \
    write. \
    mode('append'). \
    partitionBy('year', 'month', 'day'). \
    parquet(f'/user/{username}/warehouse/{username}_ghraw_db.db/ghactivity')

In [3]:
spark.sql(f'SHOW PARTITIONS {username}_ghraw_db.ghactivity').show(truncate=False)

+-------------------------+
|partition                |
+-------------------------+
|year=2021/month=01/day=13|
|year=2021/month=01/day=14|
+-------------------------+



In [4]:
%%sh

hdfs dfs -ls -R /user/${USER}/warehouse/${USER}_ghraw_db.db/ghactivity/year=2021/month=01/day=15

-rw-r--r--   3 itversity itversity  151059311 2021-06-30 03:48 /user/itversity/warehouse/itversity_ghraw_db.db/ghactivity/year=2021/month=01/day=15/part-00000-8e2a6f57-ac21-4943-a3f1-3a5829d2d920.c000.snappy.parquet
-rw-r--r--   3 itversity itversity  151059311 2021-06-30 03:04 /user/itversity/warehouse/itversity_ghraw_db.db/ghactivity/year=2021/month=01/day=15/part-00000-a32e20a4-2deb-4e1e-bcf4-b1703bdc0b6f.c000.snappy.parquet
-rw-r--r--   3 itversity itversity  149518106 2021-06-30 03:48 /user/itversity/warehouse/itversity_ghraw_db.db/ghactivity/year=2021/month=01/day=15/part-00001-8e2a6f57-ac21-4943-a3f1-3a5829d2d920.c000.snappy.parquet
-rw-r--r--   3 itversity itversity  149518106 2021-06-30 03:04 /user/itversity/warehouse/itversity_ghraw_db.db/ghactivity/year=2021/month=01/day=15/part-00001-a32e20a4-2deb-4e1e-bcf4-b1703bdc0b6f.c000.snappy.parquet
-rw-r--r--   3 itversity itversity  147916250 2021-06-30 03:49 /user/itversity/warehouse/itversity_ghraw_db.db/ghactivity/year=2021/mont

In [5]:
spark.sql(f'''
    MSCK REPAIR TABLE {username}_ghraw_db.ghactivity
''')

In [6]:
spark.sql(f'SHOW PARTITIONS {username}_ghraw_db.ghactivity').show(truncate=False)

+-------------------------+
|partition                |
+-------------------------+
|year=2021/month=01/day=13|
|year=2021/month=01/day=14|
|year=2021/month=01/day=15|
+-------------------------+



In [8]:
spark.sql(f'''
    SELECT substring(created_at, 1, 10) AS created_dt, count(1)
    FROM {username}_ghraw_db.ghactivity
    GROUP BY created_dt
    ORDER BY created_dt
'''). \
    show()

+----------+--------+
|created_dt|count(1)|
+----------+--------+
|2021-01-13| 2829111|
|2021-01-14| 2857818|
|2021-01-15| 5305800|
+----------+--------+

