d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Table Management

Apache Spark&trade; and Databricks&reg; allow you to access and optimize data in managed and unmanaged tables.

## In this lesson you:
* Write to managed and unmanaged tables
* Explore the effect of dropping tables on the metadata and underlying data

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.
* Concept (optional): <a href="https://academy.databricks.com/collections/frontpage/products/etl-part-1-data-extraction" target="_blank">ETL Part 1 course from Databricks Academy</a>

<iframe  
src="//fast.wistia.net/embed/iframe/gyw20pwx6j?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/gyw20pwx6j?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Optimization of Data Storage with Managed and Unmanaged Tables

A **managed table** is a table that manages both the data itself as well as the metadata.  In this case, a `DROP TABLE` command removes both the metadata for the table as well as the data itself.  

**Unmanaged tables** manage the metadata from a table such as the schema and data location, but the data itself sits in a different location, often backed by a blob store like the Azure Blob or S3. Dropping an unmanaged table drops only the metadata associated with the table while the data itself remains in place.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-2/managed-and-unmanaged-tables.png" style="height: 400px; margin: 20px"/></div>

### Writing to a Managed Table

Managed tables allow access to data using the Spark SQL API.

Run the cell below to mount the data.

In [7]:
%run "./Includes/Classroom-Setup"

Create a DataFrame.

In [9]:
df = spark.range(1, 100)

display(df)

id
1
2
3
4
5
6
7
8
9
10


Register the table.

In [11]:
df.write.mode("OVERWRITE").saveAsTable("myTableManaged")

Use `DESCRIBE EXTENDED` to describe the contents of the table.  Scroll down to see the table `Type`.

Notice the location is also `dbfs:/user/hive/warehouse/< your database >/mytablemanaged`.

In [13]:
%sql
DESCRIBE EXTENDED myTableManaged

col_name,data_type,comment
id,bigint,
,,
# Detailed Table Information,,
Database,shashank_rao_rhsmith_umd_edu_dbp,
Table,mytablemanaged,
Owner,root,
Created Time,Mon Feb 24 17:16:04 UTC 2020,
Last Access,Thu Jan 01 00:00:00 UTC 1970,
Created By,Spark 2.4.4,
Type,MANAGED,


### Writing to an Unmanaged Table

Write to an unmanaged table by adding an `.option()` that includes a path.

In [15]:
df.write.mode("OVERWRITE").option('path', userhome+'/myTableUnManaged').saveAsTable("myTableUnManaged")

-sandbox
Now examine the table type and location of the data.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> An external table is the same as an unmanaged table.

In [17]:
%sql
DESCRIBE EXTENDED myTableUnManaged

col_name,data_type,comment
id,bigint,
,,
# Detailed Table Information,,
Database,shashank_rao_rhsmith_umd_edu_dbp,
Table,mytableunmanaged,
Owner,root,
Created Time,Mon Feb 24 17:23:25 UTC 2020,
Last Access,Thu Jan 01 00:00:00 UTC 1970,
Created By,Spark 2.4.4,
Type,EXTERNAL,


### Dropping Tables

Take a look at how dropping tables operates differently in the two cases below.

Look at the files backing up the managed table.

In [20]:
display(dbutils.fs.ls("dbfs:/user/hive/warehouse/" + databaseName + ".db/mytablemanaged"))

path,name,size
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/_SUCCESS,_SUCCESS,0
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/_committed_6549717446597494238,_committed_6549717446597494238,824
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/_started_6549717446597494238,_started_6549717446597494238,0
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/part-00000-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-424-1-c000.snappy.parquet,part-00000-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-424-1-c000.snappy.parquet,494
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/part-00001-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-425-1-c000.snappy.parquet,part-00001-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-425-1-c000.snappy.parquet,494
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/part-00002-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-426-1-c000.snappy.parquet,part-00002-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-426-1-c000.snappy.parquet,499
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/part-00003-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-427-1-c000.snappy.parquet,part-00003-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-427-1-c000.snappy.parquet,494
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/part-00004-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-428-1-c000.snappy.parquet,part-00004-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-428-1-c000.snappy.parquet,493
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/part-00005-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-429-1-c000.snappy.parquet,part-00005-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-429-1-c000.snappy.parquet,499
dbfs:/user/hive/warehouse/shashank_rao_rhsmith_umd_edu_dbp.db/mytablemanaged/part-00006-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-430-1-c000.snappy.parquet,part-00006-tid-6549717446597494238-27bc55c2-2619-45c2-ad83-6d4cba6adbe7-430-1-c000.snappy.parquet,494


Drop the table.

In [22]:
%sql
DROP TABLE myTableManaged

-sandbox
Next look at the underlying data.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This command will throw an error.

In [24]:
try:
  display(dbutils.fs.ls("dbfs:/user/hive/warehouse/" + databaseName + ".db/mytablemanaged"))
  
except Exception as e:
  print(e)

The data was deleted so spark will not find the underlying data. Perform the same operation with the unmanaged table.

In [26]:
display(dbutils.fs.ls("dbfs:/user/" + username + "/myTableUnManaged"))

path,name,size
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/_SUCCESS,_SUCCESS,0
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/_committed_7501568886850465961,_committed_7501568886850465961,824
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/_started_7501568886850465961,_started_7501568886850465961,0
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00000-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-434-1-c000.snappy.parquet,part-00000-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-434-1-c000.snappy.parquet,494
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00001-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-435-1-c000.snappy.parquet,part-00001-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-435-1-c000.snappy.parquet,494
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00002-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-436-1-c000.snappy.parquet,part-00002-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-436-1-c000.snappy.parquet,499
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00003-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-437-1-c000.snappy.parquet,part-00003-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-437-1-c000.snappy.parquet,494
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00004-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-438-1-c000.snappy.parquet,part-00004-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-438-1-c000.snappy.parquet,493
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00005-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-439-1-c000.snappy.parquet,part-00005-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-439-1-c000.snappy.parquet,499
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00006-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-440-1-c000.snappy.parquet,part-00006-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-440-1-c000.snappy.parquet,494


Drop the unmanaged table.

In [28]:
%sql
DROP TABLE myTableUnmanaged

See if the data is still there.

In [30]:
display(dbutils.fs.ls("dbfs:/user/" + username + "/myTableUnManaged"))

path,name,size
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/_SUCCESS,_SUCCESS,0
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/_committed_7501568886850465961,_committed_7501568886850465961,824
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/_started_7501568886850465961,_started_7501568886850465961,0
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00000-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-434-1-c000.snappy.parquet,part-00000-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-434-1-c000.snappy.parquet,494
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00001-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-435-1-c000.snappy.parquet,part-00001-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-435-1-c000.snappy.parquet,494
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00002-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-436-1-c000.snappy.parquet,part-00002-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-436-1-c000.snappy.parquet,499
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00003-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-437-1-c000.snappy.parquet,part-00003-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-437-1-c000.snappy.parquet,494
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00004-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-438-1-c000.snappy.parquet,part-00004-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-438-1-c000.snappy.parquet,493
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00005-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-439-1-c000.snappy.parquet,part-00005-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-439-1-c000.snappy.parquet,499
dbfs:/user/shashank.rao@rhsmith.umd.edu/myTableUnManaged/part-00006-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-440-1-c000.snappy.parquet,part-00006-tid-7501568886850465961-ce35ce7c-2162-488e-8202-4588025e159e-440-1-c000.snappy.parquet,494


## Review
**Question:** What happens to the original data when I delete a managed table?  What about an unmanaged table?  
**Answer:** Deleting a managed table deletes both the metadata and the data itself. Deleting an unmanaged table does not delete the original data.

**Question:** What is a metastore?  
**Answer:** A metastore is a repository of metadata such as the location of where data is and the schema information. A metastore does not include the data itself.

## Next Steps

Start the next lesson, [Capstone Project]($./08-Capstone-Project ).

**At the end of this course, please complete the <a href="https://www.surveymonkey.com/r/VYGM9TD" target="_blank">short feedback survey</a>.  Your input is extremely important and shapes future course development.**

## Additional Topics & Resources

**Q:** Where can I find out more about connecting to my own metastore?  
**A:** Take a look at the <a href="https://docs.databricks.com/user-guide/advanced/external-hive-metastore.html" target="_blank">Databricks documentation for more details</a>

**Q:** Where can I find out more about Spark Tables?  
**A:** Take a look at the <a href="https://docs.databricks.com/user-guide/tables.html" target="_blank">Databricks documentation for more details</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>