In [0]:
%sql
create database my_db;

In [0]:
%sql
create or replace table my_db.persons
(
  id int,
  name string,
  city string,
  marks int
) using delta;

In [0]:
%sql
create table my_db.students
(
  studid string,
  studname string,
  studcity string,
  studmarks int
) using delta;

In [0]:
%sql
insert into my_db.students values(1,'A','Tirupati',90);
insert into my_db.students values(2,'B','Chennai',95);
insert into my_db.students values(3,'C','Bengaluru',50);
insert into my_db.students values(4,'D','Mumbai',70);

num_affected_rows,num_inserted_rows
1,1


In [0]:
%sql
select * from my_db.students

studid,studname,studcity,studmarks
3,C,Bengaluru,50
1,A,Tirupati,90
2,B,Chennai,95
4,D,Mumbai,70


## CASE 1 : 
* Students table have studid,studname,studcity,studmarks.
* Persons table have id,name,city,marks.

## TEST BELOW SCENARIOS :
* Try to insert the data with correct position.
* Try to insert the data into target table with less columns and more columns.


* Scenario 1

In [0]:
%sql
insert into my_db.persons
select studid,studname,studcity,studmarks
from my_db.students

num_affected_rows,num_inserted_rows
4,4


In [0]:
%sql
select * from my_db.persons

id,name,city,marks
3,C,Bengaluru,50
1,A,Tirupati,90
2,B,Chennai,95
4,D,Mumbai,70


* Data inserted successfully 
### KEY 

* When we try to load the data into delta table using Insert statement then , 
Insert statement will be validated by only through column position not the column names.

* Scenario 2 with less columns insert

In [0]:
%sql
insert into my_db.persons
select studid,studname,studcity
from my_db.students

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_INSERT_COLUMN_ARITY_MISMATCH] Cannot write to 'spark_catalog.my_db.persons', not enough data columns; target table has 4 column(s) but the inserted data has 3 column(s)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert(DeltaErrors.scala:775)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert$(DeltaErrors.scala:768)
	at com.databricks.sql.transaction.tahoe.DeltaErrors$.notEnoughColumnsInInsert(DeltaErrors.scala:3573)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis.com$databricks$sql$transaction$tahoe$DeltaAnalysis$$needsSchemaAdjustmentByOrdinal(DeltaAnalysis.scala:1571)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:170)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:115)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$ano

* Insert Failed.
### KEY :
* When we populate the data into delta table with few columns (not complete columns for target mentioned) then
Delta table schema validation will be occured.

* Scenarion 2 : Insert more columns to target

In [0]:
%sql
alter table my_db.students add columns (empstate string)

In [0]:
%sql
select * from my_db.students

studid,studname,studcity,studmarks,empstate
3,C,Bengaluru,50,
1,A,Tirupati,90,
2,B,Chennai,95,
4,D,Mumbai,70,


In [0]:
%sql
insert into my_db.persons
select studid,studname,studcity,studmarks,empstate
from my_db.students

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [_LEGACY_ERROR_TEMP_DELTA_0007] A schema mismatch detected when writing to the Delta table (Table ID: 41c63924-4980-4f16-82c8-f8c30a206727).
To enable schema migration using DataFrameWriter or DataStreamWriter, please set:
'.option("mergeSchema", "true")'.
For other operations, set the session configuration
spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation
specific to the operation for details.

Table schema:
root
-- id: integer (nullable = true)
-- name: string (nullable = true)
-- city: string (nullable = true)
-- marks: integer (nullable = true)


Data schema:
root
-- id: integer (nullable = true)
-- name: string (nullable = true)
-- city: string (nullable = true)
-- marks: integer (nullable = true)
-- empstate: string (nullable = true)

         
	at com.databricks.sql.transaction.tahoe.MetadataMismatchErrorBuilder.finalizeAndThrow(DeltaErrors.scala:3772)
	at com.databricks.sql.transaction.t

#### KEY
Schema Validation occured when trying to insert more columns than source table.

In [0]:
%sql
create table my_db.students_new
(
  studid string,
  studname string,
  studcity string,
  studmarks string
) using delta;

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_INSERT_COLUMN_ARITY_MISMATCH] Cannot write to 'spark_catalog.my_db.persons', not enough data columns; target table has 4 column(s) but the inserted data has 2 column(s)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert(DeltaErrors.scala:775)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert$(DeltaErrors.scala:768)
	at com.databricks.sql.transaction.tahoe.DeltaErrors$.notEnoughColumnsInInsert(DeltaErrors.scala:3573)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis.com$databricks$sql$transaction$tahoe$DeltaAnalysis$$needsSchemaAdjustmentByOrdinal(DeltaAnalysis.scala:1571)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:170)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:115)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$ano

In [0]:
%sql
insert into my_db.students_new values('1','A','Tirupati','90');
insert into my_db.students_new values('2','B','Chennai','95');
insert into my_db.students_new values('3','C','Bengaluru','50');
insert into my_db.students_new values('4','D','Mumbai','70');

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_INSERT_COLUMN_ARITY_MISMATCH] Cannot write to 'spark_catalog.my_db.persons', not enough data columns; target table has 4 column(s) but the inserted data has 2 column(s)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert(DeltaErrors.scala:775)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert$(DeltaErrors.scala:768)
	at com.databricks.sql.transaction.tahoe.DeltaErrors$.notEnoughColumnsInInsert(DeltaErrors.scala:3573)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis.com$databricks$sql$transaction$tahoe$DeltaAnalysis$$needsSchemaAdjustmentByOrdinal(DeltaAnalysis.scala:1571)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:170)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:115)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$ano

### TESTING THE DATA TYPE VALIDATION IN DELTA LAKE
* In the below scenario, I'm going to check whether Delta lake will handle if I populate the data into a column
where data type is INT but the source column data type is STRING.

In [0]:
%sql
insert into my_db.persons
select studid,studname,studcity,studmarks
from my_db.students_new

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_INSERT_COLUMN_ARITY_MISMATCH] Cannot write to 'spark_catalog.my_db.persons', not enough data columns; target table has 4 column(s) but the inserted data has 2 column(s)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert(DeltaErrors.scala:775)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert$(DeltaErrors.scala:768)
	at com.databricks.sql.transaction.tahoe.DeltaErrors$.notEnoughColumnsInInsert(DeltaErrors.scala:3573)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis.com$databricks$sql$transaction$tahoe$DeltaAnalysis$$needsSchemaAdjustmentByOrdinal(DeltaAnalysis.scala:1571)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:170)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:115)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$ano

In [0]:
%sql
select * from my_db.persons

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_INSERT_COLUMN_ARITY_MISMATCH] Cannot write to 'spark_catalog.my_db.persons', not enough data columns; target table has 4 column(s) but the inserted data has 2 column(s)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert(DeltaErrors.scala:775)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert$(DeltaErrors.scala:768)
	at com.databricks.sql.transaction.tahoe.DeltaErrors$.notEnoughColumnsInInsert(DeltaErrors.scala:3573)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis.com$databricks$sql$transaction$tahoe$DeltaAnalysis$$needsSchemaAdjustmentByOrdinal(DeltaAnalysis.scala:1571)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:170)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:115)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$ano

* Test went successful.
* Internally the data from source will be casted to INT by spark.
![logs](/FileStore/tables/delta_lke.png)

## Writing the data from dataframe to delta table

* Scenario : Writing the data to delta table with different column names but with same no of columns 

In [0]:
df = spark.table('my_db.students')

In [0]:
df.dtypes

[('studid', 'string'),
 ('studname', 'string'),
 ('studcity', 'string'),
 ('studmarks', 'int'),
 ('empstate', 'string')]

In [0]:
# checking the column names of target delta table
spark.table('my_db.persons').dtypes

[('id', 'int'), ('name', 'string'), ('city', 'string'), ('marks', 'int')]

In [0]:
# Writing the data from df to target delta table

df.write\
    .format('delta')\
    .mode('append')\
    .saveAsTable('my_db.persons')

### KEY : When ever we are trying to write the data from dataframe to delta table , 
* we should always make sure than name of columns from source should get matched with target table.

* HANDLING THIS

In [0]:
df.select(df.studid.alias('id').cast('int').alias('id'),\
    df.studname.alias('name'),\
    df.studcity.alias('city'),\
    df.studmarks.alias('marks'))\
    .write.format('delta').mode('overwrite').saveAsTable('my_db.persons')

In [0]:
%sql
select * from my_db.persons

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_INSERT_COLUMN_ARITY_MISMATCH] Cannot write to 'spark_catalog.my_db.persons', not enough data columns; target table has 4 column(s) but the inserted data has 2 column(s)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert(DeltaErrors.scala:775)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert$(DeltaErrors.scala:768)
	at com.databricks.sql.transaction.tahoe.DeltaErrors$.notEnoughColumnsInInsert(DeltaErrors.scala:3573)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis.com$databricks$sql$transaction$tahoe$DeltaAnalysis$$needsSchemaAdjustmentByOrdinal(DeltaAnalysis.scala:1571)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:170)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:115)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$ano

### Loading only partial columns to target delta table.

In [0]:
df.select(df.studid.alias('id'),\
    df.studname.alias('name'))\
    .write.format('delta').mode('append').saveAsTable('my_db.persons')

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_INSERT_COLUMN_ARITY_MISMATCH] Cannot write to 'spark_catalog.my_db.persons', not enough data columns; target table has 4 column(s) but the inserted data has 2 column(s)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert(DeltaErrors.scala:775)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert$(DeltaErrors.scala:768)
	at com.databricks.sql.transaction.tahoe.DeltaErrors$.notEnoughColumnsInInsert(DeltaErrors.scala:3573)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis.com$databricks$sql$transaction$tahoe$DeltaAnalysis$$needsSchemaAdjustmentByOrdinal(DeltaAnalysis.scala:1571)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:170)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:115)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$ano

In [0]:
%sql
select * from my_db.persons

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_INSERT_COLUMN_ARITY_MISMATCH] Cannot write to 'spark_catalog.my_db.persons', not enough data columns; target table has 4 column(s) but the inserted data has 2 column(s)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert(DeltaErrors.scala:775)
	at com.databricks.sql.transaction.tahoe.DeltaErrorsBase.notEnoughColumnsInInsert$(DeltaErrors.scala:768)
	at com.databricks.sql.transaction.tahoe.DeltaErrors$.notEnoughColumnsInInsert(DeltaErrors.scala:3573)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis.com$databricks$sql$transaction$tahoe$DeltaAnalysis$$needsSchemaAdjustmentByOrdinal(DeltaAnalysis.scala:1571)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:170)
	at com.databricks.sql.transaction.tahoe.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:115)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$ano

### KEY(VIMP)
* Loading partial columns to delta table from insert statement will gives you error.
* Loading partial columns to delta table from dataframe will load the data and missing columns will be populated as null.
* We cannot load more columns from source table which are not part of target table.
* The data type from both source and target should match, no internal cast will happen.