You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Creating a pipeline using spark to fetch data from MySQL table, processing this data then storing this processed data into Snowflake every 5 minutes by using Apache Airflow to automate the process
First we need to create a connection to MySQL database for that hover to the "Admin" tab and select "Connections" option
Now click on "+" symbol to create a new connection
Enter the connection ID (It can be anything you want) then select "Spark" as the Connection Type
Now enter "local" or "local[*]" in Host field, "client" in Deploy mode field, "spark-submit" in Spark binary field
Now click on the "Save" button on the bottom left
Write a DAG program in the "~/airflow/dags/" directory to run a spark job every 5 minutes (you can find this code in the dags directory of this repository)
cd~/airflow/dags/
vi mysqlToSnowflakeWithSparkAndAirflow.py
Now to run a spark job we need to include MySQL and Snowflake dependencies into the DAG program, so download the following dependencies in the home directory and include their paths in the DAG program
Write a spark job that will fetch the data from MySQL, process that data and then store the processed data to Snowflake
cd~
vi spark.py
After writing the DAG and spark program go to the Airflow webUI homepage and click on the "pause/unpause DAG" slider or "Trigger DAG" to run the DAG program
For more information click on the DAG name
Click on graph to check task status
Congrats the DAG program ran successfully, the table was created in MySQL server and the records were inserted in the table
If more records are inserted into the MySQL table, the old records and the new records are overwritten to the Snowflake table after a regular interval of 5 minutes