# Ingest constructors.json file

### Step 1 - Read the JSON file using the spark dataframe reader

In [0]:
constructors_schema = "constructorId INT, constructorRef STRING, name STRING, nationality STRING, url STRING"

In [0]:
constructor_df = spark.read.schema(constructors_schema).json("/FileStore/tables/constructors.json")

In [0]:
# display(constructor_df)
constructor_df.show(5)

+-------------+--------------+----------+-----------+--------------------+
|constructorId|constructorRef|      name|nationality|                 url|
+-------------+--------------+----------+-----------+--------------------+
|            1|       mclaren|   McLaren|    British|http://en.wikiped...|
|            2|    bmw_sauber|BMW Sauber|     German|http://en.wikiped...|
|            3|      williams|  Williams|    British|http://en.wikiped...|
|            4|       renault|   Renault|     French|http://en.wikiped...|
|            5|    toro_rosso|Toro Rosso|    Italian|http://en.wikiped...|
+-------------+--------------+----------+-----------+--------------------+
only showing top 5 rows



### Step 2 - Drop unwanted columns

In [0]:
constructor_dropped_df = constructor_df.drop('url')

### Step 3 - Rename column and add ingestion date

In [0]:
from pyspark.sql.functions import current_timestamp

In [0]:
constructor_final_df = constructor_dropped_df.withColumnRenamed("constructorId", "constructor_id").withColumnRenamed("constructorRef", "constructor_ref").withColumn("ingestion_date", current_timestamp())

In [0]:
# display(constructor_final_df)
constructor_final_df.show(5)

+--------------+---------------+----------+-----------+--------------------+
|constructor_id|constructor_ref|      name|nationality|      ingestion_date|
+--------------+---------------+----------+-----------+--------------------+
|             1|        mclaren|   McLaren|    British|2023-06-02 12:00:...|
|             2|     bmw_sauber|BMW Sauber|     German|2023-06-02 12:00:...|
|             3|       williams|  Williams|    British|2023-06-02 12:00:...|
|             4|        renault|   Renault|     French|2023-06-02 12:00:...|
|             5|     toro_rosso|Toro Rosso|    Italian|2023-06-02 12:00:...|
+--------------+---------------+----------+-----------+--------------------+
only showing top 5 rows



### Step 4 - Write output to parquet file

In [0]:
constructor_final_df.write.mode("overwrite").parquet("/FileStore/tables/constructors.parquet")