# Key Notes on cast() and printSchema()

In PySpark, the cast() function is used to change the data type of a column within a DataFrame. This is helpful when you need to standardize column data types for data processing, schema consistency, or compatibility with other operations.

* Basic Syntax for cast()

```
# This is formatted as code
from pyspark.sql.functions import col

# Single column cast
df = df.withColumn("column_name", col("column_name").cast("target_data_type"))

# Multiple columns cast with select
cast_expr = [ col("column1_name").cast("target_data_type1"),
col("column2_name").cast("target_data_type2"),
# More columns and data types as needed
]


df = df.select(*cast_expr)


```



In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *  # Import the function
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import regexp_replace, col
from google.colab import drive


### Example
Let's create a dataset and apply cast() to change the data types of multiple columns

In [5]:
#Defimne the schema for the dataset
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", StringType(), True), # Storage as StringType initially
    StructField("height", StringType(), True) # Stored as StringType initially
])

# Create Sample dataset
data = [
    ("Alice", "25", "160.5"),
    ("Bob", "30", "175.2"),
    ("Charlie", "22", "180")
]

#Create DataFrame
df = spark.createDataFrame(data, schema)

#Print the schema and  display the data
df.printSchema()
df.show()





root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- height: string (nullable = true)

+-------+---+------+
|   name|age|height|
+-------+---+------+
|  Alice| 25| 160.5|
|    Bob| 30| 175.2|
|Charlie| 22|   180|
+-------+---+------+



In [6]:
# Define cast expression for multiple column
cast_expr = [
    col("name").cast('string'),
    col("age").cast("int"), # casting age to integertype
    col("height").cast("double") # casting height to doubletype

]

#Apply the cast expressions to the Dataframe
df = df.select(*cast_expr)

#Show the result
df.printSchema()
df.show()



root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- height: double (nullable = true)

+-------+---+------+
|   name|age|height|
+-------+---+------+
|  Alice| 25| 160.5|
|    Bob| 30| 175.2|
|Charlie| 22| 180.0|
+-------+---+------+



### Advantages of Using cast()
* Schema Alignment: Ensures data types in different tables or DataFrames are compatible for joining or union operations.
* Data Consistency: Ensures all columns conform to expected data types for downstream data processing.
* Error Reduction: Minimizes issues arising from mismatched data types in computations or transformations.