Spark contains invalid character
Vaquar Khan edited this page Nov 4, 2022
·
2 revisions
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace
app_name = "PySpark regex_replace Example"
master = "local"
spark = SparkSession.builder \
.appName(app_name) \
.master(master) \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
data = [[1, 'ABCDEDF!@#$%%^123456qwerty'],
[2, 'ABCDE!!!']
]
df = spark.createDataFrame(data, ['id', 'str'])
df.show(truncate=False)
df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", ""
).alias('replaced_str'))
df.show()
Parquet doesn't allow storing such column names. I'd say it's better engineering practice for you to follow some convention yourself and fix the names rather than having some system arbitrarily fix it for you.
for c in df.columns:
df = df.withColumnRenamed(c, c.replace( ";" , ""))
OR
df
.columns
.foldLeft(df){(newdf, colname) =>
newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
}
.show