Skip to content

Spark contains invalid character

Vaquar Khan edited this page Nov 4, 2022 · 2 revisions

Remove Special Characters from Column in PySpark DataFrame

		from pyspark.sql import SparkSession
			from pyspark.sql.functions import regexp_replace

			app_name = "PySpark regex_replace Example"
			master = "local"

			spark = SparkSession.builder \
				.appName(app_name) \
				.master(master) \
				.getOrCreate()

			spark.sparkContext.setLogLevel("WARN")

			data = [[1, 'ABCDEDF!@#$%%^123456qwerty'],
					[2, 'ABCDE!!!']
					]

			df = spark.createDataFrame(data, ['id', 'str'])

			df.show(truncate=False)

			df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", ""
												).alias('replaced_str'))

			df.show()

Parquet doesn't allow storing such column names. I'd say it's better engineering practice for you to follow some convention yourself and fix the names rather than having some system arbitrarily fix it for you.

                for c in df.columns:
               df = df.withColumnRenamed(c, c.replace( ";" , ""))

OR

		df
		  .columns
		  .foldLeft(df){(newdf, colname) =>
			newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
		  }
		  .show
Clone this wiki locally