Spark contains invalid character

Remove Special Characters from Column in PySpark DataFrame

		from pyspark.sql import SparkSession
			from pyspark.sql.functions import regexp_replace

			app_name = "PySpark regex_replace Example"
			master = "local"

			spark = SparkSession.builder \
				.appName(app_name) \
				.master(master) \
				.getOrCreate()

			spark.sparkContext.setLogLevel("WARN")

			data = [[1, 'ABCDEDF!@#$%%^123456qwerty'],
					[2, 'ABCDE!!!']
					]

			df = spark.createDataFrame(data, ['id', 'str'])

			df.show(truncate=False)

			df = df.select("id", regexp_replace("str", "[^0-9a-zA-Z_\-]+", ""
												).alias('replaced_str'))

			df.show()

Parquet doesn't allow storing such column names. I'd say it's better engineering practice for you to follow some convention yourself and fix the names rather than having some system arbitrarily fix it for you.

                for c in df.columns:
               df = df.withColumnRenamed(c, c.replace( ";" , ""))

OR

		df
		  .columns
		  .foldLeft(df){(newdf, colname) =>
			newdf.withColumnRenamed(colname, colname.replace(" ", "_").replace(".", "_"))
		  }
		  .show

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark contains invalid character

Remove Special Characters from Column in PySpark DataFrame

Clone this wiki locally