Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load sasdata set into Spark #48

Closed
pathri-pk opened this issue Apr 16, 2019 · 6 comments
Closed

Unable to load sasdata set into Spark #48

pathri-pk opened this issue Apr 16, 2019 · 6 comments

Comments

@pathri-pk
Copy link

Using below code to load a sample sasdata set into spark and getting a timeout error, tried increasing the timeout using option 'metadataTimeout' but still timing out reading metadata, any help is appreciated

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /Spark/spark-2.4.1-bin-hadoop2.7/jars/spark-sas7bdat-2.1.0-s_2.11.jar pyspark-shell'
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("Spark App").getOrCreate()
df = spark.read.format("com.github.saurfang.sas.spark").load("airline.sas7bdat", forceLowercaseNames=True, inferLong=True)

Error:
Py4JJavaError: An error occurred while calling o226.load.
: java.util.concurrent.TimeoutException: Timed out after 60 sec while reading file metadata, file might be corrupt. (Change timeout with 'metadataTimeout' paramater)

@thesuperzapper
Copy link
Collaborator

@pathri-pk are you able to post that sas file? (You should be able to drag it into a reply to this issue, if its not huge)

@jene4ekjene4ek
Copy link

jene4ekjene4ek commented Mar 27, 2020

Hi all. I have the same problem with this file. The file is not corrupted, because I read it with pandas.
airline.zip

@mazenora
Copy link

mazenora commented Apr 17, 2020

I had the same problem, so i read the file with pandas then convert it to Spark DataFrame.

df = pd.read_sas('YOUR_FILE',format='sas7bdat')
mySchema = StructType([ ]) <-- create suitable schema structure for spark 
spark_df   = spark.createDataFrame(df,schema=mySchema)

@jene4ekjene4ek
Copy link

I've solved this issue by using jars parso and spark-sas7bdat

@mazenora
Copy link

could you please share the code?

@jene4ekjene4ek
Copy link

In my project folder I have folder jars with parso-2.0.10.jar and spark-sas7bdat-2.1.0-s_2.11.jar (I added printscreen) and run my script as spark-submit --jars jars/spark-sas7bdat-2.1.0-s_2.11.jar,jars/parso-2.0.10.jar script.py

spark = SparkSession.builder.config("spark.jars.packages", "saurfang:spark-sas7bdat:2.1.0-s_2.11") .getOrCreate()

df = spark.read.format('com.github.saurfang.sas.spark')
.load("airline.sas7bdat")

Screenshot from 2020-04-20 21-17-53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants