New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Files and Remote SAS via IOM on Jupyter #151
Comments
Hey William! So, the code you're using in these examples uses neither saspy nor SAS. It's using pandas to read a SAS data set into python as a panda data frame. If I remember right, python (jupyter) is outside your container and on the client side? I can't say why pandas isn't producing output when you read a larger table. But is there enough memory for the python process where it's running? |
Ah yeah thanks for the clarification, right now we are trying to figure out how to copy files over to SAS container for workload since is separate container. I'm hoping I am using at least the right approach to start this? I will take a look at the python process as that makes sense ^_^ |
Ah, so you're looking to read the SAS data set into python and then use df2sd() (dataframe2sasdata()) to transfer it to SAS? That is a way to do it. With the IOM access method, that will generate a data step in SAS and stream the rows from the data frame over to SAS, writing them out a the SAS Data Set. Both of your cases (b and c) would also work, and it would be more performant to have the SAS session directly access the data set. But, I know you were trying to not have to put saspy inside the container. Tom |
Hey @tomweber-sas would you mind providing more info on how that works? Can't seem to find docs showing an example. Sorry to be a pain ^_^
import saspy
import pandas as pd
df = pd.read_sas('/home/jovyan/work/trauma.sas7bdat', format = 'sas7bdat')
hr = pd.df2sd(df)
|
You're no pain William. df2sd() is a saspy method of the SASsession object. Here's a sequence (hand typed in, so correct any typos :) ):
The doc for saspy is here: https://sassoftware.github.io/saspy/api.html Let me know if this makes sense or if you have any problem! |
Oh, and you mentioned the SAS_kernel. The code above is python, so that is in a python kernel. If you want to access that data via the sas_kernel, you would still need to move the data over w/ saspy directly, and write it to a permanent library, not WORK. Then, you could connect w/ a sas_kernel and access it directly; WELL, that is if there's a common storage location all of the docker images can get at. A connection to the saskernel would be to a different docker instance, right? So then the SAS data set wouldn't be in that container, right?
Make sense? |
Ah that makes so much sense! Thanks for walking me through this and I will try this tonight. I was also made aware of a potential third option which is where we store all of our data in Azure Storage Blob Service (which we want to do anyways) and I can just have both containers use: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-how-to-mount-container-linux So have a range of options :) Thanks so much for taking the time, as always you are super helpful and it greatly appreciated :) |
Hi there @tomweber-sas! :)
I hope you are well! My SAS container in jupyter with SASpy is working great but I have noticed an issue with larger datasets and curious what forms of mitigations there are.
Basically if I run the following this works (file size is 5kb):
import saspy import pandas as pd df = pd.read_sas(‘/home/jovyan/work/airline.sas7bdat’, format = ‘sas7bdat’) df.describe()
Then if I run on a 8mb file:
Oddly though if I then run on a much larger document (file size is 250m) then I get no output at all.
Potential Mitigations
a) Find a proper solution to pass large files in SAS Session
b) Run SASPy + SAS together in same container
c) Give both containers a shared volume mount in k8s
The text was updated successfully, but these errors were encountered: