Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Files and Remote SAS via IOM on Jupyter #151

Closed
sylus opened this issue Jul 31, 2018 · 7 comments
Closed

Large Files and Remote SAS via IOM on Jupyter #151

sylus opened this issue Jul 31, 2018 · 7 comments

Comments

@sylus
Copy link

sylus commented Jul 31, 2018

Hi there @tomweber-sas! :)

I hope you are well! My SAS container in jupyter with SASpy is working great but I have noticed an issue with larger datasets and curious what forms of mitigations there are.

Basically if I run the following this works (file size is 5kb):

import saspy
import pandas as pd

df = pd.read_sas(‘/home/jovyan/work/airline.sas7bdat’, format = ‘sas7bdat’)
df.describe()

screen shot 2018-07-31 at 10 04 25 am

Then if I run on a 8mb file:

import saspy
import pandas as pd

df = pd.read_sas('/home/jovyan/work/trauma.sas7bdat', format = 'sas7bdat')
df.describe()

screen shot 2018-07-31 at 10 15 55 am

Oddly though if I then run on a much larger document (file size is 250m) then I get no output at all.

Potential Mitigations

a) Find a proper solution to pass large files in SAS Session
b) Run SASPy + SAS together in same container
c) Give both containers a shared volume mount in k8s

@sylus sylus changed the title Large Files and Remote SAS via IOM Large Files and Remote SAS via IOM on Jupyter Jul 31, 2018
@tomweber-sas
Copy link
Contributor

tomweber-sas commented Jul 31, 2018

Hey William! So, the code you're using in these examples uses neither saspy nor SAS. It's using pandas to read a SAS data set into python as a panda data frame. If I remember right, python (jupyter) is outside your container and on the client side? I can't say why pandas isn't producing output when you read a larger table. But is there enough memory for the python process where it's running?

@sylus
Copy link
Author

sylus commented Jul 31, 2018

Ah yeah thanks for the clarification, right now we are trying to figure out how to copy files over to SAS container for workload since is separate container. I'm hoping I am using at least the right approach to start this?

I will take a look at the python process as that makes sense ^_^

@tomweber-sas
Copy link
Contributor

Ah, so you're looking to read the SAS data set into python and then use df2sd() (dataframe2sasdata()) to transfer it to SAS? That is a way to do it. With the IOM access method, that will generate a data step in SAS and stream the rows from the data frame over to SAS, writing them out a the SAS Data Set.

Both of your cases (b and c) would also work, and it would be more performant to have the SAS session directly access the data set. But, I know you were trying to not have to put saspy inside the container.
Mounting a volume for the container would allow you to keep python on the outside and give the SAS session direct access to the data. I guess it just depends upon your contraints.

Tom

@sylus
Copy link
Author

sylus commented Jul 31, 2018

Hey @tomweber-sas would you mind providing more info on how that works? Can't seem to find docs showing an example. Sorry to be a pain ^_^

  1. So i launch in python kernel, and then running the following steps:
import saspy
import pandas as pd

df = pd.read_sas('/home/jovyan/work/trauma.sas7bdat', format = 'sas7bdat')
hr = pd.df2sd(df)
  1. What do I now need to do in SAS Kernel to call the file?

@tomweber-sas
Copy link
Contributor

You're no pain William. df2sd() is a saspy method of the SASsession object. Here's a sequence (hand typed in, so correct any typos :) ):

import saspy
import pandas as pd

# use pandas module to read a local SAS data set into a data frame.
df = pd.read_sas('/home/jovyan/work/trauma.sas7bdat', format = 'sas7bdat')

# connect to SAS server; sas is a SASsession object with all of its methods
sas = saspy.SASsession()

# use saspy to transfer the data frame to a SAS data set on the SAS server we're connected to
hr = sas.df2sd(df, 'tablename', 'libref-or-WORK-is-the-default')

# hr is a SASdata object and all of its methods are available ...
hr.head()
hr.describe()

The doc for saspy is here: https://sassoftware.github.io/saspy/api.html
It's not a cookbook thing, but if you look through the methods available, you'll see some patterns. There's also a sort of a walk through which should help out too, here: https://github.com/sassoftware/saspy/blob/master/saspy_example_github.ipynb

Let me know if this makes sense or if you have any problem!
Tom

@tomweber-sas
Copy link
Contributor

Oh, and you mentioned the SAS_kernel. The code above is python, so that is in a python kernel. If you want to access that data via the sas_kernel, you would still need to move the data over w/ saspy directly, and write it to a permanent library, not WORK. Then, you could connect w/ a sas_kernel and access it directly; WELL, that is if there's a common storage location all of the docker images can get at. A connection to the saskernel would be to a different docker instance, right? So then the SAS data set wouldn't be in that container, right?
If there'a a common location that the docker images can all see, then this would work. You would need to assign a libref pointing to that storage location to write/read that SAS Data set. Here's the extra line in saspy:

# connect to SAS server; sas is a SASsession object with all of its methods
sas = saspy.SASsession()

# assign a permanent libref to write the data to
sas.saslib('perm', "'/perm/storage/path'")

# use saspy to transfer the data frame to a SAS data set on the SAS server we're connected to
hr = sas.df2sd(df, 'tablename', 'perm')

Make sense?
Tom

@sylus
Copy link
Author

sylus commented Jul 31, 2018

Ah that makes so much sense!

Thanks for walking me through this and I will try this tonight.

I was also made aware of a potential third option which is where we store all of our data in Azure Storage Blob Service (which we want to do anyways) and I can just have both containers use: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-how-to-mount-container-linux

So have a range of options :) Thanks so much for taking the time, as always you are super helpful and it greatly appreciated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants