Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py3 #287

Merged
merged 85 commits into from
Dec 7, 2018
Merged

Py3 #287

merged 85 commits into from
Dec 7, 2018

Conversation

sheagcraig
Copy link
Contributor

@sheagcraig sheagcraig commented Oct 4, 2018

Once more, with feeling.

@sheagcraig
Copy link
Contributor Author

There's a couple of unicode vs. legacy python "strings" stuff to unravel in here still.

  1. We have a server.text_utils.safe_bytes function that is used in a a dozen places to take an input text type and ensure that the resulting output is a bytestring object (legacy python "string"). Django natively handles unicode, and as long as the database is configured to handle it (it is: docker/setup_db.sh, you can just push unicode into the database. So one of the next steps here is to clean up all of the coercion from bytes to unicode and vice versa for all Django and database code. This will almost certainly be tedious. I think it's pretty common to just speculatively try adding .decode() or .encode() to text in python 2 until things work when debugging "strange character" issues (guilty...), so they can grow like weeds throughout a codebase. The resolution here is to grep through all sal code looking for decode, encode, casting to str, bytes, or unicode, and probably any time the string 'utf-8' gets bandied about.
  2. One exception is URL construction. URLs have to be in ASCII, so we have to make sure all URLs are encoded before we are done with them. For example, links for applications in the app inventory could potentially have accents in them due to app name or version (nëthäck).
  3. Despite all this, postgres cannot store a null character in a TextField, which is the database field type used for the majority of Sal's stored data. This is a limitation in C, which treats a null character (python chr(0) == b'\x00' == '\x00'). Why do we care? Well, plists can include data of a variety of types: int, date, float, string, and data. Python's plistlib deserializes the plist <data> into a bystestring, and it can most certainly contain a null character. For example, the manufacturer data returned from the battery plugin, which is one of many pieces of data returned, can (and does) include null characters, causing postgres to FREAK OUT! This data is not being analyzed, and frankly, is not really needed. But the battery plugin is just an example; what needs to be done is to protect Sal from trying to write those rows in the first place. I'm going to just replace null characters with an empty string or maybe a smiling poo emoji so that the database can save the rest of the data.

@sheagcraig
Copy link
Contributor Author

Another potential breakage point is CSV generation. Python happily handles unicode in csv. Google sheets handles it just fine. Excel 2019 just treats it as garbage. I'm sure there's a way to get Excel to handle it, but I'm not feeling inclined to figure it out.

Unless someone feels very strongly about encoding all text going out from Sal into utf-8 encoded bytestrings for csvs, I'm going to leave the data as unicode and let the csv writer handle it as is.

@sheagcraig
Copy link
Contributor Author

sheagcraig commented Nov 8, 2018

Text TODOs

The current py3 tip has been running error free for me all day, but I don't really consider this done until we do a text audit, so here's what I'm up to:

  • Create replacement for server.text_utils.safe_bytes which returns unicode strings instead. This is needed for preventing null chars getting into data for saving into postgres.
  • Refactor all uses of safe_bytes to use new text utility (safe_text).
  • Audit for all uses of str.encode
  • Audit for all uses of bytes.decode
  • Audit for all uses of casting to str
  • Audit for all uses of casting to bytes
  • Audit for all uses of casting to unicode
  • Audit for all file open / writing modes; look for binary mode.
  • Look at text_utils.decode_to_string; this is more correctly now decode_to_bytes, but it's b64 decoding and possibly bzipping as well; it's not decoding a str or bytes to str!
  • Look at all plistlib ops; I've found several incorrect attempts to open a file handle with a string of plist data and then plistlib.load that, masked by an except Exception that always catches everything.
  • Audit for all subprocess usage for bytes vs str input and output.
  • Look at csv output.
  • Look for u'' strings

Submission TODOs

  • Write tests for main checkin.
  • Write tests for inventory submissions
  • Write tests for catalog submissions
  • Write tests for installlog submissions
  • Any time we store a dumped plist, it should be BinaryFIeld. (Confirm that it comes out as bytes)

@clburlison
Copy link
Contributor

Based on the docker images 3ef8d9fa9dcc (should be the last commit 4b51c76) I am having issues with the client checkin.

https://gist.github.com/clburlison/d5730d6c4e4edc9193ea9981d18d877a

@sheagcraig
Copy link
Contributor Author

Thanks @clburlison!

I think that's probably the #298 issue in another form. I'm hoping to merge in the fix for that soon so we can rebase the py3 branch off of it. I'll let you know once I do so, so you can give it another go.

@sheagcraig sheagcraig changed the base branch from master to version-4 November 30, 2018 14:45
@sheagcraig sheagcraig added the V4 To be added to version 4 label Nov 30, 2018
@sheagcraig
Copy link
Contributor Author

There's a crazy circeci /pip issue going on that wasn't here before:
ModuleNotFoundError: No module named 'pip._internal'

Not sure what to do about that one...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V4 To be added to version 4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants