Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise error if file format is incorrect #73

Merged
merged 7 commits into from Oct 30, 2020

Conversation

dominiquesydow
Copy link
Contributor

@dominiquesydow dominiquesydow commented Oct 22, 2020

Code of Conduct

Description

Raise ValueError if an incorrect file format is loaded using the mol2 or pdb modules.

Two checks per mol2/pdb module would be optimal:

  1. When loading DataFrames from file: Check if extension is correct (.mol2/.mol2.gz or .pdb/pdb.gz, respectively).
  2. When loading DataFrames from text: Check for certain tags in the text?

This PR suggests the following changes:

  • pdb from file: Raise ValueError in PandasPdb._read_pdb if file extension not .pdb/.pdb.gz
  • TBD: pdb from text: Add len(dfs['ATOM'] == 0 to check if any atomic data was loaded? However, I am not sure if you want
  • mol2 from text: Raise ValueError in PandasPdb._get_atomsection if format is not mol2 (@<TRIPOS>ATOM not found)
    this behaviour in general.
  • TBD: mol2 from file: Raise error if file extension is incorrect in mol2.mol2_io.split_multimol2. I tried to add:
    # mol2_io.py
    def split_multimol2(mol2_path):
        if mol2_path.endswith('.mol2'):
            open_file = open
            read_mode = 'r'
        elif mol2_path.endswith('mol2.gz'):
            open_file = gzip.open
            read_mode = 'rb'
        else:
            raise ValueError('Wrong file format; allowed file formats are .mol2 and .mol2.gz.')
        # rest of the function's code

However, the error is not thrown, which I don't understand right now.

Related issues or pull requests

Suggests changes for #71.

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./biopandas/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under biopandas/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./biopandas -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./biopandas/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./biopandas

@rasbt
Copy link
Member

rasbt commented Oct 22, 2020

Thanks for the PR!

Right now, the unit tests are based on nosetest which is why importing pytest failed the unit tests. I moved another project to pytest and like it overall, however, for biopandas, this is best done in a separate PR (in the future). So, I made some adjustments to the unittests here.

TBD: mol2 from file: Raise error if file extension is incorrect in mol2.mol2_io.split_multimol2. I tried to add:

Huh, yeah, that was a weird one! It took me a bit to figure it out but then I remembered that I implemented split_multimol2 as a generator. So, split_multimol2('40_mol2_files.pdb') didn't do anything in the unit test and it required a change to next(split_multimol2('40_mol2_files.pdb')).

pdb from text: Add len(dfs['ATOM'] == 0 to check if any atomic data was loaded? However, I am not sure if you want

I generally think that's a good idea. Maybe raising a warning would be a good compromise to notify users but allow them to investigate PDB files that don't have atom entries (but only hetatm/anisou entries etc).

@coveralls
Copy link

coveralls commented Oct 22, 2020

Coverage Status

Coverage increased (+0.3%) to 94.558% when pulling 7020057 on dominiquesydow:raise-invalid-file-error into 8da42c2 on rasbt:master.

@dominiquesydow
Copy link
Contributor Author

Thanks, excellent!!
My apologies for the pytest - nosetest confusion and thanks for fixing it.
I will look into the len(dfs['ATOM'] == 0) check later (latest beginning of next week).

@rasbt
Copy link
Member

rasbt commented Oct 26, 2020

I appreciate the PR, and no worries at all!

@pep8speaks
Copy link

pep8speaks commented Oct 30, 2020

Hello @dominiquesydow! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-30 14:30:40 UTC

@dominiquesydow
Copy link
Contributor Author

pdb from text: Add len(dfs['ATOM'] == 0 to check if any atomic data was loaded? However, I am not sure if you want

I generally think that's a good idea. Maybe raising a warning would be a good compromise to notify users but allow them to investigate PDB files that don't have atom entries (but only hetatm/anisou entries etc).

Agreed! I added the warning (only checking for missing ATOM entries).

Is there anything else I should address in this PR that I might have missed?

@rasbt
Copy link
Member

rasbt commented Oct 30, 2020

This looks great. Thanks a lot. It's good to merge!

@rasbt rasbt merged commit 061eb70 into BioPandas:master Oct 30, 2020
nozomu-y pushed a commit to nozomu-y/biopandas that referenced this pull request Jan 9, 2023
…e-error

Raise error if file format is incorrect
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants