feat: Automatic sanitization of sensitive data in the output (#1842)

* feat: Automatic output sanitization to obscure sensitive data by default Ref: #1794 * test: more stable test * test: CLI test * test: reuse setup code * test: more tests * chore: support pytest * chore: mask config * chore: mask config * chore: do not use hooks * docs: update * chore: naming
schemathesis · Oct 18, 2023 · bde2ec7 · bde2ec7
1 parent 34f2bcd
commit bde2ec7
Show file tree

Hide file tree

Showing 23 changed files with 926 additions and 98 deletions.
diff --git a/docs/api.rst b/docs/api.rst
@@ -107,6 +107,16 @@ Loaders
 .. autofunction:: schemathesis.graphql.from_url
 .. autofunction:: schemathesis.graphql.from_wsgi
 
+Sanitizing Output
+~~~~~~~~~~~~~~~~~
+
+.. autoclass:: schemathesis.sanitization.Config()
+
+  .. automethod:: with_keys_to_sanitize
+  .. automethod:: without_keys_to_sanitize
+  .. automethod:: with_sensitive_markers
+  .. automethod:: without_sensitive_markers
+
 Schema
 ~~~~~~
 

diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -13,8 +13,9 @@ Changelog
 - Automatic FastAPI fixup injecting for ASGI loaders, eliminating the need for manual setup. `#1797`_
 - Support for ``body`` hooks in GraphQL schemas, enabling custom filtering or modification of queries and mutations. `#1464`_
 - New ``filter_operations`` hook to conditionally include or exclude specific API operations from being tested.
-- Introduced a new CLI option ``--experimental=openapi-3.1`` for experimental support of OpenAPI 3.1. This enables compatible JSON Schema validation for responses, while data generation remains OpenAPI 3.0-compatible. `#1820`_
 - Added ``contains`` method to ``ParameterSet`` for easier parameter checks in hooks. `#1789`_
+- Automatic sanitization of sensitive data in the output is now enabled by default. This feature can be disabled using the ``--sanitize-output=false`` CLI option. For more advanced customization, use ``schemathesis.sanitizing.configure()``. `#1794`_
+- ``--experimental=openapi-3.1`` CLI option for experimental support of OpenAPI 3.1. This enables compatible JSON Schema validation for responses, while data generation remains OpenAPI 3.0-compatible. `#1820`_
 
 **Note**: Experimental features can change or be removed in any minor version release.
 
@@ -3476,6 +3477,7 @@ Deprecated
 .. _#1802: https://github.com/schemathesis/schemathesis/issues/1802
 .. _#1801: https://github.com/schemathesis/schemathesis/issues/1801
 .. _#1797: https://github.com/schemathesis/schemathesis/issues/1797
+.. _#1794: https://github.com/schemathesis/schemathesis/issues/1794
 .. _#1789: https://github.com/schemathesis/schemathesis/issues/1789
 .. _#1788: https://github.com/schemathesis/schemathesis/issues/1788
 .. _#1783: https://github.com/schemathesis/schemathesis/issues/1783

diff --git a/docs/index.rst b/docs/index.rst
@@ -217,6 +217,7 @@ User's Guide
    contrib
    stateful
    how
+   sanitizing
    compatibility
    examples
    graphql

diff --git a/docs/sanitizing.rst b/docs/sanitizing.rst
@@ -0,0 +1,45 @@
+.. _sanitizing-output:
+
+Sanitizing Output
+=================
+
+Schemathesis automatically sanitizes sensitive data in both the generated test case and the received response to prevent accidental exposure of sensitive information.
+This feature replaces certain headers, cookies, and other fields that could contain sensitive data with the string ``[Filtered]``.
+
+.. note::
+   Schemathesis does not sanitize sensitive data in response bodies due to the challenge of preserving the original formatting of the payload.
+
+You can control this feature through the ``--sanitize-output`` CLI option:
+
+.. code-block:: bash
+
+   schemathesis run --sanitize-output=false ...
+
+Or in Python tests:
+
+.. code-block:: python
+
+    schema = schemathesis.from_dict({...}, sanitize_output=False)
+
+Disabling this option will turn off the automatic sanitization of sensitive data in the output.
+
+For more advanced customization of the sanitization process, you can define your own sanitization configuration and pass it to the ``configure`` function.
+Here's how you could do it:
+
+.. code-block:: python
+
+    import schemathesis
+
+    # Create a custom config
+    custom_config = (
+        schemathesis.sanitization.Config(replacement="[Custom]")
+        .with_keys_to_sanitize("X-Customer-ID")
+        .with_sensitive_markers("address")
+    )
+
+    # Configure Schemathesis to use your custom sanitization configuration
+    schemathesis.sanitization.configure(custom_config)
+
+This will sanitize the ``X-Customer-ID`` headers (case-insensitive), and any fields containing the substring "address" (case-insensitive) in their names, with the string "[Custom]" in the generated test case and the received response.
+
+This will sanitize the ``X-Customer-ID`` headers, and any fields containing the substring "address" in their names, with the string "[Custom]" in the generated test case and the received response.
diff --git a/docs/service.rst b/docs/service.rst
@@ -122,14 +122,32 @@ Each failure is accompanied by a cURL snippet you can use to reproduce the issue
 
 .. image:: https://raw.githubusercontent.com/schemathesis/schemathesis/master/img/service_server_error.png
 
-Alternatively, you can use the **Replay** button on the failure page.
-
-What data is sent?
+What Data is Sent?
 ------------------
 
-CLI sends info to Schemathesis.io in the following cases:
+The following data is included in the reports sent to Schemathesis.io by the CLI:
+
+- **Metadata**:
+
+  - Information about your host machine to help us understand our users better.
+  - Collected data includes your Python interpreter version, implementation, system/OS name, and release.
+
+- **Test Runs**:
+
+  - Most of the Schemathesis runner's events are included, encompassing all generated data and explicitly passed headers.
+  - Sensitive data within the generated test cases and received responses is automatically sanitized by default, replaced with the string ``[Filtered]`` to prevent accidental exposure.
+  - Further information on what is considered sensitive and how it is sanitized can be found at :ref:`Sanitizing Output <sanitizing-output>`.
+
+- **Environment Variables**:
+
+  - Some environment variables specific to CI providers are included.
+  - These are used to comment on pull requests.
+
+- **Command-Line Options**:
+
+  - Command-line options without free-form values are sent to help us understand how you use the CLI.
+  - Rest assured, any sensitive data passed through command-line options is sanitized by default.
+
+For more details on our data handling practices, please refer to our `Privacy Policy <https://schemathesis.io/legal/privacy>`_. If you have further questions or concerns about data handling, feel free to contact us at `support@schemathesis.io <mailto:support@schemathesis.io>`_.
 
-- Authentication. Metadata about your host machine, that helps us to understand our users better. We collect your Python interpreter version, implementation, system/OS name and release. For more information look at ``service/metadata.py``
-- Test runs. Most of Schemathesis runner's events, including all generated data and explicitly passed headers. For more information look at ``service/serialization.py``
-- Some environment variables specific to CI providers. We use them to comment on pull requests.
-- Command-line options without free-form values. It helps us to understand how you use the CLI.
+For information on data access, retention, and deletion, please refer to the `FAQ section <https://docs.schemathesis.io/faq>`_ in our SaaS documentation.
diff --git a/src/schemathesis/cli/__init__.py b/src/schemathesis/cli/__init__.py
@@ -55,6 +55,7 @@
 from .handlers import EventHandler
 from .junitxml import JunitXMLHandler
 from .options import CsvChoice, CsvEnumChoice, CustomHelpMessageChoice, NotSet, OptionalInt
+from .sanitization import SanitizationHandler
 
 try:
     from yaml import CSafeLoader as SafeLoader
@@ -501,6 +502,13 @@ class ReportToService:
     help="Force Schemathesis to parse the input schema with the specified spec version.",
     type=click.Choice(["20", "30"]),
 )
+@click.option(
+    "--sanitize-output",
+    type=bool,
+    default=True,
+    show_default=True,
+    help="Enable or disable automatic output sanitization to obscure sensitive data.",
+)
 @click.option(
     "--contrib-unique-data",
     "contrib_unique_data",
@@ -665,6 +673,7 @@ def run(
     stateful: Optional[Stateful] = None,
     stateful_recursion_limit: int = DEFAULT_STATEFUL_RECURSION_LIMIT,
     force_schema_version: Optional[str] = None,
+    sanitize_output: bool = True,
     contrib_unique_data: bool = False,
     contrib_openapi_formats_uuid: bool = False,
     hypothesis_database: Optional[str] = None,
@@ -838,6 +847,7 @@ def run(
         code_sample_style=code_sample_style,
         data_generation_methods=data_generation_methods,
         debug_output_file=debug_output_file,
+        sanitize_output=sanitize_output,
         host_data=host_data,
         client=client,
         report=report,
@@ -1137,6 +1147,7 @@ def execute(
     code_sample_style: CodeSampleStyle,
     data_generation_methods: Tuple[DataGenerationMethod, ...],
     debug_output_file: Optional[click.utils.LazyFile],
+    sanitize_output: bool,
     host_data: service.hosts.HostData,
     client: Optional[service.ServiceClient],
     report: Optional[Union[ReportToService, click.utils.LazyFile]],
@@ -1190,6 +1201,8 @@ def execute(
             cassettes.CassetteWriter(cassette_path, preserve_exact_body_bytes=cassette_preserve_exact_body_bytes)
         )
     handlers.append(get_output_handler(workers_num))
+    if sanitize_output:
+        handlers.insert(0, SanitizationHandler())
     execution_context = ExecutionContext(
         hypothesis_settings=hypothesis_settings,
         workers_num=workers_num,

diff --git a/src/schemathesis/cli/sanitization.py b/src/schemathesis/cli/sanitization.py
@@ -0,0 +1,15 @@
+from dataclasses import dataclass
+
+from ..runner import events
+from ..sanitization import sanitize_serialized_check, sanitize_serialized_interaction
+from .handlers import EventHandler, ExecutionContext
+
+
+@dataclass
+class SanitizationHandler(EventHandler):
+    def handle_event(self, context: ExecutionContext, event: events.ExecutionEvent) -> None:
+        if isinstance(event, events.AfterExecution):
+            for check in event.result.checks:
+                sanitize_serialized_check(check)
+            for interaction in event.result.interactions:
+                sanitize_serialized_interaction(interaction)
diff --git a/src/schemathesis/lazy.py b/src/schemathesis/lazy.py
@@ -50,6 +50,7 @@ class LazySchema:
     data_generation_methods: Union[DataGenerationMethodInput, NotSet] = NOT_SET
     code_sample_style: CodeSampleStyle = CodeSampleStyle.default()
     rate_limiter: Optional[Limiter] = None
+    sanitize_output: bool = True
 
     def hook(self, hook: Union[str, Callable]) -> Callable:
         return self.hooks.register(hook)
@@ -116,6 +117,7 @@ def wrapped_test(request: FixtureRequest) -> None:
                     code_sample_style=_code_sample_style,
                     app=self.app,
                     rate_limiter=self.rate_limiter,
+                    sanitize_output=self.sanitize_output,
                 )
                 fixtures = get_fixtures(test, request, given_kwargs)
                 # Changing the node id is required for better reporting - the method and path will appear there
@@ -276,6 +278,7 @@ def get_schema(
     data_generation_methods: Union[DataGenerationMethodInput, NotSet] = NOT_SET,
     code_sample_style: CodeSampleStyle,
     rate_limiter: Optional[Limiter],
+    sanitize_output: bool,
 ) -> BaseSchema:
     """Loads a schema from the fixture."""
     schema = request.getfixturevalue(name)
@@ -296,6 +299,7 @@ def get_schema(
         data_generation_methods=data_generation_methods,
         code_sample_style=code_sample_style,
         rate_limiter=rate_limiter,
+        sanitize_output=sanitize_output,
     )
 
 

diff --git a/src/schemathesis/models.py b/src/schemathesis/models.py
@@ -57,6 +57,7 @@
 )
 from .hooks import GLOBAL_HOOK_DISPATCHER, HookContext, HookDispatcher, dispatch
 from .parameters import Parameter, ParameterSet, PayloadAlternatives
+from .sanitization import sanitize_request, sanitize_response
 from .serializers import Serializer, SerializerContext
 from .types import Body, Cookies, FormData, Headers, NotSet, PathParameters, Query
 from .utils import (
@@ -471,6 +472,9 @@ def validate_response(
                 else self.operation.schema.code_sample_style
             )
             verify = getattr(response, "verify", True)
+            if self.operation.schema.sanitize_output:
+                sanitize_request(response.request)
+                sanitize_response(response)
             code_message = self._get_code_message(code_sample_style, response.request, verify=verify)
             payload = get_response_payload(response)
             raise exception_cls(