Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/untangling imports #161

Closed
wants to merge 11 commits into from
Closed

Feat/untangling imports #161

wants to merge 11 commits into from

Conversation

svaningelgem
Copy link
Contributor

@tools4origins : you're not going to like this one when merging :(. Sorry about that!

Cleanup of imports.

  • change of name from stat_counter.py to statcounter.py as that's the name in pyspark.
  • Moved as much as I could into 'terminals'. Meaning modules which are not depending on any others except for externals to pysparkling. There's still a lot to do here!
  • Moved stuff out from pysparkling into pysparkling.sql because the SQL code just does not belong in the root!
  • The most glaring example of this is the method toDF which is moved to sql.session.py and is being monkey-patched. Just like pyspark is doing it. (don't re-invent the wheel :)).
  • Moved stuff to private (to pysparkling) modules (modules starting with _) which are not defined in pyspark. This helped to reduce the complexity a lot.

What I'm trying to achieve here is to reduce the dependency hell of the "SQL" module.
Not much code has been changed, some code copied from pyspark to make it easier, but basically moving methods around into different files.
I also started here to make a distinction between pysparkling internals and the pyspark interface. What I mean with that is that pyspark has a certain file structure, this I kept as rigorous as possible (it was already largely that way anyhow), But pysparkling specific methods I tried to move into "private" modules (_types, _casts, _expressions, ...).

- change of name from `stat_counter.py` to `statcounter.py` as that's the name in pyspark.
- Moved as much as I could into 'terminals'. Meaning modules which are _not_ depending on any others except for externals to pysparkling. There's still a lot to do here!
- Moved stuff out from pysparkling into pysparkling.sql because the SQL code just does not belong in the root!
- The most glaring example of this is the method `toDF` which is moved to sql.session.py and is being monkey-patched. Just like pyspark is doing it. (don't re-invent the wheel :)).
- Moved stuff to private (to pysparkling) modules (modules starting with `_`) which are not defined in pyspark. This helped to reduce the complexity a lot.
- change of name from `stat_counter.py` to `statcounter.py` as that's the name in pyspark.
- Moved as much as I could into 'terminals'. Meaning modules which are _not_ depending on any others except for externals to pysparkling. There's still a lot to do here!
- Moved stuff out from pysparkling into pysparkling.sql because the SQL code just does not belong in the root!
- The most glaring example of this is the method `toDF` which is moved to sql.session.py and is being monkey-patched. Just like pyspark is doing it. (don't re-invent the wheel :)).
- Moved stuff to private (to pysparkling) modules (modules starting with `_`) which are not defined in pyspark. This helped to reduce the complexity a lot.
- Added a script to compare pyspark & pysparkling to see what still needs to be implemented.
@svaningelgem
Copy link
Contributor Author

svaningelgem commented Mar 1, 2021

Report:

I will commit this in a future commit, but maybe we can discuss what the direction would be?

For me I'd like to have the public interface of pysparkling the same as pyspark. And any internal details handled in _name.py files. This shows clearly what is pysparkling internal code (_...) and what is not.

"Problem" I see is when pyspark is changing, and if we would want to focus our efforts on maintaining an all-inclusive library.

For me, I need compatibility with pyspark 2.3.2 at the very least.

Files that should not be exposed:

  • cache_manager.py
  • exceptions.py
  • fileio\codec\bz2.py
  • fileio\codec\codec.py
  • fileio\codec\gz.py
  • fileio\codec\lzma.py
  • fileio\codec\sevenz.py
  • fileio\codec\tar.py
  • fileio\codec\zip.py
  • fileio\file.py
  • fileio\fs\file_system.py
  • fileio\fs\gs.py
  • fileio\fs\hdfs.py
  • fileio\fs\http.py
  • fileio\fs\local.py
  • fileio\fs\s3.py
  • fileio\textfile.py
  • partition.py
  • samplers.py
  • sql\internal_utils\column.py
  • sql\internal_utils\joins.py
  • sql\internal_utils\options.py
  • sql\internal_utils\readers\common.py
  • sql\internal_utils\readers\csvreader.py
  • sql\internal_utils\readers\jsonreader.py
  • sql\internal_utils\readers\textreader.py
  • sql\internal_utils\readers\utils.py
  • sql\internal_utils\readwrite.py
  • sql\internal_utils\writers.py
  • sql\internals.py
  • streaming\filestream.py
  • streaming\queuestream.py
  • streaming\tcpstream.py
  • task_context.py
  • utils.py

REPORT: pyspark vs pysparkling

  • pyspark.__init__
    __all__ is not the same:
    pysparkling is still missing:
    • Accumulator
    • AccumulatorParam
    • BarrierTaskContext
    • BarrierTaskInfo
    • BasicProfiler
    • MarshalSerializer
    • PickleSerializer
    • Profiler
    • RDDBarrier
    • SparkConf
    • SparkContext
    • SparkFiles
    • SparkJobInfo
    • SparkStageInfo
    • StatusTracker
    • TaskContext
      pysparkling has these too much:
    • CacheManager
    • Context
    • Row
    • StatCounter
    • TimedCacheManager
    • exceptions
    • fileio
    • streaming
  • pyspark._globals
    _ --> CANNOT LOAD pysparkling._globals_
  • pyspark.accumulators
    ==> All ok
  • pyspark.broadcast
    ==> All ok
  • pyspark.cloudpickle
    _ --> CANNOT LOAD pysparkling.cloudpickle_
  • pyspark.conf
    _ --> CANNOT LOAD pysparkling.conf_
  • pyspark.context
    __all__ is not defined in pysparkling! Going to check the symbols anyway.
    pysparkling is still missing:
    • SparkContext
  • pyspark.daemon
    --> CANNOT IMPORT pyspark.daemon
  • pyspark.files
    _ --> CANNOT LOAD pysparkling.files_
  • pyspark.find_spark_home
    _ --> CANNOT LOAD pysparkling.find_spark_home_
  • pyspark.heapq3
    _ --> CANNOT LOAD pysparkling.heapq3_
  • pyspark.java_gateway
    _ --> CANNOT LOAD pysparkling.java_gateway_
  • pyspark.join
    _ --> CANNOT LOAD pysparkling.join_
  • pyspark.ml.__init__
    _ --> CANNOT LOAD pysparkling.ml.__init___
  • pyspark.ml.base
    _ --> CANNOT LOAD pysparkling.ml.base_
  • pyspark.ml.classification
    _ --> CANNOT LOAD pysparkling.ml.classification_
  • pyspark.ml.clustering
    _ --> CANNOT LOAD pysparkling.ml.clustering_
  • pyspark.ml.common
    _ --> CANNOT LOAD pysparkling.ml.common_
  • pyspark.ml.evaluation
    _ --> CANNOT LOAD pysparkling.ml.evaluation_
  • pyspark.ml.feature
    _ --> CANNOT LOAD pysparkling.ml.feature_
  • pyspark.ml.fpm
    _ --> CANNOT LOAD pysparkling.ml.fpm_
  • pyspark.ml.functions
    _ --> CANNOT LOAD pysparkling.ml.functions_
  • pyspark.ml.image
    _ --> CANNOT LOAD pysparkling.ml.image_
  • pyspark.ml.linalg.__init__
    _ --> CANNOT LOAD pysparkling.ml.linalg.__init___
  • pyspark.ml.param.__init__
    _ --> CANNOT LOAD pysparkling.ml.param.__init___
  • pyspark.ml.param._shared_params_code_gen
    _ --> CANNOT LOAD pysparkling.ml.param._shared_params_code_gen_
  • pyspark.ml.param.shared
    _ --> CANNOT LOAD pysparkling.ml.param.shared_
  • pyspark.ml.pipeline
    _ --> CANNOT LOAD pysparkling.ml.pipeline_
  • pyspark.ml.recommendation
    _ --> CANNOT LOAD pysparkling.ml.recommendation_
  • pyspark.ml.regression
    _ --> CANNOT LOAD pysparkling.ml.regression_
  • pyspark.ml.stat
    _ --> CANNOT LOAD pysparkling.ml.stat_
  • pyspark.ml.tree
    _ --> CANNOT LOAD pysparkling.ml.tree_
  • pyspark.ml.tuning
    _ --> CANNOT LOAD pysparkling.ml.tuning_
  • pyspark.ml.util
    _ --> CANNOT LOAD pysparkling.ml.util_
  • pyspark.ml.wrapper
    _ --> CANNOT LOAD pysparkling.ml.wrapper_
  • pyspark.mllib.__init__
    _ --> CANNOT LOAD pysparkling.mllib.__init___
  • pyspark.mllib.classification
    _ --> CANNOT LOAD pysparkling.mllib.classification_
  • pyspark.mllib.clustering
    _ --> CANNOT LOAD pysparkling.mllib.clustering_
  • pyspark.mllib.common
    _ --> CANNOT LOAD pysparkling.mllib.common_
  • pyspark.mllib.evaluation
    _ --> CANNOT LOAD pysparkling.mllib.evaluation_
  • pyspark.mllib.feature
    _ --> CANNOT LOAD pysparkling.mllib.feature_
  • pyspark.mllib.fpm
    _ --> CANNOT LOAD pysparkling.mllib.fpm_
  • pyspark.mllib.linalg.__init__
    _ --> CANNOT LOAD pysparkling.mllib.linalg.__init___
  • pyspark.mllib.linalg.distributed
    _ --> CANNOT LOAD pysparkling.mllib.linalg.distributed_
  • pyspark.mllib.random
    _ --> CANNOT LOAD pysparkling.mllib.random_
  • pyspark.mllib.recommendation
    _ --> CANNOT LOAD pysparkling.mllib.recommendation_
  • pyspark.mllib.regression
    _ --> CANNOT LOAD pysparkling.mllib.regression_
  • pyspark.mllib.stat.__init__
    _ --> CANNOT LOAD pysparkling.mllib.stat.__init___
  • pyspark.mllib.stat._statistics
    _ --> CANNOT LOAD pysparkling.mllib.stat._statistics_
  • pyspark.mllib.stat.distribution
    _ --> CANNOT LOAD pysparkling.mllib.stat.distribution_
  • pyspark.mllib.stat.KernelDensity
    _ --> CANNOT LOAD pysparkling.mllib.stat.KernelDensity_
  • pyspark.mllib.stat.test
    _ --> CANNOT LOAD pysparkling.mllib.stat.test_
  • pyspark.mllib.tree
    _ --> CANNOT LOAD pysparkling.mllib.tree_
  • pyspark.mllib.util
    _ --> CANNOT LOAD pysparkling.mllib.util_
  • pyspark.profiler
    _ --> CANNOT LOAD pysparkling.profiler_
  • pyspark.python.pyspark.shell
    21/03/01 20:30:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    _ --> CANNOT LOAD pysparkling.python.pyspark.shell_
  • pyspark.rdd
    ==> All ok
  • pyspark.rddsampler
    _ --> CANNOT LOAD pysparkling.rddsampler_
  • pyspark.resource
    _ --> CANNOT LOAD pysparkling.resource_
  • pyspark.resultiterable
    _ --> CANNOT LOAD pysparkling.resultiterable_
  • pyspark.serializers
    _ --> CANNOT LOAD pysparkling.serializers_
  • pyspark.shell
    _ --> CANNOT LOAD pysparkling.shell_
  • pyspark.shuffle
    _ --> CANNOT LOAD pysparkling.shuffle_
  • pyspark.sql.__init__
    __all__ is not the same:
    pysparkling is still missing:
    • Catalog
    • HiveContext
    • PandasCogroupedOps
    • UDFRegistration
    • Window
    • WindowSpec
  • pyspark.sql.avro.__init__
    _ --> CANNOT LOAD pysparkling.sql.avro.__init___
  • pyspark.sql.avro.functions
    _ --> CANNOT LOAD pysparkling.sql.avro.functions_
  • pyspark.sql.catalog
    _ --> CANNOT LOAD pysparkling.sql.catalog_
  • pyspark.sql.column
    ==> All ok
  • pyspark.sql.conf
    pysparkling is still missing:
    • _NoValue
    • _test
    • basestring
    • ignore_unicode_prefix
    • since
    • sys
      pysparkling has these too much:
    • _sentinel
  • pyspark.sql.context
    __all__ is not defined in pysparkling! Going to check the symbols anyway.
    pysparkling is still missing:
    • HiveContext
  • pyspark.sql.dataframe
    ==> All ok
  • pyspark.sql.functions
    __all__ is not the same:
    pysparkling is still missing:
    • PandasUDFType
    • approxCountDistinct
    • basestring
    • overlay
    • toDegrees
    • toRadians
    • to_str
      pysparkling has these too much:
    • callUDF
    • math
    • parse
    • typedLit
  • pyspark.sql.group
    ==> All ok
  • pyspark.sql.pandas.__init__
  • pyspark.sql.pandas.conversion
    _ --> CANNOT LOAD pysparkling.sql.pandas.conversion_
  • pyspark.sql.pandas.functions
    _ --> CANNOT LOAD pysparkling.sql.pandas.functions_
  • pyspark.sql.pandas.group_ops
    _ --> CANNOT LOAD pysparkling.sql.pandas.group_ops_
  • pyspark.sql.pandas.map_ops
    _ --> CANNOT LOAD pysparkling.sql.pandas.map_ops_
  • pyspark.sql.pandas.serializers
    _ --> CANNOT LOAD pysparkling.sql.pandas.serializers_
  • pyspark.sql.pandas.typehints
    _ --> CANNOT LOAD pysparkling.sql.pandas.typehints_
  • pyspark.sql.pandas.types
    _ --> CANNOT LOAD pysparkling.sql.pandas.types_
  • pyspark.sql.pandas.utils
    pysparkling is still missing:
    • require_minimum_pyarrow_version
      pysparkling has these too much:
    • parse_pandas_version
  • pyspark.sql.readwriter
    ==> All ok
  • pyspark.sql.session
    ==> All ok
  • pyspark.sql.streaming
    _ --> CANNOT LOAD pysparkling.sql.streaming_
  • pyspark.sql.types
    ==> All ok
  • pyspark.sql.udf
    _ --> CANNOT LOAD pysparkling.sql.udf_
  • pyspark.sql.utils
    pysparkling is still missing:
    • SparkContext
    • py4j
    • raise_from
    • sys
    • unicode
  • pyspark.sql.window
    _ --> CANNOT LOAD pysparkling.sql.window_
  • pyspark.statcounter
  • pyspark.status
    _ --> CANNOT LOAD pysparkling.status_
  • pyspark.storagelevel
    ==> All ok
  • pyspark.streaming.__init__
    __all__ is not the same:
    pysparkling is still missing:
    • StreamingListener
  • pyspark.streaming.context
    ==> All ok
  • pyspark.streaming.dstream
    ==> All ok
  • pyspark.streaming.kinesis
    _ --> CANNOT LOAD pysparkling.streaming.kinesis_
  • pyspark.streaming.listener
    _ --> CANNOT LOAD pysparkling.streaming.listener_
  • pyspark.streaming.util
    _ --> CANNOT LOAD pysparkling.streaming.util_
  • pyspark.taskcontext
    _ --> CANNOT LOAD pysparkling.taskcontext_
  • pyspark.traceback_utils
    _ --> CANNOT LOAD pysparkling.traceback_utils_
  • pyspark.util
    _ --> CANNOT LOAD pysparkling.util_
  • pyspark.version
    _ --> CANNOT LOAD pysparkling.version_
  • pyspark.worker
    _ --> CANNOT LOAD pysparkling.worker_

@svaningelgem
Copy link
Contributor Author

I'm also thinking: should the private interface be implemented or not? I would say no, because it is subject to change at any time. But there are some things that we do at our company that can only be done via the private interface... So that would be making my life more difficult there.

I think I'll just concentrate on the public one for now.

@svaningelgem
Copy link
Contributor Author

Because of the untangled imports I could now also add my long standing dream of auto-injection. A proof-of-concept has been added in the latest commit (29a7d99).

The program itself does not change one bit, but the execution happens via pypark or via pysparkling:

Finished this run in : 11.449s (Injector was on: False)

vs

Finished this run in : 0.917s (Injector was on: True)



class SparkContext:
"""Accepts the same initialization parameters as pyspark, but redirects everything to Context."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not subclassing Context instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I actually wanted to make SparkContext a singleton. And as far as I can see, Context is not a singleton.

PlusI went a far way out to try to copy the pyspark API into SparkContext, but finally ended up with the current implementation.

Long story short: inheriting & checking for single use in new should be better :-). I updated the code as such.

@@ -19,6 +20,8 @@

log = logging.getLogger(__name__)

__all__ = ['SparkContext']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a rather big change in pysparkling API which we may want to discuss, it may not be aligned with the previous vision that were the API was on purpose different as pyspark (namely here because it is not a SparkContext that is created, it's a pysparkling context)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, needs to be discussed than. For me, I want it to be a drop-in replacement for pyspark (during tests). As such it would need to be 100% aligned with the pyspark API.

@svenkreiss , @tools4origins : what is your view on this?

I also refer to the auto-injector: https://github.com/svenkreiss/pysparkling/pull/161/files#diff-b6988b98630fdeab52d583b68acee7ba2d41304b72a201045c0c141c7fd502dc

@svaningelgem svaningelgem deleted the feat/untangling_imports branch March 11, 2021 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants