Feat/untangling imports #161

svaningelgem · 2021-03-01T14:31:11Z

@tools4origins : you're not going to like this one when merging :(. Sorry about that!

Cleanup of imports.

change of name from stat_counter.py to statcounter.py as that's the name in pyspark.
Moved as much as I could into 'terminals'. Meaning modules which are not depending on any others except for externals to pysparkling. There's still a lot to do here!
Moved stuff out from pysparkling into pysparkling.sql because the SQL code just does not belong in the root!
The most glaring example of this is the method toDF which is moved to sql.session.py and is being monkey-patched. Just like pyspark is doing it. (don't re-invent the wheel :)).
Moved stuff to private (to pysparkling) modules (modules starting with _) which are not defined in pyspark. This helped to reduce the complexity a lot.

What I'm trying to achieve here is to reduce the dependency hell of the "SQL" module.
Not much code has been changed, some code copied from pyspark to make it easier, but basically moving methods around into different files.
I also started here to make a distinction between pysparkling internals and the pyspark interface. What I mean with that is that pyspark has a certain file structure, this I kept as rigorous as possible (it was already largely that way anyhow), But pysparkling specific methods I tried to move into "private" modules (_types, _casts, _expressions, ...).

- change of name from `stat_counter.py` to `statcounter.py` as that's the name in pyspark. - Moved as much as I could into 'terminals'. Meaning modules which are _not_ depending on any others except for externals to pysparkling. There's still a lot to do here! - Moved stuff out from pysparkling into pysparkling.sql because the SQL code just does not belong in the root! - The most glaring example of this is the method `toDF` which is moved to sql.session.py and is being monkey-patched. Just like pyspark is doing it. (don't re-invent the wheel :)). - Moved stuff to private (to pysparkling) modules (modules starting with `_`) which are not defined in pyspark. This helped to reduce the complexity a lot.

- Added a script to compare pyspark & pysparkling to see what still needs to be implemented.

svaningelgem · 2021-03-01T20:20:39Z

Report:

I will commit this in a future commit, but maybe we can discuss what the direction would be?

For me I'd like to have the public interface of pysparkling the same as pyspark. And any internal details handled in _name.py files. This shows clearly what is pysparkling internal code (_...) and what is not.

"Problem" I see is when pyspark is changing, and if we would want to focus our efforts on maintaining an all-inclusive library.

For me, I need compatibility with pyspark 2.3.2 at the very least.

Files that should not be exposed:

cache_manager.py
exceptions.py
fileio\codec\bz2.py
fileio\codec\codec.py
fileio\codec\gz.py
fileio\codec\lzma.py
fileio\codec\sevenz.py
fileio\codec\tar.py
fileio\codec\zip.py
fileio\file.py
fileio\fs\file_system.py
fileio\fs\gs.py
fileio\fs\hdfs.py
fileio\fs\http.py
fileio\fs\local.py
fileio\fs\s3.py
fileio\textfile.py
partition.py
samplers.py
sql\internal_utils\column.py
sql\internal_utils\joins.py
sql\internal_utils\options.py
sql\internal_utils\readers\common.py
sql\internal_utils\readers\csvreader.py
sql\internal_utils\readers\jsonreader.py
sql\internal_utils\readers\textreader.py
sql\internal_utils\readers\utils.py
sql\internal_utils\readwrite.py
sql\internal_utils\writers.py
sql\internals.py
streaming\filestream.py
streaming\queuestream.py
streaming\tcpstream.py
task_context.py
utils.py

REPORT: pyspark vs pysparkling

pyspark.__init__
__all__ is not the same:
pysparkling is still missing:
- Accumulator
- AccumulatorParam
- BarrierTaskContext
- BarrierTaskInfo
- BasicProfiler
- MarshalSerializer
- PickleSerializer
- Profiler
- RDDBarrier
- SparkConf
- SparkContext
- SparkFiles
- SparkJobInfo
- SparkStageInfo
- StatusTracker
- TaskContext
  pysparkling has these too much:
- CacheManager
- Context
- Row
- StatCounter
- TimedCacheManager
- exceptions
- fileio
- streaming
pyspark._globals
_ --> CANNOT LOAD pysparkling._globals_
pyspark.accumulators
==> All ok
pyspark.broadcast
==> All ok
pyspark.cloudpickle
_ --> CANNOT LOAD pysparkling.cloudpickle_
pyspark.conf
_ --> CANNOT LOAD pysparkling.conf_
pyspark.context
__all__ is not defined in pysparkling! Going to check the symbols anyway.
pysparkling is still missing:
- SparkContext
pyspark.daemon
--> CANNOT IMPORT pyspark.daemon
pyspark.files
_ --> CANNOT LOAD pysparkling.files_
pyspark.find_spark_home
_ --> CANNOT LOAD pysparkling.find_spark_home_
pyspark.heapq3
_ --> CANNOT LOAD pysparkling.heapq3_
pyspark.java_gateway
_ --> CANNOT LOAD pysparkling.java_gateway_
pyspark.join
_ --> CANNOT LOAD pysparkling.join_
pyspark.ml.__init__
_ --> CANNOT LOAD pysparkling.ml.__init___
pyspark.ml.base
_ --> CANNOT LOAD pysparkling.ml.base_
pyspark.ml.classification
_ --> CANNOT LOAD pysparkling.ml.classification_
pyspark.ml.clustering
_ --> CANNOT LOAD pysparkling.ml.clustering_
pyspark.ml.common
_ --> CANNOT LOAD pysparkling.ml.common_
pyspark.ml.evaluation
_ --> CANNOT LOAD pysparkling.ml.evaluation_
pyspark.ml.feature
_ --> CANNOT LOAD pysparkling.ml.feature_
pyspark.ml.fpm
_ --> CANNOT LOAD pysparkling.ml.fpm_
pyspark.ml.functions
_ --> CANNOT LOAD pysparkling.ml.functions_
pyspark.ml.image
_ --> CANNOT LOAD pysparkling.ml.image_
pyspark.ml.linalg.__init__
_ --> CANNOT LOAD pysparkling.ml.linalg.__init___
pyspark.ml.param.__init__
_ --> CANNOT LOAD pysparkling.ml.param.__init___
pyspark.ml.param._shared_params_code_gen
_ --> CANNOT LOAD pysparkling.ml.param._shared_params_code_gen_
pyspark.ml.param.shared
_ --> CANNOT LOAD pysparkling.ml.param.shared_
pyspark.ml.pipeline
_ --> CANNOT LOAD pysparkling.ml.pipeline_
pyspark.ml.recommendation
_ --> CANNOT LOAD pysparkling.ml.recommendation_
pyspark.ml.regression
_ --> CANNOT LOAD pysparkling.ml.regression_
pyspark.ml.stat
_ --> CANNOT LOAD pysparkling.ml.stat_
pyspark.ml.tree
_ --> CANNOT LOAD pysparkling.ml.tree_
pyspark.ml.tuning
_ --> CANNOT LOAD pysparkling.ml.tuning_
pyspark.ml.util
_ --> CANNOT LOAD pysparkling.ml.util_
pyspark.ml.wrapper
_ --> CANNOT LOAD pysparkling.ml.wrapper_
pyspark.mllib.__init__
_ --> CANNOT LOAD pysparkling.mllib.__init___
pyspark.mllib.classification
_ --> CANNOT LOAD pysparkling.mllib.classification_
pyspark.mllib.clustering
_ --> CANNOT LOAD pysparkling.mllib.clustering_
pyspark.mllib.common
_ --> CANNOT LOAD pysparkling.mllib.common_
pyspark.mllib.evaluation
_ --> CANNOT LOAD pysparkling.mllib.evaluation_
pyspark.mllib.feature
_ --> CANNOT LOAD pysparkling.mllib.feature_
pyspark.mllib.fpm
_ --> CANNOT LOAD pysparkling.mllib.fpm_
pyspark.mllib.linalg.__init__
_ --> CANNOT LOAD pysparkling.mllib.linalg.__init___
pyspark.mllib.linalg.distributed
_ --> CANNOT LOAD pysparkling.mllib.linalg.distributed_
pyspark.mllib.random
_ --> CANNOT LOAD pysparkling.mllib.random_
pyspark.mllib.recommendation
_ --> CANNOT LOAD pysparkling.mllib.recommendation_
pyspark.mllib.regression
_ --> CANNOT LOAD pysparkling.mllib.regression_
pyspark.mllib.stat.__init__
_ --> CANNOT LOAD pysparkling.mllib.stat.__init___
pyspark.mllib.stat._statistics
_ --> CANNOT LOAD pysparkling.mllib.stat._statistics_
pyspark.mllib.stat.distribution
_ --> CANNOT LOAD pysparkling.mllib.stat.distribution_
pyspark.mllib.stat.KernelDensity
_ --> CANNOT LOAD pysparkling.mllib.stat.KernelDensity_
pyspark.mllib.stat.test
_ --> CANNOT LOAD pysparkling.mllib.stat.test_
pyspark.mllib.tree
_ --> CANNOT LOAD pysparkling.mllib.tree_
pyspark.mllib.util
_ --> CANNOT LOAD pysparkling.mllib.util_
pyspark.profiler
_ --> CANNOT LOAD pysparkling.profiler_
pyspark.python.pyspark.shell
21/03/01 20:30:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
_ --> CANNOT LOAD pysparkling.python.pyspark.shell_
pyspark.rdd
==> All ok
pyspark.rddsampler
_ --> CANNOT LOAD pysparkling.rddsampler_
pyspark.resource
_ --> CANNOT LOAD pysparkling.resource_
pyspark.resultiterable
_ --> CANNOT LOAD pysparkling.resultiterable_
pyspark.serializers
_ --> CANNOT LOAD pysparkling.serializers_
pyspark.shell
_ --> CANNOT LOAD pysparkling.shell_
pyspark.shuffle
_ --> CANNOT LOAD pysparkling.shuffle_
pyspark.sql.__init__
__all__ is not the same:
pysparkling is still missing:
- Catalog
- HiveContext
- PandasCogroupedOps
- UDFRegistration
- Window
- WindowSpec
pyspark.sql.avro.__init__
_ --> CANNOT LOAD pysparkling.sql.avro.__init___
pyspark.sql.avro.functions
_ --> CANNOT LOAD pysparkling.sql.avro.functions_
pyspark.sql.catalog
_ --> CANNOT LOAD pysparkling.sql.catalog_
pyspark.sql.column
==> All ok
pyspark.sql.conf
pysparkling is still missing:
- _NoValue
- _test
- basestring
- ignore_unicode_prefix
- since
- sys
  pysparkling has these too much:
- _sentinel
pyspark.sql.context
__all__ is not defined in pysparkling! Going to check the symbols anyway.
pysparkling is still missing:
- HiveContext
pyspark.sql.dataframe
==> All ok
pyspark.sql.functions
__all__ is not the same:
pysparkling is still missing:
- PandasUDFType
- approxCountDistinct
- basestring
- overlay
- toDegrees
- toRadians
- to_str
  pysparkling has these too much:
- callUDF
- math
- parse
- typedLit
pyspark.sql.group
==> All ok
pyspark.sql.pandas.__init__
pyspark.sql.pandas.conversion
_ --> CANNOT LOAD pysparkling.sql.pandas.conversion_
pyspark.sql.pandas.functions
_ --> CANNOT LOAD pysparkling.sql.pandas.functions_
pyspark.sql.pandas.group_ops
_ --> CANNOT LOAD pysparkling.sql.pandas.group_ops_
pyspark.sql.pandas.map_ops
_ --> CANNOT LOAD pysparkling.sql.pandas.map_ops_
pyspark.sql.pandas.serializers
_ --> CANNOT LOAD pysparkling.sql.pandas.serializers_
pyspark.sql.pandas.typehints
_ --> CANNOT LOAD pysparkling.sql.pandas.typehints_
pyspark.sql.pandas.types
_ --> CANNOT LOAD pysparkling.sql.pandas.types_
pyspark.sql.pandas.utils
pysparkling is still missing:
- require_minimum_pyarrow_version
  pysparkling has these too much:
- parse_pandas_version
pyspark.sql.readwriter
==> All ok
pyspark.sql.session
==> All ok
pyspark.sql.streaming
_ --> CANNOT LOAD pysparkling.sql.streaming_
pyspark.sql.types
==> All ok
pyspark.sql.udf
_ --> CANNOT LOAD pysparkling.sql.udf_
pyspark.sql.utils
pysparkling is still missing:
- SparkContext
- py4j
- raise_from
- sys
- unicode
pyspark.sql.window
_ --> CANNOT LOAD pysparkling.sql.window_
pyspark.statcounter
pyspark.status
_ --> CANNOT LOAD pysparkling.status_
pyspark.storagelevel
==> All ok
pyspark.streaming.__init__
__all__ is not the same:
pysparkling is still missing:
- StreamingListener
pyspark.streaming.context
==> All ok
pyspark.streaming.dstream
==> All ok
pyspark.streaming.kinesis
_ --> CANNOT LOAD pysparkling.streaming.kinesis_
pyspark.streaming.listener
_ --> CANNOT LOAD pysparkling.streaming.listener_
pyspark.streaming.util
_ --> CANNOT LOAD pysparkling.streaming.util_
pyspark.taskcontext
_ --> CANNOT LOAD pysparkling.taskcontext_
pyspark.traceback_utils
_ --> CANNOT LOAD pysparkling.traceback_utils_
pyspark.util
_ --> CANNOT LOAD pysparkling.util_
pyspark.version
_ --> CANNOT LOAD pysparkling.version_
pyspark.worker
_ --> CANNOT LOAD pysparkling.worker_

svaningelgem · 2021-03-02T08:50:14Z

I'm also thinking: should the private interface be implemented or not? I would say no, because it is subject to change at any time. But there are some things that we do at our company that can only be done via the private interface... So that would be making my life more difficult there.

I think I'll just concentrate on the public one for now.

svaningelgem · 2021-03-02T20:24:33Z

Because of the untangled imports I could now also add my long standing dream of auto-injection. A proof-of-concept has been added in the latest commit (29a7d99).

The program itself does not change one bit, but the execution happens via pypark or via pysparkling:

Finished this run in : 11.449s (Injector was on: False)

vs

Finished this run in : 0.917s (Injector was on: True)

…mparing test-runs.

tools4origins · 2021-03-03T13:40:44Z

pysparkling/context.py

+
+
+class SparkContext:
+    """Accepts the same initialization parameters as pyspark, but redirects everything to Context."""


Why not subclassing Context instead?

Because I actually wanted to make SparkContext a singleton. And as far as I can see, Context is not a singleton.

PlusI went a far way out to try to copy the pyspark API into SparkContext, but finally ended up with the current implementation.

Long story short: inheriting & checking for single use in new should be better :-). I updated the code as such.

tools4origins · 2021-03-03T13:45:32Z

pysparkling/context.py

@@ -19,6 +20,8 @@

 log = logging.getLogger(__name__)

+__all__ = ['SparkContext']


It is a rather big change in pysparkling API which we may want to discuss, it may not be aligned with the previous vision that were the API was on purpose different as pyspark (namely here because it is not a SparkContext that is created, it's a pysparkling context)

Ok, needs to be discussed than. For me, I want it to be a drop-in replacement for pyspark (during tests). As such it would need to be 100% aligned with the pyspark API.

@svenkreiss , @tools4origins : what is your view on this?

I also refer to the auto-injector: https://github.com/svenkreiss/pysparkling/pull/161/files#diff-b6988b98630fdeab52d583b68acee7ba2d41304b72a201045c0c141c7fd502dc

svaningelgem added 3 commits March 1, 2021 12:23

- renamed schema_utils to _schema_utils (as it's an internal module).

60c74d9

- Added a script to compare pyspark & pysparkling to see what still needs to be implemented.

Work in progress. Implementing conf.

b0ed5a4

Added proof of concept for auto-injection.

29a7d99

svaningelgem added 4 commits March 2, 2021 21:34

Fix test issues.

cef2e7c

Remove duplicated __all__.

1597c72

Provide SparkContext

df4b505

Created a reverse auto injector too. This will be very helpful for co…

bb4a3ab

…mparing test-runs.

tools4origins reviewed Mar 3, 2021

View reviewed changes

svaningelgem added 2 commits March 3, 2021 15:22

Inherit from Context.

f3349ff

Fix object.__new__ call.

2d7cae6

svaningelgem closed this Mar 11, 2021

svaningelgem deleted the feat/untangling_imports branch March 11, 2021 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/untangling imports #161

Feat/untangling imports #161

svaningelgem commented Mar 1, 2021

svaningelgem commented Mar 1, 2021 •

edited

svaningelgem commented Mar 2, 2021

svaningelgem commented Mar 2, 2021

tools4origins Mar 3, 2021

svaningelgem Mar 3, 2021 •

edited

tools4origins Mar 3, 2021

svaningelgem Mar 3, 2021 •

edited



		class SparkContext:
		"""Accepts the same initialization parameters as pyspark, but redirects everything to Context."""

		@@ -19,6 +20,8 @@

		log = logging.getLogger(__name__)

		__all__ = ['SparkContext']

Feat/untangling imports #161

Feat/untangling imports #161

Conversation

svaningelgem commented Mar 1, 2021

svaningelgem commented Mar 1, 2021 • edited

Report:

Files that should not be exposed:

REPORT: pyspark vs pysparkling

svaningelgem commented Mar 2, 2021

svaningelgem commented Mar 2, 2021

tools4origins Mar 3, 2021

Choose a reason for hiding this comment

svaningelgem Mar 3, 2021 • edited

Choose a reason for hiding this comment

tools4origins Mar 3, 2021

Choose a reason for hiding this comment

svaningelgem Mar 3, 2021 • edited

Choose a reason for hiding this comment

svaningelgem commented Mar 1, 2021 •

edited

svaningelgem Mar 3, 2021 •

edited

svaningelgem Mar 3, 2021 •

edited