A collection of machine learning examples using PySpark
To clone and run the tutorials, please install Anaconda Python with pyspark
and other needed packages.
Quick through the basic, I recommend a online course from Udemy and this great GitHub Repo.
Watch as John Hogue walks through a practical example of a data pipeline to feed textual data for tagging with PySpark and ML. Learn to leverage great existing Python libraries in Spark such as NLTK and how to use some of Spark’s newer features. A GitHub Repo of source code, training and test sets of data will be provided for attendees to explore and play with.