Skip to content

shah-zeb-naveed/distributed-computing-pyspark

Repository files navigation

Distributed Computing - PySpark

This repository contains mini-projects on distributed computing using Spark in Python.

  1. Text Analytics: Point-wise Mutual Information in PySpark

Calculates the PMI for a token or a pair of tokens for all the words ocurring in a text file.

  1. Graph/Network Analysis: Personalized PageRank Algorithm in PySpark

Implements a modified version of the PageRank Algorithm in which the ranking is performed in reference to a given source node. The modifications are two-fold: A. Random Jumps only to the source node B. Lost mass due to dangling nodes is transferred completely to the source node instead of redistrubuting over the entire graph

  1. Querying TPCH with spark dataframes and spark sql

  2. Stochastic Gradient Descent using Spark from scratch for email classification into spam/ham.

  3. Analysis of live robot movements data using Sparks Streaming.

About

Distributed Computing using PySpark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published