Load StackExchange Data as Spark Dataframes and run analysis over it.
All the StackExchange sites use a common schema for their data. So a common pipeline (to load data) can be shared across different sites in the StackExchange Ecosystem.
- Download Spark 1.6.1 version.
- Setup the
$SPARK_HOMEvariable, pointing it to your spark installation.
- Download StackExchange Data.
- Clone this repo
git clone https://github.com/shagunsodhani/iota.git
mvn clean install
$SPARK_HOME/bin/spark-shell --jars target/uber-iota-0.0.1-SNAPSHOT.jar
import com.shagunsodhani.iota.utils.DataFrameUtilityin the scala REPL.
val userDF = DataFrameUtility.getUserDataFrame(sc, PATH_TO_USER_FILE)to load UserDataFrame. Other avilable functions are
For now only user and post schema are mapped. Other data (Badges, PostHistory etc.) can be easily mapped.
Checkout the sample Jupyter notebooks:
- [using StackOverflow Data](notebook/Exploratory Analysis for StackOverflow User Data.ipynb).
- Parsing XML files to load the data is inefficient. An application should use XML files to create dataframes only for the first time and should save the dataframes as Parquet files using the write.parquet method. The methods that were used to parse XML files can also be used to read the parquet files by passing the path to paquet file instead of the XML file. Once data is converted into Parquet, it can be directly loaded in PySpark/R. Another way to load the Stack Exchange XML data into these languages would be to implement the
getUserDataFramemethod and other methods as datasources just like Spark-CSV is implemented. While this implementation is not tricky, it will still be less efficient than reading Parquet files.