StackOveflow Data Analysis Using Hadoop and Spark

StackOverflow (https://stackoverflow.com/) is the most popular question and answer website, popular among coders and generate very huge amounts of data on daily basis. It is an excellent example of big data. The site’s Data Dump is available online (https://archive.org/details/stackexchange) and we are using it for analytics purpose. For this project, we are using a three different infrastructure that is Stand alone, Custom Cluster with Cloudera and One-click Cluster.

Functionalities

EXTRACT DATASETS
Gathering the datasets for analyzing and meeting our project goals,we follow the following path:
- The data dump is composed of a set of XML files compacted with the .7z extension.
- Even after compacting the biggest file has 18GB of data.
- If we extract this file, it would be around 92GB,which is not easy to handle with a local machine.
- We are using python to extract the subset of around 6GB and prepare data for our analysis. Code for the same can be found on GitHub link: https://github.com/santoshburada/Hadoop/tree/main/Data-Extraction
- Schema of the dataset can be viewed on belowlink: https://data.stackexchange.com/stackoverflow/query/472607/
PREPARING DATASETS:
- We are converting the data from XML to Dataframe and from dataframe to parquetfile format. Code for the same can be found in file prepare_data.py under the belowlink: https://github.com/santosh-burada/Hadoop/blob/main/prepare_data.py
ANALYSIS:
- Reading the Paraquetformat and running spark script to perform analysis, such as:
  - Number of questions in the dataset
  - Number of answers in dataset
  - Distinct number of users
  - Questionswhich are answered
  - Most viewed questions
  - Computing the response time to get the answer for questions.
  - Computing hourly data (per hour how many questions got answered)
  - Time evaluation of number of questions and answers.
  - Computing number oftags
  - Getting the most popular tags based on their frequency.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea		.idea
Cluster-Configurations		Cluster-Configurations
Data-Extraction		Data-Extraction
data		data
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
output.txt		output.txt
prepare_data.py		prepare_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

Cluster-Configurations

Cluster-Configurations

Data-Extraction

Data-Extraction

data

data

.gitignore

.gitignore

README.md

README.md

analysis.py

analysis.py

output.txt

output.txt

prepare_data.py

prepare_data.py

Repository files navigation

StackOveflow Data Analysis Using Hadoop and Spark

Functionalities

Gathering the datasets for analyzing and meeting our project goals,we follow the following path:

Deployments

About

Releases

Packages

Languages

santosh-burada/Hadoop

Folders and files

Latest commit

History

Repository files navigation

StackOveflow Data Analysis Using Hadoop and Spark

Functionalities

Gathering the datasets for analyzing and meeting our project goals,we follow the following path:

Deployments

About

Resources

Stars

Watchers

Forks

Languages