Skip to content

santosh-burada/Hadoop

Repository files navigation

StackOveflow Data Analysis Using Hadoop and Spark

StackOverflow (https://stackoverflow.com/) is the most popular question and answer website, popular among coders and generate very huge amounts of data on daily basis. It is an excellent example of big data. The site’s Data Dump is available online (https://archive.org/details/stackexchange) and we are using it for analytics purpose. For this project, we are using a three different infrastructure that is Stand alone, Custom Cluster with Cloudera and One-click Cluster.

Functionalities

  • EXTRACT DATASETS
    Gathering the datasets for analyzing and meeting our project goals,we follow the following path:
  • PREPARING DATASETS:
  • ANALYSIS:
    • Reading the Paraquetformat and running spark script to perform analysis, such as:
      • Number of questions in the dataset
      • Number of answers in dataset
      • Distinct number of users
      • Questionswhich are answered
      • Most viewed questions
      • Computing the response time to get the answer for questions.
      • Computing hourly data (per hour how many questions got answered)
      • Time evaluation of number of questions and answers.
      • Computing number oftags
      • Getting the most popular tags based on their frequency.

Deployments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages