This analysis consists of using big data tools to answer questions about datasets from Wikipedia. There are a series of analysis questions, answered using Hive and MapReduce. The tools used are determined based on the context for each question.
- Hadoop
- HDFS
- Python
- Hive
- MapReduce
- Yarn
- Git + Github
- Find, organize, and format pageviews on any given day.
- Determine relative popularity of page access methods.
- Compare yearly popularity of pages.
- Find the different way to analyze the most internal search link fraction of hotel california.
For Problem Statement 1, 4 & 5 we need to Download jan 20, 2021 DataSet.
eg: fr.b 1-Naphthol 1 737**
domain_code: fr.b
page_title: Naphthol
count_views: 1
total_response_size: 733
wikibooks: ".b"
wiktionary: ".d"
foundationwiki: ".f"
mobile sites: ".m"
wikinews: ".n"
wikiquote: ".q"
wikisource: ".s"
wikiversity: ".v"
wikivoyage: ".voy"
mediawikiwiki: ".w"
wikidatawiki: ".wd"
For Problem Statement 2 & 3 we need to download Jan 2021 month DataSet.
eg: other-search Camp_Tawonga external 183
prev: the result of mapping the referrer URL to the fixed set of values described above
curr: the title of the article the client requested
type: describes (prev, curr)
n: the number of occurrences of the (referrer, resource) pair
type: describes (prev, curr)
link: if the referrer and request are both articles and the referrer links to the request
external: if the referrer host is not en(.m)?.wikipedia.org
- Which English wikipedia article got the most traffic on January 20, 2021?
- What English wikipedia article has the largest fraction of its readers follow an internal link to another wikipedia article?
- What series of wikipedia articles, starting with Hotel California, keeps the largest fraction of its readers clicking on internal links.
- Find an example of an English wikipedia article that is relatively more popular in the Americas than elsewhere.There is no location data associated with the wikipedia pageviews data, but there are timestamps. You'll need to make some assumptions about internet usage over the hours of the day.
- Difference between total views of 'en' and 'en.m' pages.