# Data 

In [1]:
import os
from tqdm import tqdm_notebook as tqdm

### Dowload data
Let's download the wikimedia page counts data of 2016-08-01

In [2]:
DATA_PATH = "/home/ubuntu/data"

In [3]:
# create dir data
! mkdir "$DATA_PATH"

os.chdir(DATA_PATH)

# download one day wikimedia data, date=2016-08-01
prefix = "https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-08/pagecounts-20160801-"
for i in tqdm(range(24)):
    url = "".join([prefix, "{:02}0000".format(i), ".gz"])
    ! wget "$url"

mkdir: /Users/mohamed-aminezghal/Documents/streaming_algorithms/code/data: File exists


100%|██████████| 24/24 [00:00<00:00, 113232.05it/s]


### Explore data

In [4]:
! ls "$DATA_PATH"

pagecounts-20160801-000000.gz pagecounts-20160801-120000.gz
pagecounts-20160801-010000.gz pagecounts-20160801-130000.gz
pagecounts-20160801-020000.gz pagecounts-20160801-140000.gz
pagecounts-20160801-030000.gz pagecounts-20160801-150000.gz
pagecounts-20160801-040000.gz pagecounts-20160801-160000.gz
pagecounts-20160801-050000.gz pagecounts-20160801-170000.gz
pagecounts-20160801-060000.gz pagecounts-20160801-180000.gz
pagecounts-20160801-070000.gz pagecounts-20160801-190000.gz
pagecounts-20160801-080000.gz pagecounts-20160801-200000.gz
pagecounts-20160801-090000.gz pagecounts-20160801-210000.gz
pagecounts-20160801-100000.gz pagecounts-20160801-220000.gz
pagecounts-20160801-110000.gz pagecounts-20160801-230000.gz


__We have downloaded 24 files. Each file contains page counts for one hour of 2016-08-01 in teh folowing format:__

/projectcounts-${YEAR}${MONTH}-${DAY}-${HOUR}0000

Chosse a file number (similar to choosing hour of day between 0 to 23)

In [5]:
file_number = 0

Let's have a look at the first lines of the file

In [6]:
prefix = "pagecounts-20160801-"
# fname stores the filename without the ".gz"
fname = prefix + "{:02}0000".format(file_number)
# unzip the file, while keeping the compressed file
! gunzip -k "$fname".gz
# compute the number of lines with bash command wc -l and store the result in wcout
lines_count = ! wc -l "$fname"
# get the first lines of the file
! head "$fname"
# Delete the uncompressed file
! rm "$fname"

aa File:Sleeping_lion.jpg 1 8030
aa Main_Page 1 78261
aa Special:Statistics 1 20493
aa Special:WhatLinksHere/File:Crystal_Clear_app_email.png 1 5412
aa Special:WhatLinksHere/File:Wikipedia-logo-fr.png 1 5370
aa Steward_requests/Bot_status 1 4733
aa Translation_teams/ru 1 4718
aa User:%E5%8F%B8%E5%BE%92%E4%BC%AF%E9%A2%9C 2 20096
aa User:149.62.201.0/24 1 4802
aa User:191.101.30.0/24 1 4806


The first field is ```domain_name``` (ex: aa), the second field is ```page_title``` (ex: Main_Page), the third field is ```count_views``` (ex: 1) and the last field is ```total_response_size``` (ex: 78261). So for example ```en Main_Page 42 50043``` means 42 requests to en.wikipedia.org/wiki/Main_Page, which accounted in total for 50043 response bytes.
More info: https://wikitech.wikimedia.org/wiki/Analytics/Archive/Data/Pagecounts-raw

In [7]:
lines_count

[' 6270942 pagecounts-20160801-000000']

The file pagecounts-20160801-000000 contains __6 270 942__ lines

Can you answer the following questions:
- Are the lines ordered in a certain manner? Yes. Alphanumerical order.
- What is the most viewed (```damain_name```, ```page_title```) couple in the file pagecounts-20160801-000000?
- Which (```damain_name```, ```page_title```) couple in the file pagecounts-20160801-000000 has the largest ```total_response_size```?
- What is the most visited ```domain_name```?
- Which files have the highest and lowest number of lines?
- What is the total number of lines across all files?

__What is the most viewed (```damain_name```, ```page_title```) couple in the file pagecounts-20160801-000000?__

In [8]:
file_number = 0
prefix = "pagecounts-20160801-"
fname = prefix + "{:02}0000".format(file_number)
! gunzip -k "$fname".gz
max_count_views, max_domain_name, max_page_title = 0, None, None
# Read file and extract fields
file = open(fname, "r", encoding='utf-8')
for line in file:
    domain_name, page_title, count_views, total_response_size = line.split(' ')
    count_views = int(count_views)
    if count_views > max_count_views:
        max_count_views = count_views
        max_domain_name = domain_name
        max_page_title = page_title
! rm "$fname"

print("({domain_name}, {page_title}) is the most viewed (domain_name, page_title) with {count_views} views".format(
    domain_name=max_domain_name,
    page_title=max_page_title,
    count_views=max_count_views))

(en.mw, en) is the most viewed (domain_name, page_title) with 5336925 views


__Which (```damain_name```, ```page_title```) couple in the file pagecounts-20160801-000000 has the largest ```total_response_size```?__

In [9]:
file_number = 0
prefix = "pagecounts-20160801-"
fname = prefix + "{:02}0000".format(file_number)
! gunzip -k "$fname".gz
max_total_response_size, max_domain_name, max_page_title = 0, None, None
# Read file and extract fields
file = open(fname, "r", encoding='utf-8')
for line in file:
    domain_name, page_title, count_views, total_response_size = line.split(' ')
    total_response_size = int(total_response_size)
    if total_response_size > max_total_response_size:
        max_total_response_size = total_response_size
        max_domain_name = domain_name
        max_page_title = page_title
! rm "$fname"

print("({domain_name}, {page_title}) couple has the largest total_response_size: {total_response_size} views".format(
    domain_name=max_domain_name,
    page_title=max_page_title,
    total_response_size=max_total_response_size))

(en.mw, en) couple has the largest total_response_size: 122044585824 views


__What is the most visited ```domain_name```?__

In [10]:
from collections import defaultdict

file_number = 0
prefix = "pagecounts-20160801-"
fname = prefix + "{:02}0000".format(file_number)
! gunzip -k "$fname".gz
domain_name_counts = defaultdict(int)
# Read file and extract fields
file = open(fname, "r", encoding='utf-8')
for line in file:
    domain_name, page_title, count_views, total_response_size = line.split(' ')
    domain_name_counts[domain_name] += int(count_views)
    
max_count_views, max_domain_name = 0, None
for domain_name, count_views in domain_name_counts.items():
    if count_views > max_count_views:
        max_count_views = count_views
        max_domain_name = domain_name
    
! rm "$fname"

print("{domain_name} is the most viewed domain_name with {count_views} views".format(
    domain_name=max_domain_name,
    count_views=max_count_views))

en is the most viewed domain_name with 6949620 views


In [11]:
files_lines_count = []
for file_number in tqdm(range(24)):
    prefix = "pagecounts-20160801-"
    fname = prefix + "{:02}0000".format(file_number)
    ! gunzip -k "$fname".gz
    lines_count = ! wc -l "$fname"
    # store lines_count
    files_lines_count.extend(lines_count)
    ! rm "$fname"

100%|██████████| 24/24 [00:46<00:00,  1.95s/it]


__Which files have the highest and lowest number of lines?__

In [12]:
max_lines, max_file = 0, None
min_lines, min_file = 6270942, None
for lines_count in files_lines_count:
    _, nb_lines_str, file_name = lines_count.split(' ')
    nb_lines = int(nb_lines_str)
    if nb_lines > max_lines:
        max_lines = nb_lines
        max_file = file_name
    if nb_lines < min_lines:
        min_lines = nb_lines
        min_file = file_name
        
print("File {file_name} has the maximum number of lines {nb_lines}".format(
    file_name=max_file,
    nb_lines=max_lines))
print("File {file_name} has the minimum number of lines {nb_lines}".format(
    file_name=min_file,
    nb_lines=min_lines))

File pagecounts-20160801-150000 has the maximum number of lines 7846121
File pagecounts-20160801-020000 has the minimum number of lines 5838509


__What is the total number of lines across all files?__

In [13]:
total_lines_count = sum([int(lines_count.split(' ')[1]) for lines_count in files_lines_count])
print("There are {total_lines_count} lines across all files".format(total_lines_count=total_lines_count))

There are 164306459 lines across all files
