# Wikipedia Edits

## Week 6. Practice Programming Assignment 1

Let's look at top-5 countries of the European Union by population. They are [Germany](https://en.wikipedia.org/w/index.php?title=Germany), [France](https://en.wikipedia.org/w/index.php?title=France), [Italy](https://en.wikipedia.org/w/index.php?title=Italy), [Spain](https://en.wikipedia.org/w/index.php?title=Spain), [Poland](https://en.wikipedia.org/w/index.php?title=Poland).


In this assignment you are required to look at wikipedia pages of these countries. More specifically, at history of edits of the pages (Click on 'View History' at the top right when you are on wikipedia article page). You are free to use any scraping tools we have covered to answer the questions, that you will find below.

<br><br><br><br>

### Coding part

In [2]:
import regex as re
import requests
from bs4 import BeautifulSoup, SoupStrainer
import pandas as pd

In [3]:
def parse_history_change_to_df(urls, tags, attrs, year=None):
    """
    Parse the change history of a Wikipedia page to a pandas dataframe.
        
        :param urls: list of urls to parse
        :param tags: list of tags to parse
        :param attrs: dictionary of attributes to parse
        :optional param year: year to filter
    
        :returns: pandas dataframe
    """
    df = pd.DataFrame(columns=['country', 'date', 'user'])

    # Loop through urls
    for url in urls:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        
        # Parse the data
        data = soup.find_all(tags, attrs=attrs)

        # list of lists for dataframe creation
        collection = []

        # glueing log entries for date and user together
        for i in range(0, len(data), 2):
            if 'history-deleted' in data[i].get('class'):   # 'find_all' method for some reason includes not only specified classes, but also incapsulated classes
                pass
            else:
                collection.append([urls[url], data[i].text, data[i + 1].text])

        # Create a dataframe
        temp_df = pd.DataFrame(collection, columns=['country', 'date', 'user'])
        temp_df['date'] = temp_df['date'].apply(lambda x: x[-4::])

        # Filter by year if specified
        if year:
            temp_df = temp_df[temp_df['date'] == year]

        # sanity check
        print(f"Country: {urls[url]}; status code {r.status_code}\n\tTotal edits: {len(temp_df)}\n\tUnique users: {temp_df['user'].nunique()}\n")
        
        # Concatenate dataframes
        df = pd.concat([df, temp_df])
    return df


In [4]:
# params
COUNTRIES = ['Germany', 'France', 'Italy', 'Spain', 'Poland']
TAGS = ['a', 'span']
CLASSES = {'class': ['mw-changeslist-date', 'mw-userlink']}
LIMIT = 3000
YEAR = '2019'
URLS = [f'https://en.wikipedia.org/w/index.php?title={c}&action=history&offset=&limit={LIMIT}' for c in COUNTRIES]

# dict of urls and country names for convenience
URL_COUNTRY_DICT = dict(zip(URLS, COUNTRIES))

df = parse_history_change_to_df(urls=URL_COUNTRY_DICT, tags=TAGS, attrs=CLASSES, year=YEAR)

Country: Germany; status code 200
	Total edits: 296
	Unique users: 135

Country: France; status code 200
	Total edits: 401
	Unique users: 152

Country: Italy; status code 200
	Total edits: 601
	Unique users: 139

Country: Spain; status code 200
	Total edits: 375
	Unique users: 125

Country: Poland; status code 200
	Total edits: 530
	Unique users: 96



In [5]:
df

Unnamed: 0,country,date,user
1544,Germany,2019,Zutt
1545,Germany,2019,Monkbot
1546,Germany,2019,InternetArchiveBot
1547,Germany,2019,DDWilliams1
1548,Germany,2019,Bender the Bot
...,...,...,...
2721,Poland,2019,PrimaPrime
2722,Poland,2019,Haribanshnp
2723,Poland,2019,Radom1967
2724,Poland,2019,Powertranz


<br><br>

### Questions

<br><br>

**Question 1.** How many edits overall were made on pages of all the countries in 2019? 

In [23]:
answer_part_1 = len(df)
answer_part_1

2203

<br>

**Question 2.** What is the highest number of edits per country in 2019 among all countries present? 

In [24]:
# **Question 2.** What is the highest number of edits per country in 2019 among all countries present? 
answer_part_2 = max(df.groupby('country').count()['date'])
answer_part_2

601

<br>

**Question 3.** What is the lowest number of edits per country in 2019 among all countries present? 

In [32]:
answer_part_3 = min(df.groupby('country').count()['date'])
answer_part_3

296

<br>

**Question 4.** How many users overall made edits on the countries' pages in 2019? 

In [33]:
answer_part_4 = df.user.nunique()
answer_part_4

480

<br>

**Question 5.** What is the highest number of users made edits on country's page among all countries present in 2019? 

In [34]:
answer_part_5 = df.groupby('country').aggregate({'user': 'nunique'})['user'].max()
answer_part_5

152

<br><br>

**Question 6.** What user made the most edits? 

In [35]:
answer_part_6 = df.groupby('user').count()['date'].idxmax()
answer_part_6

'Merangs'

<br>

**Question 7.** What is average number of edits per day in 2019? 

In [36]:
answer_part_7 = df.groupby('date').aggregate({'user': 'count'})['user'].sum() / 365
answer_part_7

6.035616438356165

<br>

**Question 8.** What is average number of edits per user in 2019? 

In [37]:
answer_part_8 = df.groupby('user').count()['date'].mean()
answer_part_8

4.589583333333334

<br>

**Question 9.** What is average number of edits per day in the country with most edits in 2019? 

In [38]:

country_w_most_edits = df.groupby('country').count()['date'].idxmax()
answer_part_9 = df.where(df['country'] == country_w_most_edits).groupby('date').count()['user'].mean() / 365
answer_part_9

1.6465753424657534

<br>

**Question 10.** What is average number of edits per user in the country with most edits in 2019?

In [39]:
# **Question 10.** What is average number of edits per user in the country with most edits in 2019?
answer_part_10 = df.where(df['country'] == country_w_most_edits).groupby('user').count()['date'].mean()
answer_part_10

4.323741007194244

<br>
<br>

#### Submit your answers