Make CSV file with the ff. Wikipedia activity metrics of Forbes 100 celebrities pages:
- edits = no. of edits
- size = total of edits
- users = no. of contributing users

### Import libraries

In [1]:
import pandas as pd
import numpy as np

### Define output file path & filename

In [2]:
fn_out = 'output/wikipedia_metric.csv'

### Read in data from CSV file

In [3]:
fn = 'input/wikipedia_edits.csv.zip'
df = pd.read_csv(fn)

Count no. of rows

In [4]:
len(df)

789343

Inspect first rows

In [5]:
df.head()

Unnamed: 0,title,parentid,revid,timestamp,user,userid,size,recipient,year,rank,country,career,tied
0,50_Cent,858765697,858858564,2018-09-10T02:52:25Z,ProspectIV,33893830.0,132452,50 Cent,2006,8,United States,Musician,0
1,50_Cent,858673799,858765697,2018-09-09T13:11:00Z,ProspectIV,33893830.0,132476,50 Cent,2006,8,United States,Musician,0
2,50_Cent,858673121,858673799,2018-09-08T21:46:42Z,ProspectIV,33893830.0,132444,50 Cent,2006,8,United States,Musician,0
3,50_Cent,858673016,858673121,2018-09-08T21:39:37Z,ProspectIV,33893830.0,132345,50 Cent,2006,8,United States,Musician,0
4,50_Cent,858469414,858673016,2018-09-08T21:38:24Z,ProspectIV,33893830.0,132345,50 Cent,2006,8,United States,Musician,0


### Aggregate Using Groupby

1. Get the total no. of Wikipedia edits per celebrity

In [6]:
df_edits = df.groupby('title')['revid'].agg([len])

In [7]:
df_edits.head()

Unnamed: 0_level_0,len
title,Unnamed: 1_level_1
50_Cent,13066
Adele,6725
Angelina_Jolie,7336
Backstreet_Boys,10270
Ben_Affleck,9263


Rename column to edits

In [8]:
df_edits.columns = ['edits']
df_edits.head()

Unnamed: 0_level_0,edits
title,Unnamed: 1_level_1
50_Cent,13066
Adele,6725
Angelina_Jolie,7336
Backstreet_Boys,10270
Ben_Affleck,9263


Inspect top celebrities by no. of eidts - this replicates what we got earlier from df['title'].value_counts()

In [9]:
df_edits.sort_values(by='edits', ascending=False).head(5)

Unnamed: 0_level_0,edits
title,Unnamed: 1_level_1
Roger_Federer,24641
Britney_Spears,24640
The_Beatles,23342
Beyonce,20811
Eminem,19899


1. Get the total size of Wikipedia edits per celebrity

In [10]:
df_size = df.groupby('title')['size'].agg([np.sum])

In [11]:
df_size.columns = ['size']

In [12]:
df_size.head()

Unnamed: 0_level_0,size
title,Unnamed: 1_level_1
50_Cent,518204129
Adele,456912673
Angelina_Jolie,531496219
Backstreet_Boys,412497115
Ben_Affleck,906521569


Inspect top celebrities by total size of edits

In [13]:
df_size.sort_values(by='size', ascending=False).head(5)

Unnamed: 0_level_0,size
title,Unnamed: 1_level_1
Roger_Federer,2545161596
Cristiano_Ronaldo,2351953387
Paul_McCartney,2059061790
The_Beatles,2055056731
Lionel_Messi,1977764924


1. Get no. of (unique) Wikipedia users who contributed edits per celebrity

In [14]:
df_users = df.groupby(['title'])['user'].nunique()
df_users = pd.DataFrame(df_users)
df_users.columns = ['users']

In [15]:
df_users.head()

Unnamed: 0_level_0,users
title,Unnamed: 1_level_1
50_Cent,4557
Adele,2672
Angelina_Jolie,2902
Backstreet_Boys,4050
Ben_Affleck,3655


Inspect top celebrities by no. of contributing users

In [16]:
df_users.sort_values(by='users', ascending=False).head(5)

Unnamed: 0_level_0,users
title,Unnamed: 1_level_1
Roger_Federer,8019
Eminem,7247
The_Beatles,6971
Britney_Spears,6896
Dwayne_Johnson,6727


### Merge results into 1 dataframe

First, reset index of each dataframe, then merge on title column

In [17]:
df_edits.reset_index(inplace=True)
df_size.reset_index(inplace=True)
df_users.reset_index(inplace=True)

In [18]:
df_edits.head()

Unnamed: 0,title,edits
0,50_Cent,13066
1,Adele,6725
2,Angelina_Jolie,7336
3,Backstreet_Boys,10270
4,Ben_Affleck,9263


In [19]:
df_out = pd.merge(df_edits, df_size)

In [21]:
df_out = pd.merge(df_out, df_users)

### Save dataframe to CSV file

In [23]:
df_out.to_csv(fn_out, index=False)