Our goal is to make a list of participants across IETF groups. Once we've done that, it should be possible to evaluate patterns of participation: how many people participate, in which groups, how does affiliation, gender, RFC authorship or other characteristics relate to levels of participation, and a variety of other related questions.

Start by importing the necessary libraries.

In [7]:
%matplotlib inline
import bigbang.mailman as mailman
import bigbang.graph as graph
import bigbang.process as process
from bigbang.parse import get_date
from bigbang.archive import Archive
import bigbang.utils as utils
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import math
import pytz
import pickle
import os
import csv
import re
import scipy
import scipy.cluster.hierarchy as sch
import email

Let's start with a single IETF mailing list. (Later, we can expand to all current groups, or all IETF lists ever.)

In [32]:
list_url = 'https://www.ietf.org/mail-archive/text/perpass/' # perpass happens to be one that I subscribe to

ietf_archives_dir = '../../ietf-archives' # relative location of the ietf-archives directory/repo

list_archive = mailman.open_list_archives(list_url, ietf_archives_dir)
activity = Archive(list_archive).get_activity()

Opening 43 archive files


In [33]:
people = pd.DataFrame(activity.sum(0), columns=['perpass']) # sum the message count, rather than by date

In [34]:
people.describe()

Unnamed: 0,perpass
count,261.0
mean,8.015326
std,18.733961
min,1.0
25%,1.0
50%,2.0
75%,7.0
max,231.0


Split out the email address and header name from the From header we started with.

In [35]:
# froms = pd.Series(people.index)
# emails = froms.apply(lambda x: email.utils.parseaddr(x)[1])
# emails.index = people.index
# names = froms.apply(lambda x: email.utils.parseaddr(x)[0])
# names.index = people.index
# people['email'] = emails
# people['name'] = names
# people

**Warning: long-running step.** Now repeat, parsing the archives and collecting the activities for all the mailing lists in the corpus.

In [37]:
f = open('ietf_lists_normalized.txt', 'r')
ietf_lists = f.readlines()

list_archives = []

for list_url in ietf_lists:
    try:
        archives = mailman.open_list_archives(list_url, ietf_archives_dir)
        list_archives.append((list_url, archives))
    except Exception as e:
        print str(e)

Opening 46 archive files
x-unknown unknown encoding in message <Pine.GSO.4.64.0906040915000.16060@tin>, using UTF-8 instead
x-unknown unknown encoding in message <Pine.GSO.4.64.0906101255320.9230@tin>, using UTF-8 instead
Opening 54 archive files
Opening 105 archive files
windows-874 unknown encoding in message <BAY124-W4FAEF231B7F26B42380CCE44B0@phx.gbl>, using UTF-8 instead
windows-874 unknown encoding in message <BAY124-W4FAEF231B7F26B42380CCE44B0@phx.gbl>, using UTF-8 instead
x-windows-949 unknown encoding in message <43EA2A09.1070600@eng.sun.com>, using UTF-8 instead
Opening 170 archive files
cp 1252 unknown encoding in message <1513056.oYYGoBVvp1@linne>, using UTF-8 instead
Opening 33 archive files
Opening 58 archive files
x-unknown unknown encoding in message <Pine.LNX.4.64.1506101537480.3552@ece.iisc.ernet.in>, using UTF-8 instead
Opening 10 archive files
Opening 79 archive files
Opening 3 archive files
Opening 47 archive files
Opening 36 archive files
Opening 19 archive files


In [38]:
len(list_archives)

341

In [54]:
reload(bigbang.archive)

<module 'bigbang.archive' from '/Users/nick/code/mailing-list-analysis/bigbang/bigbang/archive.py'>

In [55]:
activity_frames = []
for (list_url, list_archive) in list_archives:
    activity = bigbang.archive.Archive(list_archive).get_activity()
    list_name = mailman.get_list_name(list_url)
    activity_frame = pd.DataFrame(activity.sum(0), columns=[list_name])
    activity_frames.append(activity_frame)


OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1601-01-01 00:39:27

In [31]:
pd.merge(people, acme, how='outer', left_index=True, right_index=True)

Unnamed: 0_level_0,perpass,email,name,acme
From,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"""Adrian Farrel"" <adrian@olddog.co.uk>",3.0,adrian@olddog.co.uk,Adrian Farrel,
"""Andreas Kuckartz"" <a.kuckartz@ping.de>",8.0,a.kuckartz@ping.de,Andreas Kuckartz,
"""Andy Ligg"" <andy@startssl.com>",,,,2.0
"""Bernie Volz (volz)"" <volz@cisco.com>",1.0,volz@cisco.com,Bernie Volz (volz),
"""Carl S. Gutekunst"" <cgutekunst@sonicwall.com>",3.0,cgutekunst@sonicwall.com,Carl S. Gutekunst,
"""Christian Huitema "" <huitema@huitema.net>",1.0,huitema@huitema.net,Christian Huitema,
"""Christian Huitema"" <huitema@huitema.net>",39.0,huitema@huitema.net,Christian Huitema,
"""Cullen Jennings (fluffy)"" <fluffy@cisco.com>",3.0,fluffy@cisco.com,Cullen Jennings (fluffy),
"""Dickson, Brian"" <bdickson@verisign.com>",1.0,bdickson@verisign.com,"Dickson, Brian",
"""Dr. Pala"" <director@openca.org>",,,,3.0
