#Class 3 - Processing Files with Python
###DATSF 19
####Justin Breucop - 12/7/2015

For a lot of data files in class we'll use functionality from various libraries to process data very quickly. However, for custom files, raw text, and data that is configured in a non-standard way, it is important to be able to extract data in a customized fashion. We'll go through this exercise using only libraries that come with the default python distribution. The first step will be to open the file in sublime.

Let's say that we are curious about the latest release of ScikitLearn, since we are (or soon will be) frequent users. Our goal is to take the raw commits, sort our authors alphabetically and also count the number of contributions they made. Let's first look at the file. You can do this via the command line but for simplicity's sake we can use the Jupyter cell magic.

In [None]:
# For Max/Linux users:
! more ../data/raw_commits.txt

# For windows users:
# ! more ..\data\raw_commits.txt

We see that each commit has an Author and a date. We need to be able to read the file line by line and add to a list of authors. Remember to use `with open('<filename>') as <variable>` where `<filename>` is the full path to the file and the `<variable>` is any identifier (such as `f`).

##### Lines of file -> List of Strings

In [3]:
print "hello w"

hello w


In [60]:
# Open the file and try printing all lines that start with author

authors = []

with open('../data/raw_commits.txt', 'rw') as filename:
    for line in filename:
        if line.split(" ")[0][:6].lower() == "author":
            authors.append(line[8:].split(' <')[0])
        else:
            pass
        

print authors

            
            
# Make sure to append the author name to the list. You'll need to use string manipulation techniques.




['Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'KamalakerDadi', 'Andreas Mueller', 'Graham Clenaghan', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'trevorstephens', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'Andreas Mueller', 'trevorstephens', 'Olivier Grisel', 'TomDLT', 'Lo\xc3\xafc Est\xc3\xa8ve', 'Andreas Mueller', 'Ganiev Ibraim', 'Giorgio Patrini', 'MartinBpr', 'Andreas Mueller', 'Arthur Mensch', 'Andreas Mueller', 'Jeffrey04', 'MaryanMorel', 'Arnaud Rachez', 'Olivier Grisel', 'giorgiop', 'Gilles Louppe', 'Andreas Mueller', 'MechCoder', 'Olivier Grisel', 'Olivier Grisel', 'Andreas Mueller', 'giorgiop', 'Olivier Grisel', 'Olivier Grisel', 'Joel Nothman', 'Alexandre Gramfort', 'Andreas Mueller', 'Olivi

In [61]:
#Sort the authors to find the first and last authors, alphabetically. Make sure your data is clean! (No username should begin with an = sign, for example)

#####List of Strings -> Sorted unique list


sorted_authors = list(set(sorted([x for x in authors if ''.join(x.split()).isalnum()])))
print sorted_authors


['Lars', 'sinhrks', 'Rohan Ramanath', 'Gryllos Prokopis', 'Steven Seguin', 'Lilian Besson', 'Thomas Unterthiner', 'Daniel Kronovet', 'Dan Blanchard', 'Andrew Lamb', 'Dougal Sutherland', 'Alexey Grigorev', 'maheshakya', 'Skipper Seabold', 'Yucheng Low', 'Vincent', 'tokoroten', 'Ando Saabas', 'Alexandre Gramfort', 'Vighnesh Birodkar', 'Boyuan Deng', 'scls19fr', 'Peter Fischer', 'Ganiev Ibraim', 'Kyler Brown', 'Christopher Erick Moody', 'MaryanMorel', 'Tian Wang', 'Stephen Hoover', 'Joshua Loyal', 'Jaidev Deshpande', 'Cindy Sridharan', 'Allen Riddell', 'Ari Rouvinen', 'Zac Stewart', 'John Wittenauer', 'Eric Martin', 'Matti Lyra', 'Donne Martin', 'Martin Ku', 'Frank Zalkow', 'edson duarte', 'Jacob Schreiber', 'Joel Nothman', 'mbillinger', 'Manoj Kumar', 'martinosorb', 'Christof Angermueller', 'jfraj', 'John Kirkham', 'Danny Sullivan', 'SimonPL', 'Joseph', 'Varoquaux', 'Preston Parry', 'Jan Hendrik Metzen', 'Tim Head', 'Mathieu Blondel', 'Louis Tiao', 'Andreas Mueller', 'Fernando Carrillo',

In [None]:
# Think of what data types you can take advantage of

To count out our data, we can loop over our list and construct a dictionary where the key is the commit author and the value increases whenever we match a key.
#####List -> Dictionary

In [39]:
author_dict = {}
for y in sorted_authors:
    author_dict[y] = authors.count(y)
print author_dict

{'Lars': 13, 'sinhrks': 2, 'Rohan Ramanath': 1, 'Gryllos Prokopis': 3, 'Steven Seguin': 4, 'Lilian Besson': 3, 'Thomas Unterthiner': 26, 'Daniel Kronovet': 3, 'Dan Blanchard': 1, 'Andrew Lamb': 5, 'Dougal Sutherland': 1, 'Alexey Grigorev': 1, 'maheshakya': 3, 'Raghav R V': 35, 'Jeremy': 1, 'Sam Zhang': 1, 'David Dotson': 2, 'Ando Saabas': 1, 'Alexandre Gramfort': 60, 'Vighnesh Birodkar': 7, 'Boyuan Deng': 3, 'Eric Larson': 1, 'Peter Fischer': 1, 'Ganiev Ibraim': 5, 'Kyler Brown': 1, 'Christopher Erick Moody': 1, 'MaryanMorel': 1, 'Tian Wang': 1, 'Stephen Hoover': 3, 'David D Lowe': 3, 'Jack Martin': 2, 'Cindy Sridharan': 3, 'Allen Riddell': 1, 'Ari Rouvinen': 1, 'Zac Stewart': 1, 'John Wittenauer': 2, 'Eric Martin': 1, 'Matti Lyra': 2, 'Donne Martin': 1, 'Martin Ku': 2, 'Lars Buitinck': 26, 'edson duarte': 1, 'Jacob Schreiber': 16, 'Joel Nothman': 37, 'mhg': 25, 'Manoj Kumar': 5, 'martinosorb': 5, 'Christof Angermueller': 1, 'jfraj': 3, 'John Kirkham': 1, 'Danny Sullivan': 1, 'SimonPL'

Find the contributor with the highest number of commits. Useful dictionary method: `dict.get()`

#####Dictionary -> Specific String

In [51]:
max(author_dict, key = author_dict.get)


'Andreas Mueller'

Bonus: how do you handle a tie? Can you pull all authors with the lowest number of commits (without hardcoding the minimum).

In [59]:
lowest_authors = []
for name, count in author_dict.iteritems():
    if count == min(author_dict.values()):
        lowest_authors.append(name)
        
print '\n'.join(lowest_authors)
        

Rohan Ramanath
Dan Blanchard
Dougal Sutherland
Alexey Grigorev
Jeremy
Sam Zhang
Ando Saabas
Eric Larson
Peter Fischer
Kyler Brown
Christopher Erick Moody
MaryanMorel
Tian Wang
Allen Riddell
Ari Rouvinen
Zac Stewart
Eric Martin
Donne Martin
edson duarte
Christof Angermueller
John Kirkham
Danny Sullivan
Robert Layton
Joseph
Varoquaux
Preston Parry
Fernando Carrillo
Arnaud Rachez
akitty
Jeffrey04
Yury Zhauniarovich
David
santi
Vincent Michel
Aaron Schumacher
Kashif Rasul
Nicolas
Yucheng Low
Timothy Hopper
Jaidev Deshpande
Giorgio Patrini
Ali Baharev
Tom DLT
Omer Katz
Masafumi Oyamada
Shivan Sornarajah
Konstantin Shmelkov
Nikolay Mayorov
Ishank Gulati
Erich Schubert
Ankur Ankan
Keith Goodman
Pauli Virtanen
Dmitry Spikhalskiy
Vincent
Eduardo Caro
KamalakerDadi
Tiago Freitas Pereira
Jean Kossaifi
Theodore Vasiloudis
MartinBpr
banilo
benjaminirving
sseg
Valentin Stolbunov
Raghav
JeanKossaifi
Jake Vanderplas
Anish Shah
Jiali Mei
Rob Zinkov
Brian McFee
Christoph Gohlke
Jungkook Park
