# Using python's stdlib for data analysis
##### By Yotam Vaknin

### Pre-code : Writing the PrintableList class and loading the data:
In the actual talk I won't show this code until the very end.
1. PrintableList is a list with the _repr_html_ method implemented for better representation in the Jupyter notebook. I use the pandas implementation only to save time. Since the details of how to implement this part without pandas are trivial, I won't get into any more details.
2. The second block of code loads the data as a PrintableList of namedtuples. But it acts just like a list of objects.
3. We will look at the links connecting different Wikipedia articles. The original data can be found [here](
http://snap.stanford.edu/data/wikispeedia.html).




In [82]:
from pandas import DataFrame
class PrintableList(list):
    def to_data_frame(self):
        if self and hasattr(self[0], "_fields"): #namedtuple have their field names in the _field property
            return DataFrame(list(self), columns=self[0]._fields)
        return DataFrame(list(self))
    def _repr_html_(self):
        """
        An HTML representation of a list. Jupyter knows to looks for this function before calling __repr__.
        """
        return self.to_data_frame()._repr_html_()

In [3]:
import csv
from collections import namedtuple
with open("wikipedia_relation_data.csv", encoding="utf8") as f:
    reader = csv.reader(f)
    headers, *data = reader
line = namedtuple("line", headers)
ls = PrintableList(line(*ln) for ln in data)

In [17]:
ls

Unnamed: 0,article_name,article_category,linked_name,linked_category
0,Kuwait_City,Geography.Geography_of_the_Middle_East,15th_Marine_Expeditionary_Unit,History.Military_History_and_War
1,Kuwait_City,Geography.Geography_of_the_Middle_East,Asia,Geography.Geography_of_Asia
2,Kuwait_City,Geography.Geography_of_the_Middle_East,Kuwait,Geography.Geography_of_the_Middle_East.Middle_...
3,Kuwait_City,Geography.Geography_of_the_Middle_East,Arabic_language,Language_and_literature.Languages
4,Kuwait_City,Geography.Geography_of_the_Middle_East,Capital,Citizenship.Politics_and_government
5,Kuwait_City,Geography.Geography_of_the_Middle_East,City,Geography.General_Geography
6,Kuwait_City,Geography.Geography_of_the_Middle_East,Emirate,Citizenship.Politics_and_government
7,Kuwait_City,Geography.Geography_of_the_Middle_East,Iraq,Geography.Geography_of_the_Middle_East.Middle_...
8,Kuwait_City,Geography.Geography_of_the_Middle_East,Kuwait,Geography.Geography_of_the_Middle_East.Middle_...
9,Kuwait_City,Geography.Geography_of_the_Middle_East,Persian_Gulf,Geography.General_Geography


####  Some data sampling:
Proving that there is no magic in here. 'ls' is a list that contains objects:

In [20]:
print(ls[0].article_name)
print(ls[-1].linked_name)
PrintableList(ls[:-4:-1])

Kuwait_City
Water_resources


Unnamed: 0,article_name,article_category,linked_name,linked_category
0,Drinking_water,Everyday_life.Drink,Water_resources,Geography.Geology_and_geophysics
1,Drinking_water,Everyday_life.Drink,Water_purification,Everyday_life.Drink
2,Drinking_water,Everyday_life.Drink,Water,Geography.Climate_and_the_Weather


### Filtering and Counting (5 min)
* The simplest way to filter is list comprehension.
* We will look at 3 different way to count and try to decide which one is the best.


In [107]:
#filters (Only 5 objects for simple viewing)
PrintableList([link for link in ls if "Musical_Instruments" in link.article_category][:5])

Unnamed: 0,article_name,article_category,linked_name,linked_category
0,Guqin,Music.Musical_Instruments,Aesthetics,Science.Biology.Health_and_medicine
1,Guqin,Music.Musical_Instruments,China,Countries
2,Guqin,Music.Musical_Instruments,Go_(board_game),Everyday_life.Games
3,Guqin,Music.Musical_Instruments,Beijing,Geography.Geography_of_Asia
4,Guqin,Music.Musical_Instruments,Cello,Music.Musical_Instruments


In [105]:
#The 3 basic ways to Count:
#Naive way:
data = "asdkjbasndkjsdankjsad"

d1 = {}
for i in data:
    if i in d1:
        d1[i] +=1
    else:
        d1[i] = 1

#The short way(using the get method)
d2 = {}
for i in data:
    d2[i] = d2.get(i, 0) + 1

    
#Using defaultdict:
from collections import defaultdict
d3 = defaultdict(int)
for i in data:
    d3[i] += 1
    

#And it's basicly all the same:
print(d1 == d2 == d3)

True


#### Which one should we use?
None!

In [106]:
from collections import Counter
print(Counter(data) == d1 == d2 == d3)

True


In [108]:
Counter([link.linked_name for link in ls]).most_common(5)

[('United_States', 1845),
 ('United_Kingdom', 1140),
 ('Europe', 1092),
 ('France', 1044),
 ('England', 923)]

In [32]:
#This result can look better with PrintableList:
PrintableList(_)

Unnamed: 0,0,1
0,United_States,1845
1,United_Kingdom,1140
2,Europe,1092
3,France,1044
4,England,923


### Sets (5 min)
Sets are "unordered collections of unique elements". So sets have 2 different properties:
* unique
* unordered

Both are very useful!

In [59]:
# We can create sets in 3 different ways, just like lists and dictionaries:
set([1,0,2,3]) == {0,1,1,2,3} == {i for i in range(4)}
# As you can see: unique, unordered

True

In [109]:
#Lets look for all the articles in the Music category:
music_articles = {link.article_name for link in ls if link.article_category.startswith("Music")}
print(music_articles)

{'I_Want_to_Hold_Your_Hand', 'Medieval_music', 'Tone_cluster', 'McFly_(band)', 'Reggae', "Sgt._Pepper's_Lonely_Hearts_Club_Band", 'Himno_Nacional_Mexicano', 'Music_of_the_trecento', 'Rapping', 'American_popular_music', 'Eurovision_Song_Contest', 'Musical_instrument', 'Salsa_music', 'Italo_disco', 'The_Rite_of_Spring', 'Double_bass', 'Ukulele', 'AC_DC', "You're_Still_the_One", 'Bluegrass_music', 'Music_of_Martinique_and_Guadeloupe', 'Alternative_rock', 'Music_of_Thailand', 'Music_of_Albania', 'Music_of_Antigua_and_Barbuda', 'Trumpet', 'Gregorian_chant', 'Where_Did_Our_Love_Go', 'Duran_Duran', 'The_Temptations', 'Van_Halen', 'Oasis_(band)', 'Music_of_Hungary', 'National_Anthem_of_Russia', 'Cello', 'Drum_and_bass', 'Layla', 'Nirvana_(band)', 'Garage_(dance_music)', 'Arctic_Monkeys', 'A_cappella', 'Music_of_New_Zealand', 'Synthesizer', 'U2', 'Music_of_the_Bahamas', 'Ska', 'Nine_Million_Bicycles', 'The_Beatles', 'Guitar', 'Rhythm_and_blues', 'Brass_instrument', 'The_Beatles_discography', 'M

In [110]:
#Musical articles that aren't about an instument:
print(music_articles - {link.article_name for link in ls if "Musical_Instruments" in link.article_category})

{'I_Want_to_Hold_Your_Hand', 'Medieval_music', 'U2', 'Tone_cluster', 'Music_of_the_Bahamas', 'Ska', 'Nine_Million_Bicycles', 'McFly_(band)', 'Reggae', 'The_Beatles', 'Rhythm_and_blues', "Sgt._Pepper's_Lonely_Hearts_Club_Band", 'The_Beatles_discography', 'Mixtape', 'Renaissance_music', 'Himno_Nacional_Mexicano', 'Jazz', 'Soukous', 'Music_of_the_trecento', 'Rapping', 'Ragtime', 'American_popular_music', 'Beatles_for_Sale', 'Eurovision_Song_Contest', 'Music_of_Barbados', 'Ray_of_Light', 'Music_of_Trinidad_and_Tobago', 'Salsa_music', 'Classic_female_blues', 'Italo_disco', 'The_Rite_of_Spring', 'The_Rolling_Stones', 'Music_of_Dominica', 'AC_DC', "You're_Still_the_One", 'Bluegrass_music', 'Music_of_Martinique_and_Guadeloupe', 'Alternative_rock', 'Music_of_Thailand', 'Music_of_Albania', 'Music_of_Antigua_and_Barbuda', 'Gregorian_chant', 'Music_of_the_United_States', 'Hey_Jude', 'Bohemian_Rhapsody', 'Glastonbury_Festival', 'Where_Did_Our_Love_Go', 'Duran_Duran', 'The_Temptations', 'Van_Halen',

#### We will now use the fact that sets are unordered, and count the connections between categories:

In [112]:
Counter(set([link.article_category, link.linked_category]) for link in ls)

#### The code above will not work, since sets are mutable. We will use the imutable version of sets: frozenset
The difference between a set and a frozenset is precisely the same difference between a list and a tuple.

In [113]:
counter = Counter(frozenset([link.article_category, link.linked_category]) for link in ls )
counter.most_common(5)

[(frozenset({'Countries'}), 3664),
 (frozenset({'Science.Chemistry.Chemical_elements'}), 2768),
 (frozenset({'Geography.European_Geography.European_Countries'}), 2260),
 (frozenset({'Countries', 'Geography.European_Geography.European_Countries'}),
  2064),
 (frozenset({'Citizenship.Politics_and_government', 'Countries'}), 2026)]

### Sorting (5 min)

We can sort lists not only by their "natural" order, but by a specific key or property of each object. For example, we can arrange our connection from a Music article by the name of the linked article.

The key keyword argument in the sorted function accepts a function that can identify the value that should be ordered. In the example it's the named of the linked article.

In [114]:
PrintableList(sorted([link for link in ls if link.article_category.startswith("Music")],
                    key = lambda x:x.linked_name)[:5])

Unnamed: 0,article_name,article_category,linked_name,linked_category
0,Guitar,Music.Musical_Instruments,10th_century,History.General_history
1,Music_of_Hungary,Music.Musical_genres_styles_eras_and_events,11th_century,History.General_history
2,Medieval_music,Music.Musical_genres_styles_eras_and_events,12th_century,History.General_history
3,Trobairitz,Music.Musical_genres_styles_eras_and_events,12th_century,History.General_history
4,Music_of_the_trecento,Music.Musical_genres_styles_eras_and_events,13th_century,History.General_history


In [115]:
# We can write a bit more complicated function, like:
counter = Counter(link.article_name for link in ls)
def count_links(article):
    return counter[article.article_name]

PrintableList(sorted(ls, key = count_links, reverse=True ))

Unnamed: 0,article_name,article_category,linked_name,linked_category
0,United_States,Geography.North_American_Geography,€2_commemorative_coins,Business_Studies.Currency
1,United_States,Geography.North_American_Geography,15th_Marine_Expeditionary_Unit,History.Military_History_and_War
2,United_States,Geography.North_American_Geography,1896_Summer_Olympics,Everyday_life.Sports_events
3,United_States,Geography.North_American_Geography,18th_century,History.General_history
4,United_States,Geography.North_American_Geography,1928_Okeechobee_Hurricane,Geography.Storms
5,United_States,Geography.North_American_Geography,1973_oil_crisis,History.Recent_History
6,United_States,Geography.North_American_Geography,1980_eruption_of_Mount_St._Helens,Geography.Geology_and_geophysics
7,United_States,Geography.North_American_Geography,1997_Pacific_hurricane_season,Geography.Natural_Disasters
8,United_States,Geography.North_American_Geography,19th_century,History.General_history
9,United_States,Geography.North_American_Geography,2-6-0,Design_and_Technology.Railway_transport


In [93]:
#We can also use they same keyword for min and max. In this case we will get an object
print("Most linked article is : {}".format(
        max(ls, key = count_links ).article_name))

Most linked article is : United_States


### groupby (3 min)

Just as in SQL, in Python we can group objects by a property (or by calculating some other value).

The data must already be sorted by the group key, but by now this should be very easy to do.

In [95]:
#Ordering the data
def get_category(x):
    return x.article_category
sorted_data = sorted(ls, key = get_category)

In [96]:
from itertools import groupby
PrintableList([[key, len(list(group))] 
               for key,group in groupby(sorted_data, get_category)])

Unnamed: 0,0,1
0,Art.Art,1475
1,Art.Artists,23
2,Business_Studies.Business,1046
3,Business_Studies.Companies,510
4,Business_Studies.Currency,1342
5,Business_Studies.Economics,1140
6,Citizenship.Animal_and_Human_Rights,415
7,Citizenship.Community_organisations,284
8,Citizenship.Conflict_and_Peace,267
9,Citizenship.Culture_and_Diversity,812
