# Using python's stdlib for data analysis
##### By Yotam Vaknin

### Pre-code : Writing the PrintableList class and loading the data:
In the actual talk I won't show this code until the very end.
* PrintableList is a list container the looks nice the jupyter notebook. I don't actually implement anything (but I sadly did  the first time I wrote it). Since the details of how to implement this part without pandas are trivial, I don't bother.
* The second block of code loads the data as a PrintableList of namedtuples. 

The original data can be found here:
http://snap.stanford.edu/data/wikispeedia.html

In [5]:
from pandas import DataFrame
class PrintableList(list):
    def to_data_frame(self):
        if self and hasattr(self[0], "_fields"):  
            return DataFrame(list(self), columns=self[0]._fields)
        return DataFrame(list(self))
    def _repr_html_(self):
        return self.to_data_frame()._repr_html_()

In [3]:
import csv
from collections import namedtuple
with open("wikipedia_relation_data.csv", encoding="utf8") as f:
    reader = csv.reader(f)
    headers, *data = reader
line = namedtuple("line", headers)
ls = PrintableList(line(*ln) for ln in data)

In [17]:
ls

Unnamed: 0,article_name,article_category,linked_name,linked_category
0,Kuwait_City,Geography.Geography_of_the_Middle_East,15th_Marine_Expeditionary_Unit,History.Military_History_and_War
1,Kuwait_City,Geography.Geography_of_the_Middle_East,Asia,Geography.Geography_of_Asia
2,Kuwait_City,Geography.Geography_of_the_Middle_East,Kuwait,Geography.Geography_of_the_Middle_East.Middle_...
3,Kuwait_City,Geography.Geography_of_the_Middle_East,Arabic_language,Language_and_literature.Languages
4,Kuwait_City,Geography.Geography_of_the_Middle_East,Capital,Citizenship.Politics_and_government
5,Kuwait_City,Geography.Geography_of_the_Middle_East,City,Geography.General_Geography
6,Kuwait_City,Geography.Geography_of_the_Middle_East,Emirate,Citizenship.Politics_and_government
7,Kuwait_City,Geography.Geography_of_the_Middle_East,Iraq,Geography.Geography_of_the_Middle_East.Middle_...
8,Kuwait_City,Geography.Geography_of_the_Middle_East,Kuwait,Geography.Geography_of_the_Middle_East.Middle_...
9,Kuwait_City,Geography.Geography_of_the_Middle_East,Persian_Gulf,Geography.General_Geography


####  Some data sampling:
Proving that there is no magic in here. 'ls' is a list that contains objects:

In [20]:
print(ls[0].article_name)
print(ls[-1].linked_name)
PrintableList(ls[:-4:-1])

Kuwait_City
Water_resources


Unnamed: 0,article_name,article_category,linked_name,linked_category
0,Drinking_water,Everyday_life.Drink,Water_resources,Geography.Geology_and_geophysics
1,Drinking_water,Everyday_life.Drink,Water_purification,Everyday_life.Drink
2,Drinking_water,Everyday_life.Drink,Water,Geography.Climate_and_the_Weather


### Filtering and Counting (7 min)
* The simplest way to filter is list comprehension.
* We will look at 3 differnt way to count and try to think which one is the best one?

In [33]:
#filters (Only 5 objects for simple viewing)
PrintableList([article for article in ls if "Musical_Instruments" in article.article_category][:5])

Unnamed: 0,article_name,article_category,linked_name,linked_category
0,Guqin,Music.Musical_Instruments,Aesthetics,Science.Biology.Health_and_medicine
1,Guqin,Music.Musical_Instruments,China,Countries
2,Guqin,Music.Musical_Instruments,Go_(board_game),Everyday_life.Games
3,Guqin,Music.Musical_Instruments,Beijing,Geography.Geography_of_Asia
4,Guqin,Music.Musical_Instruments,Cello,Music.Musical_Instruments


In [34]:
#The 3 basic ways to Count:
#Naïve way:
d1, d2, data = {},{},"asdkjbasndkjsdankjsad"
for i in data:
    if i in d1:
        d1[i] +=1
    else:
        d1[i] = 1

#The cool way (using the get method)
for i in data:
    d2[i] = d2.get(i, 0) + 1

    
#The comlicated way:
from collections import defaultdict
d3 = defaultdict(int)
for i in data:
    d3[i] += 1
    

#And it's all basicly the same:
print(d1 == d2 == d3)

True


#### Which one should I use?
None!

In [35]:
from collections import Counter
print(Counter(data) == d1 == d2 == d3)

True


In [31]:
Counter([article.linked_name for article in ls]).most_common(5)

[('United_States', 1845),
 ('United_Kingdom', 1140),
 ('Europe', 1092),
 ('France', 1044),
 ('England', 923)]

In [32]:
#We can make it even cooler using PrintableList:
PrintableList(_)

Unnamed: 0,0,1
0,United_States,1845
1,United_Kingdom,1140
2,Europe,1092
3,France,1044
4,England,923


### Sets (5 min)
Sets are "unordered collections of unique elements", so 2 different things:
* unique
* unordered

Both are useful!

In [59]:
# We can create sets in 3 different ways, just like lists and dictionaries:
set([1,0,2,3]) == {0,1,1,2,3} == {i for i in range(4)}
# As you can see: unique, unordered

True

In [51]:
#Lets look for all the articles in the Music category:
music_articles = {i.article_name for i in ls if i.article_category.startswith("Music")}
print(music_articles)

{'I_Want_to_Hold_Your_Hand', 'Medieval_music', 'Tone_cluster', 'McFly_(band)', 'Reggae', "Sgt._Pepper's_Lonely_Hearts_Club_Band", 'Himno_Nacional_Mexicano', 'Music_of_the_trecento', 'Rapping', 'American_popular_music', 'Eurovision_Song_Contest', 'Musical_instrument', 'Salsa_music', 'Italo_disco', 'The_Rite_of_Spring', 'Double_bass', 'Ukulele', 'AC_DC', "You're_Still_the_One", 'Bluegrass_music', 'Music_of_Martinique_and_Guadeloupe', 'Alternative_rock', 'Music_of_Thailand', 'Music_of_Albania', 'Music_of_Antigua_and_Barbuda', 'Trumpet', 'Gregorian_chant', 'Where_Did_Our_Love_Go', 'Duran_Duran', 'The_Temptations', 'Van_Halen', 'Oasis_(band)', 'Music_of_Hungary', 'National_Anthem_of_Russia', 'Cello', 'Drum_and_bass', 'Layla', 'Nirvana_(band)', 'Garage_(dance_music)', 'Arctic_Monkeys', 'A_cappella', 'Music_of_New_Zealand', 'Synthesizer', 'U2', 'Music_of_the_Bahamas', 'Ska', 'Nine_Million_Bicycles', 'The_Beatles', 'Guitar', 'Rhythm_and_blues', 'Brass_instrument', 'The_Beatles_discography', 'M

In [52]:
#Musical articles that aren't about an instument:
print(music_articles - {i.article_name for i in ls if "Musical_Instruments" in i.article_category})

{'I_Want_to_Hold_Your_Hand', 'Medieval_music', 'U2', 'Tone_cluster', 'Music_of_the_Bahamas', 'Ska', 'Nine_Million_Bicycles', 'McFly_(band)', 'Reggae', 'The_Beatles', 'Rhythm_and_blues', "Sgt._Pepper's_Lonely_Hearts_Club_Band", 'The_Beatles_discography', 'Mixtape', 'Renaissance_music', 'Himno_Nacional_Mexicano', 'Jazz', 'Soukous', 'Music_of_the_trecento', 'Rapping', 'Ragtime', 'American_popular_music', 'Beatles_for_Sale', 'Eurovision_Song_Contest', 'Music_of_Barbados', 'Ray_of_Light', 'Music_of_Trinidad_and_Tobago', 'Salsa_music', 'Classic_female_blues', 'Italo_disco', 'The_Rite_of_Spring', 'The_Rolling_Stones', 'Music_of_Dominica', 'AC_DC', "You're_Still_the_One", 'Bluegrass_music', 'Music_of_Martinique_and_Guadeloupe', 'Alternative_rock', 'Music_of_Thailand', 'Music_of_Albania', 'Music_of_Antigua_and_Barbuda', 'Gregorian_chant', 'Music_of_the_United_States', 'Hey_Jude', 'Bohemian_Rhapsody', 'Glastonbury_Festival', 'Where_Did_Our_Love_Go', 'Duran_Duran', 'The_Temptations', 'Van_Halen',

#### We will now use the fact that sets are unordered, and count the connections between categories:

In [None]:
Counter(set([i.article_category, i.linked_category]) for i in ls)

#### The code above will not work, since sets are mutable. We will use their imutable version: frozenset
The difference between set and frozenset is exactly like the difference between list and tuple.

In [63]:
counter = Counter(frozenset([i.article_category, i.linked_category]) for i in ls )
counter.most_common(5)

[(frozenset({'Countries'}), 3664),
 (frozenset({'Science.Chemistry.Chemical_elements'}), 2768),
 (frozenset({'Geography.European_Geography.European_Countries'}), 2260),
 (frozenset({'Countries', 'Geography.European_Geography.European_Countries'}),
  2064),
 (frozenset({'Citizenship.Politics_and_government', 'Countries'}), 2026)]