## Primer: zimske olimpijske igre, Soči 2014

Na primeru podatkov o olimpijskih igrah bomo spoznali tabelarično predstavitev podatkov (atribut-vrednost) v paketu Orange. Preizkusili bomo nekatere pogoste načine grafičnega prikaza podatkov.

## Example: Winter Olympics, Sochi 2014

On the case of information about the Olympic Games, we will get to know the tabular presentation of the data (attribute value) in the Orange package. We will try some common ways to graphically display data.

### Predstavitev podatkov

Tokrat imamo opravka s športniki, ki so nastopali na zimskih olimpijskih igrah v ruskem letovišču Soči ob Črnem morju leta 2014. 

Za vsakega nastopajočega športnika so na voljo naslednji podatki (atributi):

* ime in priimek,
* starost v letih,
* datum rojstva,
* spol,
* telesna višina,
* telesna teža,
* št. osvojenih zlatih medalj,
* št. osvojenih srebrnih medalj,
* št. osvojenih bronastih medalj,
* št. vseh osvojenih medalj,
* športna panoga,
* država, katero zastopa.

##### Vprašanje 2-2-1

S kakšnim podatkovnim tipom bi predstavil/a vsakega od atributov?

### Data presentation

This time we are dealing with athletes who took part in the Winter Olympics in the Russian resort Sochi near the Black Sea in 2014.

The following data (attributes) are available for each athlete:

* name and surname,
* age in years,
* date of birth,
* gender,
* height,
* body weight,
* no. won gold medals,
* no. won silver medals,
* no. won bronze medals,
* no. of all medals won,
* The sports category,
* the country it represents.

##### Question 2-2-1

With what kind of data type would you present each of the attributes?

[Odgovor](202-2.ipynb#Odgovor-2-2-1)

[Answer](202-2.ipynb#Answer-2-2-1)

Do sedaj smo spoznali načine za shranjevanje numeričnih podatkov, kot so cela in decimalna števila. Nenumerične podatke, kot so država ter naziv tekmovalca, ne moremo enostavno predstaviti v numerični obliki. Pomagali si bomo s knjižnjico `Orange`, ki skupaj s  števili hrani naslednje tipe podatkov:

* **[c]ontinuous** ali zvezni atributi, s katerimi predstavimo številske podatke (tudi cela števila),
* **[d]iscrete** ali diskretni atributi imajo zalogo vrednosti iz končne množice. Npr. spol je element množice `{moški, ženska}` ali okusi sladoleda `{čokolada, vanilija, jagoda}`. Pomni, da za razliko od števil med elementi takih množic ne obstaja urejen vrstni red.
* **[s]tring** ali niz znakov, hrani nize znakov poljubne (končne) dolžine.

So far, we have learned ways to store numerical data, such as integers and decimal numbers. Numerical data, such as the country and the name of the competitor, can not be easily represented in numerical form. We will use the `Orange` library, which stores the following data types along with numbers:

* **[c]ontinuous**  attributes to represent numerical data (including integers),
* **[d]iscrete**  attributes have a stock of values from a finite set. For example. Gender is an element of the `{man, woman}` set, or ice-cream flavors `{chocolate, vanilla, strawberry}`. Note that, unlike with numbers, there is no order between the elements of such sets. 
* **[s]tring** of characters, stores the sets of characters of any (final) length.

##### Vprašanje 2-2-2

Katerega od treh naštetih tipov podatkov bi uporabil za vsakega od atributov športnikov? Rešitev najdeš, če si ogledaš prvih nekaj vrstic datoteke [`athletes.tab`](podatki/athletes.tab).

##### Question 2-2-2

Which of the three types of data would you use for each of the athletes' attributes? You can find the solution if you look at the first few lines of the file [`athletes.tab`](podatki/athletes.tab).

[Odgovor](202-2.ipynb#Odgovor-2-2-2)

[Answer](202-2.ipynb#Answer-2-2-2)

### Knjižnica  `pandas`

Podatke naložimo v spremenljivko tipa `DataFrame`. Podatkovni tipi atributov so določeni glede na vrednosti.  

### Library `pandas`

We load the data into a variable of type `DataFrame`. The data types of attributes are assigned autimatically.

In [1]:
import pandas as pd
data = pd.read_table('podatki/athletes.tab', skiprows=[1])
data

Unnamed: 0,age,birthdate,gender,height,name,weight,gold_medals,silver_medals,bronze_medals,total_medals,sport,country
0,17,1996-04-12,Male,1.72,Aaron Blunck,68,0,0,0,0,Freestyle Skiing,United States
1,27,1986-05-14,Male,1.85,Aaron March,85,0,0,0,0,Snowboard,Italy
2,21,1992-06-30,Male,1.78,Abzal Azhgaliyev,68,0,0,0,0,Short Track,Kazakhstan
3,21,1992-07-30,Male,1.86,Adam Barwood,82,0,0,0,0,Alpine Skiing,New Zealand
4,21,1992-12-18,Male,1.75,Adam Cieslar,57,0,0,0,0,Nordic Combined,Poland
...,...,...,...,...,...,...,...,...,...,...,...,...
2471,28,1985-04-30,Male,1.93,Ziga Pavlin,98,0,0,0,0,Ice Hockey,Slovenia
2472,31,1982-12-05,Female,1.70,Zina Kocher,60,0,0,0,0,Biathlon,Canada
2473,28,1985-06-14,Female,1.68,Zoe Gillings,65,0,0,0,0,Snowboard,Great Britain
2474,22,1991-03-01,Male,1.76,Zongyang Jia,68,0,0,1,1,Freestyle Skiing,China


In [2]:
type(data)

pandas.core.frame.DataFrame

Domena je množica imen stolpcev.

Domain is a set of column names.

In [3]:
data.columns

Index(['age', 'birthdate', 'gender', 'height', 'name', 'weight', 'gold_medals',
       'silver_medals', 'bronze_medals', 'total_medals', 'sport', 'country'],
      dtype='object')

Preverimo tipe posameznih atributov.

Check the types of individual attributes.

In [4]:
data.dtypes

age                int64
birthdate         object
gender            object
height           float64
name              object
weight             int64
gold_medals        int64
silver_medals      int64
bronze_medals      int64
total_medals       int64
sport             object
country           object
dtype: object

Za diskretne atribute lahko dostopamo do zaloge vrednosti.

For discrete attributes we can access the set of values.

In [5]:
pd.unique(data['sport'])

array(['Freestyle Skiing', 'Snowboard', 'Short Track', 'Alpine Skiing',
       'Nordic Combined', 'Cross-Country', 'Biathlon', 'Luge',
       'Ice Hockey', 'Bobsleigh', 'Speed Skating', 'Skeleton',
       'Ski Jumping', 'Curling'], dtype=object)

Dostopamo lahko do posameznih vrstic:

We can access individual lines:

In [6]:
print(data.iloc[0])
print()
print(data.iloc[3:5])

age                            17
birthdate              1996-04-12
gender                       Male
height                       1.72
name                 Aaron Blunck
weight                         68
gold_medals                     0
silver_medals                   0
bronze_medals                   0
total_medals                    0
sport            Freestyle Skiing
country             United States
Name: 0, dtype: object

   age   birthdate gender  height          name  weight  gold_medals  \
3   21  1992-07-30   Male    1.86  Adam Barwood      82            0   
4   21  1992-12-18   Male    1.75  Adam Cieslar      57            0   

   silver_medals  bronze_medals  total_medals            sport      country  
3              0              0             0    Alpine Skiing  New Zealand  
4              0              0             0  Nordic Combined       Poland  


Dostopamo lahko do atributov posamezne vrstice.
Navedeni načini so ekvivalentni za dostop do športa športnika v prvi vrstici:

We can access the attributes of each line.
These modes are equivalent to accessing the sport on the sportsman in the first line:

In [7]:
print(data.iloc[0, 10])
print(data.loc[0, 'sport'])

Freestyle Skiing
Freestyle Skiing


Dostopamo tudi do več stolpcev hkrati:

We also access multiple columns at the same time:

In [8]:
print(data.loc[0, ['name','sport', 'country']])
print()
print(data.iloc[0, [4,10,11]])

name           Aaron Blunck
sport      Freestyle Skiing
country       United States
Name: 0, dtype: object

name           Aaron Blunck
sport      Freestyle Skiing
country       United States
Name: 0, dtype: object


### Izbira podmnožice vrstic

Za izbiro podmnožice vrstic uporabimo filter. Naredimo objekt filter, ki vključuje pogoj ter ga pokličemo na podmnožici podatkov.

### Selecting a subset of rows

We use a filter to select a subset of rows. Let's create a filter object that includes a condition and call it on a subset of the data.

In [9]:
data.loc[data['sport'] == 'Alpine Skiing']

Unnamed: 0,age,birthdate,gender,height,name,weight,gold_medals,silver_medals,bronze_medals,total_medals,sport,country
3,21,1992-07-30,Male,1.86,Adam Barwood,82,0,0,0,0,Alpine Skiing,New Zealand
5,18,1995-04-22,Male,1.70,Adam Lamhamedi,76,0,0,0,0,Alpine Skiing,Morocco
6,23,1990-09-13,Male,1.78,Adam Zampa,80,0,0,0,0,Alpine Skiing,Slovakia
7,21,1992-09-28,Female,1.62,Adeline Baud,56,0,0,0,0,Alpine Skiing,France
10,29,1984-09-18,Male,1.82,Adrien Theaux,80,0,0,0,0,Alpine Skiing,France
...,...,...,...,...,...,...,...,...,...,...,...,...
2420,19,1994-12-20,Male,1.81,Yohan Goncalves Goutt,78,0,0,0,0,Alpine Skiing,Timor-Leste
2425,16,1997-07-19,Female,1.68,Young-seo Kang,60,0,0,0,0,Alpine Skiing,Korea
2451,24,1989-02-20,Male,1.82,Yuxin Zhang,71,0,0,0,0,Alpine Skiing,China
2458,21,1992-11-15,Male,1.76,Zan Kranjec,77,0,0,0,0,Alpine Skiing,Slovenia


Za numerične stolpce lahko pridobimo osnovne statistične podatke.

In [10]:
data.describe()

Unnamed: 0,age,height,weight,gold_medals,silver_medals,bronze_medals,total_medals
count,2476.0,2476.0,2476.0,2476.0,2476.0,2476.0,2476.0
mean,25.974152,1.757371,73.113893,0.029887,0.029887,0.030291,0.090065
std,4.976159,0.090514,13.673625,0.194665,0.177284,0.180603,0.342843
min,15.0,1.5,40.0,0.0,0.0,0.0,0.0
25%,22.0,1.69,62.0,0.0,0.0,0.0,0.0
50%,26.0,1.76,72.0,0.0,0.0,0.0,0.0
75%,29.0,1.83,83.0,0.0,0.0,0.0,0.0
max,55.0,2.06,120.0,3.0,2.0,2.0,3.0


S crossdata() pridobimo število pojavitev za posamezne pare elementov dveh stributov.

In [11]:
pd.crosstab(data['country'],data['sport'])

sport,Alpine Skiing,Biathlon,Bobsleigh,Cross-Country,Curling,Freestyle Skiing,Ice Hockey,Luge,Nordic Combined,Short Track,Skeleton,Ski Jumping,Snowboard,Speed Skating
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Albania,2,0,0,0,0,0,0,0,0,0,0,0,0,0
Andorra,4,1,0,0,0,0,0,0,0,0,0,0,1,0
Armenia,1,0,0,3,0,0,0,0,0,0,0,0,0,0
Australia,5,2,6,4,0,21,0,1,0,2,3,0,11,1
Austria,20,9,6,8,0,10,25,10,5,1,3,7,17,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ukraine,2,11,0,8,0,4,0,6,1,1,0,0,2,0
United States,20,10,14,14,2,24,46,10,4,8,5,7,23,17
Uzbekistan,2,0,0,0,0,0,0,0,0,0,0,0,0,0
Venezuela,1,0,0,0,0,0,0,0,0,0,0,0,0,0


Z uporabo pogojev lahko izpišemo le določene vrstice.

In [12]:
subset = data[(data['gold_medals'] > 0 ) & (data['silver_medals'] > 0 )]
subset

Unnamed: 0,age,birthdate,gender,height,name,weight,gold_medals,silver_medals,bronze_medals,total_medals,sport,country
214,24,1989-06-18,Female,1.65,Anna Fenninger,64,1,1,0,2,Alpine Skiing,Austria
424,26,1987-07-22,Female,1.62,Charlotte Kalla,59,1,2,0,3,Cross-Country,Sweden
911,27,1986-04-01,Female,1.68,Irene Wust,63,1,2,0,3,Speed Skating,Netherlands
1062,33,1980-03-19,Male,1.81,Johan Olsson,68,1,1,0,2,Cross-Country,Sweden
1407,28,1985-11-25,Male,1.83,Marcus Hellner,75,1,1,0,2,Cross-Country,Sweden
1419,29,1984-11-24,Female,1.82,Maria Hoefl-riesch,78,1,1,0,2,Alpine Skiing,Germany
1482,25,1988-09-14,Male,1.85,Martin Fourcade,75,2,1,0,3,Biathlon,France
2173,17,1997-01-30,Female,1.75,Sukhee Shim,56,1,1,0,2,Short Track,Korea
2183,27,1986-04-23,Male,1.87,Sven Kramer,83,1,1,0,2,Speed Skating,Netherlands


Na tabeli lahko uporabimo tudi združevalne in agregatne funkcije.
Koliko zlatih, srebrnih, bronastih medalj je zmagala posamezna država?

In [13]:
medals_by_country = data[['country', 'gold_medals', 'silver_medals', 'bronze_medals']].groupby('country').sum()
medals_by_country

Unnamed: 0_level_0,gold_medals,silver_medals,bronze_medals
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Albania,0,0,0
Andorra,0,0,0
Armenia,0,0,0
Australia,0,2,1
Austria,1,10,1
...,...,...,...
Ukraine,0,0,1
United States,4,4,10
Uzbekistan,0,0,0
Venezuela,0,0,0


Isto lahko dosežemo s funkcijo aggregate().

In [14]:
medals_by_country = data[['country', 'gold_medals', 'silver_medals', 'bronze_medals']].groupby('country').aggregate('sum')
medals_by_country

Unnamed: 0_level_0,gold_medals,silver_medals,bronze_medals
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Albania,0,0,0
Andorra,0,0,0
Armenia,0,0,0
Australia,0,2,1
Austria,1,10,1
...,...,...,...
Ukraine,0,0,1
United States,4,4,10
Uzbekistan,0,0,0
Venezuela,0,0,0


Ali z njenim aliasom agg()

In [15]:
medals_by_country = data[['country', 'gold_medals', 'silver_medals', 'bronze_medals']].groupby('country').agg('sum')
medals_by_country

Unnamed: 0_level_0,gold_medals,silver_medals,bronze_medals
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Albania,0,0,0
Andorra,0,0,0
Armenia,0,0,0
Australia,0,2,1
Austria,1,10,1
...,...,...,...
Ukraine,0,0,1
United States,4,4,10
Uzbekistan,0,0,0
Venezuela,0,0,0
