# Data analysis with Pandas

[Pandas quick-start guide](http://pandas.pydata.org/pandas-docs/stable/10min.html)  
[Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)  
[Lecture notes on pandas](../predavanja/Analiza podatkov s knjižnico Pandas.ipynb)


### Loading pandas and our data

In [43]:
# naložimo paket
import pandas as pd

# ker bomo delali z velikimi razpredelnicami, povemo, da naj se vedno izpiše le 10 vrstic
pd.options.display.max_rows = 10

# select an interactive "notebook" plot style
import matplotlib.pyplot as plt

# naložimo razpredelnico, s katero bomo delali
filmi = pd.read_csv('../predavanja/obdelani-podatki/filmi.csv', index_col='id')

Let's take a look at the data.

In [44]:
filmi.head(10)

Unnamed: 0_level_0,naslov,dolzina,leto,ocena,metascore,glasovi,zasluzek,opis
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12349,The Kid,68,1921,8.3,,90443,5450000.0,"The Tramp cares for an abandoned child, but ev..."
13442,"Nosferatu, simfonija groze",94,1922,8.0,,77975,,Vampire Count Orlok expresses interest in a ne...
15864,Zlata mrzlica,95,1925,8.2,,85136,5450000.0,A prospector goes to the Klondike in search of...
17136,Metropolis,153,1927,8.3,98.0,136601,26435.0,In a futuristic city sharply divided between t...
17925,General,67,1926,8.2,,68196,,When Union spies steal an engineer's beloved l...
21749,Luči velemesta,87,1931,8.5,,138228,19181.0,"With the aid of a wealthy erratic tippler, a d..."
22100,M - mesto isce morilca,117,1931,8.4,,121443,28877.0,When the police in a German city are unable to...
24216,King Kong,100,1933,7.9,90.0,71806,10000000.0,A film crew goes to a tropical island for an e...
25316,Zgodilo se je neke noci,105,1934,8.1,87.0,81390,,A spoiled heiress running away from her family...
27977,Moderni časi,87,1936,8.5,96.0,179725,163245.0,The Tramp struggles to live in modern industri...


## Inspecting the data

Sort the data by rating.

In [45]:
filmi.sort_values('ocena')

Unnamed: 0_level_0,naslov,dolzina,leto,ocena,metascore,glasovi,zasluzek,opis
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5988370,Reis,108,2017,1.8,,71685,,A drama about the early life of Recep Tayyip E...
1213644,Disaster Movie,87,2008,1.9,15.0,80828,14190901.0,"Over the course of one evening, an unsuspectin..."
799949,Velik film,86,2007,2.3,17.0,96182,39739367.0,A comedic satire of films that are large in sc...
185183,Battlefield Earth,118,2000,2.4,9.0,71149,21471685.0,It's the year 3000 A.D.; the Earth is lost to ...
1098327,Dragonball Evolution,85,2009,2.6,45.0,63966,9353573.0,The young warrior Son Goku sets out on a quest...
...,...,...,...,...,...,...,...,...
71562,"Boter, II. del",202,1974,9.0,90.0,950252,57300000.0,The early life and career of Vito Corleone in ...
468569,Vitez teme,152,2008,9.0,84.0,1972591,534858444.0,When the menace known as the Joker emerges fro...
68646,Boter,175,1972,9.2,100.0,1372528,134966411.0,The aging patriarch of an organized crime dyna...
111161,Kaznilnica odrešitve,142,1994,9.3,80.0,2003395,28341469.0,Two imprisoned men bond over a number of years...


Extract the 'ocena' column.

In [48]:
filmi.ocena

id
12349      8.3
13442      8.0
15864      8.2
17136      8.3
17925      8.2
          ... 
5813916    9.4
5988370    1.8
6294822    7.2
6644200    7.7
7784604    7.3
Name: ocena, Length: 2500, dtype: float64

There is a difference between `filmi['ocena']` and `filmi[['ocena']]`:

In [None]:
print(type(filmi['ocena']))
print(type(filmi[['ocena']]))

The columns of dataframes are `Series`. Using single brackets extracts a `Series` (think: a vector, no further structure), double brackets extracts a sub-`DataFrame`. Most of the operations we perform (grouping, joining, plotting, filtering, ...) operate on dataframes. 

A `Series` is used for example if we want to add a column to a dataframe.

Round the extracted rating series to the nearest integer using the `round()` function.

Add the rounded value to the filmi dataframe.

Remove the newly added column using the `.drop()` method with a `columns = ` argument.

### Side-note: slices
Selecting a sub-dataframe creates a "slice".
A slice is a view defined by reference to a different dataframe,
and cannot be altered directly. Instead, we have to create a copy
of the portion selected by the slice by calling the `.copy()` method on that slice, and can then alter that copy.


Select the slice corresponding to the columns `naslov`, `leto`, and `glasovi` from `filmi`, and add a column with the rounded rating to it.

### Filtering data

Create a filter that selects films from before 1930, and one for films from after 2017.
Combine them to select films from before 1930 or after 2017.

Define a function that splits a string into words and tests if the number of words is at most two, then select the films with a name no longer than two words and a rating greater than 8.

Hint: Use the `.apply()` method to create a filter from the `naslov` column.

### Histograms: Counting frequencies of values

Group the films by rating, then number of occurances of each rating.

Create a bar plot of this data.

Dataframes have a built-in `.hist()` method that allows creating histograms for each column. Use this method to create a corresponding plot for the simplified data.

### Plot the average film length by year

### Plot the sum of the revenues by year