<hr><hr>

# Data Science Summer School - Split '16 </center>

## Day 1 - Python for data analysis fundamentals 
### *Numpy, Pandas, Matplotlib*

(c) 2016 Damir Pintar

*version: 0.1* 


`kernel: Python 2.7`

<hr> <hr>

# Part 2 - Pandas

*Pandas* is an open-source library which aims to facilitate data analysis tasks in the Python programming language. The name *pandas* stands for "Python Data Analysis Software", although the name actually originates from "panel data" a term commonly used in economics for multidimensional datasets.

While *Numpy* offers plenty of tools for manipulating numerical vectors of matrices, it is not very well suited for dealing with more complex data formats. *Pandas* library provides data structures specifically aimed at facilitating analysis of tabular data where columns depict attributes and rows describe observations - similar to  *DataFrame* objects used in programming language R, tables in a relational database, or perhaps typical Excel sheets. 

In our brief *Pandas* introduction we will focus on two flagship objects in the Pandas Library:
- `Series` and
- `Data Frame`

Most of our work in *Pandas* will revolve around manipulating objects of the above classes. Hence, let's briefly describe what these objects are all about, and then show how to deal with them in practice.'

It's easiest to consider `Series` as a *Numpy* array which allows *indexing*. Essentially it trades *efficiency* for *flexibility*. While it's relatively easy to convert between *Pandas*' `Series` and *Numpy*'s `Array`, in practice you will probably stick with `Series` unless you require computational efficiency or want to perform some lower-level operations.

`Data Frame` on the other hand can be viewed as a data structure which allow storing data in tabular form, each column describing a certain attribute belonging to a certain domain, and row depiciting a tuple or observation. You can also see a `Data Frame` as a collection of `Series` objects.

Or, to put simply, `Data Frame` is our dataset in a tabular form, and each column is a `Series`. Analyzing our data often boils down to manipulating our `Data Frame` objects in various ways by reshaping them, transforming, slicing, aggregating and using as input for various functions and methods.

## Pandas Series

First, let's import the *Pandas* package using the common convetion of abbreviating it as *pd*:

In [None]:
import numpy as np
import pandas as pd

We can create a `Series` object in a similar fashion as we did with *Numpy* `Array`, by calling a default constructor (`Series`) and providing a list argument:

In [None]:
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# create a Numpy Array called np_a and a Pandas Series called pd_s from the above list
###
###

# print out np_a and pd_s
###
###


Notice one change between `Series` and `Array` - first, the information is being printed out in a more informative fashion, with even the data type provided. More importantly, *Pandas* has automatically asigned indexes to our `Series` - in this case integers which correspond to default indexing used by Python lists as `Array`s. However, if you try to treat it as if it were a *Numpy* array, you might soon run into certain problems.

In [None]:
# slice pd_s from 2nd to 4th element and store it in a variable called pd_slice
###


# print out pd_slice
###


# print out the first element of pd_slice
###


Did you try to get the first element of `pd_slice` by indexing it with 0? If you did, you might have been suprised by an unexpected error. What might be the cause of it?

Here, we have stumbled upon a slightly controversial issue in *Pandas* `Series` indexing. By design, `Series` recognizes three types of slicing: **boolean-based**, **integer-based** and **label-based**. Integer-based indexing means we are asking for the elements regarding their positions - i.e. integer locations - in a Series. Label-based indexing means we identify our elements by labels (or keys). Label-based indexing often takes priority, and things may get confusing when (as in our case) labels and integer locations seem interchangeable. When we used a range of integers, *Pandas* supposed that we wanted to use integer-based indexing (as we did), but when we asked for one element, *Pandas* incorrectly assumed we were asking for an entry with a specific key.  

To avoid this issue, we may opt to use `Series` methods called `iloc` and `loc`, which are hardcoded to use integer-based and label-based indexing respectively.

It *is* possible to use integer-based location indexing though, but you need to use the `iloc` method of the `Series` object:

In [None]:
# print the first element of pd_slice using the iloc method.
###


# now print the same element using the loc method
###


Things may get slightly more confusing when we reveal that the range operator `:` also works just fine for label-based indexing!

In [None]:
# print the first two elements of pd_slice using the iloc method
###

# print the last two elements of pd_slice using the loc method
###


If the last exercise left you with a feeling something is not quite right, that's actually true. There is a slight inconsistency in label-based indexing compared to Python's usual syntax - **label-based indexing is inclusive from both ends**! Even though this might be considered somewhat unpythonic and more like something we would expect to see in a less strict language like `R`, it is actually an agreed upon compromise since otherwise the user would be inconvinienced to always search for the label of the row under the one he wants to actually retrieve, or use the clumsy `+1` semantics which doesn't really make sense when applied to labels.

Let's make our life easier by assigning our own index to the `Series`:

In [None]:

pd_s = pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e'])


# print the element with the index 'c' from pd_s.
###


# print all elements from the one indexed with 'b' to the one with indexed 'e'
# use list range syntax
###


Notice that in the last exercise you could have used the `loc` method, but it wasn't necessary; *Numpy* realized you wanted to use label-based indexing simply by checking out the index arguments. 

Even though the `Series` behaves a lot like a dictionary, there are still scenarios where we can treat it just as if we are dealing with a *Numpy* array. One example of this is using relational operators. Let's try to filter `Series` elements by using similar methods we used in the *Numpy* lecture:

In [None]:
# see what happens if you print the following expression:  pd_s > 2
###


# now try to print all elements from pd_s which are larger then 2
###




As you can see, instead of actual indexes we can provide a `boolean Series` which will then slice our `Series` in an expected fashion. 

Also similar to arrays, we can perform arithmetical operations with scalars and get the expected results.

In [None]:
# multiply pd_s by 10 and print out the result
###


Performing arithmetic operations between `Series` objects will work similar to *Numpy* arrays, i.e. it will be executed in vectorized fashion. However there is one crucial difference - *Pandas* will take into account `Series` indexing and will align two objects accordingly.

In [None]:
pd_a = pd.Series([1, 2, 3, 4, 5], index = ('a', 'b', 'c', 'd', 'e'))
pd_b = pd.Series([1, 2, 3, 4, 5], index = ('c', 'e', 'd', 'f', 'b'))


# print the result of pd_a + pd_b w and explain the result
###


As you see, *Pandas* will first align rows by indexes and then perform the operation. 

If one or both `Series` operands contain indexes which aren't present in the other operand, we get `NaN` as the result. Sometimes this isn't what we want. *Pandas* allows us to "fill in" the default value for these cases, but then we cannot use the operator (since we don't have a way to input additional parameters). This is why *Pandas* offers special methods of`Series` objects - such as `add` or `multiply` (`mul`) -  which perform the same thing as the operator but offer us a chance to input additional parameters.

In [None]:
# add pd_a and pd_b using the function add from the pandas package. Add a fill_value parameter set to 0
# simply print the result
###


# multiply pd_a and pd_b using the function multiply (or mul). Use appropriate fill in value to avoid NaNs.
# print out the result
###



*Pandas* offers a huge selection of attributes functions which offer various ways to manipulate `Series` elements, execute various unary or binary operations or compute values. You may find it worthwhile to check the official API documentation for Series at <a href ="#1"> [1] </a>. Amongst other things, you can try:
- `shape` - (attribute) returns dimensions of underlying data
- `index` - (attribute) returns an array of Indexes
- `values` - (attribute) returns an array of values
- `size` - (attribute) returns the number of elements
- `sort_values` - sorts Series by values
- `sort_index` - sorts Series by indexes
- `add, sub, mul, div, mod, pow` - performs arithmetic operations
- `min, max, mean, median, describe` - calculates statistics
- `isnull, notnull, eq, lt, gt` - returns boolean Series based on chosen condition
- `unique` - returns unique values 
- `str.*` - a family of functions involving string manipulation
- etc.

Let's experiment with a few of these methods:

In [None]:
 #initializing a Series first - notice how index keys do not need to be unique!
pd_a = pd.Series([4, 2, 9, 11, 4, 18, 2, 10], index = ['b', 'a', 'd', 'c', 'a', 'e', 'f', 'c'])

# print just the indexes of the above series
###


In [None]:
# print just the values
###


In [None]:
# print the number of elements
###


In [None]:
# print the series sorted by indexes
###


In [None]:
# print the series sorted by values
###


In [None]:
# print the unique values of the Series
###


In [None]:
# print the arithmetic mean of the series
###


In [None]:
# print all the statistics by using the "describe" method
###


Finally, we are not constrained to provided functions. We can apply any function we want to a `Series` object by using the `apply` method of a `Series` object and providing the function as a parameter (and `args` as a set of additional parameters, if needed).

In [None]:
# initialize a Series containing three floating point numbers
import math
pd_series = pd.Series([1./3, math.pi, 0.123456789])

# print a Series of natural logarithms of the elements of pd_series
# use np.log function from the numpy package
###

# round the elements of pd_series elements to 2 decimals using numpy's round function 
# remember that you can put positional arguments in an 'args' list 
# or keyword arguments directly
# print out the results
###

# you can use custom functions too!
# apply custom square function to the series by using lambda x: x ** 2 as a function argument
###


<hr><hr>

## Pandas DataFrame

`DataFrame` is a data structure most commonly used for storing 2-dimensional tabular data with columns of potentially different types. It is easiest to compare it to a table in a relational database, Excel spreadsheet, or a `data.frame` object in the `R` programming language. It can also be thought of as a collection of `Series` objects whih all share the same index ("row indices"), but are also indexed by the `DataFrame` itself externally ("column indices"). If you prefer Python lingo, if `Series` objects are like a dictionary, `DataFrame` is like a dictionary of dictionaries (in fact, a dictionary of dictionaries is one type of arguments the `DataFrame` accepts by default!).

### Constructing a DataFrame

You most commonly create `DataFrame` objects in one of the two ways:
- programmaticaly using one of the default constructors
- by reading data from a file (.TXT, .CSV etc.)

Both of these ways have plenty of available options. For clarity, we will here demonstrate two most common ones. For additional options, be sure to check the official documentation at <A href = "http://pandas.pydata.org/pandas-docs/stable/api.html#series">[1]</a>.

In [None]:
# creating a DataFrame programmatically
zipcode = [10000, 51000, 21000, 31000, 20000]
dataset = { 'name': ['Zagreb', 'Rijeka', 'Split', 'Osijek', 'Dubrovnik'], 
           'avg_salary': [6359, 5418, 5170, 4892, 5348], 
           'population': [790017, 128384, 167121, 84104, 28434], 
           'tax': [18, 15, 10, 13, 10]}
    

# and send this dictionary to the DataFrame constructor
cities_df = pd.DataFrame(dataset, index = zipcode, columns = ['name', 'avg_salary', 'population', 'tax'])
cities_df.index.name = 'zipcode'   #otherwise it will stay nameless

# try printing out the above data frame
###


We have first created a list of indexes, and then a dictionary holding our data columns. Then we have used one of the appropriate `DataFrame` constructors which takes exactly the format of arguments that we have prepared. The `columns` parameter wasn't strictly necessary, but since Python dictionaries do not have defined ordering, the order of columns would ultimately be completely arbitrary which usually isn't what we want. Additionally, you might have noticed that we explicitly stated the index name by setting the `index.name` attribute of the `DataFrame` object. This is not really required, but when we are using a data column for a key, we usually prefer it having its own header.

NOTE: if you explicitly used the `print` function to print out the data frame, revisit the code segment again and print out `cities_df` using autoprint (just put the name of the variable as the last command in the code block). You will see that as an added bonus, Jupyter Notebook prints out dataframes as HTML tables. 

There is a myriad of other ways to construct a `DataFrame` programmatically. We avoid showing these for reasons of brevity, but be free to check out the official API documentation on <a href = "#1"> [1] </a> for a particular method you may find convenient.

Now let's try to create the same data frame by reading it from the CSV file. Ensure that there is a `CroCities.csv` file in the home directory of this notebook. Open it with a text editor (*Notepad* on Windows, *vi* or similar on Unix) and analyze its structure. It should look similar to the following: 

**CroCities.csv**

`zipcode,name,avg_salary,population,tax
10000,Zagreb,6359,790017,18 
51000,Rijeka,5418,128384,15
21000,Split,5170,167121,10
31000,Osijek,4892,84104,13 
20000,Dubrovnik,5348,28434,10`

Let's create our data frame:

In [None]:
# creating a dataframe by reading from a CSV file

cities_df = pd.read_csv('CroCities.csv', index_col = 'zipcode')


# print out the above data frame
###



As you saw, this was pretty painless. In practice, you will want to check the `read_csv` API for all available options so your imported data frame contains correct data in the type or form that you require. Things you need to watch out for is the separator symbol (default is `,`, set your own with a `sep` attribute if needed), weather the file has a header or not (control it with `header` parameter which takes a binary argument) and, if needed, set your own column names (with a 'columns' parameter), set character encoding etc. In our case we have explicitly stated that `zipcode` is our index column, otherwise *Pandas* would have provided its own key (a usual range of integer values). 

If you want to read data from another source, such as an Excel file, relational database, JSON file, HTML file etc. check out the official `DataFrame` API at <a href = "#1"> [1] </a> for available functions. *Pandas* offers plenty of options out-of-the-box and there is a great chance that the format you need has its own accompanying *Pandas* function. If not, rememeber that you can almost always use a CSV file as a proxy between your data source and *Pandas*.

### Selecting rows / colums

One of the most common things you will do with a data frame is filtering out specific rows and columns based on a certain criteria. *Pandas* is very flexible with what you can do with your data frames, however you need to understand the funcionalities it offers first.

First of all, remember that a `DataFrame` is basically a dictionary of `Series` objects. If you want to extract a specific `Series` from a `DataFrame`, just reference it by name in a typical dictionary fashion (i.e. df['name']), or - even simpler - use regular attribute syntax (df.name). If you want more columns, put their names in a list and send it as an argument.

In [None]:
# print the name column from cities_df using dictionary syntax, print it out on the screen
###

#  print the population column using attribute syntax
###

# print the name and tax columns together by using dictionary syntax with a list of keys as an argument. 
#Notice the result is not a Series anymore, but a DataFrame
###


Similar to indexing issues we have encountered with the `Series` object, we also need to be aware of certain indexing inconsistencies when using `DataFrame`s. 

When we use a **single-argument label-based** indexing with a `DataFrame` object, it will default to selecting **columns**. This is what you were doing in the last exercise. However if you put a **boolean list** or an **integer list** as a single indexing argument, you will be selecting **rows**.

In [None]:
# print the second and third row from cities_df
###


In [None]:
#print all cities with the average salary greater then 5200 Kn
###


The reason why *Pandas* designer implemented this slightly confusing feature is simply because this type of row-wise selection is extremely common, so the inconsistency is a decided compromise between the ease of use and the "cleanliness" of the API. Again, this is a common feature of the `R` programming language, but not something that we usually see in Python.

For a more 'pythonic' approach, you should use the `iloc` and `loc` functions, which you have met when dealing with indexing `Series` objects. They will expect two arguments - first for rows, second for columns, and the functions will expect integer-based and label-based indexing respectively. Interestingly, both functions will readily accept boolean lists for any othe arguments. Also, if you want to mix integer- and label- based indexing (for example you want to index rows by their integer order but columns by their label), you can use the `ix` function. This function is a hybrid of `iloc` and `loc` and it will try to determine what type of indexing you want by the arguments you have provided. 

In [None]:
# print the 3rd and 4th row and the 1st and 2nd column from cities_df
###


In [None]:
# print the name and tax of cities where the population is under 200000
###


In [None]:
# print the name of all cities with tax less then 15 (percent) and population higher then 100000
###


### Adding rows / columns to a DataFrame

Adding columns to a data frame is simple - you most commonly perform it using dictionary syntax for adding a new key (df['new'] = 'value'). Your new column can be based on operations performed on existing columns, or you can add a completely new column with its own indexes. The indexes do not need to conform completely with the existing data - in case of missing indexes or new additions, *Pandas* will reshape the data frame accordingly.

In [None]:
# create a new boolean column in cities_df called is_tax_high which will show whether the tax is over 13 percent
# use dictionary sintax  (a['key'] = values)
###

# create a new column in cities_df called "area" which will store city areas in km2
# imagine you are only aware of the following data: Zagreb has 641 km2, Osijek 169 km2, Dubrovnik 22 km2
###

# print cities_df
###


If you made a mistake, go up and reload `cities_df` again. Alternatively, if you only want to get rid of some columns, simply delete them with Python's `del` function. 

Adding rows usually boils down to "concatenating" two `DataFrame` objects using the `append` method.

In [None]:
# we are creating a new data frame with a few more cities for our dataset
# notice we are missing some columns!
more_cities_zipcode = [44000, 33000]
more_cities_data = {'name': ['Sisak', 'Virovitica'], 
           'population': [44322, 14688], 
           'tax': [7, 10],
           'area': [423, 179]}
more_cities_df = pd.DataFrame(more_cities_data, index = more_cities_zipcode)
more_cities_df.index.name = 'zipcode'  

# append more_cities_df to cities_df. Store it in a variable called new_cities_df
# use the append method
###

# print out new_cities_df
###


It seems all data is there - but the columns have reordered themselves! Don't worry, we can fix this easily - let's just select all data from our data frame, but put columns in proper order. We can put the result of that selection back to `new_cities_df`.

In [None]:
# Rorder new_cities_df so columns follow this order: name, population, avg_salary, tax, is_tax_high, area
###

# print new_cities_df again
###


If you did everything right, your final `new_cities_df` data frame needs to look like this:


`zipcode        name  population  avg_salary  tax is_tax_high area                                                      
10000        Zagreb      790017      6359.0   18      True  641.0
51000        Rijeka      128384      5418.0   15      True    NaN
21000         Split      167121      5170.0   10     False    NaN
31000        Osijek       84104      4892.0   13     False  169.0
20000     Dubrovnik       28434      5348.0   10     False   22.0
44000         Sisak       44322         NaN    7       Nan  423.0
33000    Virovitica       14688         NaN   10       Nan  179.0`

We like the order now, but the data frame seems to be infested with missing values. Let's learn how to deal with them in the next chapter.

<hr><hr>

### Dealing with missing values

Missing values are very common in real world datasets. When a value is missing it can mean a lot of things - maybe there was a mistake in the original data which caused a parsing problem, maybe this data was never collected, was witheld or perhaps is not even applicable for this particular observation.  

When we notice that we have missing values in our datasets, we need to carefully consider the strategy of dealing with such values. The problem with them is that they can skew or invalidate our calculations, or cause further havoc when we further manipulate our data. Missing values in expressions commonly cause more missing values, and even if computations are programmed in such a way they ignore the missing values and only take into account observations or atribute values which are actually present, the result may not be representative of what we wanted to calculate in the first place.

Handling of missing values is actually a very complex issue, with many potential strategies at disposal, some simple, some more advanced. Here, we will focus only on a few of the basic methods, but readers are encouraged to explore and research the more advanced techniques, especially if they expect that handling missing values would be a common task they will have to confront in the future.

Some of the most basic guidelines are:
- always check for the missing data. Is there any missing data? In which rows / columns? How prevalent is it?
- try to discover the reasons why the data is missing. Is it missing at random or are there specific patterns?
- should you ignore the missing data, remove it, or change it to another value?
 - if you are ignoring missing data - are the functions you plan to use able to gracefully accept missing data without errors? Will the results be repesentative?
 - if you are removing missing data - should you remove rows, or attributes? Is this removal affecting the nature of the dataset? Is the data you are removing telling you some important information which you need to explore further?
 - if you are replacing missing data with other values - which values do you choose? Should it be 0, a mean or median of this attribute, a mean or median of only a subset of observations which the observation with missing data belongs to, or something else?

Without delving too deep into these questions, let's move onto exact methods you require to even start getting dirty with missing data. Let's go back to our `new_cities_df` data frame which, as you remember, has some missing data itself.

Let's first learn how to spot if missing data even exists. For that we can use a data frame method called `isnull`.

In [None]:
# try out the isnull method on new_cities_df
###


We received a boolean data frame as a result. Now even though we can explore it visually and check for cells which have `True`, this is not really practical for larger datasets. One neat trick we can do is simply calculate a sum of all elements for this dataframe. *Numpy* has a function for that called `DataFrame.sum`.

In [None]:
# print out the sum of new_citis_df.isnull()
###


Now we clearly see which columns have missing values so we can start planning our strategy.

One of the simplest things we can do is simply remove all rows with missing values from them. A data frame method for that is called `dropna()`. If we only want to remove rows with missing values in for a subset of columns, we can use the `subset` argument and state a list of column labels which we want removed if they contain missing values. There are other arguments to consider which you can check in the documentation. For now, let's simply annihilate all missing values.

In [None]:
# drop all missing values from new_cities_df and store it in a new variable: new_cities_df_scrubbed
###

# print out new_cities_df_scrubbed
###


This method is ok if we have a large dataset with a very few rows which have missing values, especially if we checked those rows and deemed them truly insignificant. For the `new_cities_df` however this operation was devastating - we lost more than half of our rows.

Let's think of an alternate strategy. Print out `new_cities_df` again and try to establish a strategy for dealing with missing values.

In [None]:
# print out new_cities_df
###


Suppose this was our chosen strategy:
- column `is_tax_high` is derived from `tax` column which doesn't have missing values; this means we can simply recalculate it
- we can calculate the median of the `avg_salary` column and impute those values instead of the missing ones 
- we can try to find out city areas from an alternate data source - in this example googling them is simple and easy enough, even though in practice we would probably need to converse with domain experts for the availability of data we're missing

Let's fix those missing values!

In [None]:
# copy the new_cities_df data frame into a new variable called new_cities_df_clean
# use the copy method of a DataFrame object
###

# print out new_cities_df_clean
###


In [None]:
# recalculate the is_tex_high column
###

# print out new_cities_df_clean
###


In [None]:
# suppose we procurred additional data for the area column: area of Rijeka is 44km2, and Split 79 km2
# update those values in the data frame 
###
###

# print out new_cities_df_clean
###


In [None]:
# finally, change the missing values in the avg_salary column 
# first calculate the median of this column with the help of Numpy's "nanmedian" method (median which ignores NaNs)
# you can round the result to the nearest integer using Numpy's "round" method
# call this value `median_salary`
###

# put indexes of cells where avg_salary is NaN in a boolean array called missing_salaries
###

# finally, using the loc method impute the median into appropriate cells
###

# print out new_cities_df_clean
###


There. We have rid ourselves of missing values completely. Be aware though that the fix involving the salaries might bites us in the long run; we have effectively narrowed down the distribution of salary values as well as entered potentially wrong values for these observations (which would be especially problematic if these observations have values which are in fact very far away from the median). The most important thing to remember is that handling missing values should always be performed with care and forethought, and even what the perceived best solution can still be a compromise of sorts which might effect the results of our analysis in the long run. 

<hr><hr>

### Sorting, grouping and aggregation

For our last segment, we will briefly demonstrate how to do two very common tasks when dealing with data frames - sorting by a certain key and aggregating values from our data frame, which may or may not involve spliting our data frame first into logical groupings of observations.

The easiest way to sort a data frame is to use the `sort_values` method of the `DataFrame` object. This method takes a number of arguments out of which we can make do with just two: `by`, which takes a column name (or a list of column names) by which we want to sort, and `ascending` which takes a boolean value (or a list of boolean values) which determine should the sorting by a specific column be done in an ascending or descending fashion.

In [None]:
# print the sorted variant of the new_cities_df_clean data frame
# sort first by tax (ascending) and then average salary (descending)

###


Grouping and aggregation is a process more commonly known as S-A-C: *Split - Apply - Combine*. It basically means:
- *Split* - cut up the data frame based on some criteria
- *Apply* - perform an operation on each of the sections gained by splitting
- *Combine* - collect all results from the previous step in one resulting data frame

Users of SQL language will immediately understand this principle since it is one of very common operations done on relational tables, using GROUP BY clause and aggregate functions.

Our `new_cities_df` is pretty small, but still usable to demonstrate grouping and aggregating. To achieve this, we will first add another column called `size`, which will contain a categorical variable with the following values:
- "Small" - if the city has a population of less than 50,000     
- "Medium" - if the city has a population between 40,000 and 100,000 (inclusive range from both ends)
- "Large" - if the city has a population of more than 100,000   

Let's create this column first.

In [None]:
# create the column called `size` in new_cities_df_clean and fill it with values according to the above criteria
###
###
###

# print out new_cities_df_clean
###


Now we will create a so-called `group by` object by calling the `DataFrame` method `groupby` and providing the column we want to group on.

In [None]:
# call the method groupby on new_cities_df_clean data frame
# provide column 'size' as argument
# store this object as a variable called size_groupby

###

# print out the attribute `groups` from this object
###


We can see that this object basically consists of groupings of indexes based on our grouping criteria.

Now we are free to perform aggregation (the apply-combine steps of the S-A-C principle). We simply call one of the provided methods of the `groupby` object, optionally stating which columns we are interested in. 

In [None]:
# find out how many members each group has by calling the count method of the group by object
# and selecting just the first column of the result
###


In [None]:
# calculate the average area of cities grouped by size by calling the mean method of the groupby object
# and selecting the 'area' column of the result

###


With this we conclude our brief introduction to *Pandas*. In our final notebook we will give a quick presentation of how to perform exploratory data analysis on larger datasets using `pandas` and `matplotlib` packages.

<hr> <hr> <hr>
## <font color = "blue">Exercises

Instead of proiding our own exercises, we will use the opportunity to point to a venture similar to previously mentioned "100 Numpy exercises", called **100 Pandas exercises**, available here: <a href = "#5">[5]</a>. While this seems to be a work in progress (and the number of exercises isn't even near one hundred yet), the exercises there are a great way to repeat some of the above mentioned concepts and learn someting new.  Also, do not forget to check out *Part 3*, which expands on the subject of *Pandas* DataFrames.

<hr> <hr> <hr>

## Additional resources

<a name="1"></a><A href = "http://pandas.pydata.org/pandas-docs/stable/api.html#series">[1]</a> *Series and Data Frame API Reference*, official Pandas documentation, last accessed 2016/09/06

<a name="2"></a><a href = "http://synesthesiam.com/posts/an-introduction-to-pandas.html">[2]</a> *An Introduction to Pandas* by Michael Hansen, last accessed 2016/09/06

<a name="3"></a><a href = "http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/">[3]</a> *Intro to Pandas data structures* by Greg Reda, last accessed 2016/09/07

<a name="4"></a><a href = "http://pandas.pydata.org/pandas-docs/stable/10min.html">[4]</a> *10 minutes to Pandas*, official *Pandas* 0.18.1 documentation, last accessed 2016/09/07

<a name="5"></a><a href = "https://github.com/ajcr/100-pandas-puzzles">[5]</a> *100 Pandas exercises*, last accessed 2016/09/22</a>