# DNDS6013 Scientific Python: 11th Class
## Central European University, Winter 2019/2020

Instructor: Márton Pósfai, TA: Luis Natera Orozco

Emails: posfaim@ceu.edu, natera_luis@phd.ceu.edu



<style>
div.cell, div.text_cell_render{
  max-width:760px;
  margin-left:auto;
  margin-right:auto;
}

.rendered_html
{
  font-size: 130%;
  }

.rendered_html li
{
  line-height: 1.;
  }

.rendered_html h1, h2 {
  font-familly:"Charis SIL", serif;
}

img { 
    max-width: 200% !important;
    height: auto !important;
}

.input_prompt, .CodeMirror-lines, .output_area
{
  font-family: Consolas, monospace;
  font-size: 120%;
}
</style>

[Intro video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=4aa0894d-686a-42d6-9f07-ab8a00d1e2bf)

To complete the class, go though this notebook and
* Study the example codes
* Solve the exercises in the notebook. Try looking at solutions only when you are done.
* Follow links to videos for short verbal explanations if needed
* Complete a final task and upload your result to Moodle, pay attention to upload only **a single pdf figure, do not upload your code**.

If you have any questions or you get stuck with one of the exercises I will be available on the [slack channel](http://sp2020winter.slack.com). I will be online during regular class hours, outside of that I will try to get back to you as soon as possible.


## Today:

Introduce and explore [Pandas](http://pandas.pydata.org/), a library for tabular data manipulating and analysis that has implementations of common tasks making the following (and more) very straightforward:
- Reading in and cleaning tabular data
- Merging data from different sources
- Basic analysis and plotting

Pandas stands for **Pan**el **Da**ta (an expression borrowed from econometrics), it brings tools and ideas from Excel, R, and SQL to Python.

We'll have a look at the basic data structures: Series and Dataframes. Then we'll look at a dataset about the passengers of the Titanic.

# Part I: Pandas and its data structures
## Series and Dataframes



In [None]:
#we typically import pandas with the alias pd (a la import numpy as np)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Pandas has two primary data structures: series and dataframes. Series are similar to Python lists or numpy vectors: they are one dimensional. 

Series can contain mixed types:

In [None]:
#create a series from a list
my_list = [3.0,2,1,'shoe',1.5,['apple','banana'],2,3.0,100]
my_first_series = pd.Series(my_list)
print(my_first_series)

This series contains integers, floats, a string, and even a list. (Note that when we print out a series, it ends with displaying the data type contained in the series. If it is mixed, it says the general `object`.)

When we create the series, notice that we also get a column number each row. This is the index of our series. 

In [None]:
#access the values
print(my_first_series.values)
print(type(my_first_series.values))

#access indices
print(my_first_series.index)

The indexing works similarly to lists or numpy arrays.

Looking up a single value:

In [None]:
print(my_first_series[4])

We can also use the slicing syntax that we know:

In [None]:
print(my_first_series[2:5])

We can also use masks the same way we did with numpy arrays. For example, we can filter based on values:

In [None]:
just_threes = my_first_series[my_first_series == 3]
print(just_threes)
print(just_threes.index)

Notice that `just_threes` has only two elements, but it inherited the indices from `my_first_series`. The indices are not always consecutive and don't always mean the position.

### Exercise

Try to decipher how indexing and slicing works if the indices are not consecutive by changing the next cell:
<details><summary><u>Hint</u></summary>
<p>

Try the following lines out

```python
print(just_threes[1])
print(just_threes[7])
print(just_threes[0:2])
```
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
#accessing by index name, e.g.,
print(just_threes[7])

#slicing always works by position
print(just_threes[0:2])
```
    
</p>
</details>

### Applying functions to series

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=a345cc48-3de3-4d62-b54b-ab8a00d80565)

In most use cases, our series will contain only one data type, so let's look at examples like that.

In [None]:
a = pd.Series([1.0, 2.0, 3.0])
b = pd.Series([1.0, 1.0, 1.0])

print(a)

Similarly to numpy, basic operations get applied to the series elementwise. For example, multiplying a series by 2, doubles each element:

In [None]:
2*a

Or adding two series is calculated elementwise:

In [None]:
a+b

We can even apply a numpy function to the enitre series:

In [None]:
np.sin(a)

If we want apply a general function to the Series elemetwise we have to use the `apply` function. This is important, we will use it often to create new columns in tables.

The `apply()` function for series is the same as list comprehensions are for lists. Let's look at a few examples:

In [None]:
print(my_list)
new_list = [ type(x) for x in my_list]
print(new_list)

The same using series and `apply()`:

In [None]:
print(type(my_first_series))
new_series = my_first_series.apply(type)
print(new_series)

We can define our own functions:

In [None]:
#list
L = ["apple!","pear!","watermelon!"]
L2 = [x.replace("!","") for x in L]
print(L2)

#series
s = pd.Series(L)

def func(x):
    return x.replace("!","")

s2 = s.apply(func)
print(s2)

Or we can do the same thing with lambda functions:

In [None]:
s = pd.Series(L)
s2 = s.apply(lambda x: x.replace("!",""))
print(s2)

### Exercise

Take the following list and list comprehension and
* Convert the list into a pandas series
* Use `apply()` to do the same operation as the list comprehension

In [None]:
L = ["APPLE","PEAR","WATERMELON"]
L2 = [ x.lower()+"!" for x in L]

<details><summary><u>Solution.</u></summary>
<p>
    
```python
L = ["APPLE","PEAR","WATERMELON"]
L2 = [ x.lower()+"!" for x in L]

s= pd.Series(L)
s2 = s.apply(lambda x: x.lower()+"!")
print(s2)
```
    
</p>
</details>

### From dictionaries to series

We converted a list to create `my_first_series`, we can also use a python dictionaries to create a pandas series.

Let's say that I have my ratings of my favorite actors in a dictionary and let's convert this into a series: 

In [None]:
#let's create a dict of the average movie ratings from 1 to 10 of actors.
actor_rating_dict = {'Nicolas Cage':3,'Robert Redford':5,'Julianne Moore':8,
                     'Jeff Bridges':7, 'Idris Elba':8,'Meryl Streep':9,
                     'Pam Grier':9, 'Dorottya Udvaros':7.5}
actor_rating_series = pd.Series(actor_rating_dict)
print(actor_rating_series)

Note the keys of the dictionary are mapped to indices. The index name is not aways an integer.

We can now look up values based on index name or by position:

In [None]:
#look up by index name
print(actor_rating_series['Idris Elba'])
#look up by index position
print(actor_rating_series[4])

Look, I have another dictionary! This dictionary stores the number of movies I have seen with the actors in them:

In [None]:
#creating another series: this time how many movies an actor has played in.
actor_frequency_dict = {'Nicolas Cage':20,'Robert Redford':6,
                        'Julianne Moore':10, 'Jeff Bridges':2,
                        'Idris Elba':14,'Mr. Bean':3,'Meryl Streep':7,
                        'Pam Grier':11,'Dorottya Udvaros':5}

actor_frequency_series = pd.Series(actor_frequency_dict)
actor_frequency_series

### Dataframe

We can combine any number of series with the concat command,  what is returned is a dataframe.

In [None]:
df = pd.concat([actor_rating_series, actor_frequency_series], axis=1, sort=True)
df

What does Mr. Bean's `NaN` mean? I have seen three movies with Mr. Bean, but for some reason I didn't rate him. If the `concat()` function encounters a key that is missing from one of the dictionaries, it substitutes the missing value with the special value `NaN`, which stands for not-a-number. 

### Tiny exercise
What happens if we exclude `axis=1` from the concat command? Try it out!

We can rename the columns to have more descriptive labels:

In [None]:
df.columns = ['Average_Rating','Number_of_Movies']
df

We can access columns by name or by postion, this returns a series:

In [None]:
print(df['Number_of_Movies'])

We can access rows and elements by index name using `loc`:

In [None]:
#element: [row_name,column_name]
print(df.loc['Mr. Bean','Number_of_Movies'])
print('-------------')

#row: [row_name]
print(df.loc['Mr. Bean'])
print('-------------')

#column: [:,column_name]
print(df.loc[:,'Number_of_Movies'])

Or by position using `iloc`:

In [None]:
#element
print(df.iloc[5,1])
print('-------------')

#row
print(df.iloc[5])
print('-------------')

#column
print(df.iloc[:,1])

You can also do slicing a la numpy arrays.

### Dealing with missing values

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=58cfe757-1aed-4d44-ba11-ab8a00dc9c7e)

A common task in data processing is to deal with missing data. For example, we don't have a rating for Mr. Bean. One possibility is that we make an educated guess, and say that Mr. Bean has the same rating as the average of everyone else:

In [None]:
#df['Average_Rating'] returns a series corresponding to the column
#and series has a method to calculate its mean
avg = df['Average_Rating'].mean() 
print('Average of Average Rating = %g' % (avg))

#We can change the elements directly:
df.loc['Mr. Bean','Average_Rating'] = avg

df

This is such a common task, that pandas has a built in method to locate and replace `NaN` called `.fillna()`.

In [None]:
# set Mr. Bean's average rating back to np.nan
df.loc['Mr. Bean','Average_Rating'] = np.nan

# override the Average_Rating column with a version that has the nan's replaced by the average
# of the non-nan entries.
df['Average_Rating'] = df['Average_Rating'].fillna(np.mean(df['Average_Rating']))
df

### Using apply() with a dataframe

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=dd8d9677-fa4a-4493-9047-ab8a00e03354)

We seen before that you can get a column as a series, this way you use `apply()` like we did before:

In [None]:
favorites = df['Average_Rating'].apply(lambda x: 1 if x>7. else 0)
print(favorites)

And we can use this to create a bew column:

In [None]:
df['favorites']=favorites
df

However, we might want to use more than one column as input. For example, I would like to create a new column for that will indicate actors that I don't like, yet I've seen many times:

In [None]:
#define a fuction that we will use with apply
def func(row):
    if row['Average_Rating']<=5 and row['Number_of_Movies']>=10:
        return 1
    else:
        return 0

love_to_hate = df.apply(func, axis=1)
df['love_to_hate'] = love_to_hate
df

The setting `axis=1` tells `apply()` to iterate through rows, `axis=0` iterates through columns. Take look:

In [None]:
df.apply(np.mean, axis=0)

### Exercise

Create a new column `love_to_love` that contains a `1` for actors that have rating at least `7` and I have seen movies with them at least `10` times.

<details><summary><u>Solution.</u></summary>
<p>
    
```python
#define a fuction that we will use with apply
def func(row):
    if row['Average_Rating']>=7 and row['Number_of_Movies']>=10:
        return 1
    else:
        return 0

love_to_love= df.apply(func, axis=1)
df['love_to_love'] = love_to_love
df
```
    
</p>
</details>

### Plotting

Pandas has built in plotting to quickly take care of common plots. It sits 'on top of' matplotlib and so can be customized in the same way.

The pandas `plot()` function returns the matplotlib `axis` object of the figure that it created. You can use this `axis` to customize your figure.

Creating an automatically labelled bar chart is very simple:

In [None]:
axis = df.plot(kind='bar', use_index=True, y='Average_Rating')
axis.set_title('Actor Number of Films vs Avg. Rating', size=15);

Or a histogram:

In [None]:
ax=df.plot(kind='hist', y='Number_of_Movies', legend=True)

ax.set_title('Number of Movies Histogram');

So now we know the basics, let's do something more complex.

# Part II: Surviving the Titanic

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=41bee325-1d33-4211-a829-ab8a00e2e714)

Pandas has great data manipulation abilities. Let's finally consider some real data first. First we are going to consider passenger data from the Titanic, which sunk on its maiden voyage. Of 2,224 passengers and crew, more than 1,500 died.

We have a data file containing data of some of the passengers, this is how it looks like:

In [None]:
!head titanic.csv

Reading from a csv in pandas is very easy! `read_csv` is very flexible: can take `txt`, plain files, and many more.

In [None]:
df = pd.read_csv('titanic.csv', header=0, sep=',')

`header = 0` indicates that the first row is the header. In this case it is not necessary. `sep` is the column seperator, other common examples include tabs `\t`, white space, and `|`.

Note that pandas automatically guesses the datatype of each column and converts it appropriately. This usually works, but in some unusual cases it might fail, for example, phone numbers might be converted to numbers instead of kept as strings. In these cases the `dtype` argument can be used to specify the data type by hand.

How does are dataframe look like? We have too many rows to print out the whole table, but we can use the `head()` method to show only the first five rows:

In [None]:
df.head()

Tail method reads the last five rows:

In [None]:
df.tail()

### Some more information about the data:
- Pclass: passenger class. 
- SibSp: number of siblings+spouses aboard
- Parch: number of parents+children aboard
- Fare: cost of ticket
- Cabin: room ID, if passenger had a room
- Embarked: port of departure (C= Cherbourg; Q= Queenstown; S=Southampton)

### Let's check out a few data exploration techniques

Basic statistics are printed out by the `describe()` method for numeric columns.

In [None]:
df.describe()

We can also visualize pairs of variables quickly. The `pd.plotting.scatter_matrix()` function creates a matrix of plots: the diagonals contain histograms, and the off-diagonal plots are scatter plots.

In [None]:
pd.plotting.scatter_matrix(df[['Parch','Age','Fare']],figsize=(8,8));

###  Grouping

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=c74c32a0-0766-4867-90e7-ab8a00e52015)

Pandas has powerful grouping methods that allows us to group entries based on a column value.

For example, we can ask the question does the average survival rate depend on the passengers ticket class? For this we group passengers based on the colunm `Pclass`, this creates three groups. Then we calculate the average survival rate for each group separetly by calculating the mean of the `Survived` column. (Remeber this column is 1 if the passenger survived, 0 if they died; therefore the mean is the survival rate!)

These steps are done easily with pandas:

In [None]:
df.groupby('Pclass')['Survived'].mean()

### Exercise

What about "women and children first"? Does sex correlates with survival rate? Calculate the average survival rate for men and women separately!

<details><summary><u>Hint</u></summary>
<p>

Do the same thing as in the previous example, only this time group by `Sex` instead of `Pclass`.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
df.groupby(['Sex'])['Survived'].mean()
```
    
</p>
</details>

We can do even more refined grouping. Let's combine the two: groupby both class and sex, and calculate the survival rates.

In [None]:
survived_by_class_and_sex = df.groupby(['Pclass','Sex'])['Survived'].mean()
survived_by_class_and_sex

Let's plot these survival rates!

In [None]:
survived_by_class_and_sex.unstack(1).plot(kind='bar', title='Survival Probability by Sex and Class');

### Exercise

In the previous example we used the `unstack()` method. What does this do to grouped data? Try `unstack(0)` and `unstack(1)` in the next cell, also try the plot without unsing `unstack()`. Explain what the the function does!

<details><summary><u>Hint</u></summary>
<p>

In addition to trying out plots, print out `survived_by_class_and_sex`, `survived_by_class_and_sex.unstack(0)`, and `survived_by_class_and_sex.unstack(1)`. What is their type? What are the indices?

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
The call `df.groupby(['Pclass','Sex'])['Survived'].mean()` returns a series object with (hierarchical indices)[https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html], where `Pclass` is the level 0 index and `Sex` is the subindex at level 1. The method `unstack()` transforms the 1 dimensional series with hierarchical indices to a 2 dimensional data frame where the column names and row names are the two levels of indices.

Try these lines in separate cells:

```python
survived_by_class_and_sex
survived_by_class_and_sex.unstack(1).plot(kind='bar', title='Survival Probability by Sex and Class')
survived_by_class_and_sex.unstack(1)
```
    
</p>
</details>

### Subsetting

Filtering the data based on some condition is also simple. If you remember, this is similar to using boolean masks in numpy.

We can subset the data to only include passengers below 30:

In [None]:
under_30=df[df['Age']<30]
under_30.head()

### Exercise

Subset the data to only include passengers that payed less than average fare.

<details><summary><u>Hint</u></summary>
<p>

Calculate the mean fare using `df['Fare'].mean()`.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
avg_fare = df['Fare'].mean()
print("The average fare is", round(avg_fare,3))
below_average_fare = df[df['Fare']<avg_fare]
below_average_fare.head()
```
    
</p>
</details>

We can do more complicated filtering using logical operators. (Remember `&`=and and `|`=or)

For example, to select the male passengers who got on the Titanic in France (Cherbourg is in France) we can write:

In [None]:
males_from_france = df[(df['Embarked']=='C') & (df['Sex']=='male')]
males_from_france.head()

### Exercise

Subset the data to include only passengers of unknown age and survived! Take a look at the `isnull()` method.

<details><summary><u>Hint</u></summary>
<p>

To test if `Age` is `NaN`, write `df['Age'].isnull()`. You also have to check if `Survived` is equal to 1.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
survivors_unknown_age=df[df['Age'].isnull() & df['Survived']==1]
survivors_unknown_age.head()
```
    
</p>
</details>

### New columns

We can also create new columns! Let's count the reverends on board.

First we define a helper function that takes a name as input and returns 1 if they are a reverend and 0 if they are a layman. 

In [None]:
def is_rev(input_name):
    # they are a reverend if their name contains the 'Rev.' title
    if 'Rev.' in input_name:
        return 1  
    else:
        return 0

To test the function, we can apply it elementwise to the `Name` column and count the number of reverends on board:

In [None]:
sum(df['Name'].apply(is_rev))

To define a new column we can simply write:

In [None]:
df['is_reverend'] = df['Name'].apply(is_rev)

#check the columns
df.columns

We can also use `apply()` for a function that that takes multiple columns as input. For example, we can count the number of revereneds that are older than 50: 

In [None]:
sum(df.apply(lambda row: 1 if 'Rev.' in row['Name'] and row['Age']>50 else 0, axis=1))

Note the attribute `axis=1`, this tells `apply()` to apply the function to each row, `axis=0` would apply it to each column.

## Exercises

### Age histogram

What is the distribution of ages? Plot the ages in a histogram.

Advanced: try to combine the histogram with its kernel density estimate using `kind='kde'`.

<details><summary><u>Hint</u></summary>
<p>

Scroll back to the first section, and check out how we plotted histograms for the actor ratings.

Advanced:
* Be sure to set `density=True` for the histogram, so that it matches with the kernel density estimate.
* To plot both graphs on the same `axis` use the `ax` attribute.
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
ax = df.plot(kind='hist',y='Age', alpha=.5, density=True);
df.plot(kind='kde', y='Age', ax=ax);
```
    
</p>
</details>

### New column

Create one of the following new columns (or all of them if you would like to practice)
- Add a new column, fancy_title, by writing a function that checks the passenger name for a fancy title like "Master" or "Colonel" or "Count".
- Create a new column, family_on_board, by consider both the SibSp and Parch columns

<details><summary><u>Hint</u></summary>
<p>

* Look at the example where we added a new column to indicate reverends, we have to do the same here only this time you have to check if any of the possible titles appear in the `Name` column.
* For the second problem, you have to consider multiple columns, try using
```python
df.apply(some_function, axis=1)
```

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
fancy_title = df['Name'].apply(lambda x: 1 if "Master" in x or "Colonel" in x or "Count" in x else 0)
df['fancy_title'] = fancy_title

family_on_board = df.apply(lambda x: 1 if x['SibSp']>0 or x['Parch']>0 else 0, axis=1)
df['family_on_board']=family_on_board
```
    
</p>
</details>

### Grouping

Figure out how to group by multiple columns with multiple aggregation functions: i.e. use `groupby('Pclass')` and calculate the mean and variance of `Fare` and the count of `is_reverend`.


<details><summary><u>Hint</u></summary>
<p>
    
Look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) of `agg()` or this [stackoverflow question](https://stackoverflow.com/questions/12589481/multiple-aggregations-of-the-same-column-using-pandas-groupby-agg).

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
df.groupby(['Pclass'])[['Fare','is_reverend']].agg({'Fare':['mean','var'],'is_reverend':'sum'})
```
    
</p>
</details>

### If you have time: Correlations
Check out the documentation of the `corr()` method. Use it on the data and make some hypotheses: besides class and gender, what predicted survival? Plot the correlation between each column and `Survived` using a bar chart.

<details><summary><u>Hint</u></summary>
<p>

The `df.corr()` function returns a symmetric dataframe, where the row and column names are the columns of `df`. To plot the correlation between `Survived` any every other column in `df`, just plot the `Survived` column of `df.corr()`.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
corr_df = df.corr()
#advanced: leave out 'Survived's correletion with itself
corr_df[corr_df.index!="Survived"].plot(kind='bar',y='Survived');
corr_df
```
    
</p>
</details>

### If you have time: Missing values and grouping

Some ages are missing from the data. Are they missing at random.
* Create a new colunm `unknown_age` that is `True` if the age is unknown and `False` if it is not missing.
* Count the number of missing values in the `Age` column. 
* Count the probability that age is missing for different passenger classes `Pclass`
* Fill in the missing values with an educated guess, set the age to be the average.

Advanced: set the missing age to be the average of the passengers class.

<details><summary><u>Hint</u></summary>
<p>

Scroll back almost to the begining of the notebook, where we made an educated guess for Mr. Bean's rating. 

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
#count unknown age
df['unknown_age']= df["Age"].isnull()
print("Number of passengers with unknown age:", df['unknown_age'].sum() )

#count the probability for each group
print(df.groupby('Pclass')['unknown_age'].mean())

#calculate average age
avg_age = df['Age'].mean()
print(avg_age)
#fill NaNs with average
df['Age'].fillna(avg_age)
```
    
</p>
</details>

<details><summary><u>Solution advanced.</u></summary>
<p>
    
```python

#calculate average age in each class
avg_age_by_pclass = df.groupby('Pclass')['Age'].mean()
print(avg_age_by_pclass[1])

#create a series with guessed age for each unknown
age_guess = df[df['unknown_age']]['Pclass'].apply(lambda x: avg_age_by_pclass[x])
#pass this series to the fillna() function
df['Age'].fillna(age_guess)
```
    
</p>
</details>

## Final exercise

Pick **one** of the following problems and create a figure. Upload only a **single figure as a pdf** to Moodle, do not upload your code. Successfully completing this task counts as attendance.

The lottery craze is sweaping the nation! Calculate some statistics from the historical winning numbers and prizes!
Load the file `hun_lotto.csv` containing data of all draws of the Hungarian 5-out-of-90 lotto. Investigate the output of `df.head()`, what do the columns mean?

Data concerning the prize and number of winners is missing before 1998 and substituted by 0. So for plots involving these fields, you should subset the data so only 1998 and the years after are included.

1. The simplest option: plot a histogram of the number of winners with x matches where you select x.
2. Plot the histogram of prize money for one of the winning categories (Prize_match_x).
    * The prize money is stored as a string. Use `apply()` and string operations to convert it to a number.
    * Plot the histogram.
3. Which year was the most lucky?
    * Use `groupby()` and `sum()` to calculate the total number of jackpots in each year
    * Plot this using a bar chart.
4. Do they cheat when drawing the numbers? Plot the histogram of winning numbers! This is a bit tricky:
    * Select the last five columns, for example using `iloc[]` and slicing.
    * Use `unstack()` to create a series that contains all winning numbers.
    * Plot!
5. Come up with an interesting plot yourself!
6. Find an interesting csv file that you have **not** worked with before and plot something interesting!