# Data analysis and visualization with pandas


## Exploratory data analysis in Jupyter

 `pandas` is a more recently developed package for data manipulation and analysis 
 - powerful high-level tool for data exploration
 - two fundamental data structures which can be applied to many types of data: `Series` and `DataFrames`  
 - has support for missing data (`NaN`, `NaT`)

We will download and process a dataset on Nobel prizes. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

For later reproducibility print versions

In [None]:
print(f'numpy version: {np.__version__}')
print(f'pandas version: {pd.__version__}')
print(f'matplotlib version: {mpl.__version__}')

pandas defines a `read_csv` function that can read any CSV file. By giving the URL to the file, pandas will automatically download and parse the file, and return a `DataFrame` object. We need to specify a few options to make sure the dates are parsed correctly.

In [None]:
# old dataset from http://oppnadata.se/en/dataset/nobel-prizes/resource/f3da8ba9-a17f-4911-9003-4bcef93619cc
# new dataset from https://ckan.oppnadata.se/dataset/nobel-prizes/resource/cafde48c-586d-4731-95f8-2e91091222d9
nobel = pd.read_csv("data/nobels_new.csv")

The `nobel` variable now contains a `DataFrame` object, a Pandas data structure that contains 2D tabular data. The `head(n)` method displays the first `n` rows of this table.

In [None]:
nobel.head()

Each column (and row) of the `DataFrame` is a `Series`. Series can be accessed by their names as follows.

In [None]:
nobel["year"]

In [None]:
type(nobel["year"])

A Series object can produce statistical information about the datums in it.

In [None]:
nobel["share"].describe()

It's also somewhat smart about the contents of the data it sees so it can summarize non-numerical data as well.

In [None]:
nobel["bornCountryCode"].describe()

If you call a method on the dataframe like count, it will call the same method on each of the series.

In [None]:
nobel.count()

The dataset is clearly not quite complete, especially in the death statistics. Possibly because the laureates are still alive?

We can also use the function `describe()` to request statistics for the entire dataframe but then it will only give statistics for the numerical variables.

In [None]:
nobel.describe()

To calculate some more elaborate statistics, we first add a column (one Nobel prize per laureate). This will add the column "number" to the dataframe with the value 1 for each row.

In [None]:
nobel["number"] = 1

### Age statistics

Let's first look at statistics based on the age of prize recipients.  
We need to convert the "born" column to datetime format. Datetimes are hardly ever recognized correctly.

In [None]:
type(nobel["born"][0])

In [None]:
nobel["born"] = pd.to_datetime(nobel["born"], errors ='coerce')
# coercion is necessary because the data is a bit messy

In [None]:
type(nobel["born"][0])

In [None]:
nobel["born"].dt.year

We can now add a column to the DataFrame with age when prize was received.

In [None]:
nobel["age"] = nobel["year"] - nobel["born"].dt.year
nobel[["surname","age"]].head(10)
#print(nobel["age"].to_string())

We can now plot a histogram of the age at which laureates receive their prize, using the inbuilt matplotlib support of pandas 

In [None]:
nobel.plot?

In [None]:
nobel["age"].plot.hist(bins=[20,30,40,50,60,70,80,90,100],alpha=0.6);

To extract the numbers, use the value_counts method

In [None]:
nobel["age"].value_counts(bins=[20,30,40,50,60,70,80,90,100])

An alternative plot that is is better for comparing distributions is the box plot.

The "by" keyword tells by which value the the observations should be **grouped by**.

In [None]:
nobel.boxplot(column="age", by="category")

## Exercises

- Discuss together: how to make sure the analysis is reproducable a year later. Are you working with data sets containing "missing data".
- If you are interested in learning more go to the `more_of_pandas.ipynb` notebook
- You can also go back to earlier exercises you did not have time to finish before