# Python for Data Analysis using Pandas (part 1 of 2)

> The latest version of this notebook is always found at [github.com/tommyod/awesome-pandas](https://github.com/tommyod/awesome-pandas).   
  Improvements, corrections or suggestions? **Please submit a [Pull Request](https://github.com/tommyod/awesome-pandas/pulls).**
  
  ![](pandas_vs_excel_vs_sas.png)

# Table of contents

- **(1) Setup**
  - (1.1) Installing Python and packages
  - (1.2) Importing packages
- **(2) Importing data**
  - (2.1) Importing .csv files
  - (2.2) Other ways of creating DataFrames
  - (2.3) Changing names and data types
- **(3) Summarizing data**
  - (3.1) Peeking at the data
  - (3.2) Null values and summary statistics
  - (3.3) Unique values, value counts and sorting
  - (3.4) Basic visualizations
- **(4) Selecting and computing new columns**
  - (4.1) Accessing rows, columns and data
  - (4.2) Selecting subsets of columns
  - (4.3) Selecting subsets of rows
  - (4.4) Selecting subsets of rows *and* columns
  - (4.5) Creating new columns
  - (4.6) Applying functions
  
  
**In the next video:** filtering and sorting, split-apply-combine, plotting, time series, machine learning, ...

---------------------------------

# (1) Setup

## (1.1) Installing Python and packages

### Python and Anaconda

If you haven't done it yet, you need to install Python.
I recommend the [Anaconda Distribution](https://www.anaconda.com/download/), and you should install version `3.X`.
- If you're on Windows, you will get a program called *Anaconda Prompt*. Open in and run `conda --version` to verify that everything works.
- If you're on Linux, open a terminal and run `conda --version`.

Type `conda list` to see every installed package, and `conda update --all` to update every package. Type `python` to open an interactive Python terminal, and `exit()` to leave.

### NumPy, matplotlib and Pandas

![](https://indranilsinharoy.files.wordpress.com/2013/01/scientificpythonecosystemsi.png?w=584&h=442)

*Image source: https://indranilsinharoy.com/2013/01/06/python-for-scientific-computing-a-collection-of-resources/*

To install indiviual packages, run `conda install <package>`.   
The Anaconda distribution comes with 3 packages which this tutorial requires, namely [pandas](https://pandas.pydata.org/), [NumPy](http://www.numpy.org/) and [matplotlib](https://matplotlib.org/).
We'll also briefly use [sklearn](https://scikit-learn.org/stable/).

- **NumPy** implements $n$-dimensional arrays in Python for efficient numerical computations. See the [arXiv](https://arxiv.org/pdf/1102.1523.pdf) paper for a nice introduction. To learn basic NumPy, consider doing these [100 NumPy exercises](https://github.com/rougier/numpy-100). For an in-depth look at NumPy and vectorization to speed up scientific computing, see [From Python to Numpy
](https://www.labri.fr/perso/nrougier/from-python-to-numpy/).
- **Matplotlib** is the most popular library for plotting in Python. See the beautiful [gallery](https://matplotlib.org/gallery.html) to get an overview of the capabilities of matplotlib. Read the [Matplotlib tutorial](http://www.labri.fr/perso/nrougier/teaching/matplotlib/matplotlib.html) by Nicolas P. Rougier for an introduction.
- **Pandas** is a library for data analysis based on two objects, the [Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) and the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

### Jupyter

A [Jupyter Notebook](https://jupyter-notebook.readthedocs.io/en/stable/) is an environment for running Python code interactively, displaying graphs and working with data. Think of it as a tool with capabilities somewhere between a simple terminal and a full fledged IDE. Move to a directory using the `cd` command in the terminal, then run `jupyter notebook` to start up a notebook. A video introduction to JupyterLab is [JupyterLab: Building Blocks for Interactive Computing](https://www.youtube.com/watch?v=Ejh0ftSjk6g). See also this list of [28 Jupyter Notebook tips](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/).

## (1.2) Importing packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import KDEpy
import sklearn
%matplotlib inline

### Package versions

To make this Jupyter Notebook more easily reproducible, we list versions of the libraries we will be using.

In [None]:
import datetime

print('Today is', datetime.datetime.utcnow())
print('-'*2**6)

for lib in [pd, np, matplotlib, KDEpy, sklearn]:
    print(f'{lib.__name__.ljust(12)} version {lib.__version__}')

### Using Jupyter Notebooks

- Useful shortcuts: `SHIFT + TAB`, `SHIFT + ENTER`, `ESC`, `ENTER`, `E`, `A`, `D,D`, `I, I`. Type `H` to see all shortcuts.
- Executing terminal commands from within the notebook using `!`.
- Using markdown and $\LaTeX{}$.
- Timing cells using `%%timeit` and other built-in magic commands.
- Pitfalls when using notebooks: state, order of execution, tidyness.

Example of $\LaTeX{}$ usage in notebooks:

$$0 \leq
  \left[\operatorname{tr}(\mathbf{A} \mathbf{B})\right]^2 \leq
  \operatorname{tr}\left(\mathbf{A}^2\right) \operatorname{tr}\left(\mathbf{B}^2\right) \leq
  \left[\operatorname{tr}(\mathbf{A})\right]^2 \left[\operatorname{tr}(\mathbf{B})\right]^2$$
$$\varphi_X(t) = \operatorname{E}\left[\exp \left ({i\int_\mathbf{R} t(s)X(s)ds} \right ) \right]$$

# (2) Importing data

Starting a cell with a `!` let's us use terminal commands. The UNIX `head` command shows the first rows of the file.

## (2.1) Importing `.csv` files

In [None]:
!tree . -L 2

In [None]:
!head data/movie_metadata.csv -n 2

> **Interested in learning UNIX commands?** The book [The Linux Command Line](https://www.amazon.com/Linux-Command-Line-Complete-Introduction-ebook/dp/B006X2QEQS) gives a detailed introduction, and [Data Science at the Command Line](https://www.amazon.com/Data-Science-Command-Line-Time-Tested/dp/1491947853) shows how basic data manipulation may be done using the command line only.

The file has many columns, so we'll only load a couple into a pandas DataFrame.
To familiarize ourselves with with [magic commands](http://ipython.readthedocs.io/en/stable/interactive/magics.html), we'll use `%%time` to time the execution of the cell below.

In [None]:
%%time

cols_to_use = ['movie_title', 'director_name', 'country', 'content_rating', 'imdb_score', 'gross']
df = pd.read_csv(r'data/movie_metadata.csv', sep=',', usecols=cols_to_use)
print(f'Loaded data of size {df.shape} into memory.')

In [None]:
df.head(2)  # Show the top 2 rows

The `df.shape` attribute gives the rows and columns of the DataFrame.

In [None]:
df.shape  # Alternatively, use len(df) for row count

## (2.2) Other ways of creating DataFrames

**The DataFrame**

> Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with **labeled axes** (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a **dict-like
container for Series objects**. The primary pandas data structure.

**Creating a DataFrame from scratch**

In [None]:
pd.DataFrame({'name':['Max', 'Mark', 'Mia'], 'age':[31, 25, 38]})

**Reading a HTML table from the web**

In [None]:
# Read HTML tables into a list of DataFrame objects.
url = r'https://en.wikipedia.org/wiki/List_of_Germans_by_net_worth'
tables = pd.read_html(url, header=0)

df_net_worth = tables[0]

# Asserts can be useful for sanity checks
assert len(df_net_worth) > 0 
assert df_net_worth.Name.is_unique


df_net_worth.head()

**Reading from databases is also possible.**

Reading from Microsoft SQL using `pyodbc` and `pd.read_sql(sql_code, connection)`.

---------

> **Gotcha.** Methods on DataFrames **return a new instance** by default. In other words, they behave like methods on *immutable* Python object, and not like methods on *mutable* objects.

In [None]:
# Lists are MUTABLE
scores = [6, 2, 4, 9, 1]
scores.sort()  # Changes the object in-place, returns None
print(scores)

# Strings are IMMUTABLE
my_name = 'tommy'
my_name = my_name.capitalize()  # A new instance is returned
print(my_name)

## (2.3) Changing names and data types

In [None]:
# Alter axes labels
df_net_worth = (df_net_worth
                .rename(columns={'Net worth (USD)': 'net_worth',
                                'World ranking': 'world_ranking',
                                'Sources of wealth': 'wealth_source'}))

df.rename(columns=str.capitalize).head(2)

The data type of each column may be listed using `df.dtypes`. Automatic conversion is possible via `pd.to_numeric` and `pd.to_datetime`.

In [None]:
df.dtypes

In [None]:
df_net_worth['net_worth_num'] = (df_net_worth['net_worth']
                             .str.replace(' billion', '')
                             .apply(float))

df_net_worth.head(2)

### Getting help

There are many ways to help help on objects and methods.

- Use `SHIFT TAB` in a notebook.
- Use question marks in the notebook, e.g. `?df.sum`.
- Use the built-in Python function `help`, e.g. `help(df.sum)`.

In [None]:
# df.sum?

# (3) Summarizing data

This section shows some important methods related to summarizing data.

## (3.1) Peeking at the data

Three methods that are useful when peeking at the data are `df.head`, `df.tail` and `df.sample`.
Head and tail are $\mathcal{O}(1)$ operations, while sample is $\mathcal{O}(n)$, where $n$ is the number of rows.
For small datasets, this makes no difference in practice. We'll use `df.sample` here.

In [None]:
# Return the first `n` rows.
df.head(n=2)  # df.tail(n=2) returns the last rows

In [None]:
df.sample(n=2, replace=False, weights=None, random_state=None)

## (3.2) Null values and summary statistics

We should make sure the data types are correct. To do so, we can use `df.dtypes`, or `df.info()` for some more information.

In [None]:
# Print a concise summary of a DataFrame
df.info(verbose=True, memory_usage=True, null_counts=True)

We have some null values. Let's count them by chaining `df.isnull()` and `df.sum()`.

In [None]:
# Detect missing values -> sum over rows
null_values = df.isnull().sum(axis=0)
null_values #  / len(df)

The result of the above is not a DataFrame object, but a Series.

In [None]:
type(null_values)

![alt text](https://www.mathsisfun.com/algebra/images/scalar-vector-matrix.svg)
*Image source:* https://www.mathsisfun.com/algebra/scalar-vector-matrix.html


We can make the output prettier by converting the `null_values` Series to a DataFrame using the `to_frame()` method, then transposing using `.T`, and finally renaming the first index.

In [None]:
null_values.to_frame().T.rename(index={0: 'Missing values'})

The above is called *method chaining*, and can be written like so:

In [None]:
(df
    .isnull()    # Figure out whether every entry is null (missing), or not
    .sum(axis=0) # Sum over each column, axis=0 is the default
    .to_frame()  # The result is a Series, convert to DataFrame
    .T           # Transpose (switch rows and columns)
    .rename(index={0:'Missing values'}) # Rename the index and show it
)

A tour of summarization would not be complete without `df.describe()`.  
Calling `df.count()`, `df.nunique()`, `df.mean()`, `df.std()`, `df.min()`, `df.quantile()`, `df.max()` is also possible.

> **Gotcha.** There are 200+ methods defined on a DataFrame. See the [API Reference](https://pandas.pydata.org/pandas-docs/stable/api.html) for an exhaustive list.

In [None]:
# dir(pd.DataFrame)

In [None]:
df.describe(percentiles=[0.5], include='all').fillna('')

In [None]:
df.shape

In [None]:
df.dropna(axis=0, how='any').shape

In [None]:
df.drop_duplicates(subset=None).shape  # Use df[df.duplicated()] to see rows

In [None]:
df = df.dropna(axis=0, how='any').drop_duplicates(subset=None)

## (3.3) Unique values, value counts and sorting

In [None]:
df.content_rating.unique()  # Not the same as: df.content_rating.is_unique

In [None]:
print(df.content_rating.drop_duplicates().tolist())

In [None]:
df.content_rating.value_counts()

In [None]:
df[['movie_title', 'gross']].nlargest(3, 'gross')

In [None]:
# Sort by country, then by IMDB_score. Put NA values last
df.sort_values(by=['country', 'imdb_score'], 
               ascending=[True, False], 
               na_position='last').head()

## (3.4) Basic visualizations

Some quick visualizations.

In [None]:
df.corr(method='pearson')

In [None]:
(df.content_rating
 .value_counts()
 .to_frame()  # Below are the default values, except `figsize`
 .plot.barh(subplots=False, sharex=None, sharey=False, layout=None, 
            figsize=(10, 5), use_index=True, title=None, grid=None, legend=True, 
            style=None, logx=False, logy=False, loglog=False, xticks=None, 
            yticks=None, xlim=None, ylim=None, rot=None, fontsize=None, 
            colormap=None, table=False, yerr=None, xerr=None, 
            secondary_y=False, sort_columns=False));

In [None]:
df.gross.plot.kde(bw_method=0.1, grid=True, title='IMDB score', lw=3, figsize=(10, 5));

In [None]:
plot = pd.plotting.scatter_matrix(df, alpha=0.5, figsize=(10, 5))

# (4) Selecting and computing new columns

This section is about selecting subsets of a datset, or creating new data from existing data, i.e.:

- Selecting a **single column**, or a **subset of columns**.
- Selecting a **subset of rows**, i.e. filtering.
- Chaining and/or combining the above operations to accomplish both.


## (4.1) Accessing index, columns and data

In [None]:
df.index

In [None]:
df.columns

In [None]:
# This is very useful when using data with libraries
df.to_numpy()

In [None]:
df.gross.dropna().to_numpy()

## (4.2) Selecting subsets of columns

In [None]:
print(df.columns.tolist())  # Get the columns

In [None]:
df.director_name.head(3)  # Alternatively, use df['director_name'].head(3)

Selecting two or more columns.

In [None]:
df[['movie_title', 'country']].head(2)

The most useful selection function is `df.loc[[row1, row2, ...], [col1, col2, ...]]`.

- `df.loc[:, [col1, col2]]` selects every row, and columns `[col1, col2]`
- `df.loc[[row1, row2], :]` selects rows `[row1, row2]`, and every column

In [None]:
df.loc[:, ['movie_title', 'country']].head(2)

In [None]:
a = df.loc[:, 'gross']  # Returns a Series
b = df.loc[:, ['gross']]  # Returns a DataFrame

print(type(a))
print(type(b))

Instead of selecting a subset of columns to *keep*, we can select a subset to *drop*.

In [None]:
# Drop specified labels from rows or columns
df.drop(columns=['director_name', 'gross', 'movie_title']).head(3)

In [None]:
# Integer-location based indexing
df.iloc[1:3, [0, 1, 2]]

## (4.3) Selecting subsets of rows

In [None]:
# Return the first `n` rows
df.head(n=1)

In [None]:
# Access a group of rows and columns by label(s) or a boolean array
df.loc[[0], :]

In [None]:
df.loc[[0]]

In [None]:
# Top three movies / TV-series not from the USA
df[df.country != 'USA'].nlargest(3, 'imdb_score')

In [None]:
# Best non-American films, with content rating PG-13
mask = (df.country != 'USA') & (df.content_rating == 'PG-13')
df[mask].nlargest(3, 'imdb_score')

## (4.4) Selecting subsets of rows *and* columns

In [None]:
# Above average movies, with the title containing 'ring'
row_mask = ((df.imdb_score > df.imdb_score.mean()) & 
             df.movie_title.str.contains('ring'))
df.loc[row_mask, ['director_name', 'movie_title', 'country']]

In [None]:
# Columns containing and underscore
cols = [c for c in df.columns if '_' in c]
df.loc[:, cols].head()

In [None]:
# Numerical columns
numeric_cols = df.dtypes[df.dtypes == np.float].index.tolist()
df.loc[:, numeric_cols].head(n=2)

## (4.5) Creating new columns

In [None]:
temp = df.copy()  # Copy the DataFrame

# Create a new column - based on the gross income
temp['log_gross'] = temp['gross'].apply(np.log10)

temp.head(2)

In [None]:
temp.plot.scatter(x='imdb_score', y='log_gross', alpha=0.2, s=15, figsize=(10, 5));

In [None]:
# Assign new columns to a DataFrame, returning a new object
# (a copy) with the new columns added to the original ones.
(temp.assign(log_gross=lambda df:df.gross.apply(np.log10))).head()

# One advantage is that method chaining can be used
(temp
     .assign(log_gross=lambda df:df.gross.apply(np.log10)) # Create a new column
     .loc[:, ['country', 'content_rating', 'log_gross']] # Filter
     .groupby(['country', 'content_rating']) # Group by and mean
     .mean()
     .reset_index() # Reset the index to sort
     .sort_values(['country', 'log_gross'], ascending=[True, False]) # Sort the results
     .set_index(['country', 'content_rating']) # Re-index
     .assign(log_gross=lambda df:df.log_gross.round(2)) # Re-define the column and round it
     .head(5)
)

## (4.6) Applying functions

On a `pd.Series`:

- `pd.Series.map` applies an elementwise $f: \mathbb{R} \to \mathbb{R}$ function (e.g. `str`, or `float`)
- `pd.Series.apply` applies a vectorized $f: \mathbb{R}^n \to \mathbb{R}^n$ function  (e.g. `log`, or `sin`)
- `pd.Series.aggregate` applies an aggreation $f: \mathbb{R}^n \to \mathbb{R}$ function  (e.g. `mean`, or `std`)

On a `pd.DataFrame`:

- `pd.DataFrame.applymap` applies an elementwise $f: \mathbb{R} \to \mathbb{R}$ function to every element
- `pd.DataFrame.apply` applies a vectorized $f: \mathbb{R}^n \to \mathbb{R}^n$ function to every element
- `pd.DataFrame.aggregate` applies an aggreation $f: \mathbb{R}^n \to \mathbb{R}$ function over an axis

### Functions on Series

In [None]:
# Map values of Series using input correspondence (a dict, Series, or function).
df.gross.map(float).head(2)

In [None]:
# Dictionaries are also maps, but brittle since no keys maps to NaN
(df.content_rating
    .map({'PG-13':'inappropriate for children under 13', 
          'PG': 'may not be suitable for children'}, na_action='ignore')
    .value_counts(dropna=False)
    .to_frame())

The `df.apply` method applies a NumPy [ufunc](https://docs.scipy.org/doc/numpy-1.15.1/reference/ufuncs.html).

> A *universal function* (or ufunc for short) is a function that operates on ndarrays in an **element-by-element** fashion.

In [None]:
# Invoke function on values of Series. Can be ufunc (a NumPy function
# that applies to the entire Series) or a Python function that only works
# on single values
df.gross.apply(np.log10).head(2)

In [None]:
# Aggregate using one or more operations over the specified axis.
df.gross.aggregate(np.mean, axis=0)

---------------

### Functions on DataFrames

In [None]:
df.loc[:, ['gross', 'imdb_score']].apply(np.log).head(2)

In [None]:
df.loc[:, ['gross', 'imdb_score']].applymap(int).head(2)

In [None]:
df.loc[:, ['gross', 'imdb_score']].mean().head(2)

In [None]:
# Or specify your own aggregation function
def spread(array):
    return np.max(array) - np.min(array)

df.loc[:, ['gross', 'imdb_score']].aggregate(spread, axis=0)

## Next up ...

**In the next video:** filtering and sorting, split-apply-combine, plotting, time series, machine learning, ...