<a href="https://colab.research.google.com/github/worldbank/dec-python-course/blob/session3/1-foundations/3-numpy-and-pandas/foundations-s3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Libraries
Within the realm of python, a package is a collection of modules, a library is a collection of packages. In practice, "python library" and "python package" are used interchangeably to refer to a reusable chunk of code. Use of libraries allows us to "stand on the shoulders of giants".

## Examples of python libraries
- [NumPy](https://numpy.org/) stands for Numerical Python. It is the fundamental Python package for scientific computing.
- [pandas](https://pandas.pydata.org/) is a Python package for fast and efficient processing of tabular data, time series, matrix data, etc.
- [Matplotlib](https://matplotlib.org/)  is a comprehensive library for creating data visualizations in Python. 

## How do I get some?

In [131]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print(np.pi)

3.141592653589793


`as` is optional; it is usually used to alias the library name to a shorthand or for disambiguation. The above are some conventional aliases for these libraries. If you `import numpy` wihtout aliasing, just be sure to use `numpy` instead of `np` when calling the library's functions later.

`import` the library like you would import a built-in python module e.g. `import math` works for common libraries on Google Colab. To see which libraries are pre-installed and their versions:

In [132]:
# Note: ! in google colab executes a bash command
# `| more` shortens the output content for immediate display
!pip freeze | more

absl-py==1.2.0
alabaster==0.7.12
albumentations==0.1.12
altair==4.2.0
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arviz==0.12.1
astor==0.8.1
astropy==4.3.1
astunparse==1.6.3
atari-py==0.2.9
atomicwrites==1.4.1
attrs==21.4.0
audioread==2.1.9
autograd==1.4
Babel==2.10.3
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==5.0.1
blis==0.7.8
bokeh==2.3.3
branca==0.5.0
[K

To check if a library you want to use is already installed:

In [133]:
!pip freeze | grep pandas

pandas==1.3.5
pandas-datareader==0.9.0
pandas-gbq==0.13.3
pandas-profiling==1.4.1
sklearn-pandas==1.8.0


[pip](https://pip.pypa.io/) is the de facto python package manager. You can use it to view the current installed packages and to install a new library or upgrade an existin library:

In [134]:
# install/upgrade to the latest stable verison of a package
!pip install pandas --upgrade 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [135]:
# install a specific version of a package
!pip install pandas==1.3.5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Exercises
The following exercises are meant to be completed step by step. Upon successful completion of all steps, the last cell should execute without any `AssertionError`.

Step 1. Run a command to find out what is the current installed version of `matplotlib`:

In [136]:
# hint: use ! and bash command `pip freeze` and pipe the results to bash command `grep matplotlib`

Step 2. Install version 3.5.2 of `matplotlib` for this notebook. Upon successful installation, click on "RESTART RUNTIME" in the code cell output.

In [137]:
# hint: use ! and bash command `pip install`

Step 3. Import `matplotlib`'s `pyplot` module and alias it as `plt`

In [138]:
# hint: use `import ... as ...`

# === Do not modify code below ===
fig = plt.figure(figsize=(10, 80))
assert hasattr(fig, "subfigures"),\
 "If correct version of matplotlib were installed you should not see this message"

print("Well done!")

AssertionError: ignored

<Figure size 720x5760 with 0 Axes>

If you have successfully completed the above exercies and are feeling adventurous, head over to `foundations-s3-bonus.ipynb` – it contains some bonus content and corresponding exercises on this topic.

# NumPy

NumPy (**Numerical Python**) is an open source Python library that’s used in almost every field of science and engineering. It’s the **universal standard** for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. It serves as the fundamention for popular data science and scientific Python packages, such as Pandas, SciPy, Matplotlib, scikit-learn.



## ndarray
The NumPy library provides **powerful functionalities** for numerical operations and does so **effeciently**. Its basic building block is `ndarray`, a homogeneous n-dimensional array object, with methods to efficiently operate on it. `ndarray` is a simple and flexible data structure that can represent vectors (1-D arrays), matrices (2-D arrays), and tensors (3-D or higher dimensional arrays).

// TODO: add picture



### Wait, what's an array?

In computer science, an array is a data structure consisting of a collection of elements, each identified by an index. 

### How does a NumPy `ndarray` compare to a Python `list` or `array`?
In Python, arrays are most often represented using `list`, which allows for different data types within a single list. Python standard library does come with its own `array` module which requires that a single array can only contain one data type of elements. Python's `array` is more compact (takes less space in storage) than `list` and comes with some basic math operations. NumPy's `ndarray` is like Python's `array` on steroids, functionality wise.

In [139]:
from array import array
import numpy as np

py_list = [1, 2, '3']
py_array = array('i', [1, 2, 3]) # i indicates integer
np_array = np.array([1, 2, 3])

### How to create a NumPy `ndarray`?

In [140]:
# give me 3 zeros
np.zeros(3)

array([0., 0., 0.])

In [141]:
# give me 4 ones
np.ones(4)

array([1., 1., 1., 1.])

In [142]:
# give me 5 numbers randomly sampled from [0.0, 1.0)
np.random.random(5)

array([0.11488624, 0.00615753, 0.09633504, 0.87404725, 0.9838823 ])

In [143]:
# give me 4 consecutive numbers
# starts at 0 by default, stop is excluded
one_d_array = np.arange(4)
one_d_array

array([0, 1, 2, 3])

In [144]:
# start, stop, step
np.arange(1, 9, 2)

array([1, 3, 5, 7])

In [145]:
# 2-D array / matrix
two_d_array = np.array([[1, 2], [3, 4]])
two_d_array

array([[1, 2],
       [3, 4]])

In [146]:
# 3-D array
three_d_array = np.array([[[1.0, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])
three_d_array

array([[[ 1.,  2.],
        [ 3.,  4.]],

       [[ 5.,  6.],
        [ 7.,  8.]],

       [[ 9., 10.],
        [11., 12.]]])

### How to find properties of a `ndarray`?

In [147]:
# dimension
two_d_array.ndim

2

In [148]:
three_d_array.ndim

3

In [149]:
# shape
two_d_array.shape

(2, 2)

In [150]:
three_d_array.shape

(3, 2, 2)

In [151]:
# how many elements total?
three_d_array.size

12

In [152]:
# what's the data type of the elements?
three_d_array.dtype

dtype('float64')

In [153]:
print(f'''Array dimension: {one_d_array.ndim},
      shape: {one_d_array.shape}, 
      size: {one_d_array.size},
      data type: {one_d_array.dtype}''')

Array dimension: 1,
      shape: (4,), 
      size: 4,
      data type: int64


## Manipulating and operating on `ndarray`
With basic understanding of ndarray, let's go ahead do some maths with it.

### How to do maths and (even) linear algebra with ndarrays?

In [154]:
one_d_array.min()

0

In [155]:
one_d_array.max()

3

In [156]:
one_d_array.sum()

6

In [157]:
# broadcasting
one_d_array * 2

array([0, 2, 4, 6])

In [158]:
# arithmetic operations
one_d_array + one_d_array * 2

array([0, 3, 6, 9])

In [159]:
one_d_array - two_d_array

ValueError: ignored

### How to reshape a `ndarray`?
Use the `reshape` function with the new shape as arguments. Note that the new shape must have the same number of elements as the original array.

<img src="https://numpy.org/doc/stable/_images/np_reshape.png" />

Image credit: https://numpy.org/doc/stable/user/absolute_beginners.html#transposing-and-reshaping-a-matrix

In [160]:
one_d_array.reshape(2, 2) - two_d_array

array([[-1, -1],
       [-1, -1]])

In [161]:
one_d_array.reshape(1, 4) - two_d_array.reshape(1, 4)

array([[-1, -1, -1, -1]])

To transpose:

In [162]:
one_d_array.transpose()

array([0, 1, 2, 3])

In [163]:
# or use short T attribute on the array
one_d_array.T

array([0, 1, 2, 3])

In [164]:
one_d_array.T - two_d_array.reshape(1, 4)

array([[-1, -1, -1, -1]])

To flatten any multidimensional array to a 1-d array:

In [165]:
two_d_array.flatten()

array([1, 2, 3, 4])

In [166]:
one_d_array - two_d_array.flatten()

array([-1, -1, -1, -1])

### How to stack and split ndarrays?

Multiple arrays can be stacked together along different axes:

In [167]:
another_two_d_array = two_d_array * 2
print(two_d_array)
print(another_two_d_array)

[[1 2]
 [3 4]]
[[2 4]
 [6 8]]


In [186]:
# stack row wise
vstacked = np.vstack((two_d_array, another_two_d_array))
vstacked

array([[1, 2],
       [3, 4],
       [2, 4],
       [6, 8]])

In [169]:
# stack column wise
hstacked = np.hstack((two_d_array, another_two_d_array))
hstacked

array([[1, 2, 2, 4],
       [3, 4, 6, 8]])

An array can be split into multiple arrays, along the specified axes:

In [170]:
vstacked

array([[1, 2],
       [3, 4],
       [2, 4],
       [6, 8]])

In [171]:
# split to 2 arrays row wise
vsplit1, vsplit2 = np.vsplit(vstacked, 2)
print(vsplit1)
print(vsplit2)

[[1 2]
 [3 4]]
[[2 4]
 [6 8]]


In [172]:
# split to 2 arrays column wise
hsplit1, hsplit2 = np.hsplit(vstacked, 2)
print(hsplit1)
print(hsplit2)

[[1]
 [3]
 [2]
 [6]]
[[2]
 [4]
 [4]
 [8]]


## Exercises

Task: Calculate the dot product of vector `(1, 2, 3)` wtih itself.

The following exercises are meant to be completed step by step. Upon successful completion of all steps, the last cell should execute without any `AssertionError`.

Step 1. represent vector `(1, 2, 3)` as a 1-d NumPy array of shape (3, ) and store in variable `a`:

In [173]:
# hint: use np.array or np.arange
# Your code here

# === Do not modify code below ===
assert a.shape == (3, )
assert a[0] == 1
assert a[1] == 2

AssertionError: ignored

Step 2. Implement the dot product calculation and store the result in variable `result`:

In [None]:
# hint: use tranpose or reshap, then sum or np.sum
# Your code here

# === Do not modify code below ===
assert result == 14


## Universal functions

In NumPy, universal functions (`ufunc`) are functions operate elementwise on an array, producing an array as output.

Here are some examples:

<p><a href="https://numpy.org/doc/stable/reference/generated/numpy.all.html#numpy.all" title="numpy.all"><code class="xref py py-obj docutils literal notranslate"><span class="pre">all</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.any.html#numpy.any" title="numpy.any"><code class="xref py py-obj docutils literal notranslate"><span class="pre">any</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html#numpy.apply_along_axis" title="numpy.apply_along_axis"><code class="xref py py-obj docutils literal notranslate"><span class="pre">apply_along_axis</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.argmax.html#numpy.argmax" title="numpy.argmax"><code class="xref py py-obj docutils literal notranslate"><span class="pre">argmax</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.argmin.html#numpy.argmin" title="numpy.argmin"><code class="xref py py-obj docutils literal notranslate"><span class="pre">argmin</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.argsort.html#numpy.argsort" title="numpy.argsort"><code class="xref py py-obj docutils literal notranslate"><span class="pre">argsort</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.average.html#numpy.average" title="numpy.average"><code class="xref py py-obj docutils literal notranslate"><span class="pre">average</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.bincount.html#numpy.bincount" title="numpy.bincount"><code class="xref py py-obj docutils literal notranslate"><span class="pre">bincount</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.ceil.html#numpy.ceil" title="numpy.ceil"><code class="xref py py-obj docutils literal notranslate"><span class="pre">ceil</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.clip.html#numpy.clip" title="numpy.clip"><code class="xref py py-obj docutils literal notranslate"><span class="pre">clip</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.conj.html#numpy.conj" title="numpy.conj"><code class="xref py py-obj docutils literal notranslate"><span class="pre">conj</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html#numpy.corrcoef" title="numpy.corrcoef"><code class="xref py py-obj docutils literal notranslate"><span class="pre">corrcoef</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.cov.html#numpy.cov" title="numpy.cov"><code class="xref py py-obj docutils literal notranslate"><span class="pre">cov</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.cross.html#numpy.cross" title="numpy.cross"><code class="xref py py-obj docutils literal notranslate"><span class="pre">cross</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.cumprod.html#numpy.cumprod" title="numpy.cumprod"><code class="xref py py-obj docutils literal notranslate"><span class="pre">cumprod</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html#numpy.cumsum" title="numpy.cumsum"><code class="xref py py-obj docutils literal notranslate"><span class="pre">cumsum</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.diff.html#numpy.diff" title="numpy.diff"><code class="xref py py-obj docutils literal notranslate"><span class="pre">diff</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.floor.html#numpy.floor" title="numpy.floor"><code class="xref py py-obj docutils literal notranslate"><span class="pre">floor</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.inner.html#numpy.inner" title="numpy.inner"><code class="xref py py-obj docutils literal notranslate"><span class="pre">inner</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.invert.html#numpy.invert" title="numpy.invert"><code class="xref py py-obj docutils literal notranslate"><span class="pre">invert</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.lexsort.html#numpy.lexsort" title="numpy.lexsort"><code class="xref py py-obj docutils literal notranslate"><span class="pre">lexsort</span></code></a>,
<a class="reference external" href="https://docs.python.org/3/library/functions.html#max" title="(in Python v3.10)"><code class="xref py py-obj docutils literal notranslate"><span class="pre">max</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.maximum.html#numpy.maximum" title="numpy.maximum"><code class="xref py py-obj docutils literal notranslate"><span class="pre">maximum</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.mean.html#numpy.mean" title="numpy.mean"><code class="xref py py-obj docutils literal notranslate"><span class="pre">mean</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.median.html#numpy.median" title="numpy.median"><code class="xref py py-obj docutils literal notranslate"><span class="pre">median</span></code></a>,
<a class="reference external" href="https://docs.python.org/3/library/functions.html#min" title="(in Python v3.10)"><code class="xref py py-obj docutils literal notranslate"><span class="pre">min</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.minimum.html#numpy.minimum" title="numpy.minimum"><code class="xref py py-obj docutils literal notranslate"><span class="pre">minimum</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.nonzero.html#numpy.nonzero" title="numpy.nonzero"><code class="xref py py-obj docutils literal notranslate"><span class="pre">nonzero</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.outer.html#numpy.outer" title="numpy.outer"><code class="xref py py-obj docutils literal notranslate"><span class="pre">outer</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.prod.html#numpy.prod" title="numpy.prod"><code class="xref py py-obj docutils literal notranslate"><span class="pre">prod</span></code></a>,
<a class="reference external" href="https://docs.python.org/3/library/re.html#module-re" title="(in Python v3.10)"><code class="xref py py-obj docutils literal notranslate"><span class="pre">re</span></code></a>,
<a class="reference external" href="https://docs.python.org/3/library/functions.html#round" title="(in Python v3.10)"><code class="xref py py-obj docutils literal notranslate"><span class="pre">round</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.sort.html#numpy.sort" title="numpy.sort"><code class="xref py py-obj docutils literal notranslate"><span class="pre">sort</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.std.html#numpy.std" title="numpy.std"><code class="xref py py-obj docutils literal notranslate"><span class="pre">std</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.sum.html#numpy.sum" title="numpy.sum"><code class="xref py py-obj docutils literal notranslate"><span class="pre">sum</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.trace.html#numpy.trace" title="numpy.trace"><code class="xref py py-obj docutils literal notranslate"><span class="pre">trace</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.transpose.html#numpy.transpose" title="numpy.transpose"><code class="xref py py-obj docutils literal notranslate"><span class="pre">transpose</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.var.html#numpy.var" title="numpy.var"><code class="xref py py-obj docutils literal notranslate"><span class="pre">var</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.vdot.html#numpy.vdot" title="numpy.vdot"><code class="xref py py-obj docutils literal notranslate"><span class="pre">vdot</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html#numpy.vectorize" title="numpy.vectorize"><code class="xref py py-obj docutils literal notranslate"><span class="pre">vectorize</span></code></a>,
<a href="https://numpy.org/doc/stable/reference/generated/numpy.where.html#numpy.where" title="numpy.where"><code class="xref py py-obj docutils literal notranslate"><span class="pre">where</span></code></a></p>

Most universal functions can operate not only on NumPy arrays (ndarray), but also on Python lists, arrays or any object convertible to a NumPy array:

In [174]:
# cross product of two vectors
np.cross([3, 0, 2], [-1, 4, 2])

array([-8, -8, 12])

More usage examples of universal functions:

In [175]:
# reverse an array
np.flip(one_d_array)

array([3, 2, 1, 0])

In [176]:
# sort an array, default ascending
np.sort(np.random.random(5))

array([0.16869387, 0.34193238, 0.79661081, 0.83152845, 0.9564924 ])

In [177]:
# get the index of the max element
one_to_nine = np.arange(1, 10)
np.argmax(one_to_nine)

8

In [178]:
# look up the element by index
one_to_nine[np.argmax(one_to_nine)]

9

In [179]:
# check if any of the element evaluates to true
np.any([False, 0])

False

In [180]:
# check if all of the elements evaluate to true
np.all([True, 3, np.nan])

True

In [181]:
# pair-wise element comparison
np.greater([1, 2, 3], [1, 2, 1])

array([False, False,  True])

In [182]:
# Does any of the pair-wise comparison evaluate to true?
np.any(np.greater([1, 2, 3], [1, 2, 1]))

True

In [183]:
# select the indexes of elements that evaluate to true
a = np.array([1, 2, 3, 4])
np.where(a % 2) # find odd numbers

(array([0, 2]),)

In [184]:
# select the elements that evaluate to true
np.where(a % 2, a, np.nan)

array([ 1., nan,  3., nan])

## Exercise

Task: Calculate the dot product of vector `(1, 2, 3)` with itself **using one of NumPy's universal functions**. Store the result in variable `dot_result`:

In [185]:
# Your code here

# === Do not modify code below ===
assert dot_result == 14

NameError: ignored

# Pandas

Datasets we often work with come in tabular form – think Excel/Google spreadsheets – and with mixed data types, some numerical, some categorical, some textual. Before we can perform fancy mathematical operations and meaningful analysis on such data using NumPy and other python libraries, we usually need to understand and preprocess the data first. This process involves operations such as cleaning, reshaping, filtering, and subsetting. This is where Pandas comes in.

Like `ndarray` is the basic building block of NumPy, `Series` and `DataFrame` are the basic building blocks of Pandas. 

## Series

`Series` is a one-dimensional **labeled** array capable of holding **any data type**. The axis labels are collectively referred to as the index. 

### Can I create Series from ndarrays?

In [1]:
import pandas as pd
import numpy as np

# specify the labels through the `index` argument
labelled_series = pd.Series(np.arange(1, 4), index=["r1", "r2", "r3"])
labelled_series

r1    1
r2    2
r3    3
dtype: int64

In [11]:
# by default, integer sequence starting from 0 is used as labels
default_series = pd.Series(np.arange(1, 4))
default_series

0    1
1    2
2    3
dtype: int64

In [10]:
# Individual elements can be looked up using labels
labelled_series["r2"]

2

In [13]:
default_series[2]

3

### Can I recover the ndarray from a series?

In [14]:
labelled_series.to_numpy()

array([1, 2, 3])

In [15]:
# alternatively
labelled_series.values

array([1, 2, 3])

## DataFrame

`DataFrame` is a 2-dimensional **labeled** data structure with **columns of potentially different types**. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

### Can I create a DataFrame from Series?

In [27]:
df_from_series = pd.DataFrame({"c1": labelled_series, "c2": labelled_series * 2})
df_from_series

Unnamed: 0,c1,c2
r1,1,2
r2,2,4
r3,3,6


In [28]:
df_from_series.columns

Index(['c1', 'c2'], dtype='object')

In [29]:
df_from_series.index

Index(['r1', 'r2', 'r3'], dtype='object')

### Can I create a DataFrame from an ndarray?

In [20]:
raw_values = np.arange(1, 7).reshape(2, -1)
raw_values

array([[1, 2, 3],
       [4, 5, 6]])

In [31]:
df_from_ndarray = pd.DataFrame(raw_values, columns=['A', 'B', 'C'])
df_from_ndarray

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6


To recover the raw values from a `DataFrame`:

In [32]:
df_from_ndarray.values

array([[1, 2, 3],
       [4, 5, 6]])

### Can I create a DataFrame from lists?

In [26]:
pd.DataFrame({"ID": [1, 2], "Name": ["dog", "cat"], "Needs Walking": [True, False]})

Unnamed: 0,ID,Name,Needs Walking
0,1,dog,True
1,2,cat,False


### Can I create a DataFrame from a file?

In [39]:
# TODO: update url after repo is public
file_url = 'https://raw.githubusercontent.com/weilu/colab-sandbox/main/data/Singapore_Annual_New_Car_Registrations_by_make_type.csv'
singapore_cars = pd.read_csv(file_url)
singapore_cars

Unnamed: 0,year,make,fuel,type,number
0,2015,ALFA ROMEO,Petrol,Hatchback,29.0
1,2015,ASTON MARTIN,Petrol,Hatchback,3.0
2,2015,AUDI,Petrol,Hatchback,262.0
3,2015,AUSTIN,Petrol,Hatchback,1.0
4,2015,B.M.W.,Petrol,Hatchback,408.0
...,...,...,...,...,...
2805,2021,TOYOTA,Others,Coupe/ Convertible,
2806,2021,VOLKSWAGEN,Petrol,Coupe/ Convertible,2.0
2807,2021,VOLKSWAGEN,Others,Coupe/ Convertible,1.0
2808,2021,VOLVO,Petrol,Coupe/ Convertible,2.0


To save a `DataFrame` as a csv file:

In [37]:
singapore_cars.to_csv('singapore_cars.csv')

In Colab, navigate to the folder icon in the left pane to find and download the exported csv file. 

### Inspecting the data

Now let's perform some basic inspection to understand our dataset.

#### How many rows and columns?

In [40]:
singapore_cars.shape

(2810, 5)

In [49]:
nrow, ncol = singapore_cars.shape
print(f'nrow={nrow}, ncol={ncol}')

nrow=2810, ncol=5


#### What are the column names?

In [41]:
singapore_cars.columns

Index(['year', 'make', 'fuel', 'type', 'number'], dtype='object')

#### What does the data look like?

From the top:

In [43]:
singapore_cars.head()

Unnamed: 0,year,make,fuel,type,number
0,2015,ALFA ROMEO,Petrol,Hatchback,29.0
1,2015,ASTON MARTIN,Petrol,Hatchback,3.0
2,2015,AUDI,Petrol,Hatchback,262.0
3,2015,AUSTIN,Petrol,Hatchback,1.0
4,2015,B.M.W.,Petrol,Hatchback,408.0


From the bottom:

In [46]:
# Optionally specify the exact number of rows you want
singapore_cars.tail(3)

Unnamed: 0,year,make,fuel,type,number
2807,2021,VOLKSWAGEN,Others,Coupe/ Convertible,1.0
2808,2021,VOLVO,Petrol,Coupe/ Convertible,2.0
2809,2021,VOLVO,Others,Coupe/ Convertible,


#### What type of data does each column currently hold and how many missing values?


In [47]:
singapore_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2810 entries, 0 to 2809
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   year    2810 non-null   int64  
 1   make    2810 non-null   object 
 2   fuel    2810 non-null   object 
 3   type    2810 non-null   object 
 4   number  1432 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 109.9+ KB


### Subsetting the data

A crucial part of working with DataFrames is extracting subsets of the data: finding rows that meet a certain set of criteria, isolating columns/rows of interest, etc. After narrowing down our data, we are closer to discovering insights. This section will be the backbone of many analysis tasks.

#### How to select columns?

To select a single column (returns a `Series`):

In [51]:
singapore_cars.make

0         ALFA ROMEO
1       ASTON MARTIN
2               AUDI
3             AUSTIN
4             B.M.W.
            ...     
2805          TOYOTA
2806      VOLKSWAGEN
2807      VOLKSWAGEN
2808           VOLVO
2809           VOLVO
Name: make, Length: 2810, dtype: object

In [52]:
# Alternatively
singapore_cars['make']

0         ALFA ROMEO
1       ASTON MARTIN
2               AUDI
3             AUSTIN
4             B.M.W.
            ...     
2805          TOYOTA
2806      VOLKSWAGEN
2807      VOLKSWAGEN
2808           VOLVO
2809           VOLVO
Name: make, Length: 2810, dtype: object

The dictionary accessor is useful when the column name is not a valid Python variable, or when selecting multiple columns:

In [55]:
singapore_cars[['make', 'fuel', 'type']]

Unnamed: 0,make,fuel,type
0,ALFA ROMEO,Petrol,Hatchback
1,ASTON MARTIN,Petrol,Hatchback
2,AUDI,Petrol,Hatchback
3,AUSTIN,Petrol,Hatchback
4,B.M.W.,Petrol,Hatchback
...,...,...,...
2805,TOYOTA,Others,Coupe/ Convertible
2806,VOLKSWAGEN,Petrol,Coupe/ Convertible
2807,VOLKSWAGEN,Others,Coupe/ Convertible
2808,VOLVO,Petrol,Coupe/ Convertible


#### How to select rows?
If the index are default sequential integers, simply select as if you are selecting from an array:

In [56]:
singapore_cars[100:200]

Unnamed: 0,year,make,fuel,type,number
100,2015,MERCEDES BENZ,Petrol,Multi-purpose Vehicle/Station-wagon,61.0
101,2015,MERCEDES BENZ,Diesel,Multi-purpose Vehicle/Station-wagon,12.0
102,2015,NISSAN,Petrol,Multi-purpose Vehicle/Station-wagon,103.0
103,2015,NISSAN,Others,Multi-purpose Vehicle/Station-wagon,21.0
104,2015,OPEL,Petrol,Multi-purpose Vehicle/Station-wagon,25.0
...,...,...,...,...,...
195,2016,B.M.W.,Diesel,Hatchback,949.0
196,2016,B.M.W.,Others,Hatchback,6.0
197,2016,BYD,Others,Hatchback,7.0
198,2016,CHERY,Petrol,Hatchback,6.0


#### How to select by columns and rows (aka indexing)?

Sometimes we want a specific "section" of the `DataFrame`, say the first 3 rows with just the middle 3 columns:

In [66]:
singapore_cars.iloc[0:3, 1:4]

Unnamed: 0,make,fuel,type
0,ALFA ROMEO,Petrol,Hatchback
1,ASTON MARTIN,Petrol,Hatchback
2,AUDI,Petrol,Hatchback


We use `iloc` to select by row and column's positions above; use `loc` to select by row and column's names/labels:

In [63]:
singapore_cars.loc[10:15, ['year', 'number']]

Unnamed: 0,year,number
10,2015,10.0
11,2015,4.0
12,2015,1.0
13,2015,2.0
14,2015,155.0
15,2015,1068.0


In this case the rows' positions happen to be the same as the rows' names/labels. 

**Important**: when selecting by position (`iloc`) only the start point is included, the end point is excluded. When selecting by name/label (`loc`), both start and end points are included.

#### How to filter rows based on conditions?

This is useful when we want to zoom in to explore a subset of the data satisfying some condition(s). For example, to get only 2021 data on petrol cars:

In [65]:
singapore_cars[(singapore_cars.year == 2021) & (singapore_cars.fuel == 'Petrol')]

Unnamed: 0,year,make,fuel,type,number
2375,2021,ALFA ROMEO,Petrol,Hatchback,
2377,2021,ALPINE,Petrol,Hatchback,
2378,2021,ASTON MARTIN,Petrol,Hatchback,
2379,2021,AUDI,Petrol,Hatchback,34.0
2381,2021,AUSTIN,Petrol,Hatchback,1.0
...,...,...,...,...,...
2798,2021,SUBARU,Petrol,Coupe/ Convertible,
2800,2021,SUZUKI,Petrol,Coupe/ Convertible,
2803,2021,TOYOTA,Petrol,Coupe/ Convertible,27.0
2806,2021,VOLKSWAGEN,Petrol,Coupe/ Convertible,2.0


**Important**: Take note of the syntax here. We surround each condition with parentheses, and we use bitwise operators (`&`, `|`, `~`) instead of logical operators (`and`, `or`, `not`).

This mode of row selection/filtering is called "boolean indexing" because the selection is based on the specified condition evaluting to `True` or `False` (aka boolean). Rows that satisfy the condition are selected; those do not are filtered out. You can see the condition for the above example gets evaluated for each row as `True`/`False`:

In [67]:
(singapore_cars.year == 2021) & (singapore_cars.fuel == 'Petrol')

0       False
1       False
2       False
3       False
4       False
        ...  
2805    False
2806     True
2807    False
2808     True
2809    False
Length: 2810, dtype: bool

Think of `True` as 1 and `False` as 0, summing along the resulting `Series` gives us the number of rows that satisfy our condition, which matches our boolean indexing results above:

In [68]:
sum((singapore_cars.year == 2021) & (singapore_cars.fuel == 'Petrol'))

235