## Lesson 5 - Python Functions, Pandas

Here will introduce Functions, Pandas, and Matplotlib. Pandas uses DataFrames (tables, much like R DataFrames) and Series (columns of a DataFrame) with powerful SQL-like queries. 



### Table of Contents

* [Functions](#functions)
* [Pandas](#pandas)

<a id="functions"></a>
### Functions

A function is a block of code which only runs when it is called.

You can pass data, known as parameters, into a function.

A function can return data as a result.

$f(x) = print(x)$

In [1]:
def call_me(s):
    print(s)

In [2]:
call_me("Yo Man!")

Yo Man!


In [3]:
# call function multiple times
call_me("David")
call_me("Hippo")
call_me("Emily")

David
Hippo
Emily


Return Values

To let a function return a value, use the return statement:

$f(x) = x$

In [4]:
# modify call_me function, let function return s
def call_me(s):
    return s

In [5]:
res = call_me("David")
res

'David'

In [6]:
def get_ntd_dollar(usd):
    return 32 * usd

print(get_ntd_dollar(100))

3200


In [7]:
def sum(a, b):
    s = 0
    s = a + b
    return s

sum(1, 2)

3

In [8]:
# once function is declared, we can re-use funtion in the whole python code
sum(1.3, 2.9)

4.2

Additionally, you can define functions to take `*x` and `**y` arguments. This allows a function to accept any number of positional and/or named arguments that aren't specifically named in the declaration. 

Example with `*` (positional arguments):

In [9]:
def sum(*values):
    s = 0
    for v in values:
        s = s + v
    return s

sum(1, 2, 3, 4, 5)

15

### LAB

<a id="pandas"></a>

### Pandas

#### What is Pandas?

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. 

#### Library features

* DataFrame object for data manipulation with integrated indexing
* Tools for reading and writing data between in-memory data structures and different file formats
* Data alignment and integrated handling of missing data
* Reshaping and pivoting of data sets
* Label-based slicing, fancy indexing, and subsetting of large data sets
* Data structure column insertion and deletion
* Group-by engine allowing split-apply-combine operations on data sets
* Data set merging and joining
* Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging

The library is highly optimized for performance, with critical code paths written in Cython or C.

#### Install packages

Install pandas and matplotlib using if you haven't already. If you're not sure, you can type `conda list` at a terminal prompt.

```
conda install pandas
conda install matplotlib
```

#### Import modules

In [10]:
import pandas
pandas.__version__

'0.25.3'

Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``:

In [11]:
import pandas as pd

### Pandas Objects

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are.
Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``.

We will start our code sessions with the standard NumPy and Pandas imports:

In [12]:
import numpy as np
import pandas as pd

### Pandas 基本資料結構

Pandas 有兩個基本資料結構:

* <b style="color:red;">DataFrame</b>: 可以想成一個表格。
* <b style="color:red;">Series</b>: 表格的某一列、某一行, 基本上就是我們以前的 list 或 array

一個 DataFrame, 我們有 `index` (列的名稱), `columns` (行的名稱)。

#### DataFrame

![DataFrame 的結構](images/indexcol.png)

#### The Pandas Series Object

A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:

In [13]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [14]:
data = pd.Series(["David", "Emily", "Sean", "Kendy"])
data

0    David
1    Emily
2     Sean
3    Kendy
dtype: object

As we see in the output, the ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes.
The ``values`` are simply a familiar NumPy array:

In [15]:
data.values

array(['David', 'Emily', 'Sean', 'Kendy'], dtype=object)

The ``index`` is an array-like object of type ``pd.Index``, which we'll discuss in more detail momentarily.

In [16]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [17]:
data[1]

'Emily'

In [18]:
data[1:3]

1    Emily
2     Sean
dtype: object

As we will see, though, the Pandas ``Series`` is much more general and flexible than the one-dimensional NumPy array that it emulates.

#### ``Series`` as generalized NumPy array

From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.

This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

In [19]:
data = pd.Series(["David", "Emily", "Sean", "Kendy"],
                 index=['2019', '2020', '2021', '2022'])
data

2019    David
2020    Emily
2021     Sean
2022    Kendy
dtype: object

In [20]:
data["2019"]

'David'

And the item access works as expected:

In [21]:
# data['2023']

### We can even use non-contiguous or non-sequential indices:

In [22]:
data = pd.Series(["David", "Emily", "Sean", "Kendy"],
                 index=[2, 5, 3, 7])
data

2    David
5    Emily
3     Sean
7    Kendy
dtype: object

In [23]:
data[3]

'Sean'

### LAB

### Series as specialized dictionary

In this way, you can think of a Pandas ``Series`` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

The ``Series``-as-dictionary analogy can be made even more clear by constructing a ``Series`` object directly from a Python dictionary:

In [24]:
# 人口
population_dict = {'台北': 38332521,
                   '台中': 26448193,
                   '高雄': 19651127,
                   '基隆': 19552860,
                   '台南': 12882135}
population = pd.Series(population_dict)
population

台北    38332521
台中    26448193
高雄    19651127
基隆    19552860
台南    12882135
dtype: int64

By default, a ``Series`` will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:

In [25]:
population['台北']

38332521

Unlike a dictionary, though, the ``Series`` also supports array-style operations such as slicing:

In [26]:
# start with index California , end of index Illinois
population['台北':'台中']

台北    38332521
台中    26448193
dtype: int64

### Constructing Series objects

We've already seen a few ways of constructing a Pandas ``Series`` from scratch; all of them are some version of the following:

```python
>>> pd.Series(data, index=index)
```

where ``index`` is an optional argument, and ``data`` can be one of many entities.

For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

In [27]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

``data`` can be a scalar, which is repeated to fill the specified index:

In [28]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

``data`` can be a dictionary, in which ``index`` defaults to the sorted dictionary keys:

In [29]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In each case, the index can be explicitly set if a different result is preferred:

In [30]:
data = pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
data

3    c
2    a
dtype: object

Notice that in this case, the ``Series`` is populated only with the explicitly identified keys.

### The Pandas DataFrame Object

The next fundamental structure in Pandas is the ``DataFrame``.
Like the ``Series`` object discussed in the previous section, the ``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
We'll now take a look at each of these perspectives.

#### DataFrame as a generalized NumPy array
If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.
Here, by "aligned" we mean that they share the same index.

To demonstrate this, let's first construct a new ``Series`` listing the area of each of the five states discussed in the previous section:

In [31]:
# 面積
area_dict = {'台北': 423967, '台中': 695662, '高雄': 141297,
             '基隆': 170312, '台南': 149995}
area = pd.Series(area_dict)
area

台北    423967
台中    695662
高雄    141297
基隆    170312
台南    149995
dtype: int64

Now that we have this along with the ``population`` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [32]:
states = pd.DataFrame({'人口': population,
                       '面積': area})
states

Unnamed: 0,人口,面積
台北,38332521,423967
台中,26448193,695662
高雄,19651127,141297
基隆,19552860,170312
台南,12882135,149995


Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [33]:
states.index

Index(['台北', '台中', '高雄', '基隆', '台南'], dtype='object')

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [34]:
states.columns

Index(['人口', '面積'], dtype='object')

Thus the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

### DataFrame as specialized dictionary

Similarly, we can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

In [35]:
# pandas.core.series.Series
states['人口']

台北    38332521
台中    26448193
高雄    19651127
基隆    19552860
台南    12882135
Name: 人口, dtype: int64

In [36]:
# pandas.core.series.Series
states['面積']

台北    423967
台中    695662
高雄    141297
基隆    170312
台南    149995
Name: 面積, dtype: int64

Notice the potential point of confusion here: in a two-dimesnional NumPy array, ``data[0]`` will return the first *row*. For a ``DataFrame``, ``data['col0']`` will return the first *column*.
Because of this, it is probably better to think about ``DataFrame``s as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful.
We'll explore more flexible means of indexing ``DataFrame``s in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).

### LAB

### Constructing DataFrame objects

A Pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give several examples.

In [37]:
pd.DataFrame(population, columns=['人口1'])

Unnamed: 0,人口1
台北,38332521
台中,26448193
高雄,19651127
基隆,19552860
台南,12882135


#### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [38]:
lst = [{"興趣":"打球", "年齡":"21", "性別":"男生"},
       {"興趣":"電影", "年齡":"23", "性別":"女生", "地區":"台北"}]
pd.DataFrame(lst)

Unnamed: 0,興趣,年齡,性別,地區
0,打球,21,男生,
1,電影,23,女生,台北


Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [39]:
lst = [{"年齡":"21", "性別":"男生"},
       {"興趣":"電影", "性別":"女生"}]
pd.DataFrame(lst)

Unnamed: 0,年齡,性別,興趣
0,21.0,男生,
1,,女生,電影


#### From a dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [40]:
# population and area are both dictionary
pd.DataFrame({'人口': population,
              '面積': area})

Unnamed: 0,人口,面積
台北,38332521,423967
台中,26448193,695662
高雄,19651127,141297
基隆,19552860,170312
台南,12882135,149995


#### From a two-dimensional array (or NumPy array)

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [41]:
person1 = ["男生","1980","籃球"]
person2 = ["女生","1998","羽球"]
person3 = ["男生","1991","足球"]

person4 = ["女生","1997","桌球"]
person5 = ["女生","2000","板球"]
person6 = ["男生","1982","游泳"]

classroom = []
group1 =[]
group1.append(person1)
group1.append(person2)
group1.append(person3)

group2 = []
group2.append(person4)
group2.append(person5)
group2.append(person6)
# append list of list
classroom.append(group1)
classroom.append(group2)

df1 = pd.DataFrame(group1,
                   columns=['性別', '生日', '專長'],
                   index=['a', 'b', 'c'])

df2 = pd.DataFrame(group2,
                   columns=['性別', '生日', '專長'],
                   index=['a', 'b', 'c'])

In [42]:
df1

Unnamed: 0,性別,生日,專長
a,男生,1980,籃球
b,女生,1998,羽球
c,男生,1991,足球


In [43]:
df2

Unnamed: 0,性別,生日,專長
a,女生,1997,桌球
b,女生,2000,板球
c,男生,1982,游泳


In [44]:
df = df1 + df2

In [45]:
# 這是錯誤的
df

Unnamed: 0,性別,生日,專長
a,男生女生,19801997,籃球桌球
b,女生女生,19982000,羽球板球
c,男生男生,19911982,足球游泳


In [46]:
# index 有著重複的 a, b, c，若使用 DataFrame.append()也是有問題的
df = df1.append(df2)
df

Unnamed: 0,性別,生日,專長
a,男生,1980,籃球
b,女生,1998,羽球
c,男生,1991,足球
a,女生,1997,桌球
b,女生,2000,板球
c,男生,1982,游泳


In [47]:
# 所以，在建立DataFrame時，讓 index 不重複，來合併兩個 DataFrame
df1 = pd.DataFrame(group1,
                   columns=['性別', '生日', '專長'],
                   index=['a', 'b', 'c'])

df2 = pd.DataFrame(group2,
                   columns=['性別', '生日', '專長'],
                   index=['d', 'e', 'f'])

In [48]:
# 利用 append 來增加的 DataFrame
df = df1.append(df2)
df

Unnamed: 0,性別,生日,專長
a,男生,1980,籃球
b,女生,1998,羽球
c,男生,1991,足球
d,女生,1997,桌球
e,女生,2000,板球
f,男生,1982,游泳


In [49]:
# 如果不想讓 DataFrame 一直以附加而增長，建議使用 pd.concat() 來合併兩個 DataFrame
df3 = pd.concat([df1, df2])
df3

Unnamed: 0,性別,生日,專長
a,男生,1980,籃球
b,女生,1998,羽球
c,男生,1991,足球
d,女生,1997,桌球
e,女生,2000,板球
f,男生,1982,游泳


### LAB

## Data Indexing and Selection

### Data Selection in Series

As we saw in the previous section, a ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

### Series as dictionary

Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values:

In [50]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [51]:
# string 'b' is Index
data['b']

0.5

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [52]:
'a' in data

True

In [53]:
# 查看現有的 Index 有哪些
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [54]:
# convert series to Lists of Tuples
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

``Series`` objects can even be modified with a dictionary-like syntax.
Just as you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:

In [55]:
# Add new index and value to Series
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### Series as one-dimensional array

A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*.
Examples of these are as follows:

In [56]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [57]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [58]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [59]:
# fancy indexing, by []
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

Among these, slicing may be the source of the most confusion.
Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice.

### Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

In [60]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [61]:
# explicit index when indexing
data[1]

'a'

In [62]:
# implicit index when slicing, notice that there is only 3 elements, but set 10th, not including 0 as well
data[1:10]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

First, the ``loc`` attribute allows indexing and slicing that always references the explicit index:

In [63]:
data.loc[1]

'a'

In [64]:
# 只有一個 element的 index 1 落於 1:2 之間
data.loc[1:2]

1    a
dtype: object

The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [65]:
print(data.iloc[0])
print(data.iloc[1])
print(data.iloc[2])

a
b
c


In [66]:
data.shape[0]

3

In [67]:
data.index

Int64Index([1, 3, 5], dtype='int64')

In [68]:
# iloc[3] doesn't exist, this cell will cause exception
print(data.iloc[3])

IndexError: single positional indexer is out-of-bounds

In [69]:
# does not include iloc[0]
data.iloc[1:9]

3    b
5    c
dtype: object

A third indexing attribute, ``ix``, is a hybrid of the two, and for ``Series`` objects is equivalent to standard ``[]``-based indexing.
The purpose of the ``ix`` indexer will become more apparent in the context of ``DataFrame`` objects, which we will discuss in a moment.

One guiding principle of Python code is that "explicit is better than implicit."
The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

### Data Selection in DataFrame

Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

#### DataFrame as a dictionary

The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:

In [70]:
area = pd.Series({'台北': 423967, '基隆': 695662,
                  '台中': 141297, '台南': 170312,
                  '高雄': 149995})
pop = pd.Series({'台北': 38332521, '基隆': 26448193,
                 '台中': 19651127, '台南': 19552860,
                 '高雄': 12882135})
data = pd.DataFrame({'面積':area, '人口':pop})
data

Unnamed: 0,面積,人口
台北,423967,38332521
基隆,695662,26448193
台中,141297,19651127
台南,170312,19552860
高雄,149995,12882135


In [71]:
# 新增一個 index 為 '宜蘭' 面積的 dict，但是沒有該城市的人口
area = pd.Series({'台北': 423967, '基隆': 695662,
                  '台中': 141297, '台南': 170312,
                  '高雄': 149995,
                  '宜蘭': 323000})
pop = pd.Series({'台北': 38332521, '基隆': 26448193,
                 '台中': 19651127, '台南': 19552860,
                 '高雄': 12882135})
data = pd.DataFrame({'面積':area, '人口':pop})
data

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
宜蘭,323000,
高雄,149995,12882135.0


The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [72]:
data['面積']

台中    141297
台北    423967
台南    170312
基隆    695662
宜蘭    323000
高雄    149995
Name: 面積, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings:

In [73]:
data.面積

台中    141297
台北    423967
台南    170312
基隆    695662
宜蘭    323000
高雄    149995
Name: 面積, dtype: int64

This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [74]:
data.面積 is data['面積']

True

Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.
For example, the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column:

In [75]:
data.人口

台中    19651127.0
台北    38332521.0
台南    19552860.0
基隆    26448193.0
宜蘭           NaN
高雄    12882135.0
Name: 人口, dtype: float64

In [76]:
data['人口']

台中    19651127.0
台北    38332521.0
台南    19552860.0
基隆    26448193.0
宜蘭           NaN
高雄    12882135.0
Name: 人口, dtype: float64

In [77]:
data.pop is data['人口']

False

In particular, you should avoid the temptation to try column assignment via attribute (i.e., use ``data['pop'] = z`` rather than ``data.pop = z``).

Like with the ``Series`` objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [78]:
data['密度'] = data['人口'] / data['面積']
data

Unnamed: 0,面積,人口,密度
台中,141297,19651127.0,139.076746
台北,423967,38332521.0,90.413926
台南,170312,19552860.0,114.806121
基隆,695662,26448193.0,38.01874
宜蘭,323000,,
高雄,149995,12882135.0,85.883763


and passing a single "index" to a ``DataFrame`` accesses a column:

In [79]:
data['面積']

台中    141297
台北    423967
台南    170312
基隆    695662
宜蘭    323000
高雄    149995
Name: 面積, dtype: int64

Thus for array-style indexing, we need another convention.
Here Pandas again uses the ``loc``, ``iloc``, and ``ix`` indexers mentioned earlier.
Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:

In [80]:
# index 取 0:3， column 取0:2
data.iloc[:3, :2]

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0


Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [81]:
# [0]: index, [1] is column
data.loc[:'台北', '人口':'密度']

Unnamed: 0,人口,密度
台中,19651127.0,139.076746
台北,38332521.0,90.413926


Keep in mind that for integer indices, the ``ix`` indexer is subject to the same potential sources of confusion as discussed for integer-indexed ``Series`` objects.

Any of the familiar NumPy-style data access patterns can be used within these indexers.
For example, in the ``loc`` indexer we can combine masking and fancy indexing as in the following:

In [82]:
data.loc[data.密度 > 100, ['人口', '密度']]

Unnamed: 0,人口,密度
台中,19651127.0,139.076746
台南,19552860.0,114.806121


Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

In [83]:
# 更新(修改) 第 0 個index，第2個欄位 (密度) 的值
data.iloc[0, 2] = 91
data

Unnamed: 0,面積,人口,密度
台中,141297,19651127.0,91.0
台北,423967,38332521.0,90.413926
台南,170312,19552860.0,114.806121
基隆,695662,26448193.0,38.01874
宜蘭,323000,,
高雄,149995,12882135.0,85.883763


To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple ``DataFrame`` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.

### Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice.
First, while *indexing* refers to columns, *slicing* refers to rows:

In [84]:
data['台北':'台南']

Unnamed: 0,面積,人口,密度
台北,423967,38332521.0,90.413926
台南,170312,19552860.0,114.806121


Such slices can also refer to rows by number rather than by index:

In [85]:
data[1:3]

Unnamed: 0,面積,人口,密度
台北,423967,38332521.0,90.413926
台南,170312,19552860.0,114.806121


Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [86]:
# 語法 1
data[data.密度 > 100]

Unnamed: 0,面積,人口,密度
台南,170312,19552860.0,114.806121


In [87]:
# 語法 2
data[data["密度"] > 100]

Unnamed: 0,面積,人口,密度
台南,170312,19552860.0,114.806121


These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.

### Add new city into DataFrame with Series

In [88]:
new_city_list = [{"面積":100, "人口":200, "密度":2}]
df_new_city = pd.DataFrame(new_city_list, index=['台北'])
df_new_city.head()

Unnamed: 0,面積,人口,密度
台北,100,200,2


In [89]:
data.append(df_new_city)

Unnamed: 0,面積,人口,密度
台中,141297,19651127.0,91.0
台北,423967,38332521.0,90.413926
台南,170312,19552860.0,114.806121
基隆,695662,26448193.0,38.01874
宜蘭,323000,,
高雄,149995,12882135.0,85.883763
台北,100,200.0,2.0


### Remove rows with duplicate indices (Pandas DataFrame and TimeSeries)

In [90]:
data = data.loc[~data.index.duplicated(keep='first')]
data

Unnamed: 0,面積,人口,密度
台中,141297,19651127.0,91.0
台北,423967,38332521.0,90.413926
台南,170312,19552860.0,114.806121
基隆,695662,26448193.0,38.01874
宜蘭,323000,,
高雄,149995,12882135.0,85.883763


### Convert List of Dictionaries to DataFrame

In [91]:
# create two dictionaries
class_list = []
student1 = {'國文': 92, '英文': 82, '數學': 88}
student2 = {'國文': 87, '英文': 89, '數學': 98}
student3 = {'國文': 96, '英文': 73, '數學': 94}

class_list.append(student1)
class_list.append(student2)
class_list.append(student3)

In [92]:
class_list

[{'國文': 92, '數學': 88, '英文': 82},
 {'國文': 87, '數學': 98, '英文': 89},
 {'國文': 96, '數學': 94, '英文': 73}]

In [93]:
df_class = pd.DataFrame(class_list)
df_class.head()

Unnamed: 0,國文,英文,數學
0,92,82,88
1,87,89,98
2,96,73,94


### Add new student score dictionary to DataFrame

In [94]:
# create new student list of  dictionarry
new_student_list = [{"國文":100, "英文":91, "數學":87}]

# append to DataFrame df_class
df_class.append(new_student_list)

Unnamed: 0,國文,英文,數學
0,92,82,88
1,87,89,98
2,96,73,94
0,100,91,87


As you can see, this merge result is wrong, since you have not specify index, it will start with 0.

There are two solutions:
1. let class_list append new student's score dictionary, then re-construct DataFrame.
2. create a new DataFrame of new student dictionary, then concat two DataFrame.

#### 1. Append new student's score dictionary, then re-construct DataFrame

In [95]:
student4 = {"國文":100, "英文":91, "數學":87}

class_list.append(student4)
class_list

[{'國文': 92, '數學': 88, '英文': 82},
 {'國文': 87, '數學': 98, '英文': 89},
 {'國文': 96, '數學': 94, '英文': 73},
 {'國文': 100, '數學': 87, '英文': 91}]

In [96]:
df_class = pd.DataFrame(class_list)
df_class.head()

Unnamed: 0,國文,英文,數學
0,92,82,88
1,87,89,98
2,96,73,94
3,100,91,87


#### 2. create a new DataFrame of new student dictionary, then concat two DataFrame.

In [97]:
new_student_list = [{"國文":99, "英文":89, "數學":77}]
df_new_student = pd.DataFrame(new_student_list)
df_new_student.head()

Unnamed: 0,國文,英文,數學
0,99,89,77


In [98]:
# merge two DataFrame by using pd.concat()
df_class_new = pd.concat([df_class, df_new_student])
df_class_new

Unnamed: 0,國文,英文,數學
0,92,82,88
1,87,89,98
2,96,73,94
3,100,91,87
0,99,89,77


Looks like the result is the same, we got two index equals to 0. Emmm...

In [99]:
# when using pd.concat, we can reset index by calling reset_index()
df_class_new = pd.concat([df_class, df_new_student]).reset_index(drop=True)
df_class_new

Unnamed: 0,國文,英文,數學
0,92,82,88
1,87,89,98
2,96,73,94
3,100,91,87
4,99,89,77


### Acturally, reset_index  apply to all DataFrame as well.

In [100]:
# create new student list of  dictionarry
new_student_list = [{"國文":100, "英文":91, "數學":87}]

# append to DataFrame df_class with reset_index
df_class_new.append(new_student_list).reset_index(drop=True)

Unnamed: 0,國文,英文,數學
0,92,82,88
1,87,89,98
2,96,73,94
3,100,91,87
4,99,89,77
5,100,91,87


### Drop all duplicate rows in Python Pandas

df_class_new got duplcate rows at index 3 and 5, now it's time to remove them.

In [101]:
df_class_new = df_class_new.drop_duplicates(subset=['國文', '英文', '數學'], keep=False)
df_class_new

Unnamed: 0,國文,英文,數學
0,92,82,88
1,87,89,98
2,96,73,94
3,100,91,87
4,99,89,77


Just reviewed on drop_duplicates keep parameter:

keep : {'first', 'last', False}, default 'first'

- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.

So setting keep to False will give us desired answer.

### LAB

### Bonus round: deal with NaN value (missing data) in DataFrame

Remember we have to ditionaries: population and area, we added a new city `宜蘭` without population, and that leads this DataFrame with NaN value, let's find our how to deal with this situation.

In [102]:
# 新增一個 index 為 '宜蘭' 面積的 dict，但是沒有該城市的人口
area = pd.Series({'台北': 423967, '基隆': 695662,
                  '台中': 141297, '台南': 170312,
                  '高雄': 149995,
                  '宜蘭': 323000})
pop = pd.Series({'台北': 38332521, '基隆': 26448193,
                 '台中': 19651127, '台南': 19552860,
                 '高雄': 12882135})
data = pd.DataFrame({'面積':area, '人口':pop})
data

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
宜蘭,323000,
高雄,149995,12882135.0


#### Working with missing data

In this section, we will discuss missing (also referred to as NA or NaN) values in pandas.

To make detecting missing values easier (and across different array dtypes), pandas provides the isna() and notna() functions, which are also methods on Series and DataFrame objects:

In [103]:
# Find out is there any missing value in specific column.
pd.isna(data['人口'])

台中    False
台北    False
台南    False
基隆    False
宜蘭     True
高雄    False
Name: 人口, dtype: bool

In [104]:
# Check if column notna.
data['人口'].notna()

台中     True
台北     True
台南     True
基隆     True
宜蘭    False
高雄     True
Name: 人口, dtype: bool

#### Copy (backup) DataFrame

In [105]:
# backup our original data, incase we mass up our own.
# then if you want to recover from back: data  = data_backup.copy()
data_backup = data.copy()
data_backup

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
宜蘭,323000,
高雄,149995,12882135.0


#### Filling missing values: fillna

In [106]:
data['人口'].fillna('missing')
data

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
宜蘭,323000,
高雄,149995,12882135.0


In [107]:
# inplace=True 可直接修改原始資料
data['人口'].fillna(0, inplace=True)
data

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
宜蘭,323000,0.0
高雄,149995,12882135.0


In [108]:
# 上述看起來是不合理的，人口為0，若問題為時間序列，則可以以前面的值來填補
# Restore data
data  = data_backup.copy()
data

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
宜蘭,323000,
高雄,149995,12882135.0


In [109]:
data.fillna(method='ffill', inplace=True)
data

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
宜蘭,323000,26448193.0
高雄,149995,12882135.0


In [110]:
# 但就面積與人口比例，宜蘭的人口數是錯誤的，因為這不是時間序列的問題
# Restore data
data  = data_backup.copy()
data

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
宜蘭,323000,
高雄,149995,12882135.0


In [111]:
# 指定某個欄位 NaN 才刪除該 row
data1 = data.dropna(subset=['人口'])
data1

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
高雄,149995,12882135.0


In [112]:
# 直接刪除 DataFrame 不區分欄位，所有包含 NaN 的 row，避免資料混砸導致數值偏差
data2 = data.dropna(how='any')
data2

Unnamed: 0,面積,人口
台中,141297,19651127.0
台北,423967,38332521.0
台南,170312,19552860.0
基隆,695662,26448193.0
高雄,149995,12882135.0


In [113]:
data1 == data2

Unnamed: 0,面積,人口
台中,True,True
台北,True,True
台南,True,True
基隆,True,True
高雄,True,True


### LAB