In [1]:
from IPython.core.display import HTML
table_css = 'table {align:left;display:block} '
HTML('<style>{}</style>'.format(table_css))

# 3 Work with DataFrame

This section explores essential functions of pandas.

## 3.1 common attributes/metheds of DataFrame

| seq |     name     |                description                                        |
| --- | ------------ | ----------------------------------------------------------------- |
| 01  |  dtypes      | returns data type of each columns                                 |
| 02  |  index       | returns sequence numbers or labels of rows                        |
| 03  |  columns     | returns a list of column names                                    |
| 04  |  ndim        | returns dimesions of DataFrame                                    |
| 05  |  shape       | returns number of rows and columns                                |
| 06  |  size        | total number of values(including NaN) in the data sets            |
| 07  |  count()     | a series of numbers representing non-NaN values of each column    |
| 08  |  head()      | return first few rows, default 5                                  |
| 09  |  tail()      | return latest few rows, default 5                                 |
| 10  |  sample()    | return a randow list of rows, default 1                           |
| 11  |  nunique()   | returns a Series object with counts of unique values per column   |
| 12  |  max()       | returns a Series with the maximum value from each column          |
| 13  |  min()       | returns a Series with the minimum value from each column          |
| 14  |  nlargest()  | like top n                                                        |
| 15  |  nsmallest() | like bottom n                                                     |
| 16  |  sum()       | returns a Series with the sum of value from each column           |
| 17  |  mean()      | returns a Series with the mean of value from each column          |
| 18  |  median()    | returns a Series with the median of value from each column        |
| 19  |  std()       | returns a Series with the standard deviation                      |
| 20  |  info()      | returns DataFrame meta info such as memory consumption            |


## 3.2 Import the NBA dataset

In [2]:
import pandas as pd

### 3.2.1 plain import

In [None]:
nba1 = pd.read_csv("nba.csv")
nba1.info()

### 3.2.2 parse date columns

In [None]:
nba2 = pd.read_csv("nba.csv", parse_dates=['Birthday'])

### 3.2.3 Top 10 high-pay players

In [None]:
nba2.nlargest(10, columns="Salary")

### 3.2.4 Top 5 oldest players

In [None]:
nba2.nsmallest(5, columns="Birthday")

## 3.3 Sort

We can use the `sort_values()` method to sort rows in a DataFrame. Like SQL we can specify mulitple columns to sort and the order of each column. 

### 3.3.1 Sort by player income descendingly

In [None]:
nba2.sort_values(by="Salary", ascending=False)

### 3.3.2 Sort by player ages ascendingly


In [None]:
nba2.sort_values(by="Birthday", ascending=False)

### 3.3.3 Sort by team and income

In [None]:
nba2.sort_values(by=["Team", "Salary"], ascending=[True, False])

### 3.3.4 Sort columns

We can also sort columns by using `sort_index()` and specifying the `axis` parameter. For example, to columns in alphabetical order

In [None]:
nba2.sort_index(axis="columns", ascending=False)

## 3.5 Change index

Besides numeric index, we can also use string or date as index. We change index for existing DataFrame object by calling the `set_index()` method. Or we can specficy `index_col` to set the index when we construct a DataFrame by importing from file.

In [None]:
nba3 = nba2.set_index(keys="Name")

## 3.6 Select columns
Select columns is similar to projection in SQL. With Pandas we select single or multiple columns.

### 3.6.1 Select one column

In [None]:
nba3.Salary

### 3.6.2 Select multiple columns

In [None]:
nba3[["Team","Birthday", "Salary"]]

### 3.6.3 Select based on data type

In [None]:
nba3.select_dtypes(include=["object", "datetime"])

## 3.7 Select rows

There are multiple ways to select a specific row or a collection of rows.



### 3.7.1 Using index label

To extract a specific row, we use the following accessors:

- `loc`
- `iloc`
- `at`
- `iat`
The first two are for extracting a range of rows. The last two extract just one row and is more performant.
The i variant accept only numberic index.

Let's start with `loc`.

In [None]:
nba3.loc["LeBron James"]

We can specify multiples labels as follows:

In [None]:
nba3.loc[["Chris Chiozza","Admiral Schofield"]]

We can also specify a range of labels using the colon syntax. Note the upper boundry is included.

In [None]:
nba3.sort_index().loc["Admiral Schofield":"Chris Chiozza"]

By default, `loc` retuns all columns. You can pick columns you need by specifying the secode parameter to the `loc` accessor as follows:

In [None]:
nba3.sort_index().loc["Admiral Schofield":"Chris Chiozza", ['Team', 'Salary']]

### 3.7.2 Row accessors compared

Let's compare the speed of accessing one row using the `loc` and `at` variants.

In [96]:
%%timeit
nba3.loc["LeBron James", "Team"]

3.43 µs ± 9.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [95]:
%%timeit
nba3.at["LeBron James", "Team"]

1.7 µs ± 7.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
