# `pandas` - Single Table Verbs

__Contents__:
1. Select rows and columns
1. Rename columns
1. Modify columns
1. Sort rows
1. Sample rows
1. Filter rows

The summarization and grouping verbs are described in the `Summarization` notebook.

Related/useful documentation:
- http://pandas.pydata.org/pandas-docs/stable/index.html
- https://pandas.pydata.org/pandas-docs/stable/dsintro.html

### Load libraries

In [5]:
import pandas  as pd
import numpy   as np
(pd.__version__,
 np.__version__
)

The most common way to create a DataFrame is to use the `read_csv` (pandas) function to read a CSV file. We will do that in notebook `?????`.

Another common technique is to use the `DataFrame` function, which has three paramters:
1. `data`, which is a numpy array, a dictionary or another DataFrame (examples of each follow)
1. `index`, which is a list of the names of the rows
1. `columns`, which is a list if the names of the columns

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [7]:
column_names = ['symboling', 'normalized-losses', 'make', 'fuel-type',
                'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
                'engine-location', 'wheel-base', 'length', 'width',
                'height', 'curb-weight', 'engine-type', 'num-of-cylinders',
                'engine-size', 'fuel-system', 'bore', 'stroke',
                'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
                'highway-mpg', 'price']
import_df = pd.read_csv('/dbfs/mnt/datalab-datasets/file-samples/imports-85.csv',
                        names=[string.replace('-','_') for string in column_names],
                        na_values=['?']
                       )

## Select Rows and Columns

Create a sample dataframe for the demonstrations below.

In [10]:
df_col = pd.DataFrame([[100, 200, 300, 400],
                       [101, 201, 301, 401],
                       [102, 202, 302, 402]], 
                      columns=['col_a', 'col_b', 'col_c', 'col_d']
                     )
df_col

__Columns can be accessed using square brackets.__

In [12]:
x = df_col.col_b
print(type(x))
x

In [13]:
x = df_col['col_b']
print(type(x))
x

In [14]:
col_name = 'col_b'
x = df_col[col_name]
print(type(x))
x

In [15]:
x = df_col[['col_b']]
print(type(x))
x

In [16]:
x = df_col[['col_b','col_c']]
print(type(x))
x

In [17]:
col_list = ['col_b','col_c']
x = df_col[col_list]
print(type(x))
x

In [18]:
x = df_col.iloc[0,:]
print(type(x))
x

__Columns can be accessed using the `iloc` method.__

This method specifies columns by their integer location.

In [20]:
x = df_col.iloc[:,0]
print(type(x))
x

In [21]:
x = df_col.iloc[0:2,:]
print(type(x))
x

__Columns can be accessed using the `loc` method.__

This method specifies columns by their name.

In [23]:
x = df_col.loc[:,'col_b']
print(type(x))
x

In [24]:
x = df_col.loc[:,['col_b']]
print(type(x))
x

In [25]:
x = df_col.loc[:,['col_b','col_a']]
print(type(x))
x

In [26]:
x = df_col.loc[:,'col_a':'col_b']
print(type(x))
x

__The following demontrates accessing rows from a table with a character index.__

In [28]:
df_row_col = pd.DataFrame([[100, 200, 300, 400],
                           [101, 201, 301, 401],
                           [102, 202, 302, 402]], 
                          columns=['col_a', 'col_b', 'col_c', 'col_d'],
                          index  =['row_1', 'row_2', 'row_3']
                         )
df_row_col

In [29]:
df_row_col.iloc[0:2,1:3]

In [30]:
df_row_col.loc['row_2','col_b':'col_d']

In [31]:
df_dt_col = pd.DataFrame(data=[[100, 200, 300, 400],
                               [101, 201, 301, 401],
                               [102, 202, 302, 402]], 
                         columns=['col_a', 'col_b', 'col_c', 'col_d'],
                         index  =pd.date_range(pd.to_datetime('20180203', 
                                                              format='%Y%m%d'),
                                               periods=3,
                                               freq='D')
                        )
df_dt_col

In [32]:
df_dt_col.iloc[0:2,1:3]

In [33]:
x = df_dt_col.loc['2018-02-03',['col_b','col_d']]
print(type(x))
x

In [34]:
x = df_dt_col.loc['2018-02-04':'2018-02-05',['col_b','col_d']]
print(type(x))
x

__Exercise__: try changing the previous cell so that it `loc` accepts a list of row names.

## Rename Columns
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

In [37]:
rename_df = import_df.rename(columns={'city_mpg'   : 'mpg_city',
                                      'highway_mpg': 'mpg_highway'
                                     })
rename_df.columns

## Modify Columns

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html

> Assigning multiple columns within the same assign is possible, but you cannot reference other columns created within the same assign call.

In [40]:
df_col.assign(apb = df_col.col_a    + df_col.col_b, 
              ctd = df_col['col_c'] * df_col['col_d'])

In [41]:
df_col.assign(apb = df_col.col_a    + df_col.col_b, 
              ctd = df_col['col_c'] * df_col['col_d'])

## Sort Rows

- http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.sort_values.html

In [44]:
import_df.sort_values(by='horsepower',axis=0, ascending=True)[['horsepower','make','city_mpg','highway_mpg']].head()

## Sample Rows
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html

See also parameters for: `frac`, `replace` and `weight`

In [46]:
import_df.sample(n=10)[['horsepower','make','city_mpg','highway_mpg']]

## Filter Rows
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.filter.html
- https://pythonspot.com/pandas-filter/

In [48]:
import_df[import_df.make=="toyota"][['make','body_style','city_mpg','highway_mpg']].head()

In [49]:
import_df.query('make=="toyota"')[['make','body_style','city_mpg','highway_mpg']].head()

__The End__