# DS-SF-23 | Codealong 02 | Introduction to `pandas`

> ## Importing `pandas` (and `NumPy`) into the environment

Most (if not all) of the course notebooks will start by importing `pandas` into the environment.  Because much of `pandas` is built on `NumPy`, `NumPy` is also imported alongside.

We will use a widely used convention by importing `pandas` and referencing it with a `pd.` prefix.  Likewise, `NumPy` is imported and referenced with a `np.` namespace.

In [2]:
import os
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 6)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

> ## Loading data from files and the Web

`pandas` provides powerful facilities for easy retrieval of data from a variety of data sources.  In particular, it provides built-in support for loading data in `.csv` format, a common means of storing structured data in text files.

In [3]:
df = pd.read_csv(os.path.join('..', 'datasets', 'zillow-02-start.csv'))

In [4]:
type(df)

pandas.core.frame.DataFrame

The result is a `DataFrame`.  A `DataFrame` stores tabular data:

In [5]:
df

Unnamed: 0,ID,Address,Latitude,Longitude,DateOfSale,...,Size,SizeUnit,LotSize,LotSizeUnit,BuiltInYear
0,15063471,"55 Vandewater St APT 9, San Francisco, CA",37805103,-122412856,12/4/15,...,550.0,sqft,,,1980.0
1,15063505,"740 Francisco St, San Francisco, CA",37804420,-122417389,11/30/15,...,1430.0,sqft,2435.0,sqft,1948.0
2,15063609,"819 Francisco St, San Francisco, CA",37803728,-122419055,11/12/15,...,2040.0,sqft,3920.0,sqft,1976.0
...,...,...,...,...,...,...,...,...,...,...,...
997,2128308939,"33 Santa Cruz Ave, San Francisco, CA",37709136,-122465332,12/10/15,...,1738.0,sqft,2299.0,sqft,1976.0
998,2131957929,"1821 Grant Ave, San Francisco, CA",37803760,-122408531,12/15/15,...,1048.0,sqft,,,1975.0
999,2136213970,"1200 Gough St, San Francisco, CA",37784770,-122424100,1/10/16,...,900.0,sqft,,,1966.0


> ## Selecting columns of a `DataFrame`

Selecting data in specific columns of a `DataFrame` is performed by using the `[]` operator.

Passing a single integer, or a list of integers, to `[]` will perform a location based lookup of the columns.

E.g., columns 5 and 6:

In [None]:
df[ [5, 6] ]

In [6]:
type(df[ [5, 6] ])

pandas.core.frame.DataFrame

The list can contain a single integer.

E.g., column 7 only:

In [7]:
df[ [7] ]

Unnamed: 0,IsAStudio
0,False
1,False
2,False
...,...
997,False
998,False
999,False


In [8]:
type(df[ [7] ])

pandas.core.frame.DataFrame

If the values passed to `[]` are non-integers, the `DataFrame` will attempt to match them to those in the `columns` index.

In [None]:
df.columns_

In [None]:
df[ ['SalePrice', 'SalePriceUnit'] ]

However, you cannot mix integers and non-integers.

In [None]:
df[ ['SalePrice', 6] ]

Not passing a list always results in a value based lookup of the column:

In [None]:
df['Address']

And the result is a `Series`:

In [None]:
type(df['Address'])

Columns can also be retrieved using "attribute access" as `DataFrames` adds a property for each column with the names of the properties as the names of the columns.  Note that this will not work for columns that have spaces or dots in their name.

In [None]:
df.Address

The columns index again...

In [None]:
df.columns

To find the zero-based location of a column, use the `.get_loc()` method of the `columns` index.  E.g.,

In [None]:
df.columns.get_loc('BedCount')

In [None]:
df[ [df.columns.get_loc('BedCount')] ]

In [9]:
df[ ['BedCount'] ]

Unnamed: 0,BedCount
0,1.0
1,
2,2.0
...,...
997,3.0
998,2.0
999,1.0


> ## Selecting rows and values of a `DataFrame` using the index

### Slicing using the `[]` operator

E.g., first five rows:

In [None]:
df[:5]

> ## Selecting rows by index label and location: `.loc[]` and `.iloc[]`

Until now, the index of the `DataFrame` is a numerical starting from 0 but you can specify which column(s) should be in the index.  E.g., `ID`:

In [11]:
df = df.set_index('ID')

In [None]:
df

E.g., row with index 15063505:

In [None]:
df.loc[15063505]

E.g., rows with indices 15063505 and 15064044:

In [None]:
df.loc[ [15063505, 15064044] ]

E.g., rows 1 and 3:

In [None]:
df.iloc[ [1, 3] ]

> ## Scalar lookup by label or location using `.at[]` and `.iat[]`

Scalar values can be looked up by label using .at, by passing both the row label and then the column name/value.  E.g.,

In [12]:
df.at[15064044, 'DateOfSale']

'12/11/15'

Scalar values can also be looked up by location using .iat by passing both the row location and then the column location. E.g.,

In [None]:
df.iat[3, 3]

> ## Selecting rows of a `DataFrame` by Boolean selection

Rows can also be selected by using Boolean selection, using an array calculated from the result of applying a logical condition on the values in any of the columns.  This allows us to build more complicated selections than those based simply upon index labels or positions.

E.g., what homes have been built before 1900?

In [13]:
df.BuiltInYear < 1900

ID
15063471      False
15063505      False
15063609      False
              ...  
2128308939    False
2131957929    False
2136213970    False
Name: BuiltInYear, dtype: bool

This results in a `Series` that can be used to select the rows where the value is True:

In [None]:
df[ df.BuiltInYear < 1900 ]

Multiple conditions can be put together.  E.g.,

In [None]:
df[ (df.BuiltInYear < 1900) & (df.Size > 1500) ]

At the same time, it is possible to select a subset of the columns.  E.g.,

In [None]:
df[ (df.BuiltInYear < 1900) & (df.Size > 1500) ][ ['Address'] ]