#### Introduction to Statistical Learning, Lab 2.2

# Data Access: Introduction to Pandas & ISLPy

Pandas is the Python library of choice for accessing and manipulating data sets. It provides I/O in various file formats and a powerful interface to data sets. 

You can always to refer to the latest [documentation](https://pandas.pydata.org/pandas-docs/stable/index.html). You should work through the excellent [10 minutes to pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) if you have not done so already. This will likely take more than 10 minutes, though.

Here we will only review some basic operations to get us started. More features will be
introduced in later labs when the need arises. We will also introduce the `islpwf` library that provides efficient access to the data sets used in this course.


Pandas is conventionally imported like this:

In [None]:
import pandas as pd

We will also need NumPy and the os module (for file path manipulation):

In [None]:
import numpy as np
import os

#### Loading Data into Pandas

When we load data into pandas it will create a data frame. 
Data frames are the core structure of pandas. We use data frames to access and manipulate data sets. You can think of a pandas data frame as a table or spreadsheet. Typically, the
rows are the observations in a data set and the columns are the variables (features).

Pandas can load data from a [variety of file formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). 

Load the `Auto` dataset csv file into pandas:

In [None]:
datasets_dir = '../../datasets'
auto_path = os.path.join(datasets_dir, 'Auto.csv')
df = pd.read_csv(auto_path)
type(df)

Load the 'Khan' data set file into pandas. This file has *many* columns.

In [None]:
# this load only the training features
khan_path = os.path.join(datasets_dir, 'Khan_xtrain.csv')
df = pd.read_csv(khan_path)

#### The ISLPy Library

In order to speed things up and removing the need to set the correct file paths, we provide the `islpwf` library. It facilitates access to the data sets used in this course and loads the larger data sets a little faster.

Let's load the `Khan` data set using the `islpwf.dataset` module.

In [None]:
from islpwf import datasets
# conveniently loads the entire data set
xtrain, ytrain, xtest, ytest = datasets.Khan()

We can get help about any data set in the `islpwf.datasets` module:

In [None]:
help(datasets.Khan)

Or for the whole module, as usual:

In [None]:
help(datasets)

The following examples assume the `Auto` data set, so we reload it through the `islpwf.datasets` module.

In [None]:
help(datasets.Auto)

In [None]:
df = datasets.Auto()

#### Inspecting, Selecting & Rearranging Data

The data in data in data frames can be accessed, selected and rearranged in various ways. We explore a few of them on the `Auto` data set.

In [None]:
df.head()  # the first five rows

In [None]:
df.tail()  # the last five rows

In [None]:
df.head(2)  # the first two rows

In [None]:
df.tail(11)  # the last 11 rows

In [None]:
df.describe()  # produce an overview of the data set

In [None]:
c = df['mpg']  # we can use the column names as keys
c.head()

In [None]:
type(c)  # a data frame column is a pandas series object

In [None]:
# we can ask for column properties
print(df['mpg'].median(), df['mpg'].quantile(0.25))

In [None]:
df.mpg.head()  # we can also use column names like attributes

In [None]:
df.keys()  # all column names/keys

In [None]:
s = df[['mpg', 'name']]  # we can select multiple columns
s.head()

In [None]:
df[3:5]  # we can select rows by indexing & slicing (just like python lists)

In [None]:
df[3:5][['mpg', 'name']]  # the selections methods can be combined

In [None]:
# we can create a numpy array
x = df[['mpg', 'weight', 'acceleration']].to_numpy()
x[2:7]

In [None]:
df[(df.cylinders > 4) & (df.mpg > 30)]  # we can make selection cuts

In [None]:
df.T  # transpose of the data frame

In [None]:
# sort rows by values in a column
df.sort_values(by='mpg', ascending=False).head()

In [None]:
df.iloc[:, 2:4]  # use the iloc property for numpy style indexing

In [None]:
df.iloc[:, 2:4].to_numpy()  # we can convert to raw numpy arrays

#### Column and index names

It is often useful to access and manipulate the column and row names of a data frame. In particular, we often want to use a specific column as a row index or name.

In [None]:
df.columns  # access the column names.

In [None]:
df.index.values  # access the row indices or names

In [None]:
df.rename({'cylinders': 'pistons', 'name': 'model'},
          axis='columns', inplace=True)
df.head()

In [None]:
df.set_index('model', inplace=True)
df.head()