#### Introduction to Statistical Learning, Lab 1.2

# Data Access: Introduction to Pandas & ISLPy

Pandas is the Python library of choice for accessing and manipulating datasets. It provides I/O in various file formats and a powerful interface to datasets. 

You can always to refer to the latest [documentation](https://pandas.pydata.org/pandas-docs/stable/index.html). You should work through the excellent [10 minutes to pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) if you have not done so already. This will likely take more than 10 minutes, though.

Here we will only review some basic operations to get us started. More features will be
introduced in later labs when the need arises. We will also introduce the `islpy` library that provides efficient access to the datasets used in this course.


Pandas is conventionally imported like this:

In [None]:
import pandas as pd

We will also need NumPy and the os module (for file path manupulation):

In [None]:
import numpy as np
import os

#### Loading Data into Pandas

When we load data into pandas it will create a data frame. 
Data frames are the core structure of pandas. We use data frames to access and manipulate datasets. You can think of a pandas data frame as a table or spreadsheet. Typically, the
rows are the observations in a dataset and the columns are the variables (features).

Pandas can load data from a [variety of file formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). 

Load the `Auto` dataset csv file into pandas:

In [None]:
datasets_dir = '../../datasets'
auto_path = os.path.join(datasets_dir, 'Auto.csv')
df = pd.read_csv(auto_path)
type(df)

Load the 'Khan' dataset file into pandas. This file has *many* columns.

In [None]:
khan_path = os.path.join(datasets_dir, 'Khan_xtrain.csv')
df = pd.read_csv(khan_path)

#### The ISLPy Library

In order to speed things up and removing the need to set the correct file paths, we provide the `islpy` library. It facilitates access to the datasets used in this course and loads the larger datasets a little faster.

Let's load the `Khan` dataset using the `islpy.dataset` module.

In [None]:
from islpy import datasets
xtrain, ytrain, xtest, ytest = datasets.Khan()

We can get help about any dataset in the `islpy.datasets` module:

In [None]:
help(datasets.Khan)

Or for the whole module, as usual:

In [None]:
help(datasets)

The following examples assume the `Auto` dataset, so we reload it through the `islpy.datasets` module.

In [None]:
help(datasets.Auto)

In [None]:
df = datasets.Auto()

#### Inspecting, Selecting & Rearranging Data

The data in data in data frames can be accessed, selected and rearranged in various ways. We explore a few of them on the `Auto` dataset.

In [None]:
df.head() # the first five rows

In [None]:
df.tail() # the last five rows

In [None]:
df.head(2) # the first two rows

In [None]:
df.tail(11) # the last 11 rows

In [None]:
df.describe() # produce an overview of the data set

In [None]:
c = df['mpg'] # we can use the column names as keys
c.head()

In [None]:
type(c) # a data frame column is a pandas series object

In [None]:
print(df['mpg'].median(), df['mpg'].quantile(0.25)) # we can ask for column properties

In [None]:
df.mpg.head() # we can also use column names like attributes

In [None]:
df.keys() # all column names/keys

In [None]:
s = df[['mpg', 'name']] # we can select multiple columns
s.head()

In [None]:
df[3:5] # we can select rows by indexing & slicing (just like python lists)

In [None]:
df[3:5][['mpg', 'name']] # the selections methods can be combined

In [None]:
x = df[['mpg', 'weight', 'acceleration']].to_numpy() # we can create a numpy array
x[2:7]

In [None]:
df[(df.cylinders > 4) & (df.mpg > 30)] # we can make selection cuts

In [None]:
df.T # transpose of the data frame

In [None]:
df.sort_values(by='mpg', ascending=False).head() # sort rows by values in a column 