## Overview 

Today I will talk about some basic functions and usage for `Python`, `Numpy` and `Pandas` which are popular tools for a wide range of applications. The goal for today's lab is to provide an overview get you familiarized with some of most basic usages due to time limit. For more in-depth understanding and functionalities, there are many great resources. Some contents for this lab is adpated from the ISLP (introduction to statistical learning in Python [https://github.com/intro-stat-learning/ISLP_labs/blob/stable/Ch02-statlearn-lab.ipynb](https://github.com/intro-stat-learning/ISLP_labs/blob/stable/Ch02-statlearn-lab.ipynb)). I will share this notebook on Canvas after class.

# 1. Introduction to Python

`Python` tutorial: [docs.python.org/3/tutorial/](https://docs.python.org/3/tutorial/). 

## 1.1 Comments

Comments in Python start with the **#** symbol and are used to explain code or make notes. Comments are ignored by the Python interpreter.

In [20]:
# This is a comment
print("hello")

hello


## 1.2 Variables 

Variables are used to store data. In Python, you don’t need to declare the data type of a variable explicitly. Python will automatically infer the data type based on the value assigned to the variable.

In [22]:
a = 10 
a

10

## 1.3 Data types

Python supports various data types, including integers, floats, strings, lists, tuples, dictionaries, and more.

Integers: Whole numbers without decimals. \
Floats: Numbers with decimals. \
Strings: Text enclosed in single or double quotes. \
Lists: Ordered collections of items. \
Tuples: Immutable collections of items. \
Dictionaries: Key-value pairs. 

In [27]:
list_1 = [1, "A", True]
list_1[0]
list_1[:2]
list_1.append(2) 

In [28]:
tup = (0, 1)
tup.append(2)

AttributeError: 'tuple' object has no attribute 'append'

In [30]:
set_1 = {1, 2, 3, 1}
set_1

{1, 2, 3}

In [34]:
dict_1 = {"a": 1, 
         "b": 2}
dict_1.keys()
dict_1.values()
dict_1["a"] 

1

## 1.4 Indentation

Python uses indentation to define blocks of code, such as loops and functions. Use four spaces for indentation. Incorrect indentation can lead to syntax errors.

In [35]:
def greet(name):
print("hello", "name")


IndentationError: expected an indented block after function definition on line 1 (2551259297.py, line 2)

## 1.5 Operators 

Python supports various operators, including arithmetic, comparison, logical, and assignment operators.

Arithmetic operators: +, -, *, /, %, ** (exponentiation), // (floor division). \
Comparison operators: ==, !=, <, >, <=, >=.  \
Logical operators: and, or, not. \
Assignment operators: =, +=, -=, *=, /=, %=, **=, //=. \
Bitwise operators: &, |, ^, ~, <<, >>. \
Strings: Strings can be enclosed in single or double quotes. You can use the + operator to concatenate strings.

In [None]:
a = 0 
a = a + 1
a *= 1

## 1.6 Control flow

Python supports various control flow structures, such as if-else statements, loops, and more.

### 1.6.1 if-esle statement

### 1.6.2 for loops

for var in iterable:    # statements

### 1.6.3 while loops

while expression:
    statement(s)

## 1.7 Functions 

Functions are blocks of code that perform a specific task. You can define your own functions using the def keyword.

# 2. Numpy

The `numpy` *library*, or *package* is widely used in the field of machine learning and data science. A package is a collection of modules that are not necessarily included in the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*. 

To access `numpy`, we must first `import` it. In the previous line, we named the `numpy` *module* `np`; an abbreviation for easier referencing.

In `numpy`, an *array* is a generic term for a multidimensional set of numbers. The *ndarray* is at the core of the Numpy package. 

In [37]:
import numpy as np

### 2.1 Array creation
The simpliest way to create an array is from Python list. We use the `np.array()` function to define `x` and `y`, which are one-dimensional arrays, i.e. vectors.

In [41]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

We could also create an array using the `np.arange()` function.

In [42]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In `numpy`, matrices are typically represented as two-dimensional arrays, and vectors as one-dimensional arrays. {It is also possible to create matrices using  `np.matrix()`.}
We can create a two-dimensional array as follows. 

In [44]:
x = np.array([[1,2], [3,4]])
x

array([[1, 2],
       [3, 4]])

The object `x` has several 
*attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute`
with the name of the attribute. 
For instance, we can access the `ndim` attribute of  `x` as follows. 

In [47]:
x.ndim
x.shape
x.dtype

dtype('int64')

### 2.2 Basic operations
Arithmetic operators on arrays apply *elementwise*. A new array is created and filled with the result.

In [53]:
a = np.array([1,2,3,4])
b = np.arange(4)
a - b 

array([1, 1, 1, 1])

A *method* is a function that is associated with an
object. 
For instance, given an array `x`, the expression
`x.sum()` sums all of its elements, using the `sum()`
method for arrays. 
The call `x.sum()` automatically provides `x` as the
first argument to its `sum()` method.

In [54]:
a.sum()

10

We could also sum the elements of `x` by passing in `x` as an argument to the `np.sum()` function. 

In [55]:
np.sum(a)

10

 As another example, the
`reshape()` method returns a new array with the same elements as
`x`, but a different shape.
 We do this by passing in a `tuple` in our call to
 `reshape()`, in this case `(2, 3)`.  This tuple specifies that we would like to create a two-dimensional array with 
$2$ rows and $3$ columns. {Like lists, tuples represent a sequence of objects. Why do we need more than one way to create a sequence? There are a few differences between tuples and lists, but perhaps the most important is that elements of a tuple cannot be modified, whereas elements of a list can be.}

In [57]:
a = np.array([1,2,3,4,5,6])
a.reshape((2,3))

array([[1, 2, 3],
       [4, 5, 6]])

The previous output reveals that `numpy` arrays are specified as a sequence
of *rows*. This is  called *row-major ordering*, as opposed to *column-major ordering*. 

`Python` (and hence `numpy`) uses 0-based
indexing. This means that to access the top left element of `x_reshape`, 
we type in `x_reshape[0,0]`.

In [58]:
a.reshape((2,3))[0,0]

1

### break out exercise

Take 2-3 minutes to creat arrays and look at its shape. Test out the built-in functions and see the attributes. You can go to Numpy's helper\
page to look at a comprehensive list of functions and attributes.


### 2.3 generate random values 
The `np.random.normal()`  function generates a vector of random
normal variables. The function's arguments are  `loc`, `scale`, and `size`. These are *keyword* arguments, which means that when they are passed into the function, they can be referred to by name (in any order). {`Python` also uses *positional* arguments. Positional arguments do not need to use a keyword. To see an example, type in `np.sum?`. We see that `a` is a positional argument, i.e. this function assumes that the first unnamed argument that it receives is the array to be summed. By contrast, `axis` and `dtype` are keyword arguments: the position in which these arguments are entered into `np.sum()` does not matter.}
 By default, this function will generate random normal variable(s) with mean (`loc`) $0$ and standard deviation (`scale`) $1$; furthermore, 
 a single random variable will be generated unless the argument to `size` is changed. 

We now generate 50 independent random variables from a $N(0,1)$ distribution. 

In [69]:
np.random.normal?

[0;31mDocstring:[0m
normal(loc=0.0, scale=1.0, size=None)

Draw random samples from a normal (Gaussian) distribution.

The probability density function of the normal distribution, first
derived by De Moivre and 200 years later by both Gauss and Laplace
independently [2]_, is often called the bell curve because of
its characteristic shape (see the example below).

The normal distributions occurs often in nature.  For example, it
describes the commonly occurring distribution of samples influenced
by a large number of tiny, random disturbances, each with its own
unique distribution [2]_.

.. note::
    New code should use the ``normal`` method of a ``default_rng()``
    instance instead; please see the :ref:`random-quick-start`.

Parameters
----------
loc : float or array_like of floats
    Mean ("centre") of the distribution.
scale : float or array_like of floats
    Standard deviation (spread or "width") of the distribution. Must be
    non-negative.
size : int or tuple of ints, optio

In [60]:
a = np.random.normal(size=50)

In [61]:
np.mean(a)

-0.06964949261835635

In [62]:
np.var(a)

0.6819225079951642

### Slicing and Indexing

Slice notation is used to index sequences such as lists, tuples and arrays.
Suppose we want to retrieve the fourth through sixth (inclusive) entries
of a string. We obtain a slice of the string using the indexing  notation  `[3:6]`.

In [63]:
"hello world"[3:6]

'lo '

### Indexing Rows, Columns, and Submatrices
To select multiple rows at a time, we can pass in a *list*
specifying our selection. For instance, `[1,3]` will retrieve the second and fourth rows:

In [68]:
# Let's first create an 4 by 4 matrix 
A = np.arange(16).reshape((4,4))
A

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

To select the first and third *columns*, we pass in  `[0,2]` as the second argument in the square brackets.
In this case we need to supply the first argument `:` 
which selects all rows.

In [69]:
A[:,[3]]

array([[ 0,  2],
       [ 4,  6],
       [ 8, 10],
       [12, 14]])

### Indexing Submatrices

Now, suppose that we want to select the submatrix made up of the second and fourth 
rows as well as the first and third columns. This is where
indexing gets slightly tricky. It is natural to try  to use lists to retrieve the rows and columns:

In [70]:
A[[0,3],[0,2]]

array([ 0, 14])

We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix.

One easy way to do this is as follows. We first create a submatrix by subsetting the rows of `A`, and then on the fly we make a further submatrix by subsetting its columns.


In [72]:
A[[0,3]][:,[0,2]]

array([[ 0,  2],
       [12, 14]])

There are more efficient ways of achieving the same result.

The *convenience function* `np.ix_()` allows us  to extract a submatrix
using lists, by creating an intermediate *mesh* object.

In [74]:
idx = np.ix_([1,3], [0,2,3])
A[idx]

array([[ 4,  6,  7],
       [12, 14, 15]])

Alternatively, we can subset matrices efficiently using slices.
  
The slice
`1:4:2` captures the second and fourth items of a sequence, while the slice `0:3:2` captures
the first and third items (the third element in a slice sequence is the step size).

In [75]:
A[1:4:2, 0:3:2]

array([[ 4,  6],
       [12, 14]])

### Boolean indexing

In `numpy`, a *Boolean* is a type  that equals either   `True` or  `False` (also represented as $1$ and $0$, respectively).
The next line creates a vector of $0$'s, represented as Booleans, of length equal to the first dimension of `A`. 

In [78]:
keep_rows = np.zeros(4, bool)
keep_rows

array([False, False, False, False])

`keep_rows` retrieves the second and fourth rows  of `A` --- i.e. the rows for which the Boolean equals `TRUE`. 

In [80]:
keep_rows[[1,3]] = True 
keep_rows

array([False,  True, False,  True])

We again make use of the `np.ix_()` function
 to create a mesh containing the second and fourth rows, and the first,  third, and fourth columns. This time, we apply the function to Booleans,
 rather than lists.

In [82]:
idx_bool = np.ix_(keep_rows, [0,2,3])
A[idx_bool]

array([[ 4,  6,  7],
       [12, 14, 15]])

We can also mix a list with an array of Booleans in the arguments to `np.ix_()`:

# 3. Pandas

In [84]:
# first import the pandas package 
import pandas as pd

### 3.1 Viewing the data

In [85]:
# load the data from the ISLP website

# URL of the CSV file
url = 'https://raw.githubusercontent.com/intro-stat-learning/ISLP_labs/stable/Auto.csv'

# Load the dataset into a pandas DataFrame
Auto = pd.read_csv(url)

In [86]:
# Look at the first few rows of the data set 
Auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [87]:
# Look at the last few rows of the data set 
Auto.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
387,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79,2625,18.6,82,1,ford ranger
391,31.0,4,119.0,82,2720,19.4,82,1,chevy s-10


In [88]:
# get the summary statistics for the data 
Auto.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592,1.576531
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737,0.805518
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0,1.0
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [89]:
# get data description 
Auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    int64  
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
 8   name          392 non-null    object 
dtypes: float64(3), int64(5), object(1)
memory usage: 27.7+ KB


The `Auto.shape`  attribute tells us that the data has 392
observations, or rows, and nine variables, or columns.

In [90]:
# get the shape of the data
Auto.shape

(392, 9)

### 3.2 Selection

We can use `Auto.columns`  to check the variable names.

In [91]:
# Look at the columns of the data set 
Auto.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')

Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array. 
Recall that the first argument to the `[]` method
is always applied to the rows of the array.  
Similarly, 
passing in a slice to the `[]` method creates a data frame whose *rows* are determined by the slice:

In [92]:
# select rows by slicing 
Auto[:3]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite


Similarly, an array of Booleans can be used to subset the rows:

In [96]:
# select rows by Booleans 
idx_80 = Auto['year'] > 80
Auto[idx_80];

However, if we pass  in a list of strings to the `[]` method, then we obtain a data frame containing the corresponding set of *columns*. 

In [98]:
# select multiple columns using list of strings 
Auto[["mpg","year"]];

Since we did not specify an *index* column when we loaded our data frame, the rows are labeled using integers
0 to 392.

In [99]:
# look at the data index 
Auto.index

RangeIndex(start=0, stop=392, step=1)

We can use the
`set_index()` method to re-name the rows using the contents of `Auto['name']`. 

In [102]:
# set row index using 'name' column
Auto_re = Auto.set_index("name")

In [104]:
# check column names again 
Auto_re.columns
Auto_re

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,1
buick skylark 320,15.0,8,350.0,165,3693,11.5,70,1
plymouth satellite,18.0,8,318.0,150,3436,11.0,70,1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1
...,...,...,...,...,...,...,...,...
ford mustang gl,27.0,4,140.0,86,2790,15.6,82,1
vw pickup,44.0,4,97.0,52,2130,24.6,82,2
dodge rampage,32.0,4,135.0,84,2295,11.6,82,1
ford ranger,28.0,4,120.0,79,2625,18.6,82,1


Now that the index has been set to `name`, we can  access rows of the data 
frame by `name` using the `{loc[]`} method of
`Auto`:

In [105]:
# set rows based on row names 
rows = ["amc rebel sst", "ford torino"]
Auto_re.loc[rows]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1


As an alternative to using the index name, we could retrieve the 4th and 5th rows of `Auto` using the `{iloc[]`} method:

In [108]:
# use `iloc` method to select rows
Auto_re.iloc[[3,4]]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1


We can also use it to retrieve the 1st, 3rd and and 4th columns of `Auto_re`:

In [109]:
# use `iloc` method to select columns 
Auto_re.iloc[:, [0,3,4]]

Unnamed: 0_level_0,mpg,horsepower,weight
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chevrolet chevelle malibu,18.0,130,3504
buick skylark 320,15.0,165,3693
plymouth satellite,18.0,150,3436
amc rebel sst,16.0,150,3433
ford torino,17.0,140,3449
...,...,...,...
ford mustang gl,27.0,86,2790
vw pickup,44.0,52,2130
dodge rampage,32.0,84,2295
ford ranger,28.0,79,2625


We can extract the 4th and 5th rows, as well as the 1st, 3rd and 4th columns, using
a single call to `iloc[]`:

In [110]:
# use `iloc` to select both rows and columns; this wouldn't work in Numpy! 
Auto_re.iloc[[3,4], [0,2,3]]

Unnamed: 0_level_0,mpg,displacement,horsepower
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
amc rebel sst,16.0,304.0,150
ford torino,17.0,302.0,140


Suppose now that we want to create a data frame consisting of the  `weight` and `origin`  of the subset of cars with 
`year` greater than 80 --- i.e. those built after 1980.
To do this, we first create a Boolean array that indexes the rows.
The `loc[]` method allows for Boolean entries as well as strings:

In [111]:
# select rows based on boolean values 
idx_80 = Auto_re['year'] > 80
Auto_re.loc[idx_80, ['weight', 'origin']]

Unnamed: 0_level_0,weight,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1
plymouth reliant,2490,1
buick skylark,2635,1
dodge aries wagon (sw),2620,1
chevrolet citation,2725,1
plymouth reliant,2385,1
toyota starlet,1755,3
plymouth champ,1875,1
honda civic 1300,1760,3
subaru,2065,3
datsun 210 mpg,1975,3


To do this more concisely, we can use an anonymous function called a `lambda`: 

In [112]:
# select rows using `lambda` function 
Auto_re.loc[lambda df: df['year'] > 80, ['weight', 'origin']]

Unnamed: 0_level_0,weight,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1
plymouth reliant,2490,1
buick skylark,2635,1
dodge aries wagon (sw),2620,1
chevrolet citation,2725,1
plymouth reliant,2385,1
toyota starlet,1755,3
plymouth champ,1875,1
honda civic 1300,1760,3
subaru,2065,3
datsun 210 mpg,1975,3


In summary, a powerful set of operations is available to index the rows and columns of data frames. For integer based queries, use the `iloc[]` method. For string and Boolean
selections, use the `loc[]` method. For functional queries that filter rows, use the `loc[]` method
with a function (typically a `lambda`) in the rows argument.