## Importing numpy

In [1]:
import numpy as np

## Array: The Fundamental Data Structure in Numpy

Numpy is fundamentally based on arrays, N-dimensional data structures. Here we mainly stay with one- and two-dimensional structures (vectors and matrices) but the arrays can also have higher dimension (called tensors). Besides arrays, numpy also provides a plethora of functions that operate on the arrays, including vectorized mathematics and logical operations.

Arrays can be created with np.array. For instance, we can create a 1-D vector of numbers from 1 to 4 by feeding a list of desired numbers to the np.array




In [2]:
a = np.array([1,2,3,4])
print("a:\n", a)

a:
 [1 2 3 4]


Note that it is printed in brackets as list, but unlike a list, it does not have commas separating the components.

If we want to create a matrix (two-dimensional array), we can feed np.array with a list of lists, one sublist for each row of the matrix:

In [3]:
b = np.array([[1,2], [3,4]])
print("b:\n", b)

b:
 [[1 2]
 [3 4]]


The output does not have the best formatting but it is clear enough.

One of the fundamental property of arrays is its dimension, called shape in numpy. Shape is array’s size along all of its dimensions. This can be queried by attribute .shape which returns the sizes in a form of a tuple:

In [4]:
a.shape

(4,)

In [5]:
b.shape

(2, 2)

One can see that vector a has a single dimension of size 4, and matrix b has two dimensions, both of size 2 (remember: (4,) is a tuple of length 1!).

One can also reshape arrays, i.e. change their shape into another compatible shape. This can be achieved with .reshape() method. .reshape takes one argument, the new shape (as a tuple) of the array. For instance, we can reshape the length-4 vector into a 2x2 matrix as

In [6]:
a.reshape((2,2))

array([[1, 2],
       [3, 4]])

and we can “straighten” matrix b into a vector with:

In [7]:
b.reshape((4,))

array([1, 2, 3, 4])

## Creating Arrays

Sometimes it is practical to create arrays manually as we did above, but usually it is much more important to make those by computation. Below we list a few options.

np.arange creates sequences, quite a bit like range, but the result will be a numpy vector. If needed, we can reshape the vector into a desired format:

In [8]:
np.arange(10)  # vector of length 10

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [9]:
np.arange(10).reshape((2,5))  # 2x5 matrix

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

np.zeros and np.ones create arrays filled with zeros and ones respectively:

In [10]:
np.zeros((5,))

array([0., 0., 0., 0., 0.])

In [11]:
np.ones((2,4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.]])

Arrays can be combined in different ways, e.g. np.column_stack combines them as columns (next to each other), and np.row_stack combines these as rows (underneath each other). For instance, we can combine a column of ones and two columns of zeros as follows:

In [12]:
oneCol = np.ones((5,))  # a single vector of ones
zeroCols = np.zeros((5,2))  # two columns of zeros
np.column_stack((oneCol, zeroCols))  # 5x3 columns

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

Note that column_stack expects all arrays to be passed as a single tuple (or list).

## Vectorized Functions (Universal Functions)

It is possible to use loops to do computation with numpy objects exactly in the same way when working with lists. However, one should use vectorized operations instead whenever possible. Vectorized operations are easier to code, easier to read, and result in faster code.

Numpy offers a plethora of vectorized functions and operators, called universal functions. Many of these work as expected. For instance, mathematical operations. We create a matrix, and then add “100” to it, and then rise “2” to the power of the values:

In [13]:
a = np.arange(12).reshape((3,4))
print(a)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [14]:
print(100 + a, "\n")

[[100 101 102 103]
 [104 105 106 107]
 [108 109 110 111]] 



In [15]:
print(2**a, "\n")  # remember: exponent with **, not with ^

[[   1    2    4    8]
 [  16   32   64  128]
 [ 256  512 1024 2048]] 



Both of these mathematical operations, + and ** are performed elementwise for every single element of the matrix.

create matrix of even numbers:

In [16]:
2 + 2*np.arange(20).reshape(4,5)

array([[ 2,  4,  6,  8, 10],
       [12, 14, 16, 18, 20],
       [22, 24, 26, 28, 30],
       [32, 34, 36, 38, 40]])

Comparison operators are vectorized too:

In [17]:
a > 6

array([[False, False, False, False],
       [False, False, False,  True],
       [ True,  True,  True,  True]])

In [18]:
a == 7

array([[False, False, False, False],
       [False, False, False,  True],
       [False, False, False, False]])

As comparison operators are vectorized, one might expect that the other logical operators, and, or and not, are also vectorized. But this is not the case. There are vectorized logical operators, but they differ from the base python version. These are more similar to corresponding operators in R or C, namely & for logical and, | for logical or, and ~ for logical not:

In [19]:
(a < 3) | (a > 8)  # logical or

array([[ True,  True,  True, False],
       [False, False, False, False],
       [False,  True,  True,  True]])

In [None]:
(a > 4) & (a < 7)  # logical and

In [None]:
~(a > 6)  # logical not

There is no vectorized multi-way comparison like 1 < x < 2.

## Array Indexing and Slicing

Indexing refer to extracting elements based on their position or certain criteria. This is one of the fundamental operations with arrays. There are two ways to extract elements: based on position, and based on logical criteria. Unfortunately, this also makes indexing somewhat confusing, and it needs some time to become familiar with.

### Extracting elements based on position

Array indexing is very similar to list indexing. As matrices have two dimensions, we need two indices.

In [None]:
a = np.arange(12)
a

In [None]:
print(a[::2])  # every second element

However, unlike lists, one can do vectorized assignments in numpy:

In [None]:
a[5:11] = -1  # assign multiple elements
a

One can also extract multiple elements from a vector:

In [None]:
a[[4,5,7]]  # extract 3 elements in one go

When working with matrices (2-D arrays), we need two indices, separated by comma. Comma separates two slices

In [None]:
c = np.arange(12).reshape((3,4))
c

In [None]:
c[1,2]  # 2nd row, 3rd column

In [None]:
c[1] # 2nd row

Comma can separate not just two indices but two slices, so we can write

In [None]:
c[:,2]  # all rows, 3rd column

In [None]:
c[:2]  # 1st, 2nd row

In [None]:
c[:2, :3]  # 1s, 2nd row, first three columns

Create matrix and access rows and columns

create a 4x5 array of even numbers: 10, 12, 14, …
extract third column
set the fourth row to 1,2,3,4,5

In [None]:
a=10+(2*np.arange(20).reshape((4,5)))  #create a 4x5 array of even numbers: 10, 12, 14
a

In [None]:
a[:,2]    #extract third column

In [None]:
a[3]=[1,2,3,4,5]  #set the fourth row to 1,2,3,4,5

In [None]:
a

### Boolean indexing

An extremely widely used approach is to extract elements of an array based on a logical criteria. Fundamentally, it is just using a logical vector for indexing. The vector must be of the same lengts as the array in question, and the results contains only those elements the correspond to True in the indexing vector. Here is an example how we can do this manually:

In [None]:
a = np.array([1,2,7,8])
i = np.array([True, False, True, False])
a[i]  # 1, 7

The previous example–manually creating a logical index vectors of trues and falses is hardly ever useful. Almost always we use logical operations instead. For instance, we can extract all elements of a that are greater than 5:

In [None]:
i = a > 5
i

In [None]:
a[i]

This is often written in a more compact manner by skipping explicit logical vector i:

In [None]:
a[a > 5]

New users of numpy (and other languages that support logical indexing) sometimes forget that the logical condition does not have to be related to the same array that we are attempting to extract. For instance, we can extract all results for a certain person:

In [None]:
names = np.array(["Cyrus", "Darius", "Xerxes", "Artaxerxes", "Cyrus", "Darius"])
results = np.array([17, 14, 20, 18, 13, 15])
results[names == "Darius"]

Here index vector is based on the variable name only and is not directly related to results. However, we use it to extract values from the latter.

Finally, we also can extract rows (or columns) from a 2-D array in a fairly similar fashion:

In [None]:
names = np.array(["Cyrus", "Darius", "Xerxes"])
results = np.array([[17, 14], [20, 18], [13, 15]])
results

In [None]:
results[names == "Darius",:]

Logical indexing can also be used on the left-hand-side of the expression, in order to replace elements. Below is an example where we replace all the negative elements of a with zero.

In [None]:
a = np.random.randn(2,3)
a

In [None]:
a[a < 0] = 0
a

When replacing elements in such fashion then we need to supply the replacement vector that is either length 1 (all elements are replaced by “0” in the example above), or alternatively we should supply a replacement vector of correct length. For instance, we can replace the positive numbers left in a with 1, 2, 3:

In [None]:
a[a > 0] = np.array([1, 2, 3])
a

In [None]:
names = np.array(["Roxana", "Statira", "Roxana", "Statira", "Roxana"])
score = np.array([126, 115, 130, 141, 132])

In [None]:
 score[score<130] #Extract all test scores that are smaller than 130

In [None]:
score[names=="Statira"]          #Extract all test scores by Statira#

In [None]:
score[names=="Roxana"] + 10            #Add 10 points to Roxana’s scores. (You need to extract it first.)

## Random numbers

Numpy offer a large set of random number generators. These can be invoked as np.random.generator(params, size). For instance, np.random.choice(N) can be used to create random numbers from 0 to  
N−1. size determines the shape of the resulting object.

 The argument is size, not shape, although it determines the output shape!

In [None]:
x = np.random.choice(6, size=5)
x

But maybe we prefer not to label the results as 0..5 but 1..6. So we can just add one to the result. Here is an example that creates 2-D array of die rolls:

In [None]:
1 + np.random.choice(6, size=(2,4))

Numpy offers a large set of various random values. Here we list a few more:

#### Random elements from list

In [None]:
nucleotides = ["A", "G", "C", "T"]
dna = np.random.choice(nucleotides, 20)
"".join(dna)

As the example demonstrates, random.choice picks random elements with replacement (use replace option to change this behavior).

#### Random normals

random.normal(loc, scale, size) generates normally distributed random numbers. The distribution is centered at loc and its variance is scale:

In [None]:
np.random.normal(1000, 100, size=10)

#### Binomial random numbers

random.binomial(n, p, size) creates random binomials where probability of success is p and sample size is n:


In [None]:
np.random.binomial(2, 0.5, size=(2,4))

We can describe a coin toss as Binomial(1, 0.5) where 1 refers to the fact that we toss a single coin, and 0.5 means it has probability 0.5 to come heads up. So such random variables are sequences of zeros and ones. But how can we get a sequence of -1 and 1 instead? Demonstrate it on computer!

In [None]:
2*np.random.binomial(1, 0.5, size=10) - 1

#### Uniform random numbers

random.uniform(low, high, size) creates uniformly distributed random numbers in the interval [low, high]:



In [None]:
np.random.uniform(-1, 1, size=(3,4))  # random numbers in [-1, 1]

#### Repeating the exact same random sequence

The random numbers are often called pseudorandom as they are not truly random–they are computed based on a well-defined algorithm, so when feeding the same initial values to the algorithm, one always gets the same random numbers. However, normally the initial values are taken from certain hart-to-control parameters outside of the program control, such as time in microseconds and hard disk serial number, so in practice it is impossible to replicate the same sequence.

However, if you need to replicate your results exactly, you have to set the initial values explicitly using random.seed(value). This re-initializes RNG-s to the given initial state:

In [None]:
np.random.seed(1)
np.random.uniform(size=5)  # 1st batch of numbers

In [None]:
np.random.uniform(size=5)  # 2nd batch is different

In [None]:
np.random.seed(1)
np.random.uniform(size=5)  # repeat the 1st batch

#### Statistical functions

Numpy offers a set of basic statistical functions, including sum, mean, and standard deviations std. These can be applied to the array as a whole, or separately to rows or columns. In the latter case one has to specify the argument axis, where the value 0 means to apply the operation row-wise (and preserve columns) and axis=1 means to apply the operation column-wise (and preserve rows). Here is an example:


In [None]:
a = np.arange(12).reshape((3,4))
a  # 3 rows, 4 columns

In [None]:
a.sum()  # total sum

In [None]:
a.sum(axis=0)  # add rows, preserve columns

In [None]:
a.sum(axis=1)  # add columns, preserve rows

The functions come in two forms: as a method x.sum(), and as a separate function np.sum(x). These two ways are pretty much equivalent.

By default, a missing value of an array causes the function to return missing:

In [None]:
a = a.astype(float)  # as np.nan is float, need a float array
a[1,2] = np.nan
a

In [None]:
np.sum(a)

 This differs from the corresponding functionality in pandas where missings are ignored by default!

The other statistical functions include

mean for average
median for median
var for variance
std for standard deviation
np.percentile and np.quantile for quantiles

-----------------------------------

# Pandas

In [None]:
Pandas is the standard python library to work with dataframes.  It is typically imported as pd:

In [22]:
import pandas as pd

Pandas contains two central data types: Series and DataFrame. Series is often used as a second-class citizen, just as a single variable (column) in data frame. But it can also be used as a vectorized dict that links keys (indices) to values. DataFrame is broadly similar to other dataframes as implemented in R or spark. When you extract its individual columns and rows you normally get those in the form of Series. So it is extremely useful to know the basics of Series when working with data frames. Both DataFrame and Series include index, a glorified row name, which is very useful for extracting information based on names, or for merging different variables into a data frame

### Series

Series is a one-dimensional positional column (or row) of values. It is in some sense similar to list, but from another point of view it is more like a dict, as it contains index, and you can look up values based on index as a key. So it allows not only positional access but also index-based (key-based) access. In terms of internal structure, it is implemented with vectorized operations in mind, so it supports vectorized arithmetic, and vectorized logical, string, and other operations. Unlike dicts, it also supports multi-element extraction.

Let’s create a simple series:

In [23]:
s = pd.Series([1,2,5,6])
s

0    1
1    2
2    5
3    6
dtype: int64

Series is printed in two columns. The first one is the index, the second one is the value. In this example, index is essentially just the row number and it is not very useful. This is because we did not provide any specific index and hence pandas picked just the row number. Underneath the two columns, you can also see the data type, in this case it is 64-bit integer, the default data type for integers in python.

Now let’s make another example with a more informative index:

In [24]:
pop = pd.Series( [ 38, 26, 19, 19],
                 index = ['ca', 'tx', 'ny', 'fl'])
# population, in millions
pop

ca    38
tx    26
ny    19
fl    19
dtype: int64

Now the index is helpful: we are looking at state populations, and index tells us which state is in which row. Another advantage of possessing index is that even when we filter and manipulate the series, it’s index will still retain the original row label. So we know that index “fl” will always correspond to Florida. But if we have removed a few cases, or re-ordered the series, then Florida may not be on the fourth position any more.

Create a series of 4 capital cities where the index is the name of corresponding country.

In [25]:
cities = pd.Series(["Brazzaville", "Libreville", "Malabo", "Yaoundé"],
                   index=["Congo", "Gabon", "Equatorial Guinea", "Cameroon"])
cities

Congo                Brazzaville
Gabon                 Libreville
Equatorial Guinea         Malabo
Cameroon                 Yaoundé
dtype: object

We can extract values and index using the corresponding attributes:

In [26]:
pop.values

array([38, 26, 19, 19], dtype=int64)

In [27]:
pop.index

Index(['ca', 'tx', 'ny', 'fl'], dtype='object')

In [28]:
cities.values

array(['Brazzaville', 'Libreville', 'Malabo', 'Yaoundé'], dtype=object)

In [29]:
cities.index

Index(['Congo', 'Gabon', 'Equatorial Guinea', 'Cameroon'], dtype='object')

Note that values are returned as np array, and index is a special index object. If desired, this can be converted to a list:

In [30]:
list(pop.index)

['ca', 'tx', 'ny', 'fl']

Series also supports ordinary mathematics, e.g. we can do operations like:

In [31]:
pop > 20

ca     True
tx     True
ny    False
fl    False
dtype: bool

the result will be another series, here of logical values, as indicated by the “bool” data type.

### DataFrame

DataFrame is the central data structure for holding 2-dimensional rectangular data. It is in many ways similar to R dataframes. However, it also shares a number of features with Series, in particular the index, so you can imagine a data frame is just a number of series stacked next to each other. Also, extracting single rows or columns from DataFrames typically results in a series.

#### Creating data frames

In [36]:
df = {'ca': [35, 37, 38], 'tx': [23, 24, 26], 'md': [5,5,6]}
pop = pd.DataFrame(df)
print('population:\n', pop, '\n')

population:
    ca  tx  md
0  35  23   5
1  37  24   5
2  38  26   6 



The data frame is printed as four columns. Exactly as in case of series, the first column is index. In the example above we did not specify the index and hence pandas picked just row numbers. But we can provide an explicit index, for instance the year of observation:

In [35]:
pop = pd.DataFrame(df, index = [2010,2012,2014])
print('population:\n', pop, '\n')

population:
       ca  tx  md
2010  35  23   5
2012  37  24   5
2014  38  26   6 



In this case the index is rather useful.

Create a dataframe of (at least 4) countries, with 2 variables: population and capital. Country name should be the index.

Hint: feel free to invent populations!

#### Read data from file

To create data frames manually is useful for testing and debugging, in real applications we typically read data from disk. This can be done with pd.read_csv that takes the file name as the first argument, and also supports many other options. In the example below, we read data about G.W.Bush approval rate in fall 2001. pd.read_csv assumes files are comma-separated by default, but as this example file is tab-separated we have to declare it using sep="\t" as an extra argument. We also read the first 10 rows only for demonstration:

In [43]:
approval = pd.read_csv('https://bitbucket.org/otoomet/lecturenotes/raw/master/data/gwbush-approval.csv', \
 sep='\t', nrows=10) 
approval

Unnamed: 0,date,approve,disapprove,dontknow
0,2001 Dec 14-16,86,11,3
1,2001 Dec 6-9,86,10,4
2,2001 Nov 26-27,87,8,5
3,2001 Nov 8-11,87,9,4
4,2001 Nov 2-4,87,9,4
5,2001 Oct 19-21,88,9,3
6,2001 Oct 11-14,89,8,3
7,2001 Oct 5-6,87,10,3
8,2001 Sep 21-22,90,6,4
9,2001 Sep 14-15,86,10,4


In [45]:
a = pd.read_csv('https://bitbucket.org/otoomet/lecturenotes/raw/master/data/gwbush-approval.csv')  # wrong separator
a.shape

(31, 1)

In [46]:
a.head(2)

Unnamed: 0,date\tapprove\tdisapprove\tdontknow
0,2001 Dec 14-16\t86\t11\t3
1,2001 Dec 6-9\t86\t10\t4


Two problems are immediately visible: first, the file contains a single column only (because it does not consider tab symbols as separators), and the two lines we printed look weird. If you ask for variable names, you can also see that all variable names are combined together into a single weird name:

In [None]:
a.columns

The tab markers \t in printout give strong hints that the correct separator is tab.

It may initially be quite confusing to understand how to specify the file name. If you load data in a jupyter notebook, then the working directory is normally the same directory where the notebook is located3. Notebook also let’s you to complete file names with TAB key. But in any case, the working directory can be found with os.getcwd (get current working directory):

In [38]:
import os
os.getcwd()

'C:\\Users\\DELL\\Documents\\MSC_Computer_Science\\Machine-Learning-Zoomcamp\\Introduction to Machine Learning\\1.7 Introduction to NumPy'

This helps to specify the relative path if your data file is not located in the same place as your code. You can also find which files does python find in a given folder, e.g. in ../data/:

In [40]:
files = os.listdir("../'1.7 Introduction to NumPy'/")
files[:5]

FileNotFoundError: [WinError 3] The system cannot find the path specified: "../'1.7 Introduction to NumPy'/"

As another complication, notebooks are often run on a separate server or in a docker container. These may have no access to files in your computer (as the server), or only have a limited access (like docker container). In such a case, you should upload the file to notebook, even if it is running on your computer!

# Indexing data frames and series

Indexing refers to selecting data from data frames and series based on variable names, logical conditions, and position. It is a complex task with many different methods, and unfortunately also with many caveats. Below, the topic is split into several subsections:

Select variables explains how to select desired variables from a data frame
Filter observations with logical operations describes how to filter rows
Positional indexing of Series introduces positional indexing, indexing based on row number, and how to do it with series
Positional indexing of data frames explains positional indexing, indexing based on both row and column numbers, for data frames
Modifying data frames: there are slight differences when modifying data instead of extracting, these are discussed here.
Indexing: summary and comparison provides a summary of all methods.
Fortunately, Series and data frames behave in a broadly similar way, e.g. selecting cases by logical conditions, based on index, and location are rather similar. As series do not have columns, we cannot access elements by column name or by column position though.

In [41]:
# Download CSV with read_csv
df = pd.read_csv('https://bitbucket.org/otoomet/lecturenotes/raw/master/data/gwbush-approval.csv', \
 sep='\t')

In [42]:
df.head(2)

Unnamed: 0,date,approve,disapprove,dontknow
0,2001 Dec 14-16,86,11,3
1,2001 Dec 6-9,86,10,4


In [47]:
approval.head(4)

Unnamed: 0,date,approve,disapprove,dontknow
0,2001 Dec 14-16,86,11,3
1,2001 Dec 6-9,86,10,4
2,2001 Nov 26-27,87,8,5
3,2001 Nov 8-11,87,9,4


To begin with, data frames have variable names. We can extract a single variable either with ["varname"] or a shorthand as attribute .varname (note: replace varname with the name of the relevant variable):



In [48]:
approval["approve"]  # approval, as series

0    86
1    86
2    87
3    87
4    87
5    88
6    89
7    87
8    90
9    86
Name: approve, dtype: int64

In [49]:
approval.approve  # the same, as series

0    86
1    86
2    87
3    87
4    87
5    88
6    89
7    87
8    90
9    86
Name: approve, dtype: int64

These constructs return the column as a series. If we prefer to get a single-column data frame, we can wrap the variable name into a list:

In [50]:
approval[["approve"]]  # approval, as data frame

Unnamed: 0,approve
0,86
1,86
2,87
3,87
4,87
5,88
6,89
7,87
8,90
9,86


The attribute shorthand is usually the easier way, but it does not work if you need to use indirect variable name (variable name that is stored in another variable) or if the variable name contains spaces or other special characters. It also does not work for creating new variables in the data frame.

The previous example where we extracted a single column as a data frame instead of Series also hints how to extract more than one variable: just wrap all the required variable names into a list:

In [51]:
vars = ["date", "approve"]
approval[vars]

Unnamed: 0,date,approve
0,2001 Dec 14-16,86
1,2001 Dec 6-9,86
2,2001 Nov 26-27,87
3,2001 Nov 8-11,87
4,2001 Nov 2-4,87
5,2001 Oct 19-21,88
6,2001 Oct 11-14,89
7,2001 Oct 5-6,87
8,2001 Sep 21-22,90
9,2001 Sep 14-15,86


There are no attribute shortcuts to extract multiple columns.

### Filter observations with logical operations

Filtering refers to extracting only a subset of rows from the dataframe based on certain conditions. The conditions are logical operations that can be either true or false, depending on the values in each row. Filtering produces a sub-dataframe where only those observations that meet the selection criteria are present: Here is an example:

In [52]:
approval[approval.approve > 88]

Unnamed: 0,date,approve,disapprove,dontknow
6,2001 Oct 11-14,89,8,3
8,2001 Sep 21-22,90,6,4


Note that we have to refer to data variables as approval.approve, not just approve, unlike in R dplyr where one can just write approve. This is somewhat harder to write but it is less ambiguous and produces fewer hard-to-find bugs.

Obviously we can use more complex selection conditions, for instance we can look for very low or very high approval rates as follows:

In [53]:
approval[(approval.approve < 86) | (approval.approve > 89)]

Unnamed: 0,date,approve,disapprove,dontknow
8,2001 Sep 21-22,90,6,4


Note that we are using the vectorized “or” operator |, not the base python or. We also need to wrap both the “less than” and “greater than” parts in parenthesis.

How many polls in the data show the president’s approval rate at least 88%? At which dates are those polls conducted?

In [54]:
approval[approval.approve >= 88][["date", "approve"]]  # only print date,
    # approval rate

Unnamed: 0,date,approve
5,2001 Oct 19-21,88
6,2001 Oct 11-14,89
8,2001 Sep 21-22,90


In [55]:
approval[approval.approve >= 88].shape[0]  # data for at least 90% approval

3

The filtered object is not a new data frame but a view of the original data frame. This may give you warnings and errors later when you attempt to modify the filtered data. If you intend to do that, perform a deep copy of data using the .copy method.

### Positional indexing of Series

Besides selecting variables and filtering by logical conditions, we occasionally need to access elements by index, or by position (location). Here we demonstrate the positional indexing using a series object, positional indexing of data frames is discussed below:

In [56]:
pop = pd.Series([32.7, 267.7, 15.3],  # in millions
                index=["MY", "ID", "KH"])
pop

MY     32.7
ID    267.7
KH     15.3
dtype: float64

We can access series’ values in two ways: by position, and by index. In order to access elements by position, we have to use attribute .iloc[] where i loc refers to “integer”. Unlike most other methods, .iloc expects arguments in brackets. A single number in brackets returns the element as an element (e.g. a single number), if brackets contain a list (this looks like double brackets), it returns a series, potentially containing only a single element. So in order to extract 2nd and 3rd element in the population series, we can write:

In [57]:
pop.iloc[1]  # extract 2nd element as a number

267.7

In [58]:
pop.iloc[[1,2]]   # extract 2nd, 3th as a series

ID    267.7
KH     15.3
dtype: float64

Alternatively, we can also extract the elements by index. This works in a similar fashion, except we have to use .loc[] instead of .iloc[]. The rules for single and double brackets apply in the similar fashion as in case of positional access.

In [59]:
pop.loc["ID"]  # extract Indonesian population as a number

267.7

In [60]:
pop.loc[["ID", "MY"]]  # extract Indonesian and Malaysian population
# as a series

ID    267.7
MY     32.7
dtype: float64

One can also drop the .loc[] syntax and just use square brackets, so instead of writing pop.loc[["ID", "MY"]], one can just write pop[["ID", "MY"]].

The fact that there are several ways to extract positional data causes a lot of confusion for beginners. It is not helped by the common habit of not using indices and just relying on the automatic row-numbers. In this case positional access by .iloc[] produces exactly the same results as the index access by .loc[], and one can conveniently forget about the index and use whatever feels easier. But sometimes the index changes as a result of certain operations and that may lead to errors or unexpected results. For instance, we can create an alternative population series without explicit index:

In [61]:
pop1 = pd.Series([np.nan, 26, 19, 13])  # index is 0, 1, ...
pop1

0     NaN
1    26.0
2    19.0
3    13.0
dtype: float64

In this example, position and index are equivalent and hence it is easy to forget that .loc[] is index-based access, not positional access! So one may freely mix both methods (and remember, .loc is not needed):

In [62]:
pop1.loc[2]

19.0

In [63]:
pop1.iloc[2]

19.0

In [64]:
pop1[2]

19.0

This becomes a problem if a numeric index is not equivalent to row number any more, for instance after we drop missings:

In [65]:
pop2 = pop1.dropna()  # remove missings
pop2  # note: the first row has index 1

1    26.0
2    19.0
3    13.0
dtype: float64

In [66]:
pop2.iloc[2]  # this is by position

13.0

In [67]:
pop2.loc[2]  # this is by index

19.0

In [68]:
pop2[2]  # also by index

19.0

Additionally, if pop2 for some reason turns into a numpy array, then pop2[2] is is based on position as arrays do not have index!

### Positional indexing of data frames

In [69]:
countries = pd.DataFrame({"capital":["Kuala Lumpur", "Jakarta", "Phnom Penh"],
                     "population":[32.7, 267.7, 15.3]},  # in millions
                    index=["MY", "ID", "KH"])
countries

Unnamed: 0,capital,population
MY,Kuala Lumpur,32.7
ID,Jakarta,267.7
KH,Phnom Penh,15.3


(MY is Malaysia, ID Indonesia and KH is Cambodia).