# Querying Series

A pandas Series can be queried either by the index position or the index label. If you don't give an  index to the series when querying, the position and the label are effectively the same values.  
To query by numeric location, starting at zero, use the **iloc** attribute.  
To query by the index label,  you can use the **loc** attribute. 

In [1]:
import pandas as pd
student_subject = {'Alice':'Physics', 'John':'Maths', 
                   'Sam':'English', 'Shubham':'Science'}
s = pd.Series(student_subject)
s

Alice      Physics
John         Maths
Sam        English
Shubham    Science
dtype: object

In [2]:
s.iloc[3]

'Science'

In [3]:
s.loc['John']

'Maths'

Keep in mind that iloc and loc are not methods, they are attributes. So you don't use parentheses to query them, but square brackets instead, which is called the **indexing operator**. In Python this calls get or set for an item depending on the context of its use.

In [4]:
s[3]  # internally python uses iloc

'Science'

In [5]:
s['John']  # internally python uses loc

'Maths'

So what happens if your index is a list of integers? This is a bit complicated and Pandas can't  determine automatically whether you're intending to query by index position or index label. So  you need to be careful when using the indexing operator on the Series itself. The safer option  is to be more explicit and use the iloc or loc attributes directly.

In [6]:
myindex = list(range(5))
myindex.reverse()
s1 = pd.Series(list(range(5,15,2)), index = myindex)
s1

4     5
3     7
2     9
1    11
0    13
dtype: int64

In [7]:
s1[0]

13

In [8]:
s1.iloc[0]

5

In [9]:
s1.loc[0]

13

In [10]:
class_code = {99: 'Physics', 100: 'Chemistry', 101: 'Maths', 98: 'EVS'}
s = pd.Series(class_code)
s

99       Physics
100    Chemistry
101        Maths
98           EVS
dtype: object

In [11]:
# s[0] # KeyError

So, that didn't call s.iloc[0] underneath as one might expect, instead it generates an error. KeyError because there's no item in the classes list with an index of zero, instead we have to call iloc explicitly if we want the first item.

In [12]:
s.iloc(0)

<pandas.core.indexing._iLocIndexer at 0x1d9159de5e0>

In [13]:
s.iloc[0]

'Physics'

Speeding using Parellization -

In [14]:
grades = pd.Series([90, 70, 80, 60])

# calculating average (slowly) -
total = 0 
for grade in grades:
    total += grade
total/len(grades)

75.0

In [15]:
# calculating average (fastly using numpy or pandas) -
import numpy as np
total = np.sum(grades)
total/len(grades)

75.0

Proof using cellular magic function %%timeit-

In [16]:
numbers = pd.Series(np.random.randint(1,1000,10000))
numbers.head()

0    117
1    371
2    969
3    653
4    678
dtype: int32

In [17]:
len(numbers)

10000

Note by default timeit chooses sufficient no. of loop such that it can calculate accuracy. If you want to set it to 100 then you can do it using %%timeit -n 100

In [18]:
%%timeit
total = 0
for num in numbers:
    total += num
total/len(numbers)

4.45 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [19]:
%%timeit
total = np.sum(numbers)
total/len(numbers)

214 µs ± 39.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


This demonstrates why one should be aware of parallel computing features and start thinking in functional programming terms. Put more simply, **vectorization** is the ability for a computer to execute multiple instructions at once, and with high performance chips, especially graphics cards, you can get dynamic speedups. Modern graphics cards can run thousands of instructions in parallel.

A Related feature in pandas and nummy is called broadcasting. With broadcasting, you can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every random variable by 2, that can be done using the += operator directly on the Series object. 

In [20]:
numbers.head()

0    117
1    371
2    969
3    653
4    678
dtype: int32

In [21]:
numbers += 2
numbers.head()

0    119
1    373
2    971
3    655
4    680
dtype: int32

This can also be done using using iteritems() and at[label] -

In [22]:
for label, number in numbers.iteritems():
    numbers.at[label]+=2
numbers.head()

0    121
1    375
2    973
3    657
4    682
dtype: int32

But iteration in pandas will not always be efficient and should be avoided. Lets calculate speed as we did previously -

In [28]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, number in s.iteritems():
    s.loc[label] = number + 2

1.34 s ± 170 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [29]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2

1.17 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


If we are trying to assign value to label that don't exists in index then that label is added and index is created -

In [31]:
s = pd.Series([31,28,35])
s

0    31
1    28
2    35
dtype: int64

In [33]:
s['shubham'] = 6
s

0          31
1          28
2          35
shubham     6
dtype: int64

Index value can be duplicates too -

In [41]:
student_subject

{'Alice': 'Physics', 'John': 'Maths', 'Sam': 'English', 'Shubham': 'Science'}

In [42]:
s = pd.Series(student_subject)
s

Alice      Physics
John         Maths
Sam        English
Shubham    Science
dtype: object

In [43]:
kelly_subjects = pd.Series(['EVS', 'PT', 'Arts'], index=['Kelly']*3)
kelly_subjects

Kelly     EVS
Kelly      PT
Kelly    Arts
dtype: object

In [44]:
all_students = s.append(kelly_subjects)
all_students

Alice      Physics
John         Maths
Sam        English
Shubham    Science
Kelly          EVS
Kelly           PT
Kelly         Arts
dtype: object

In [45]:
s

Alice      Physics
John         Maths
Sam        English
Shubham    Science
dtype: object

In [46]:
all_students['Kelly']

Kelly     EVS
Kelly      PT
Kelly    Arts
dtype: object