In [2]:
import pandas as pd

#see documentation in new window-type thing in bottom of browser window
pd.Series?

**Series** = cross between list + dictionary

* items are stored in order = **values**
* can retrieve items w/ labels = **index**

When passing in list of values to .Series(), pandas creates a Series starting at index 0 and so on for each list element + sets the name of the Series to "None"

Underneath the hood, panda stores Series values in a typed array using the numpy --> offers significant speed-up when
processing data vs. traditional Python lists. 

In [3]:
# create list
animals = ['Tiger', 'Bear', 'Moose']

#conver the list to Series to see their indices
pd.Series(animals)

0    Tiger
1     Bear
2    Moose
dtype: object

In [4]:
# can pass in numbers type
numbers = [1, 2, 3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

For missing data in Python, we have the None type to indicate a lack of data

In [5]:
# can pass in None type
animals = ['Tiger', 'Bear', None]
pd.Series(animals)

0    Tiger
1     Bear
2     None
dtype: object

Underneath, Pandas does some type conversion here --> Pandas inserts None type element it as a None + uses the type "Object" for the underlying array. 

If we create a list of *numbers, integers, or floats* + put in a None type, Pandas automatically converts this to a
special floating point value, **NaN** = not a number. 

In [6]:
numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

***NaN is not None*** 

* can't do an equality test of NaN to itself.
* Need to use special functions to test for the presence of NaN, isnan(). 
* Keep in mind when you see NaN, it's meaning is similar to None, but it's a *numeric* value + is treated differently for efficiency reasons.

In [9]:
import numpy as np
np.nan == None
np.nan == np.nan

False

In [11]:
np.isnan(np.nan)

True

Often we have labeled data we want to manipulate

A series can be created from dictionary data --> index is automatically assigned to the keys of the dictionary

In [10]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

When we create this series, see that, since it was string data, Pandas set the data type of the series to Object, the list of the countries as the value of the series, + the index values to the dictionary keys

Once the series has been created, we can get the index object using the **.index** attribute.

In [12]:
s.index

Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

Can also separate your index creation from the data by passing in the index as a list *explicitly to the series* 

In [14]:
s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
s

India      Tiger
America     Bear
Canada     Moose
dtype: object

If a  list of values in the index object are not aligned w/ the keys in a dictionary, Pandas overrides the automatic
creation to favor only and all of the indices values provided

It will ignore all dictionary keys which are not in your index argument + will add None or NaN values for any index value in the provided index that aren't keys in the dictionary

In [17]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s

Golf      Scotland
Sumo         Japan
Hockey         NaN
dtype: object

<h2> Querying a Series </h2>

Pandas Series can be queried by index position or index label (position + label are effectively the same values

In [20]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)

# get the value from the 4th item (index 3) w/ .iloc() attribute
s.iloc[3]

'South Korea'

In [21]:
# do the same w/ named index w/ .loc() attribute
s.loc['Golf']

'Scotland'

In [22]:
# don't need .iloc attribute
s[3]

'South Korea'

In [23]:
# don't need .loc attribute
s['Golf']

'Scotland'

In [24]:
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead

'Bhutan'

Don't use parentheses to query, but square brackets --> the **indexing operator.**

If you pass in an integer parameter, the operator will behave as if you want it to query via iloc attribute. If you pass in an object, it will query as if you wanted to use label-based loc attribute

If index = a list of integers, Pandas can't determine automatically whether you're intending to query by position or label. 

Safer option is to be more explicit + use iloc or loc 

In [25]:
sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead

KeyError: 0

In [26]:
# sum up Series elements via loop
s = pd.Series([100.00, 120.00, 101.00, 3.00])
total = 0
for item in s:
    total+=item
print(total)

324.0


In [27]:
# do so w/ numpy
import numpy as np

total = np.sum(s)
print(total)

324.0


Loop works, but is slow.

Modern CPUs can do many tasks simultaneously, especially, but not only, tasks involving mathematics. 

Pandas + its underlying NumPy libraries support a method of computation called **vectorization** which works with most offunctions in NumPy,

Both methods above create the same value, but is one actually faster? 

Jupyter Notebook has a function to help see.

In [37]:
# creates a large series of random numbers
s = pd.Series(np.random.randint(0,1000,10000))
s.head()

0    203
1    574
2    869
3    990
4    662
dtype: int32

In [31]:
%%timeit -n 100
summary = 0
for item in s:
    summary+=item

100 loops, best of 3: 1.1 ms per loop


In [32]:
%%timeit -n 100
summary = np.sum(s)

100 loops, best of 3: 111 µs per loop


Used a **cellular magic function** --> start w/ 2 %'s + modify/**wrap** code in the current Jupyter cell. 

Function we use = **timeit** --> runs code a few times to determine, on average, how long it takes. 

Vectorization gives a shocking difference in the speed + demonstrates why data scientists need to be aware of parallel computing features + start thinking in functional programming terms. 

A related feature in Pandas + NumPy is called **broadcasting** = apply an operation to every value in a Series, changing the Series.

In [36]:
# add 2 to each item in s using broadcasting
s += 2 
s.head()

0    572
1    423
2    508
3    234
4    805
dtype: int32

Procedural way of doing this would be to iterate through all items in the series + increase the values directly. 

* Pandas does support iterating through a series much like a dictionary, allowing you to unpack values easily. 
* But if you find yourself iterating through a series, question whether you're doing things in the best possible way.

In [40]:
%%timeit -n 10

s = pd.Series(np.random.randint(0,1000,10000))

for label, value in s.iteritems():
    s.loc[label]= value + 2

10 loops, best of 3: 774 ms per loop


In [39]:
%%timeit -n 10

s = pd.Series(np.random.randint(0,1000,10000))

s += 2

10 loops, best of 3: 479 µs per loop


Not only is broadcasting significantly faster, but more concise and maybe even easier to read. 

The typical mathematical operations you would expect are vectorized, + NumPy documentation outlines what it takes to create vectorized functions of your own. 

<h4> 1 last note on using the indexing operators to access series data:</h4>

.loc attribute lets you not only modify data **in place**, but also **add** new data as well. 

If a value you pass in as the index doesn't exist, a new entry is added

Also keep in mind that indices can have mixed types. 

* While it's important to be aware of the type underneath, Pandas will automatically change underlying NumPy types as appropriate.

In [41]:
# add string index + value to numeric Series + see type swich to Object
s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears'
s

0             1
1             2
2             3
Animal    Bears
dtype: object

Mixed types for data values or index labels are no problem for Pandas

Can also have index values that are NOT unique, + this makes data frames different, conceptually, to an RDB

In [42]:
# create a series
original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})
# create a new series
cricket_loving_countries = pd.Series(['Australia', 'Barbados','Pakistan','England'], 
                                   index=['Cricket','Cricket','Cricket','Cricket'])

# concatenate Series
all_countries = original_sports.append(cricket_loving_countries)

In [44]:
print(original_sports,'\n\n',cricket_loving_countries,'\n\n',all_countries)

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object 

 Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object 

 Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object


Pandas takes the 2nd series + tries to infer the best data types to use (everything is a string, here, so no problem)

append method doesn't actually *change* the underlying series but instead returns a *NEW* series made up of the 2 appended together

In [45]:
all_countries.loc['Cricket']

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

Notice when we query the appended series for those who have cricket as a national sport, we don't get a single value, but a series. 

This is actually very common, and if you have an RDB background, it's very similar to every table query resulting in a **return set**, which, itself, is a table.