# Intro to Pandas with Series
**Series**: cross between a list and dictionary, where the first column is the index and second is your actual data. Data column has its own label and can be accessed by `.names` attribute.

In [1]:
import pandas as pd

# initialize series with array-like object

students = ['Alice', 'Jack', 'Molly']

pd.Series(students) # default data type is object

0    Alice
1     Jack
2    Molly
dtype: object

Pandas stores `Series` values in a typed array using the Numpy library.

In [3]:
numbers = [1,2,3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

## Missing Data
Python uses None type to indicate a lack of data. Pandas does some type conversion where the one element is None type but uses type obkect for the underlying array.

In [4]:
students = ['Alice', 'Jack', None]

pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

Pandas converts a None in a list of integer/floats/numbers as a `NaN`

In [5]:
numbers = [1,2,None]

pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

Converted to float not int because Pandas represents `NaN` as a floating point number and integers can be typecast to floats, so Pandas just converts integers to floats. Helpful when you have a list of integers and it gets typecast as float, you know you have missing data.

However, `NaN` != `None` when you're checking for missing data:

In [6]:
import numpy as np

np.nan == None

False

... you can't even do an equality test of `NaN` to itself.

In [7]:
np.nan == np.nan

False

Therefore, you need to use special functions to check for the presence of not a number:

In [8]:
np.isnan(np.nan)

True

All in all, `NaN`'s meaning is similar to `None`, but it's a numeric value and treated differently for efficiency reasons.

## How else can Series be created?
From dictionary data if have label data. The index of `Series` is automatically assigned to the dictionary keys

In [9]:
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}
s = pd.Series(students_scores)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

Pandas sets string data to 'object'.

Once series has been created, get the index using `.index` attribute:

In [10]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

## Object `dtype`
Not just for strings but also for arbitrary objects, like a list of tuples.

In [11]:
students = [('Alice', 'Brown'), ('Jack', 'White'), ('Molly', 'Green')]
pd.Series(students)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

## Automatic Index Creation
Separate index creation from data by passing in a list explicitly:

In [12]:
s = pd.Series(['Physics', 'Chemistry', 'English'], index = ['Alice', 'Jack', 'Molly'])
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

Pandas overrides automatic creation to favor only and all of the indices values you provide with dictionary keys. Pandas will ignore from your dictionary all keys which are not in your index, and add `None` or `NaN` type valyes for any index value you provide, which is not in your dictionary key list.

In [14]:
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}

s = pd.Series(students_scores, index = ['Alice', 'Molly', 'Sam'])
s

Alice    Physics
Molly    English
Sam          NaN
dtype: object

## Querying a Series
Query either by index position using `.iloc` or index label using `.loc`. If an index is not provided to the series when querying, the position and label are the same values.

### Query by Position

In [15]:
import pandas as pd


In [41]:
students_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}

s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [17]:
s.iloc[3]

'History'

### Query by Label

In [18]:
s.loc['Molly']

'English'

### Smart Syntax using Index Operator Directly
If you pass in an **integer** parameter to `[]`, the operator will act like the `.iloc` attribute:

In [21]:
s[3]

'History'

In [22]:
s[3] == s.iloc[3]

True

If you pass in an **object** parameter to `[]`, the operator will act like the `.loc` label attribute:

In [23]:
s['Molly']

'English'

In [24]:
s['Molly'] == s.loc['Molly']

True

In [25]:
class_code = {99: 'Physics',
              100: 'Chemistry',
              101: 'History',
              102: 'English'}
s = pd.Series(class_code)

Try calling `s[0]` if we want the first item:

In [26]:
s[0]

KeyError: 0

Gives us a `KeyError` because there are no items in the `class_codes` list with an index of zero, since the indices are the scores themselves. Therefore, we have to use `iloc` explicitly if we want the first item:

In [27]:
s.iloc[0]

'Physics'

## Working with Data
Types of tasks: find a certain number, summarizing data, transform data

Iterate over all items in the series and invoke an operation.
Example) find the average grade for a Series of student grades.

In [28]:
# slow way
grades = pd.Series([90, 80, 70, 60])

total = 0
for grade in grades:
    total += grade
print(total/len(grades))

75.0


In [34]:
# vectorization way using Pandas and Numpy
import numpy as np

total = np.sum(grades)
print(total/len(grades))

75.0


### How to check which way was faster?
Jupyter Notebook has magic functions using `%`.

In [30]:
# create big series of 10k random numbers with values between 0 and 1000 (inclusive)
numbers = pd.Series(np.random.randint(0,1000,10000))

numbers.head()

0    463
1    960
2    460
3    338
4     88
dtype: int64

In [31]:
len(numbers)

10000

Use a **cellular magic function** with 2 `%` signs called `timeit`. In a Jupyter Notebook, it must be the first line of the cell.

In [32]:
%%timeit -n 100 # pass in 100 for the number of loops
total = 0
for number in numbers:
    total += number
    
total/len(numbers)

1.09 ms ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [33]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

51.5 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Broadcasting
Applying an operation to every value in the series, changing the series.

In [35]:
numbers.head()

0    463
1    960
2    460
3    338
4     88
dtype: int64

In [36]:
numbers += 2
numbers.head()

0    465
1    962
2    462
3    340
4     90
dtype: int64

The procedural way is to iterate through all items in series and increase values directly using `.iteritems()`. Pandas supports iterating through series like a dictionary.

In [37]:
for label, value in numbers.iteritems():
    #numbers.set_value(label value+2)    # old way
    numbers.at[label] = value+2
    
numbers.head()

0    467
1    964
2    464
3    342
4     92
dtype: int64

### Should You Be Iterating?
If you find yourself iterating any time in pandas, you should question whether you're doing things in the best possible way.

In [38]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,1000))

for label, value in s.iteritems():
    s.loc[label] = value + 2

38.7 ms ± 805 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [39]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,1000))
s+=2

155 µs ± 45.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### One Last Note
The `.loc` attribute lets you modify data in place and add new data. If the value you pass in as the index doesn't exist, the new entry will be added. Indices can have mixed types. Pandas will automatically change the underlying NumPy types as appropriate.

In [40]:
s = pd.Series([1,2,3])

s.loc['History'] = 102

s

0            1
1            2
2            3
History    102
dtype: int64

### Non-Unique Indicies

In [43]:
students_classes = pd.Series({'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'})

students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [44]:
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index = ['Kelly','Kelly','Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [45]:
all_students_classes = students_classes.append(kelly_classes)
all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

When you append two series together, Pandas will infer the best data types to use. The `append()` doesn't change the underlying series `students_classes`, it returns a new Series made up of the two appended series. Pandas by default returns a new object, not modifying the one in place.

In [46]:
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [47]:
all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

# Topics Discussed:
- how to query
- `.loc` and `.iloc`
- `Series` is an indexed data structure
- how to merge two `Series` with `append()`
- vectorization