# Agenda: The full course

1. Getting started
    - What is data analytics?
    - What is Pandas?
    - Descriptive statistics
    - Pandas series 
    - Retrieving values
    - Setting values
    - Broadcasting operations
    - Mask arrays (boolean arrays) for retrieving selected values
    - Indexes
    - Some useful methods
2. Data frames, for 2-dimensional data
3. Real-world data
4. Text data and date/time data
5. Visualization

# Jupyter

I see Jupyter as my Python laboratory -- I can try things, and then if they don't work, I can modify my experiment and try again.

Everything in Jupyter is done in a "cell." We have two types of cells:

- Code cells, in which we can run Python code
- Markdown cells, in which we have Markdown text, which turns into HTML very nicely

When I press Enter, I go down one line. But when I press shift+Enter together, the cell executes -- which, if it's a code cell, actually runs the Python.  And if it's a Markdown cell, then it gets formatted.

# Two modes in Jupyter

- Edit mode -- this allows us to type into Jupyter. The cell has a green outline. We can get into edit mode by clicking in the cell or by press ENTER.
- Command mode -- this allows us to give Jupyter commands, typically one character long. The cell has a blue outline in this case. We can get into command mode by clicking to the left of the cell or pressing ESC.

What can I do in command mode?

- `m` -- turn the cell into a Markdown cell, for text
- `y` -- turn the cell into a Python code cell, for programming
- `c` -- copy the current cell
- `x` -- cut the current cell
- `v` -- paste the copied/cut cell
- `h` -- get help, a full list of commands
- `a` -- create a new cell *above* the current one
- `b` -- create a new cell *below* the current one

In [2]:
# this is Python code

print('Hello!')   # shift+enter sends the cell's contents to the back-end Python process, where it runs

Hello!


# What is data science?

There is no one definition! (So everyone can choose the definition that they like.)

I define data science as data analytics + (data engineering +) machine learning.

- Data analytics -- we have a bunch of data. We want to understand it.  For example: Who bought from our store? What time do people use the train? How much do people spend on taxis?
- Data engineering -- the data all exists in one place, and is very messy.  How can we easily and efficiently get it from its current location to our systems, cleaning it up along the way?
- Machine learning -- let's use this existing data to make projections/predictions about the future.  How many people will buy at our Black Friday sale? How many toll collectors do we need at 6 a.m. on Sunday?  If we hear a certain soundwave, can we predict the meaning? If we see a picture, can we accurately predict the animal in the picture?

In this class, we'll be talking about how to use Python and Pandas for data analytics.

Data science is all about asking questions and using scientific or scientific-like methods to answer our questions.

# Exercise: 

1. What kinds of data does Amazon have?
2. What sorts of questions can they ask of that data?

# Python and Pandas

Python has been around for about 30 years. It's a fantastic high-level programming language.

But it is *not* very fast or efficient, certainly not compared with C, Java, and C#.  It takes far more time, and it uses far more memory than these other languages.

So: Why would we use Python?  Just because it's fun and easy to use?

No: A library called NumPy provides us with a Python interface to C-language data structures. So we have the speed and efficiency of C, but the ease of use of Python.

NumPy is still very popular and very useful. But it can be a bit low level for many people. Enter Pandas, which is a wrapper around NumPy that makes it easier to use and more convenient.

I call Pandas the automatic transmission to NumPy's stick shift.

We're going to use Pandas, but we will occasionally see hints of NumPy under the hood.

In [3]:
# we have to load Pandas (and also NumPy)

# I very strongly encourage you to load NumPy and Pandas and give them these very standard aliases

import numpy as np
import pandas as pd
from pandas import Series   # this allows me to avoid saying pd.Series; I can just say Series

In [4]:
# I'll create a Pandas series with 5 integers
# I do this by creating a new Series, passing it a Python list of integers

s = Series([10, 20, 30, 40, 50])

In [5]:
# we see that s isn't a list, but rather a Pandas series:

type(s)

pandas.core.series.Series

In [6]:
# let's take a look at s
# note: in Jupyter, I don't have to use print to see something. The final line of a cell,
# if it returns a value, is displayed automatically.

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# Series

A series contains values, and also has an index. The index, by default, is integers that start at 0 and go all the way to the series length - 1.

We can have almost any data types we want in the series, but we'll soon see that we normally will stick to a limited set, matching C's data types

Series are always displayed in this way, with the index and values in two parallel columns. At the end, we see the "dtype," describing the type of data that's in our series.  For now, we'll mostly (not always) have 64-bit integers, known as int64.

In [7]:
# in some ways, a series is like a Python list

s[0]

10

In [8]:
s[1]

20

In [10]:
# can I get the final element with s[-1]?
# no... not quite like a list

s[-1]

KeyError: -1

In [11]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [12]:
s[0] + 10

20

In [13]:
s + s  # can I add a series to itself?  And if so, what do I get back?

0     20
1     40
2     60
3     80
4    100
dtype: int64

In [14]:
# the result of adding a series to itself is a new series,
# one with the same length (and thus index) as before, but in which
# the values are doubled from before.

# what about adding two different series together?

s1 = Series([10, 20, 30, 40, 50])
s2 = Series([100, 200, 300, 400, 500])

s1 + s2

0    110
1    220
2    330
3    440
4    550
dtype: int64

In [15]:
# what if the two series are not the same length?

s3 = Series([123, 456, 789])

s1 + s3  # it works... but we get NaN ("not a number") back

0    133.0
1    476.0
2    819.0
3      NaN
4      NaN
dtype: float64

In [16]:
# all operations in Pandas are *vectorized*
# meaning: When I perform an operation on a series, I'm not running it on a single value in the series
# the operation is repeated for every single element.

# if I say s1 + s2, I get a new series back based on s1 and s2, but not modifying them
s1 + s2

0    110
1    220
2    330
3    440
4    550
dtype: int64

In [17]:
# I can, of course, assign the sum back to s1 (or any other variable, including a new one)

series_sum = s1 + s2
series_sum

0    110
1    220
2    330
3    440
4    550
dtype: int64

In [18]:
# I can use all of our favorite Python operators on two series

s1 + s2

0    110
1    220
2    330
3    440
4    550
dtype: int64

In [19]:
s1 - s2

0    -90
1   -180
2   -270
3   -360
4   -450
dtype: int64

In [20]:
s1 * s2

0     1000
1     4000
2     9000
3    16000
4    25000
dtype: int64

In [21]:
s1 / s2   # notice that division always returns a float... so we'll get back floating-point numbers (and dtype)

0    0.1
1    0.1
2    0.1
3    0.1
4    0.1
dtype: float64

In [22]:
# Python has a // operator, which returns an int from division, ignoring the remainder

s1 // s2

0    0
1    0
2    0
3    0
4    0
dtype: int64

In [23]:
s1 ** s2   # s1 to the s2 power

0    0
1    0
2    0
3    0
4    0
dtype: int64

In [25]:
s1 % s2    # divide s1 by s2, return the remainder

0    10
1    20
2    30
3    40
4    50
dtype: int64

# Exercise: Temperature differences

1. Find a web site with the 10-day forecast for your city, including high and low temperatures.
2. Create a series with the high temperatures
3. Create another series with the low temperatures
4. Create a new series showing how much higher the high temp is each day.

In [26]:
# we're working on this exercise for about 5 minutes... that's why it's quiet... but you can ask in the Q&A!

In [27]:
high_temps = Series([18, 20, 21, 23, 25, 25, 26, 23, 23, 24])
low_temps = Series([14, 14, 13, 13, 14, 15, 15, 14, 14, 14])


In [28]:
high_temps

0    18
1    20
2    21
3    23
4    25
5    25
6    26
7    23
8    23
9    24
dtype: int64

In [29]:
low_temps

0    14
1    14
2    13
3    13
4    14
5    15
6    15
7    14
8    14
9    14
dtype: int64

In [30]:
high_temps - low_temps

0     4
1     6
2     8
3    10
4    11
5    10
6    11
7     9
8     9
9    10
dtype: int64

# Next up

1. Descriptive statistics
2. Aggregate methods we can run on a series
3. Setting and retreiving values with .loc, fancy indexing, and slices

# Random data

We will use some manually entered data in this course, especially today and next week.  But sometimes I'll just want to show you some basic data, and using random data is useful.

NumPy (the lower-level library) can create an array of random integers very easily.  We can say:

    np.random.randint(0, 100, 5) 
    
The above code returns 5 integers, from 0 up to and not including 100.      We can hand that to `Series`:



In [35]:
np.random.seed(0)    # reset the random-number system to a known state

s = Series(np.random.randint(0, 100, 5))
s

0    44
1    47
2    64
3    67
4    67
dtype: int64

In [36]:
# let's create a series of 10 numbers

np.random.seed(0)    # seed the pseudo-random function with a known starting point, so that we'll known what numbers it'll give us
s = Series(np.random.randint(0, 100, 10))

s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

# Descriptive statistics

We see our numbers in `s`, but how would we describe them? What attributes can we describe in our series that'll be useful to someone else?

What can we say about our data that would be useful to someone who wants to understand it better? Remember that this is a collection of numbers. So anything general we say will be a little wrong, but it will help with overall understanding.

We could list:
- The lowest number
- The highest number
- The middle number
- How much do the numbers spread out from the middle?

Indeed, these numbers are exactly what we're going to use to describe our data. They are actually known as "descriptive statistics," giving us a numeric picture of our data.

In Pandas, we can get many descriptive statistics with *methods*, functions that we can call on our series:

- Lowest number -- call `min`
- Highest number -- call `max`
- Middle number -- if you want the mean, call `mean` (same as sum / length)
- Middle number -- if you want the median, call `median` (the middle number after lining values up from lowest to highest)
- Spread -- the standard deviation, call `std` -- a std of 0 means that all values == the mean value, but a std of 50 means that they vary qute a lot.

In [37]:
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [38]:
s.min()

9

In [39]:
s.max()

87

In [40]:
s.mean()

52.5

In [41]:
# we can do this together with the "sum" method and "size" attribute
s.sum() / s.size

52.5

In [42]:
s.std()   # return the standard deviation

25.67424130654432

In [43]:
s.mean() - s.std()   # what is a std below the mean

26.82575869345568

In [44]:
s.mean() + s.std()    # what is a std above the mean

78.17424130654432

In [45]:
# if you want the "middle" without being thrown off by very big and very small numbers,
# use the median instead

s.median()

55.5

In [46]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80, 90])
s.mean()

50.0

In [47]:
# if there is an even number of values, then the median is the mean
# of the innermost two values.

s.median()  

50.0

In [48]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80, 900_000])

In [49]:
s.mean()

100040.0

In [50]:
s.median()

50.0

In [51]:
# these descriptive statistics are so useful that Pandas provides us with a method 
# to get all of them at once: desdribe!

s.describe()

count         9.000000
mean     100040.000000
std      299985.000875
min          10.000000
25%          30.000000
50%          50.000000
75%          70.000000
max      900000.000000
dtype: float64

In [52]:
# if I line the values of s up in order, from smallest to largest,
# then the value at the 50% mark is the median.

# the values at the 25% and 75% mark are also useful to get a sense of how regularly the 
# values increase. 

# moreover, the value at 75% - the value at 25% is known as the "inter-quartile range," or IQR.
# And often the IQR can be used to establish what is an "outlier" value, too small or too large
# for us to care about.

In [53]:
s.quantile(.25)  

30.0

In [54]:
s.quantile(0.5)  # median

50.0

In [55]:
s.quantile(0.75)  # 75% mark

70.0

In [56]:
# I often use describe when encountering data for the first time,
# because it allows me to get a quick picture of the data's highs, lows,
# and shape.

# Exercise: Descriptive weather statistics

1. Run `describe` on the low temperatures you have from before.
2. Run `desdribe` on the high temperatures from before.
3. Run `describe` on the difference between them.
4. What conclusions can you draw from looking at these?
5. Which is more useful to you, the mean or the median, in describing the next 10 days of temperatures with one number?

Take 5 minutes...

In [57]:
from pandas import Series   # I do this so that I don't have to say pd.Series

In [59]:
low_temps.describe()

count    10.000000
mean     14.000000
std       0.666667
min      13.000000
25%      14.000000
50%      14.000000
75%      14.000000
max      15.000000
dtype: float64

In [60]:
high_temps.describe()

count    10.000000
mean     22.800000
std       2.485514
min      18.000000
25%      21.500000
50%      23.000000
75%      24.750000
max      26.000000
dtype: float64

In [62]:
# run describe on the series I got back from the difference

(high_temps - low_temps).describe()

count    10.000000
mean      8.800000
std       2.250926
min       4.000000
25%       8.250000
50%       9.500000
75%      10.000000
max      11.000000
dtype: float64

In [65]:
# create a new s with random integers

np.random.seed(0)
s = Series(np.random.randint(0, 100, 10))
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [66]:
# how can I retrieve from the series? 
# we've already seen that we can use [], as with a list (although it's not quite the same)

# a far better way to do it is with the .loc accessor
# meaning: say s.loc[INDEX]

s.loc[2]

64

In [67]:
s.loc[7]

21

In [69]:
# you can get away with just using [], not .loc[]
# but once we start to use data frames, that won't be the case.
# I suggest getting used to working with .loc

s.loc[[3, 5]]     # fancy indexing -- ask for two values via a list of integers

3    67
5     9
dtype: int64

In [70]:
s.loc[[2,4,6,8]]

2    64
4    67
6    83
8    36
dtype: int64

In [71]:
s.loc[[2,2,4,4,6,6]]

2    64
2    64
4    67
4    67
6    83
6    83
dtype: int64

In [72]:
# can I set data this way? YES!

s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [73]:

s.loc[4] = 999
s

0     44
1     47
2     64
3     67
4    999
5      9
6     83
7     21
8     36
9     87
dtype: int64

In [74]:
s.loc[3] = 888
s

0     44
1     47
2     64
3    888
4    999
5      9
6     83
7     21
8     36
9     87
dtype: int64

In [75]:
s.loc[[3, 4]]

3    888
4    999
dtype: int64

In [76]:
# assign two values to those two elements
s.loc[[3, 4]] = [222,333]

In [77]:
s

0     44
1     47
2     64
3    222
4    333
5      9
6     83
7     21
8     36
9     87
dtype: int64

In [78]:
# I can also retrieve via a slice, as with a list
# but with one big difference!

# normally, slices in Python are [a:b], starting at a, until (not including) b
# but in Pandas, we *do* include the end index!

s.loc[3:5] 

3    222
4    333
5      9
dtype: int64

In [79]:
s.loc[:5]

0     44
1     47
2     64
3    222
4    333
5      9
dtype: int64

In [80]:
s.loc[5:]

5     9
6    83
7    21
8    36
9    87
dtype: int64

In [81]:
# when you retrieve a single value, you get that single value back

# but if you retrieve multiple values -- via a slice or fancy index -- you get a series back,
# a subset of the original series

In [82]:
s.loc[9:]

9    87
dtype: int64

# Exercise: Retrieving weather parts

1. Find the mean and max high temperatures in the next 5 days.
2. Find the mean and min low temperatures in the final 3 days of your data.
3. Retrieve the first, 4th, and final values from the high temps, and get descriptive statistics for them.



In [84]:
# this creates a series with a single element, the NumPy array that np.random.randint returned

Series([np.random.randint(0, 100, 10)])

0    [81, 37, 25, 77, 72, 9, 20, 80, 69, 79]
dtype: object

In [85]:
# you want to hand Series a list or similar object
# np.random.randint returns a NumPy array, which is sort of like a list
# don't use square brackets, and then it'll work

Series(np.random.randint(0, 100, 10))

0    47
1    64
2    82
3    99
4    88
5    49
6    29
7    19
8    19
9    14
dtype: int64

In [91]:
# if I get a slice of the next 5 days' high temps:

high_temps.loc[:4].mean()

21.4

In [92]:
high_temps.loc[:4].max()

25

In [97]:
# mean and min for the final 3 days' low temps

low_temps.loc[7:].mean()

14.0

In [98]:
low_temps.loc[7:].min()

14

In [99]:
low_temps.loc[7:].describe()

count     3.0
mean     14.0
std       0.0
min      14.0
25%      14.0
50%      14.0
75%      14.0
max      14.0
dtype: float64

In [100]:
# Retrieve the first, 4th, and final values from the high temps, and get descriptive statistics for them.

high_temps.loc[[0, 3, 9]]   # double square brackets -- outers indicate retrieve, inner show fancy indexing

0    18
3    23
9    24
dtype: int64

In [101]:
high_temps.loc[[0, 3, 9]].describe()

count     3.000000
mean     21.666667
std       3.214550
min      18.000000
25%      20.500000
50%      23.000000
75%      23.500000
max      24.000000
dtype: float64

In [103]:
# how can we retrieve the final element if we don't know its index?
# 2 options:

high_temps.loc[high_temps.size - 1]

24

In [104]:
# another option is to use .iloc, rather than .loc
# .iloc always goes based on the position
# we can use -1 to indicate "the final element"

high_temps.iloc[-1]

24

# Next up:

1. Broadcasting
2. Operations with broadcasting
3. Comparisons with broadcasting
4. Boolean/mask indexes 

In [105]:
np.random.seed(0)

s1 = Series(np.random.randint(0, 100, 5))
s2 = Series(np.random.randint(0, 100, 5))

s1 + s2  # return a new series with the same index as s1 and s2, adding the values at each index


0     53
1    130
2     85
3    103
4    154
dtype: int64

In [106]:
s1

0    44
1    47
2    64
3    67
4    67
dtype: int64

In [107]:
s2

0     9
1    83
2    21
3    36
4    87
dtype: int64

In [108]:
# what happens if we add a scalar value to s1, rather than another series?

s1 + 5   # it adds 5 to each element of s1 -- this is known as "broadcasting"

0    49
1    52
2    69
3    72
4    72
dtype: int64

In [109]:
# this is how I can actually change s1, rather than get a new (anonymous) series back
s1 = s1 + 5

In [110]:
s1

0    49
1    52
2    69
3    72
4    72
dtype: int64

In [111]:
s1 -= 5    # we can use Python syntax to broadcast a similar operation
s1

0    44
1    47
2    64
3    67
4    67
dtype: int64

In [112]:
s1 * 3

0    132
1    141
2    192
3    201
4    201
dtype: int64

In [113]:
s1 / 4

0    11.00
1    11.75
2    16.00
3    16.75
4    16.75
dtype: float64

In [114]:
s1 ** 2

0    1936
1    2209
2    4096
3    4489
4    4489
dtype: int64

In [115]:
s1 % 3

0    2
1    2
2    1
3    1
4    1
dtype: int64

# Vector operations

- If we run an operation on two series, then the result is a new series with the index from the two input series. Each value is the result of running the operation on the original values at that index.
- If we run an operation on a series and a scalar, then the result is a new series with each element in the new series the result of running the operation on the value at that index and the scalar, broadcast.

# Exercise: Temperature conversion

Take your high-temperature series, and convert all of its values to be in C/F (the other one).

- To convert C to F: c*1.8 + 32
- To convert F to C: (f-32) / 1.8

In [117]:
high_temps * 1.8 + 32

0    64.4
1    68.0
2    69.8
3    73.4
4    77.0
5    77.0
6    78.8
7    73.4
8    73.4
9    75.2
dtype: float64

In [118]:
f_high_temps = high_temps * 1.8 + 32
f_high_temps

0    64.4
1    68.0
2    69.8
3    73.4
4    77.0
5    77.0
6    78.8
7    73.4
8    73.4
9    75.2
dtype: float64

In [119]:
(f_high_temps - 32) / 1.8

0    18.0
1    20.0
2    21.0
3    23.0
4    25.0
5    25.0
6    26.0
7    23.0
8    23.0
9    24.0
dtype: float64

In [120]:
high_temps.mean()

22.8

In [121]:
high_temps - high_temps.mean()   # how far is each value from the mean?

0   -4.8
1   -2.8
2   -1.8
3    0.2
4    2.2
5    2.2
6    3.2
7    0.2
8    0.2
9    1.2
dtype: float64

In [123]:
# we've seen that we can use broadcasting with math operations

np.random.seed(0)
s = Series(np.random.randint(0, 100, 10))
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [125]:
s + 30

0     74
1     77
2     94
3     97
4     97
5     39
6    113
7     51
8     66
9    117
dtype: int64

In [126]:
# we've also seen fancy indexing

s[[3, 5, 7]]

3    67
5     9
7    21
dtype: int64

In [127]:
# there's another way to retrieve the items at indexes 3, 5, and 7: a mask index
# if we give s[] a list of boolean (True/False) values, only the values that correspond to True will be returned

#   0      1      2      3     4       5    6      7    8      9
s[[False, False, False, True, False, True, False, True, False, False]]

3    67
5     9
7    21
dtype: int64

In [128]:
# what if I don't use a usual math operation, but rather I use a comparison operation?

s < 50

0     True
1     True
2    False
3    False
4    False
5     True
6    False
7     True
8     True
9    False
dtype: bool

In [129]:
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

# The story so far

1. If we apply a list of `True`/`False` values to a series, we get back only those values that correspond to `True`
2. If we apply a comparison operator and a scalar to a series, we get back a series of `True`/`False` values

So... what if we apply the boolean series we got back in step 2 as a mask index as per step 1?

In [133]:
# this is known as a "mask index"
# this is how we retrieve selected values in Pandas

# first, we create a boolean series based on a comparison
# then we apply that boolean series in [] on the series
# the result is a new series containing only those values in s where the comparison was True

s.loc[s<50]  # returns all values of s that are < 50

0    44
1    47
5     9
7    21
8    36
dtype: int64

In [134]:
# always look INSIDE of the square brackets, and figure out what boolean index
# was created, before looking outside

# you're applying that boolean index to the series on the outside

s.loc[s>s.mean()]   # returns all values of s that are > s's mean

2    64
3    67
4    67
6    83
9    87
dtype: int64

# When do we use `[]`, and when `[[]]`?

We use `[]` to retrieve from a series:

- If you want to retrieve a single element, then you provide a single value in the `[]`, as in `[5]`.
- If you want to retrieve multiple elements, then you have to provide a non-scalar value.  That can be a Python list, in which case you'll have `[[]]`.  But if it's a Pandas series, or a NumPy array, then it already contains multiple values, and thus you don't need the second set of `[]`.

In [138]:
s.loc[5]   # retrieve one value from s

9

In [137]:
s.loc[[3, 5, 7]]  # retrieve multiple values from s, using fancy indexing

3    67
5     9
7    21
dtype: int64

In [139]:
s.loc[3:7]  # the 3:7 is transformed into a "slice" object in Python, which means multiple values

3    67
4    67
5     9
6    83
7    21
dtype: int64

In [141]:
np.random.seed(0)
s.loc[np.random.randint(0, 5, 3)]  # get 3 random elements from s

4    67
0    44
3    67
dtype: int64

In [143]:
# double square brackets -- we're retrieving based on a list of booleans
s.loc[    [True, False, True, True, False, False, True, True, False, False]  ]

0    44
2    64
3    67
6    83
7    21
dtype: int64

In [145]:
s.loc[s < 40]   # s<20 returns a Pandas series, so no double [] is needed

5     9
7    21
8    36
dtype: int64

# Exercise: Boolean indexes

1. Retrieve low temperatures that are less than the mean.
2. Define a series of 100 random numbers, from 0-100.  Find those numbers that are less than the mean - 1 standard deviation.
3. Find the mean of the numbers > mean + 1 standard deviation.

In [147]:
# to find low temperatures that are less than the mean, I need to:

# (1) find the mean, low_temps.mean()
# (2) find low_temps < low_temps.mean() -- which will return a boolean series (of True/False values)
# (3) apply that boolean series as a mask index to low_temps


# step (1)
low_temps.mean()

14.0

In [148]:
# step (2) 
low_temps < low_temps.mean()

0    False
1    False
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

In [149]:
# step(3)
low_temps.loc[low_temps < low_temps.mean()]

2    13
3    13
dtype: int64

In [150]:
# Define a series of 100 random numbers, from 0-100.

np.random.seed(0)
s = Series(np.random.randint(0, 100, 100))
s

0     44
1     47
2     64
3     67
4     67
      ..
95    23
96    79
97    13
98    85
99    48
Length: 100, dtype: int64

In [None]:
# Find those numbers that are less than the mean - 1 standard deviation.

# (1) find the mean
# (2) find the std
# (3) find the mean - 1 std
# (4) find numbers < mean - 1 std, getting a boolean index
# (5) apply that boolean index back to the series

In [151]:
# (1) find the mean
s.mean()

48.23

In [152]:
# (2) find the std
s.std()

28.37229228837476

In [153]:
# (3) find the mean - 1 std
s.mean() - s.std()

19.85770771162524

In [154]:
# (4) find numbers < mean - 1 std, getting a boolean index

s < s.mean() - s.std()


0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97     True
98    False
99    False
Length: 100, dtype: bool

In [155]:
# (5) apply that boolean index back to the series

s.loc[s < s.mean() - s.std()]


5      9
13    12
25     9
37    19
38    19
39    14
43     9
54     0
55     0
58     5
60    17
62     4
66     1
71    11
75     0
76    14
79    12
84     6
87     3
91    15
97    13
dtype: int64

In [None]:
# Find the mean of the numbers > mean + 1 standard deviation.

# (1) find the mean
# (2) find the std
# (3) find the mean + 1 std
# (4) get a boolean series, numbers > mean + 1 std
# (5) apply that boolean series as a mask index

In [156]:
# (1) find the mean
s.mean()

48.23

In [157]:
# (2) find the std
s.std()

28.37229228837476

In [158]:
# (3) find the mean + 1 std

s.mean() + s.std()

76.60229228837476

In [159]:
# (4) get a boolean series, numbers > mean + 1 std

s > s.mean() + s.std()


0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96     True
97    False
98     True
99    False
Length: 100, dtype: bool

In [160]:
# (5) apply that boolean series as a mask index

s[s > s.mean() + s.std()]

6     83
9     87
11    88
12    88
17    87
19    88
20    81
23    77
27    80
29    79
32    82
33    99
34    88
61    79
73    82
74    91
77    99
81    84
90    78
93    99
96    79
98    85
dtype: int64

In [164]:
s[s > s.mean() + s.std()].mean()

85.5909090909091

# What if I want to combine two conditions?

For example, I can find even numbers by looking at `s % 2 == 0`.

And I can find numbers smaller than the mean with `s<s.mean()`.

How can I check for both of these?

Answer: Don't use `and` or `or`, from regular Python.  Instead, you have to use the special operators `&` and `|`, which are normally used in bitwise operations.

Then you also need to put each of the conditions in `()`, to avoid problems with operator precedence.

In [163]:
# the & takes a series on its left and another on its right, and it compares the True/False
# values at each index. It returns a new series with the same index and size as its two arguments.

# If both boolean series are True, it returns True.  Otherwise, it returns False.
# In the case of "or", using |, it returns True if either of the series has a True value.

s.loc[(s % 2 == 0) & (s < s.mean())]

0     44
8     36
13    12
18    46
26    20
39    14
41    32
45    32
52    28
53    34
54     0
55     0
56    36
59    38
62     4
63    42
72    46
75     0
76    14
79    12
80    42
84     6
92    20
99    48
dtype: int64

In [166]:
# chaining the conditions

s.loc[s % 2 == 0].loc[s < s.mean()]

0     44
8     36
13    12
18    46
26    20
39    14
41    32
45    32
52    28
53    34
54     0
55     0
56    36
59    38
62     4
63    42
72    46
75     0
76    14
79    12
80    42
84     6
92    20
99    48
dtype: int64

# Next up

1. Indexes -- non-numeric, loc/iloc, non-unique, setting
2. Useful methods to know



In [167]:
s = Series([10, 20, 30, 40, 50])

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [168]:
# we can look at the index, by asking for .index on our series
s.index

RangeIndex(start=0, stop=5, step=1)

In [169]:
# If I select values from s, the index doesn't reset to be 0-(max)

s.loc[[1, 3, 4]]


1    20
3    40
4    50
dtype: int64

In [170]:
# what if I assign a new index to s?

s.index = [92, 876, 135, 2, 81]

In [171]:
s

92     10
876    20
135    30
2      40
81     50
dtype: int64

In [173]:
# how do I retrieve an item from s now?

s.loc[92]

10

In [174]:
s.loc[2]

40

In [175]:
s.loc[[876, 81]]

876    20
81     50
dtype: int64

In [177]:
# I can use a slice here, too!

s.loc[876:2]

876    20
135    30
2      40
dtype: int64

In [None]:
# an alternative to .loc is .iloc
# .loc always uses the index that was defined on our series
# .iloc always uses the numeric position, starting at 0 and going up to s.size - 1

# Why set the index?

Many reasons:

1. It's easier to retrieve values via the index than a mask index.
2. If you have usernames, user IDs, filenames, etc., those can be really useful in an index.

The index can contain any values at all, but it's typical for it to only have numbers, strings, and date/time values.

In [179]:
# When we create a series, we can pass the argument index= and a list/series of values.

np.random.seed(0)
s = Series(np.random.randint(0, 100, 5), index=list('abcde'))
s

a    44
b    47
c    64
d    67
e    67
dtype: int64

In [180]:
s.loc['a']

44

In [181]:
s.loc[['b', 'c']]

b    47
c    64
dtype: int64

In [183]:
s.loc['b':'d']   # slices on strings!

b    47
c    64
d    67
dtype: int64

In [186]:
# what if I have two series with the same textual index?

np.random.seed(0)
s1 = Series(np.random.randint(0, 100, 5), index=list('abcde'))
s2 = Series(np.random.randint(0, 100, 5), index=list('abcde'))

In [188]:
s1

a    44
b    47
c    64
d    67
e    67
dtype: int64

In [189]:
s2

a     9
b    83
c    21
d    36
e    87
dtype: int64

In [191]:
s1 + s2   # add by the index!

a     53
b    130
c     85
d    103
e    154
dtype: int64

In [192]:
# what if s1's index is abcde and s2's index is edcba?

np.random.seed(0)
s1 = Series(np.random.randint(0, 100, 5), index=list('abcde'))
s2 = Series(np.random.randint(0, 100, 5), index=list('edcba'))

In [194]:
# we still use the index, even if it's reversed!
s1 + s2  

a    131
b     83
c     85
d    150
e     76
dtype: int64

In [195]:
# what about if the indexes don't quite match?


np.random.seed(0)
s1 = Series(np.random.randint(0, 100, 5), index=list('abcde'))  # this has a, b
s2 = Series(np.random.randint(0, 100, 5), index=list('cdefg'))  # this has f, g

In [197]:
s1 + s2   # when we add these together, matching indexes are added, while others result in NaN

a      NaN
b      NaN
c     73.0
d    150.0
e     88.0
f      NaN
g      NaN
dtype: float64

In [198]:
np.random.seed(0)
s1 = Series(np.random.randint(0, 100, 5), index=list('abcde'))  
s2 = Series(np.random.randint(0, 100, 5), index=list('abcab'))  

In [199]:
s1

a    44
b    47
c    64
d    67
e    67
dtype: int64

In [200]:
s2

a     9
b    83
c    21
a    36
b    87
dtype: int64

In [201]:
# what happens when we retrieve from s2?

s2.loc['a']

a     9
a    36
dtype: int64

In [202]:
s2.loc['b']

b    83
b    87
dtype: int64

In [203]:
s2.loc['c']

21

In [204]:
s1 + s2

a     53.0
a     80.0
b    130.0
b    134.0
c     85.0
d      NaN
e      NaN
dtype: float64

# Exercise: Family members

1. Create a series in which the values will be the ages of people in your family (or friends, if you prefer), and the index will contain strings, the people's names.
2. Retrieve all of the people below the mean age.
3. Retrieve yourself and your spouse/friend/partner (as long as it's two people), by specifying a fancy index.

In [206]:
s = Series([52, 21, 19, 17],
          index='Reuven Atara Shikma Amotz'.split())
s

Reuven    52
Atara     21
Shikma    19
Amotz     17
dtype: int64

In [207]:
s.loc['Reuven']

52

In [208]:
s.loc['Amotz']

17

In [209]:
s.mean()

27.25

In [210]:
s < s.mean()

Reuven    False
Atara      True
Shikma     True
Amotz      True
dtype: bool

In [211]:
s[s < s.mean()]

Atara     21
Shikma    19
Amotz     17
dtype: int64

In [213]:
# retrieve two people via fancy indexing
s.loc[['Reuven', 'Atara']]

Reuven    52
Atara     21
dtype: int64

In [214]:
# could I have done all of this via .iloc, rather than .loc?  YES!
s.iloc[[0, 1]]

Reuven    52
Atara     21
dtype: int64

# Assignments via indexes

Just as we can retrieve values via indexes, we can assign to them, also.

We previously saw that we can assign via a fancy index.

In [217]:
np.random.seed(0)
s = Series(np.random.randint(0, 1000, 10), index=list('abcdefghij'))
s

a    684
b    559
c    629
d    192
e    835
f    763
g    707
h    359
i      9
j    723
dtype: int64

In [218]:
# assignment via fancy indexing
s.loc[['c', 'f', 'j']] = [10, 20, 30]
s

a    684
b    559
c     10
d    192
e    835
f     20
g    707
h    359
i      9
j     30
dtype: int64

In [219]:
# what if I assign a scalar value?
# we broadcast the assignment to all of the indexes named on the left!
s.loc[['c', 'f', 'j']] = 999

In [220]:
s

a    684
b    559
c    999
d    192
e    835
f    999
g    707
h    359
i      9
j    999
dtype: int64

In [221]:
# find the even numbers
s%2 == 0

a     True
b    False
c    False
d     True
e    False
f    False
g    False
h    False
i    False
j    False
dtype: bool