# Agenda

1. Getting started with data analysis
    - Jupyter (as a tool)
    - What is data analysis?
    - Pandas -- what is it?
    - Pandas series
    - Descriptive statistics
    - Aggregate methods
    - Setting and retrieving values
    - Broadcasting
    - Mask arrays
2. Data frames (for 2D data)
    - Indexes vs. columns
    - Retreiving
    - Setting
    - Querying with mask arrays
    - Reading CSV data from external files
3. Real-world data
    - Sorting
    - Grouping
    - Cleaning
    - Pivot table
4. Text and time
    - Working with text data
    - Textual statistics
    - Dates and times
    - Time deltas
    - Time series
    - Resampling
5. Visualization
    - Line plots
    - Bar plots
    - Pie plots
    - Scatter lots


# Data analysis -- what is data?

1. Collecting data
2. Understanding the data
3. Making decisions based on that data

# Where does data come from?

- Phones
- Vehicles
- Temperature sensors

# Data science -- what is it?

1. Data analytics -- take data from the past, and understand it
2. Data engineering -- the logistics of working with data, even when it's huge or hard to work with it
3. Machine learning -- take the data from the past, and predict the future with it

Data science is these three things together, coupled with a scientific approach -- we ask a question, and we try to get an answer that we believe is justifiable with the data.

# Why are we using Python?

Python is known as a language that uses lots of memory and is slow to execute.

Data analytics needs to work with lots of data, and we want fast results.

The answer: NumPy, which provides C-style data and speed, but Python-style access.

NumPy is a little low level for many people. That's where Pandas comes in.  Pandas is a high-level data analysis system that uses NumPy behind the scenes (for speed and efficiency) but gives us very easy analysis in Python.

Pandas is sort of like Excel, inside of Python.

# What Python will we *not* be using?

- `if`/`else`
- `for` or `while` loops
- `def` to define new functions

# What will we be using?

- Basic data structures (ints, floats, strings, and lists) to communicate with Pandas.
- Comparison operators (e.g., `==` and `<`)
- Variable assignment
- Pandas object methods and constructors

# Quick Jupyter tutorial

Jupyter is the tool I'm using to type this, and which I'll use for the entire course. It lets us run code and write documentation in the browser. Jupyter is super popular among data scientists.

Learn more at https://Jupyter.org/.

Every rectangle in Jupyter is known as a "cell." To execute a cell, you press shift+Enter.




I just pressed Enter a few times, and descended those lines.  But if press shift+Enter...

# Jupyter has two modes

- Edit mode -- the outline of the cell is green. Enter edit mode by clicking inside of the cell or pressing ENTER. You can then type to enter text.
- Command mode -- the outline of the cell is blue. Enter command mode by clicking to the left of the cell or pressing ESC. Anything you type is interpreted as a command to Jupyter.

When you're in command mode, here are some commands you can use:
- `h` -- help
- `c` -- copy the current cell
- `x` -- cut the current cell
- `v` -- copy the current cell
- `m` -- set cell type to Markdown (for documentation)
- `y` -- set cell type to Python code (for coding, this is the default)
- `a` -- create a new cell above the current one
- `b` -- create a new cell below the current one

In [1]:
print('Hello!')

Hello!


In [2]:
# to use Pandas, I have to load it!

import numpy as np           # easier access to NumPy things (the lower-level library)
import pandas as pd          # import the Pandas library, but give it an alias, "pd"
from pandas import Series    # this way, we don't have to say pd.Series

In [3]:
x = 1   # regular Python code

In [4]:
# how much memory does that 1 take up?
import sys
sys.getsizeof(x)   # how many bytes does this 1 take?

28

In [5]:
# I create a Pandas series
# I use the name Series that I just imported
# I hand Series a Python list of integers
# behind the scenes, Pandas turns this into a NumPy array, and assigns the series to the variable s

s = Series([10, 20, 30, 40, 50])

In [6]:
# what kind of data does s contain?
type(s)

pandas.core.series.Series

In [7]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# What is the `dtype`?

In pure Python, we have regular integers. We don't need to think about how many bits our integer contains. But in Pandas, because we're using these low-level C types, we need to think about how many bits it contains.

By default, integers in Pandas are all 64-bit ints.  We see that, whenever we have data in a series.

In [8]:
# what operations can I run on s?

s[0]  # get the first item

10

In [9]:
s[1]   # get the second item

20

In [10]:
len(s)   # how many items are in s?

5

In [13]:
# can I modify one of these items using the index? Yes!

s[2] = 999

In [14]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [15]:
s[-1]    # normally, Python allows us to retrieve the final element with index -1 ... but not here!

KeyError: -1

In [16]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [17]:
s2 = Series([2, 4, 6, 8, 10])

In [18]:
# what happens if I add s and s2?

s + s2

0      12
1      24
2    1005
3      48
4      60
dtype: int64

# What happened here?

When you add two series together, Pandas gives you a new series back.

That series is of the same length as the two series you added, and each index is the result of adding the two inputs at that index.

Any math operation you run on two series will actually be run on each index.

In [19]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [20]:
s2

0     2
1     4
2     6
3     8
4    10
dtype: int64

In [21]:
s + s2

0      12
1      24
2    1005
3      48
4      60
dtype: int64

In [22]:
s - s2

0      8
1     16
2    993
3     32
4     40
dtype: int64

In [23]:
s * s2

0      20
1      80
2    5994
3     320
4     500
dtype: int64

In [24]:
s / s2

0      5.0
1      5.0
2    166.5
3      5.0
4      5.0
dtype: float64

In [25]:
s // s2    # this returns integer division, ignoring any remainder

0      5
1      5
2    166
3      5
4      5
dtype: int64

In [26]:
s ** s2    # exponentiate 

0                   100
1                160000
2    994014980014994001
3         6553600000000
4     97656250000000000
dtype: int64

In [27]:
s % s2   # modulus

0    0
1    0
2    3
3    0
4    0
dtype: int64

In [28]:
s2

0     2
1     4
2     6
3     8
4    10
dtype: int64

In [29]:
s2[3] = 8.5

In [30]:
s2

0     2.0
1     4.0
2     6.0
3     8.5
4    10.0
dtype: float64

# Lists vs. series

A Python list traditionally contains only one type of value. But that's a convention; lists can contain any number of values, and each value can be of any type. 

In Pandas, a series cannot contain more than one type. Every single value must be of the same dtype. This means that if you assign a float to one element of a series that previously contained only integers, the dtype will change to reflect that.

In [31]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [32]:
s ** 100000  # this goes far beyond the ability of 64-bit integers!

0                      0
1                      0
2   -2115034505054000383
3                      0
4                      0
dtype: int64

# Exercise: Weather forecast

1. Create a series with 10 elements, the high temperatures forecast for your city in the next 10 days.
2. Create a second series with 10 elements, the low temperatures forecast for your city in the next 10 days.
3. Find out the difference between the high and low temps that your city will experience.

In [34]:
highs = Series([19, 19, 23, 27, 27, 25, 21, 18, 19, 20])
lows = Series([10, 9, 8, 10, 13, 11, 14, 11, 11, 11])

In [35]:
highs - lows

0     9
1    10
2    15
3    17
4    14
5    14
6     7
7     7
8     8
9     9
dtype: int64

In [36]:
highs - s   # what happens if we try to subtract from two series of different lengths?

0      9.0
1     -1.0
2   -976.0
3    -13.0
4    -23.0
5      NaN
6      NaN
7      NaN
8      NaN
9      NaN
dtype: float64

# Next up

- Methods we can run on our series
- Aggregate methods
- Descriptive statistics

# Methods in Python

Methods are object-oriented functions -- we can invoke them via a `.` and then the name of the method.

Some examples from regular Python:

```python
name = 'reuven'
name.upper()     # this returns a new string, 'REUVEN', based on name
```

# Methods in Pandas

We can invoke a large number of different methods on a series. If you're in Jupyter, then you can always see what methods are available on an object by naming the object, writing a `.`, and then pressing `TAB`.

In [37]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [39]:
# some methods we'll want to use

s.count()   # returns the number of values in our series (ignoring NaN)

5

In [41]:
# another way to find the size of our series is to ask the size

s.size    # how many values are there, including NaN?  NOTICE -- NOT A METHOD

5

In [42]:
# what if we want the average, or mean, of the numbers?

s = Series([10, 20, 30, 40, 50])
s.mean()

30.0

In [43]:
# what is the mean? It's the total of all values, divided by the size

# it turns out that there's a .sum() method, which returns the total
s.sum()

150

In [44]:
# this is the same as calling s.mean()
s.sum() / s.size

30.0

In [45]:
# the issue with calculating the mean is that it can be skewed with one or two large outliers
# sometimes, we aren't going to want the mean to understand our data

# instead, we'll want to find the median -- we line the values up from smallest to biggest, and find
# the value at that halfway point

s.median()

30.0

In [46]:
# if the mean and median are the same, our data is normally distributed

In [47]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [48]:
s[4] = 1000000

In [49]:
s.median()   # this hasn't changed!

30.0

In [50]:
s.mean()    # this has changed a bit

200020.0

In [51]:
# what are the minimum and maximum values?

s.min()

10

In [52]:
s.max()

1000000

In [54]:
s.mode()   # which value(s) appear most?

0         10
1         20
2         30
3         40
4    1000000
dtype: int64

In [55]:
s[4] = 40    # make 40 appear twice in our series

In [57]:
s.mode()     # show the value(s) that appear most

0    40
dtype: int64

In [58]:
# we have the min, the median, and the max
# sometimes we want to know what the values are along the way
# it's common to ask for the 25th and 75th quantile

s.quantile(0.25)

20.0

In [59]:
s.quantile(0.75)

40.0

In [60]:
# some data analysts want to know the IQR -- the inter-quartile range
# this is the 75th quantile - 25th quantile

In [61]:
# all of these are useful in various ways
# together, they are known as "descriptive statistics"

# John Tewkey 

# Pandas has a method that gives us all of this information about a series -- it's called "describe"

In [62]:
s.describe()

count     5.000000
mean     28.000000
std      13.038405
min      10.000000
25%      20.000000
50%      30.000000
75%      40.000000
max      40.000000
dtype: float64

In [63]:
# standard deviation -- how much does the data fluctuate/differ from the mean?
# a std of 0 means -- all values in the data set are the same
# a big standard deviation means that the values vary a lot from the mean

In [64]:
# when we invoke describe, we are getting back... a series!
# the series has a text index
# the dtype is float64

# we can retrieve one part of this with square brackets

s.describe()['min']    # don't do this!  just use the .min() method

10.0

# Exercise: Weather analysis

1. Create a series containing the high-temperature forecast for your city in the next 10 days.
2. Create a series containing the low-temperature forecast for your city in the next 10 days.
3. Use the `.describe` method to get descriptive statistics for each of those.  What do you see? Does the standard deviation seem to describe the variation in temperature you'll see?
4. Get the difference between highs and lows, and then run `.describe` on that.  

In [65]:
highs

0    19
1    19
2    23
3    27
4    27
5    25
6    21
7    18
8    19
9    20
dtype: int64

In [66]:
lows

0    10
1     9
2     8
3    10
4    13
5    11
6    14
7    11
8    11
9    11
dtype: int64

In [67]:
highs.describe()

count    10.000000
mean     21.800000
std       3.457681
min      18.000000
25%      19.000000
50%      20.500000
75%      24.500000
max      27.000000
dtype: float64

In [69]:
lows.describe()

count    10.00000
mean     10.80000
std       1.75119
min       8.00000
25%      10.00000
50%      11.00000
75%      11.00000
max      14.00000
dtype: float64

In [71]:
# option 1: assign the difference to a variable
temp_diffs = highs - lows

In [72]:
temp_diffs.describe()

count    10.000000
mean     11.000000
std       3.651484
min       7.000000
25%       8.250000
50%       9.500000
75%      14.000000
max      17.000000
dtype: float64

In [73]:
# option 2: use parentheses and then run it directly, without a variable

(highs - lows).describe()

count    10.000000
mean     11.000000
std       3.651484
min       7.000000
25%       8.250000
50%       9.500000
75%      14.000000
max      17.000000
dtype: float64

# Getting results printed

In normal Python programs, nothing is printed without using the `print` function.

But in Jupyter, if the final line in your cell returns a result, then that result is displayed right after the cell.

In [75]:
x = 5
y = 10

x + y     # final line of the cell, and it returns a value (15), so that is displayed

15

In [76]:
# what if I have three things in the same cell?  Only the final one will be displayed.

x + y
x * y
x - y

-5

# Setting and retrieving values

We've seen that we can set and retrieve values with `[]`.  

In [77]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

s

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

In [78]:
s[3]  # retrieve at index 3

40

In [79]:
s[7]  # retrieve at index 7

80

In [80]:
# what if we try an index that doesn't exist?

s[100]

KeyError: 100

In [81]:
s[5] = 9876

In [82]:
s

0      10
1      20
2      30
3      40
4      50
5    9876
6      70
7      80
8      90
9     100
dtype: int64

# Use `.loc` and `.iloc`

It's true that you can use `[]` to set and retrieve values in a series. As of now, I ask you to *not do that*.

Instead, I want you to use `.loc` and `.iloc`.

In [83]:
# to retrieve from our series at index i, we can say s.loc[i]  (square brackets!)

s.loc[0]   # same as s[0]

10

In [84]:
s.loc[5]   # same as s[5]

9876

In [85]:
# if we have .loc, why do we also have .iloc? What's the difference?
# answer: .loc will work with whatever index we have, even if the index uses text or weird numbers
# .iloc will always start at 0 and go to len(s)-1.

s.iloc[0]   # same as s[0]

10

In [86]:
s.iloc[5]  # same as s[5]

9876

In [87]:
s.loc[5] = 12345

In [88]:
s

0       10
1       20
2       30
3       40
4       50
5    12345
6       70
7       80
8       90
9      100
dtype: int64

In [89]:
# what is s at index 2?
s.loc[2]

30

In [90]:
# what is s at index 7?
s.loc[7]

80

In [91]:
# I want the items at both 2 and 7
# for this I'll use FANCY INDEXING -- meaning, we pass a list of indexes!

s.loc[[2, 7]]  # outer [] are for .loc, inner [] are because we're passing a list

2    30
7    80
dtype: int64

# Fancy indexing

To use fancy indexing, just pass a list of indexes to `.loc` (or `.iloc`).

The result will be a series whose elements have the indexes that you named.

Don't forget to use two sets of square brackets with `.loc`.

In [93]:
s.loc[[2,7,2,7]]   # can we repeat index values? YES!

2    30
7    80
2    30
7    80
dtype: int64

In [94]:
# can we assign using fancy indexing? YES!

s.loc[[2,7]]

2    30
7    80
dtype: int64

In [95]:
s.loc[[2,7]] = [100, 200]   # assign to more than one index!

In [96]:
s

0       10
1       20
2      100
3       40
4       50
5    12345
6       70
7      200
8       90
9      100
dtype: int64

In [98]:
# slices -- (almost) just like in regular Python

s.loc[2:7]     # [from:until]  -- it's up to *AND INCLUDING* the final index

2      100
3       40
4       50
5    12345
6       70
7      200
dtype: int64

In [100]:
# what about slices on .iloc? Can I do that?

s.iloc[2:7]  # now it's up to and NOT including, just like in traditional Python

2      100
3       40
4       50
5    12345
6       70
dtype: int64

In [101]:
s.loc[[2]]  # fancy indexing with a single value

2    100
dtype: int64

In [102]:
s.loc[2]   # here, I'm asking .loc to retrieve a single element. I thus get back the integer

100

In [103]:
s.loc[[2]]   # here, I'm asking for a list of elements... the list just happens to have one element


2    100
dtype: int64

# Exercise: More with weather!

1. Create a series showing how many mm of rain/snow will fall in your city in the next 10 days.
2. Get the descriptive statistics for the rain forecast.
3. What is the mean expected between days 4 and 7, inclusive?
4. What is the max expected between days 2 and 5, exclusive? (Meaning: Don't include the final day.)
5. Modify day 3's rainfall to be twice the current value.
6. How does that change the mean between days 4 and 7, inclusive?

In [104]:
rainfall = Series([0.8, 0.6, 0, 0, 0, 0, 0.9, 2.3, 1.1, 0])
rainfall.size

10

In [105]:
rainfall.describe()

count    10.000000
mean      0.570000
std       0.749889
min       0.000000
25%       0.000000
50%       0.300000
75%       0.875000
max       2.300000
dtype: float64

In [107]:
# get days 4-7 inclusive

rainfall.loc[4:7].mean()

0.7999999999999999

In [110]:
# max from days 2-5, exclusive

rainfall.iloc[2:5].max()

0.0

In [114]:
# modify day 1's rainfall to be twice the current value  (I changed from 3 so it'll be non-zero)

rainfall.loc[1] = rainfall.loc[1] * 2

In [116]:
# how did that change our values from days 4-7?
rainfall.loc[4:7].mean()

0.7999999999999999

In [117]:
rainfall.loc[4:7].describe()

count    4.000000
mean     0.800000
std      1.086278
min      0.000000
25%      0.000000
50%      0.450000
75%      1.250000
max      2.300000
dtype: float64

In [118]:
rainfall.loc[5] = rainfall.loc[1] * 2

In [119]:
rainfall.loc[4:7].describe()

count    4.000000
mean     1.400000
std      1.157584
min      0.000000
25%      0.675000
50%      1.600000
75%      2.325000
max      2.400000
dtype: float64

# Next up

1. Broadcasting operations
2. Mask arrays



In [120]:
s = Series([10, 20, 30, 40, 50])
s2 = Series([2, 4, 6, 8, 10])

In [121]:
s + s2   # vectorized -- the operation is performed on each index, and we get back a new series

0    12
1    24
2    36
3    48
4    60
dtype: int64

In [122]:
# what happens if I perform the operation with a series and a scalar (single) value?

s + 3    # +3 operation is "broadcast" to each of the elements of s, and we get back a new series

0    13
1    23
2    33
3    43
4    53
dtype: int64

In [123]:
s - 3

0     7
1    17
2    27
3    37
4    47
dtype: int64

In [124]:
s * 3

0     30
1     60
2     90
3    120
4    150
dtype: int64

In [125]:
s / 3

0     3.333333
1     6.666667
2    10.000000
3    13.333333
4    16.666667
dtype: float64

In [126]:
s // 3   # floordiv

0     3
1     6
2    10
3    13
4    16
dtype: int64

In [127]:
s ** 3

0      1000
1      8000
2     27000
3     64000
4    125000
dtype: int64

In [128]:
s % 3

0    1
1    2
2    0
3    1
4    2
dtype: int64

In [129]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [131]:
# let's assume that s contains prices, and that the VAT on products is 17%
# can I get the product price including VAT?

prices_including_vat = s * 1.17   # VAT in Israel is 17%, so I multiply by 1.17 to get the price + VAT
prices_including_vat

0    11.7
1    23.4
2    35.1
3    46.8
4    58.5
dtype: float64

In [134]:
# how can we get random numbers in Pandas?
# we'll actually use NumPy to do that, asking for either random integers or random floats

np.random.seed(0)                           # start the random number generator from a known state
s = Series(np.random.randint(0, 100, 10))   # give me 10 random ints, each 0-100 -- then pass it to Series
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [135]:
s * 10

0    440
1    470
2    640
3    670
4    670
5     90
6    830
7    210
8    360
9    870
dtype: int64

In [136]:
s.describe()

count    10.000000
mean     52.500000
std      25.674241
min       9.000000
25%      38.000000
50%      55.500000
75%      67.000000
max      87.000000
dtype: float64

In [138]:
# how can I get random floats?
# NumPy provides me with a solution here, too -- I can call np.random.rand(10)
# that'll return 10 random floats between 0-1

np.random.seed(0)     # set the random number system to start at a known state
np.random.rand(10) * 100

array([54.88135039, 71.51893664, 60.27633761, 54.4883183 , 42.36547993,
       64.58941131, 43.75872113, 89.17730008, 96.36627605, 38.34415188])

In [141]:
# if my dtype is int64
# each int is 8 bytes
# if I create a series with 10m values, 
# that'll take up 80 MB

np.random.seed(0)
s = Series(np.random.randint(0, 100, 10_000_000))

In [142]:
s.describe()

count    1.000000e+07
mean     4.948936e+01
std      2.886807e+01
min      0.000000e+00
25%      2.400000e+01
50%      4.900000e+01
75%      7.400000e+01
max      9.900000e+01
dtype: float64

# Exercise: Convert Celsius to Fahrenheit (and back)

1. Create a series from the forecast for high temperatures in your city in the next 10 days.
2. Convert those temperatures to Fahrenheit.
3. Convert them back to Celsius.

How to convert:
- °F = (°C × 9/5) + 32 
- °C = (°F − 32) x 5/9 

In [143]:
highs

0    19
1    19
2    23
3    27
4    27
5    25
6    21
7    18
8    19
9    20
dtype: int64

In [147]:
f_highs = (highs * (9/5)) + 32
f_highs

0    66.2
1    66.2
2    73.4
3    80.6
4    80.6
5    77.0
6    69.8
7    64.4
8    66.2
9    68.0
dtype: float64

In [149]:
c_highs = (f_highs - 32) * (5/9)
c_highs

0    19.0
1    19.0
2    23.0
3    27.0
4    27.0
5    25.0
6    21.0
7    18.0
8    19.0
9    20.0
dtype: float64

In [153]:
# Pandas supports a "round" method, which takes the number of post-decimal digits as an argument

(highs / 1.23).round(2)

0    15.45
1    15.45
2    18.70
3    21.95
4    21.95
5    20.33
6    17.07
7    14.63
8    15.45
9    16.26
dtype: float64

# Broadcasting comparisons

Python (and Pandas) support a number of numeric comparisons:

- `==` (equal)
- `!=` (inequality)
- `<` (less than)
- `>` (greater than)
- `<=` (less than or equal)
- `>=` (greater than or equal)

In [154]:
# let's create a new series of random integers

np.random.seed(0)
s = Series(np.random.randint(0, 100, 10))
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [155]:
s == 67

0    False
1    False
2    False
3     True
4     True
5    False
6    False
7    False
8    False
9    False
dtype: bool

In [157]:
# let's create a simple series

s = Series([10, 20, 30])

# we can retrieve with an index
s.loc[1]

20

In [158]:
# we can retrieve with a fancy index -- a list of integers
s.loc[[0, 2]]

0    10
2    30
dtype: int64

In [159]:
# we can also retrieve with a *mask index*, meaning a list of True/False values
# wherever the value is True, the original value "leaks out"

s.loc[[True, False, True]]   # return the items at indexes 0 and 2

0    10
2    30
dtype: int64

In [160]:
s.loc[[True, True, False]]

0    10
1    20
dtype: int64

In [161]:
s.loc[[False, False, False]]

Series([], dtype: int64)

In [162]:
# no one wants to create a mask index manually
# instead, we'll do it automatically -- via the output from a broadcast comparison

In [163]:
np.random.seed(0)
s = Series(np.random.randint(0, 100, 10))

s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [164]:
s + 3   # this is broadcast -- we run +3 on each element of s, and get a series back

0    47
1    50
2    67
3    70
4    70
5    12
6    86
7    24
8    39
9    90
dtype: int64

In [165]:
s == 67   # this is also broadcast == we run == 67 on each element of s, and get a series of booleans back

0    False
1    False
2    False
3     True
4     True
5    False
6    False
7    False
8    False
9    False
dtype: bool

In [166]:
# now to put it all together...

s.loc[s == 67]      # this returns all elements of s that are equal to 67

3    67
4    67
dtype: int64

# How to create and use a mask index

1. We run a comparison operator, broadcast against a series
2. That produces a boolean series
3. We apply the boolean series as a mask index to a series via `.loc`

In [167]:
s < 67    # this will tell us which values are <67

0     True
1     True
2     True
3    False
4    False
5     True
6    False
7     True
8     True
9    False
dtype: bool

In [168]:
s.loc[s < 67]   # show me all values of s that are <67

0    44
1    47
2    64
5     9
7    21
8    36
dtype: int64

In [169]:
s.loc[s > s.mean()]  # calculate the mean, compare each value in s with the mean... apply to s

2    64
3    67
4    67
6    83
9    87
dtype: int64

In [170]:
s.index

RangeIndex(start=0, stop=10, step=1)

In [171]:
s = Series([10, 20, 30])

s.loc[[True, False, True]]   # we'll get the values that correspond with True, and ignore those with False

0    10
2    30
dtype: int64

In [172]:
s < 25  # get a series of booleans back

0     True
1     True
2    False
dtype: bool

In [173]:
s.loc[s < 25]  # this means: show me elements of s that are less than 25

0    10
1    20
dtype: int64

In [176]:
# you can use .index on a series to get its index back
s.loc[s < 25].index

Int64Index([0, 1], dtype='int64')

# Mask indexing combines several ideas

1. If you pass `.loc` a list of `True`/`False` values, it'll return those elements in parallel with the `True` values.
2. We can generate a series of `True`/`False` values with a comparison operator and a series.
3. Mask indexing involves:
    - Using a comparison operator to create a new series containing boolean values
    - Applying that new series to our original series with `.loc`
    - The result is a new series containing only those for whom the comparison is `True`

# Exercises 

1. Create a series of 50 random floats from 0-1,000.
2. Find numbers smaller than the mean.
3. Find any numbers that are larger than the mean + 1 standard deviation.

In [177]:
np.random.seed(0)
s = Series(np.random.rand(50) * 1000)
s

0     548.813504
1     715.189366
2     602.763376
3     544.883183
4     423.654799
5     645.894113
6     437.587211
7     891.773001
8     963.662761
9     383.441519
10    791.725038
11    528.894920
12    568.044561
13    925.596638
14     71.036058
15     87.129300
16     20.218397
17    832.619846
18    778.156751
19    870.012148
20    978.618342
21    799.158564
22    461.479362
23    780.529176
24    118.274426
25    639.921021
26    143.353287
27    944.668917
28    521.848322
29    414.661940
30    264.555612
31    774.233689
32    456.150332
33    568.433949
34     18.789800
35    617.635497
36    612.095723
37    616.933997
38    943.748079
39    681.820299
40    359.507901
41    437.031954
42    697.631196
43     60.225472
44    666.766715
45    670.637870
46    210.382561
47    128.926298
48    315.428351
49    363.710771
dtype: float64

In [179]:
# to find numbers smaller than the mean, I need to:

# 1. find the mean
s.mean()

537.965118275541

In [180]:
# 2. get a boolean series showing which elements are less than the mean (with True)

s < s.mean()

0     False
1     False
2     False
3     False
4      True
5     False
6      True
7     False
8     False
9      True
10    False
11     True
12    False
13    False
14     True
15     True
16     True
17    False
18    False
19    False
20    False
21    False
22     True
23    False
24     True
25    False
26     True
27    False
28     True
29     True
30     True
31    False
32     True
33    False
34     True
35    False
36    False
37    False
38    False
39    False
40     True
41     True
42    False
43     True
44    False
45    False
46     True
47     True
48     True
49     True
dtype: bool

In [181]:
# 3. apply that boolean series as a mask index

s.loc[s < s.mean()]    # this will return elements of s that are < s's mean

4     423.654799
6     437.587211
9     383.441519
11    528.894920
14     71.036058
15     87.129300
16     20.218397
22    461.479362
24    118.274426
26    143.353287
28    521.848322
29    414.661940
30    264.555612
32    456.150332
34     18.789800
40    359.507901
41    437.031954
43     60.225472
46    210.382561
47    128.926298
48    315.428351
49    363.710771
dtype: float64

In [183]:
# Find any numbers that are larger than the mean + 1 standard deviation.

# 1. find the mean
s.mean()

537.965118275541

In [184]:
# 2. find the standard deviation
s.std()

275.0300138190335

In [185]:
# 3. find numbers larger than mean + std, a boolean series

s > s.mean() + s.std()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8      True
9     False
10    False
11    False
12    False
13     True
14    False
15    False
16    False
17     True
18    False
19     True
20     True
21    False
22    False
23    False
24    False
25    False
26    False
27     True
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38     True
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
dtype: bool

In [186]:
# 4. apply the boolean series to .loc, and get the filtered/masked result

s.loc[s > s.mean() + s.std()]

7     891.773001
8     963.662761
13    925.596638
17    832.619846
19    870.012148
20    978.618342
27    944.668917
38    943.748079
dtype: float64

# Check this out in the Pandas Tutor

1. https://pandastutor.com/vis.html#code=import%20numpy%20as%20np%0Aimport%20pandas%20as%20pd%0Afrom%20pandas%20import%20Series%0A%0Anp.random.seed%280%29%0As%20%3D%20Series%28np.random.rand%2850%29%20*%201000%29%0As.loc%5Bs%20%3C%20s.mean%28%29%5D%20&d=2023-03-08&lang=py&v=v1

2. https://pandastutor.com/vis.html#code=import%20numpy%20as%20np%0Aimport%20pandas%20as%20pd%0Afrom%20pandas%20import%20Series%0A%0Anp.random.seed%280%29%0As%20%3D%20Series%28np.random.rand%2850%29%20*%201000%29%0As.loc%5Bs%20%3E%20s.mean%28%29%20%2B%20s.std%28%29%5D%20&d=2023-03-08&lang=py&v=v1

# Next up

1. Indexes 
2. Useful methods to know

In [187]:
s = Series([10, 20, 30, 40, 50, 60, 70])

In [188]:
s

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

In [190]:
# it turns out that in Pandas, we can assign *ANYTHING* to be our index values

s = Series([10, 20, 30, 40, 50, 60, 70],
          index=list('abcdefg'))    # when you create a series, you can set the index to be a list/series

In [191]:
s

a    10
b    20
c    30
d    40
e    50
f    60
g    70
dtype: int64

In [192]:
# we can, using .loc, treat our index just like before

s.loc['a']

10

In [193]:
s.loc['c']

30

In [194]:
# fancy indexing
s.loc[['a', 'c']]

a    10
c    30
dtype: int64

In [195]:
s.index  # what kind of index is this?

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

In [197]:
# slice? Of course!

s.loc['b':'f']   # up to and including

b    20
c    30
d    40
e    50
f    60
dtype: int64

In [198]:
# assign

s.loc['d'] = 999
s

a     10
b     20
c     30
d    999
e     50
f     60
g     70
dtype: int64

In [199]:
s = Series([10, 20, 30, 40, 50, 60, 70],
          index=list('abcdabc'))  

In [200]:
s

a    10
b    20
c    30
d    40
a    50
b    60
c    70
dtype: int64

In [201]:
s.loc['a']  # get back a series, because 'a' appears twice in the index

a    10
a    50
dtype: int64

In [203]:
s.loc['d']  # get back an integer, because 'd' appears once in the index

40

In [204]:
# fancy indexing

s.loc[['a', 'b', 'd']]

a    10
a    50
b    20
b    60
d    40
dtype: int64

In [205]:
# slicing?

s.loc['b':'d']

KeyError: "Cannot get left slice bound for non-unique label: 'b'"

In [206]:
# this is why we also have .iloc!  That way, we can describe what elements we want via
# their positions, ignoring the index

s.iloc[3:6]

d    40
a    50
b    60
dtype: int64

In [208]:
s.loc['a'] = 888    # updated multiple elements -- assignment is also broadcast!
s

a    888
b     20
c     30
d     40
a    888
b     60
c     70
dtype: int64

In [212]:
# let's find all of the elements less than the mean, and add 5 to them

s.loc[s < s.mean()] += 5

In [213]:
s

a    888
b     25
c     35
d     45
a    888
b     65
c     75
dtype: int64

In [214]:
x = 100
x += 5    # same as x = x + 5

x

105

# Choose your index carefully!

1. The default index is numbered from 0 through `len(s) - 1`.  You can always use this with `.iloc`
2. You can set the index when you create your series, by assigning to the `index=` keyword argument. 
3. It's common to use strings - either individual characters or entire words.
4. Pandas knows about:
    - `RangeIndex` -- when your index starts