# Agenda 

1. Getting started and series
    - Why Python?
    - What is Pandas?
    - Pandas series -- defining and working with them
    - Broadcasting
    - Boolean/mask indexes
    - Indexes
    - Dtypes
3. Data frames
4. Analyzing data 

# Jupyter notebook

REPL -- read, eval, print loop -- in the browser. You don't have to use Jupyter in this course, though! If you prefer to use VSCode or PyCharm, that's totally OK.

If you *do* want to use Jupyter:

- Inside of VSCode and PyCharm, you can create/work with notebooks
- You can also use Jupyter Lite (https://jupyter.org/try-jupyter/lab/)



# A 5-minute introduction to Jupyter

We type into *cells* in Jupyter. Each cell contains code (Python) or documentation (Markdown). Each cell has two "modes":

- Edit mode -- when I type, the text goes into the cell. I'm in edit mode right now! You can enter edit mode by pressing `ENTER` or by clicking inside of a cell.
- Command mode -- when I type, typically one character, that character is a command to Jupyter. The character is not entered into the cell, but rather tells Jupyter to do something You can enter command mode by pressing `ESC` or by clicking to the left of a cell.

What commands could we use?
- `c` -- copy the current cell
- `x` -- cut the current cell
- `v` -- paste the most recently copied/cut cell
- `a` -- add a new cell *above* the current one
- `b` -- add a new cell *below* the current one
- `y` -- make a cell in Python mode
- `m` -- make a cell in Markdown (documentation) mode

Always, you can press `ENTER` to go down a line. To execute the code in a cell, use shift+`ENTER` together.

# What is data science?

This is a huge, new term in the last decade or two. I divide into several parts:

- All of data science is everything having to do with retrieving, analyzing, and forecasting with data. It's the overall umbrella term.
- *Data analytics* is making sense of data that we've already collected. That's what we're going to be doing in this course.
- *Data engineering* is about getting data from one place to another, so that we can analyze it.
- *Modeling* and *machine learning* is about making predictions, or forecasts, based on existing data.

Example, if I'm a big company:
- My data engineers will move data from our various databases into a central location so that we can analyze it
- The data analysts will look it over and understand how many widgets we sold last year, and in which regions, and which salespeople did the best job.
- The ML specialists will use the data to predict how well we'll sell our widgets next year, and which regions and types of customers we should target.

All of these disciplines now use Python as their main language of choice.

That seems **SUPER WEIRD** if you know anything about Python. Python doesn't run quickly and uses lots of memory.

The biggest reason that Python is the #1 language for data science is NumPy -- which gives us the speed and size of C data, but with a Python layer over it. That makes it easy to work with but also very efficient. NumPy has long been favored by scientists and engineers for working with data. There's even a SciPy package which has lots of libraries for various scientific and engineering fields.

Part of the reason NumPy works so well is that it's *vectorized*, meaning that don't work with individual data points, but rather we work with groups of numbers. NumPy is optimized to do things in that sort of way.

I could use NumPy for everything! But it's a bit low level for many people's tastes.

That's where Pandas comes in: Pandas is a wrapper around NumPy that provides a ton of additional functionality. Using a Pandas series is very similar to using a NumPy array, but you have hundreds of additional methods. Also, you can work with many more data types (e.g., dates and times, and also strings), and you can work with many different file types and formats.

Pandas has been around for about 17 years now, and it continues to be really, really popular -- about 40m downloads every month.

Lots of organizations are moving from Matlab, R, or Excel to use Pandas instead.

In many ways, you can think of Pandas as providing all of the functionality of Excel, but inside of Python. This means that you can perform all of those calculations, but you'll do it without needing to sit in front of your computer.

There is no way to remember all of the Pandas functionality! My goal is to teach you how think the way that Pandas wants you to think, and thus be able to find and understand the right documentation to do what you want.

In [2]:
# before you can "import pandas", you need to install it using either "pip" or "uv" or "conda"

# I type quickly, and avoided using the "as pd" for the first year or two I used Pandas. A big mistake!

import numpy as np
import pandas as pd
from pandas import Series

# Installing packages

The traditional way to install Python packages, and still the "official" way, is with the `pip` command. It's not a Python function you run inside of Jupyter or a program. Rather, you run it on the command line, just as you run a Python program.

    pip install pandas

That's the most standard way to do it. And that installs Pandas into your `site-packages` directory.

Another package installer is `conda`, which is used by people using Anaconda distributions of Python and Pandas. I think you would say

    conda install pandas

The newest way to install packages in Python is "uv", from a company called Astral. The simplest way to install packages with uv is to say

    uv pip install pandas

uv is far, far faster than pip. But it has a lot of functionality that pip doesn't have -- it basically replaces venv, pyenv, Poetry, and a number of other package managers. 

You can learn more at https://uvcrashcourse.com .

# Series

The core data structure in Pandas is the "series," which is a 1D data structure. It's similar to a list in Python, but it also has some big differences. We can create a new series with `pd.Series`, handing it a Python list of values. (Or if you prefer, a NumPy array of values.)

In [3]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
s

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

# Let's take apart what we just did

1. We created a new Series value. This is not a list, but rather a totally different data structure. However, to get the data into Pandas, we needed to use a list.
2. Our series, as we'll see, needs to have values that are all of the same type -- in this case, all integers.
3. The series can have any number of values.
4. The series actually, behind the scenes, creates a NumPy array. Pandas is a Python shell around NumPy values.
5. When we print our series, you can see that we get all of the values on the right column, and their indexes on the left column. Our initial series has the default index, which you probably expect in Python, starting with 0 and going through the length - 1.
6. We can see the `dtype` of the series, which describes what types of values are in it. This series has `int64`, meaning 64-bit integers.

Already that should surprise you! Python doesn't have 64-bit ints, or 32-bit ints, or 128-bit ints. That's because these are not being stored as Python ints, but rather as C ints via NumPy.

In [5]:
# Python list
mylist = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# What happens if I add this list to itself? I get a list of 20 elements:
mylist + mylist

[10,
 20,
 30,
 40,
 50,
 60,
 70,
 80,
 90,
 100,
 10,
 20,
 30,
 40,
 50,
 60,
 70,
 80,
 90,
 100]

In [6]:
# what happens if I add my series to itself?
s + s

0     20
1     40
2     60
3     80
4    100
5    120
6    140
7    160
8    180
9    200
dtype: int64

When you add (or multiply, or subtract) two series in Pandas, the operation is *vectorized*. That means: The operation is done to the values at index 0, at index 1, at index 2, etc.

In [7]:
# like a list, you can retrieve from a series

s[0]

np.int64(10)

In [8]:
s[1]

np.int64(20)

In [11]:
s[2:7]   # I get a slice back from this series! The index is 2, 3, 4, 5, 6

2    30
3    40
4    50
5    60
6    70
dtype: int64

In [10]:
mylist[2:7]  # I get a list back from slicing mylist, and the index is 0, 1, 2, 3, 4

[30, 40, 50, 60, 70]

In [13]:
s[2] = 999   # series are mutable, as well!
s

0     10
1     20
2    999
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

# Exercise: Basic series operations

1. `import pandas as pd` into your environment.
2. Create a Pandas series containing 5 integers.
3. Grab the first integer (at index 0)
4. Grab the third integer (at index 2)
5. What happens when you add the series to itself?

In [14]:
import pandas as pd

s = pd.Series([15, 7, 100, 2, 8])
s

0     15
1      7
2    100
3      2
4      8
dtype: int64

In [15]:
s[0]

np.int64(15)

In [16]:
s[4]

np.int64(8)

In [17]:
s + s

0     30
1     14
2    200
3      4
4     16
dtype: int64

In [18]:
s - s 

0    0
1    0
2    0
3    0
4    0
dtype: int64

In [19]:
s / s

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
dtype: float64

In [20]:
s[-1]

KeyError: -1

# Useful methods

Pandas provides us with a lot of methods that we use to analyze our data. Some of these are taken from NumPy, but others are special to Pandas.

In [21]:
s

0     15
1      7
2    100
3      2
4      8
dtype: int64

In [22]:
s.sum()   # returns the sum of the values in the series

np.int64(132)

In [23]:
s.count()  # how many values (not NaN, or "not a number") are there in the series?

np.int64(5)

In [24]:
s.mean()  # what is the arithmetic mean of the series?

np.float64(26.4)

In [25]:
s.sum() / s.count()

np.float64(26.4)

In [26]:
s.std()   # standard deviation

np.float64(41.40410607657168)

In [28]:
(s / s).std()

np.float64(0.0)

In [29]:
# SS: is there a difference between len(s) and s.count()?
# yes!

len(s)  # this returns the number of elements, no matter what they are, in the series

5

In [30]:
s.count()  # this returns the number of non-missing values in the series

np.int64(5)

In [32]:
s.min()  # what's the smallest value?

np.int64(2)

In [33]:
s.max()  # what's the biggest number?

np.int64(100)

In [34]:
s.median()  # what is the median?

np.float64(8.0)

In [35]:
s.describe()  # this gives me all of the values we just saw, and then some!

count      5.000000
mean      26.400000
std       41.404106
min        2.000000
25%        7.000000
50%        8.000000
75%       15.000000
max      100.000000
dtype: float64

In [36]:
s.head()  # this returns the first 5 elements of a series

0     15
1      7
2    100
3      2
4      8
dtype: int64

In [37]:
s.head(2)  # now give me the first 2

0    15
1     7
dtype: int64

In [38]:
# similarly, I can use s.tail() -- gives 5 bottom/final values by default

s.tail(3)

2    100
3      2
4      8
dtype: int64

In [39]:
s = Series([10, 10, 10, 10, 10, 20, 20, 20, 30, 30, 40, 40, 40, 30, 20, 10, 50])

# how many times does each value appear in s?
# we can find out with value_counts

# the result of value_counts is a series -- one in which the values from the original series (s)
# are the index, and the results are the number of times each value appears

# also, the result is sorted, from most common to least common.

s.value_counts()

10    6
20    4
30    3
40    3
50    1
Name: count, dtype: int64

# Exercise: Series calculations

1. Create a series of 10 numbers. It's OK for some of them to repeat.
2. Find the largest and smallest numbers in two ways -- using individual methods and then using `describe`.
3. Which number occurs the most often in your series?

In [40]:
s = Series([10, 20, 30, 20, 30, 40, 30, 40, 50, 10])



In [41]:
s.min()

np.int64(10)

In [42]:
s.max()

np.int64(50)

In [43]:
s.describe()

count    10.000000
mean     28.000000
std      13.165612
min      10.000000
25%      20.000000
50%      30.000000
75%      37.500000
max      50.000000
dtype: float64

In [44]:
s.value_counts()

30    3
10    2
20    2
40    2
50    1
Name: count, dtype: int64

In [46]:
# this gives us the maximum value, but not the index associated with it
# if we want to know *which* number appeared most often, this won't do it.
s.value_counts().max()

np.int64(3)

In [47]:
s.value_counts().idxmax()   # this returns the index associated with the most common value

np.int64(30)

In [48]:
s.value_counts().head(1)

30    3
Name: count, dtype: int64

# Pandas has a limited set of tools

When we get results back from Pandas, they will always be:

- Single NumPy values (e.g., `np.int64`)
- Pandas series
- Pandas data frames (the 2D data structures we'll see next week)

# Next up

- Setting and retrieving values
- Broadcasting

In [49]:
s

0    10
1    20
2    30
3    20
4    30
5    40
6    30
7    40
8    50
9    10
dtype: int64

In [50]:
s[0]

np.int64(10)

In [51]:
s.value_counts()[0]

KeyError: 0

In [52]:
s.value_counts()

30    3
10    2
20    2
40    2
50    1
Name: count, dtype: int64

In [54]:
# what are the values in my series s?
# I get a NumPy array back

s.values

array([10, 20, 30, 20, 30, 40, 30, 40, 50, 10])

In [55]:
s.index

RangeIndex(start=0, stop=10, step=1)

In [56]:
# what are the values from value_counts()?
s.value_counts().values

array([3, 2, 2, 2, 1])

In [57]:
# what is the index from value_counts()?
s.value_counts().index

Index([30, 10, 20, 40, 50], dtype='int64')

In [58]:
s = Series(range(10, 101, 10))
s

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

In [59]:
# can I retrieve from my series using []?  We've already seen that the answer is "yes"

s[3]

np.int64(40)

In [60]:
s[6]

np.int64(70)

Don't use `[]`! From this point on, try to avoid using [] on your series. Instead, you should use .loc[].

`.loc[]` is a special Pandas thing. It allows us to retrieve from a series using the index. 





In [61]:
s

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

In [62]:
s.loc[4]

np.int64(50)

In [63]:
s.loc[9]

np.int64(100)

What's wrong with `[]` by themselves?

1. When we move to data frames (i.e., 2D tables) next week, `[]` by themselves will refer to the columns. That is hard to digest if you've been using `[]` to get from the values in a series, which are similar to the rows in a data frame.
2. `.loc` has a ton of special functionality that regular `[]` don't have.
3. If you want to retrieve via the position, and not the index, you can use `.iloc`, which only takes integer values.

In [65]:
s

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

In [66]:
s = Series([10, 20, 30, 20, 30, 40, 30, 40, 50, 10])
s

0    10
1    20
2    30
3    20
4    30
5    40
6    30
7    40
8    50
9    10
dtype: int64

In [69]:
s.loc[5]  # retrieve via the index

np.int64(40)

In [70]:
s.iloc[5]#  retrieve via the position

np.int64(40)

In [71]:
s.value_counts()

30    3
10    2
20    2
40    2
50    1
Name: count, dtype: int64

In [72]:
s.value_counts().loc[30]  # this returns the value associated with the index 30

np.int64(3)

In [73]:
s.value_counts().iloc[0]  # this returns the value in position 0

np.int64(3)

Notice that both `.loc` and `.iloc` use `[]`, not `()`, even though they're basically methods. Why is this? Because we can use *slices* when we retrieve!

In [74]:
s

0    10
1    20
2    30
3    20
4    30
5    40
6    30
7    40
8    50
9    10
dtype: int64

In [75]:
s.loc[3:6]  # from index 3 to index 6 -- loc returns up to *and including* the endpoint in our slice

3    20
4    30
5    40
6    30
dtype: int64

In [77]:
s.iloc[3:6] # from index 3 to index 6 -- iloc does the traditional Python thing, of up to and *not* including

3    20
4    30
5    40
dtype: int64

In [78]:
s.describe()  # this returns a series

count    10.000000
mean     28.000000
std      13.165612
min      10.000000
25%      20.000000
50%      30.000000
75%      37.500000
max      50.000000
dtype: float64

In [79]:
s.describe().loc['min']  # retrieve the minimum value

np.float64(10.0)

In [80]:
s.describe().iloc[3]  # also retrieve the minimum value

np.float64(10.0)

In [82]:
s.describe().loc['mean':'50%'] # up to and including

mean    28.000000
std     13.165612
min     10.000000
25%     20.000000
50%     30.000000
dtype: float64

In [83]:
s.describe().iloc[1:5]  # up to and not including

mean    28.000000
std     13.165612
min     10.000000
25%     20.000000
dtype: float64

In [84]:
s

0    10
1    20
2    30
3    20
4    30
5    40
6    30
7    40
8    50
9    10
dtype: int64

In [85]:
# can I retrieve from index 4?
s.loc[4]

np.int64(30)

In [86]:
# can I retrieve from index 2?
s.loc[2]

np.int64(30)

In [87]:
# "fancy indexing"

# I want both indexes 4 and 2, in that order
s.loc[ [4,2]  ] # we have a list of indexes inside of the []!

4    30
2    30
dtype: int64

In [88]:
s.loc[[4,2,3,4,3,2]]

4    30
2    30
3    20
4    30
3    20
2    30
dtype: int64

When you retrieve from a series, depending on what you put in `loc[]`, you might get a single value, and you might get a series of values. The series you get back is 100% a series, on which you can run series methods.

In [89]:
s.loc[[4,2,3,4,3,2]].mean()

np.float64(26.666666666666668)

In [90]:
s.loc[[4,2,3,4,3,2]].value_counts()

30    4
20    2
Name: count, dtype: int64

In [91]:
# method chaining syntax

# I might like to put each of my methods on a line by itself, to be easier to read

s
.loc[[4,2,3,4,3,2]]
.value_counts()

SyntaxError: invalid syntax (3806846548.py, line 6)

In [92]:
# if we have open parentheses, then Python treats everything as if it's on one line

(
    s
    .loc[[4,2,3,4,3,2]]
    .value_counts()
)

30    4
20    2
Name: count, dtype: int64

# Exercises with series

1. Find the 10-day forecast for wherever you live (or wish to live)
2. Define a series containing the high temps for each of the next 10 days.
3. Define a second series containing the low temps for each of the same 10 days.
4. What is the most common high temp expected in the next 10 days?
5. What is the mean difference between high and low expected in the next 10 days?
6. What is the mean of the high temp for the 3 next days, and for the 3 final days? Which will be higher?

In [93]:
high_temps = Series([27, 28, 30, 29, 27, 26, 26, 23, 22])
low_temps = Series([17, 17, 17, 17, 16, 15, 15, 14, 13])

In [94]:
(
    high_temps        # get high temps
    .value_counts()   # how often does each appear?
    .head(1)          # grab the most common
)

27    2
Name: count, dtype: int64

In [95]:
high_temps.mode()

0    26
1    27
dtype: int64

In [96]:
high_temps.value_counts()

27    2
26    2
28    1
30    1
29    1
23    1
22    1
Name: count, dtype: int64

In [99]:
temp_diff = high_temps - low_temps
temp_diff.mean()

np.float64(10.777777777777779)

In [101]:
high_temps.head(3).mean()

np.float64(28.333333333333332)

In [102]:
high_temps.tail(3).mean()

np.float64(23.666666666666668)

In [103]:
# I can use just Series beacuse I also said
from pandas import Series

# otherwise, just use pd.Series

# Broadcasting

This is one of the most important ideas we have in Pandas. We've seen that if we have two series, we can run a number of different operations on them.

In [104]:
high_temps - low_temps

0    10
1    11
2    13
3    12
4    11
5    11
6    11
7     9
8     9
dtype: int64

In [106]:
# what if, instead of subtracting low_temps (a series) from high_temps, I were to subtract a number?
# this "broadcasts" the operation of "-10" to each of the elements in high_temps. We get back a new series
# in which the index matches high_temps, but the values are each high_temps value, - 10.


high_temps - 10

0    17
1    18
2    20
3    19
4    17
5    16
6    16
7    13
8    12
dtype: int64

In [108]:
s * 100

0    1000
1    2000
2    3000
3    2000
4    3000
5    4000
6    3000
7    4000
8    5000
9    1000
dtype: int64

In [109]:
s / 2

0     5.0
1    10.0
2    15.0
3    10.0
4    15.0
5    20.0
6    15.0
7    20.0
8    25.0
9     5.0
dtype: float64

# Who cares?

This is core to working with Pandas: We run an operation on a series and a scalar value, and we get back a series that has been processed with the scalar.

This is why we never, *ever* run a `for` loop on a series in Pandas.

# Exercise: Broadcasting (and indexing)

1. Keep your two series from the last exercise, in which you have the 10-forecast for highs and lows.
2. Retrieve the first, third, and fifth high temperatures.
3. Convert each of the series to use Fahrenheit (if you're using Celsius) or vice versa. Recalculate the mean difference using the other measurement system.

- F2C: (f - 32) * 5/9
- C2F: (c * 9/5 + 32

In [110]:
high_temps

0    27
1    28
2    30
3    29
4    27
5    26
6    26
7    23
8    22
dtype: int64

In [113]:
low_temps

0    17
1    17
2    17
3    17
4    16
5    15
6    15
7    14
8    13
dtype: int64

In [115]:
high_temps.loc[[0, 2, 4]]

0    27
2    30
4    27
dtype: int64

In [117]:
high_temps.loc[0:4:2]

0    27
2    30
4    27
dtype: int64

In [119]:
high_temps.iloc[0:5:2]

0    27
2    30
4    27
dtype: int64

In [121]:
(high_temps * 9/5 + 32)

0    80.6
1    82.4
2    86.0
3    84.2
4    80.6
5    78.8
6    78.8
7    73.4
8    71.6
dtype: float64

In [122]:
(low_temps * 9/5 + 32)

0    62.6
1    62.6
2    62.6
3    62.6
4    60.8
5    59.0
6    59.0
7    57.2
8    55.4
dtype: float64

In [124]:
((high_temps * 9/5 + 32) - (low_temps * 9/5 + 32)).mean()

np.float64(19.399999999999995)

# Next up

1. Mask arrays -- how to filter data
2. Custom indexes

In [126]:
# get the elements of high_temps at indexes 0, 2, and 4
high_temps.loc[[0, 2, 4]]

0    27
2    30
4    27
dtype: int64

In [127]:
high_temps.loc[[0, 1]]

0    27
1    28
dtype: int64

In [129]:
high_temps.loc[:1]   #use a slice to get from the start through index 1

0    27
1    28
dtype: int64

In [130]:
high_temps.iloc[:2]  # use a slice with iloc to get from the start up to and not including 2

0    27
1    28
dtype: int64

In [131]:
high_temps.head(2)

0    27
1    28
dtype: int64

# Boolean/mask array

We've seen that we can apply an operator to a series, and we'll get a series back. The operation will be "broadcast" to every single value in the series.

We can apply this technique in a different way, to filter our values.

First, let's look at simple boolean/mask indexing.

In [132]:
s = Series([10, 20, 30, 35, 45, 55])
s

0    10
1    20
2    30
3    35
4    45
5    55
dtype: int64

In [133]:
# I can retrieve one item
s.loc[3]

np.int64(35)

In [134]:
# I can use fancy indexing for more than one item
s.loc[[3, 5]]

3    35
5    55
dtype: int64

In [135]:
# But what happens if, instead of passing a list (or series) of integers inside of the [], 
# I pass a list (or series) of boolean values -- True and False?

s.loc[[True, False, False, True, True, False]] 

0    10
3    35
4    45
dtype: int64

Pandas matches up our boolean series/list with s. Wherever there is a `True` value, we get the index/value from `s`. Where there is a `False`, value, we don't. It just doesn't come through the "mask."

The boolean series and `s` must be the same length, otherwise this doesn't work.

No one, but *no one*, wants to actually write out a boolean series by hand. Fortunately, we don't have to. We can use broadcasting.

In [136]:
s

0    10
1    20
2    30
3    35
4    45
5    55
dtype: int64

In [137]:
# let's broadcast the comparison 
s < 40

0     True
1     True
2     True
3     True
4    False
5    False
dtype: bool

In [139]:
# If I put s < 40 inside of .loc[], then I'll be filtering the elements of s
# I only get the elements of s that are < 40

s.loc[s<40]

0    10
1    20
2    30
3    35
dtype: int64

# What's going on here? How does it work?

1. The first thing that is calculated is the value inside of the `[]`. There, we calculate `s<40`. That's a broadcast operation, so we get back a new series whose index is identical to `s`, but whose values are all booleans (`True` and `False`).
2. We then apply that boolean series to `s.loc`.
3. Wherever there is a `True` value, we get a result. 

In [140]:
# I can get even more complex

s.loc[s<s.mean()]   # now I want only those values that are less than the mean!

0    10
1    20
2    30
dtype: int64

In [143]:
# what if I want only the odd values?

s.loc[s%2 == 1]

3    35
4    45
5    55
dtype: int64

# Exercises

1. On how many days will the forecast high be higher than the mean high temp? Compare this with the median. Are they the same? Why or why not?
2. Is the mean odd high forecast temp higher or lower than the even high forecast temp?

In [147]:
# find me all elements of high_temps
# where the value is greater than the mean, high_temps.mean()

high_temps.loc[ high_temps > high_temps.mean() ]

0    27
1    28
2    30
3    29
4    27
dtype: int64

In [148]:
high_temps.loc[ high_temps > high_temps.median() ]

1    28
2    30
3    29
dtype: int64

In [149]:
# mean -- total / length
high_temps.mean()

np.float64(26.444444444444443)

In [150]:
# median -- it's the middle value, when we line them up from smallest to largest
high_temps.median()

np.float64(27.0)

Why aren't the mean and median the same?

The mean can be pulled up or down by a number of very large or very small values. The median, by contrast, is just the middle value.

In many places, median is considered more reliable, because it really is the middle value. 

In [153]:
# is the mean odd high forecast temp higher or lower than the even high forecast temp?

high_temps.loc[high_temps % 2 == 1].mean() - high_temps.loc[high_temps % 2 == 0].mean()

np.float64(0.10000000000000142)

# Indexes

We can assign any values we want to be the index of a series. By default, we get integers, starting at 0 and going up to the length - 1. But we can, when we create a series, assign any list or series to be the index. The number of items in the index must match the number of items in the series. You can use integers, floats, strings.. anything, basically.

In [155]:
s = Series([10, 20, 30, 40, 50],
           index=list('abcde'))

s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [156]:
# to retrieve things via .loc, I need to specify the index (string)

s.loc['b']

np.int64(20)

In [157]:
s.loc['d']

np.int64(40)

In [158]:
s.loc['b':'d']

b    20
c    30
d    40
dtype: int64

In [160]:
# I can still use iloc to retrieve by position
s.iloc[1:4]

b    20
c    30
d    40
dtype: int64

In [161]:
# fancy indexing?

s.loc[['a', 'c', 'b']]

a    10
c    30
b    20
dtype: int64

In [162]:
# can I change the index? Yes! If I assign a new list/series to s.index

s.index = [100, 200, 300, 400, 500]

In [163]:
s

100    10
200    20
300    30
400    40
500    50
dtype: int64

In [164]:
s.loc[400]

np.int64(40)

In [165]:
s = Series([10, 20, 30, 40, 50],
           index=list('abcab'))
s

a    10
b    20
c    30
a    40
b    50
dtype: int64

Allowing for the index to repeat means that you can have the same datetime on multiple rows. Or user IDs on multiple rows. Or names on multiple rows.

In [166]:
s.loc['c']  # this returns a single value, because only one has an index of 'c'

np.int64(30)

In [167]:
s.loc['a'] # this returns a series, both items with the index 'a'

a    10
a    40
dtype: int64

In [168]:
s.loc[['c']]  # this guarantees I'll get a series back

c    30
dtype: int64

In [169]:
s.loc[['a']]

a    10
a    40
dtype: int64

In [170]:
s.describe()

count     5.000000
mean     30.000000
std      15.811388
min      10.000000
25%      20.000000
50%      30.000000
75%      40.000000
max      50.000000
dtype: float64

In [171]:
s.describe().loc['mean']

np.float64(30.0)

In [173]:
# fancy indexing!
s.describe().loc[['mean', '50%']]

mean    30.0
50%     30.0
dtype: float64

# Exercise

1. Create a series whose values are the forecast high temps in your city for the next 10 days. The index should be the day of the week (you can use just the first three letters, if you want.)
2. What is the mean temperature on Mondays?
3. What is the mean temperature on Sundays and Mondays?
3. What is the max temperature on the first three days?

In [175]:
high_temps.values

array([27, 28, 30, 29, 27, 26, 26, 23, 22])

In [176]:
high_temps = pd.Series([27, 28, 30, 29, 27, 26, 26, 23, 22],
                       index='Fri Sat Sun Mon Tue Wed Thu Fri Sat'.split())
high_temps

Fri    27
Sat    28
Sun    30
Mon    29
Tue    27
Wed    26
Thu    26
Fri    23
Sat    22
dtype: int64

In [178]:
high_temps.loc['Mon'].mean()

np.float64(29.0)

In [181]:
high_temps.loc['Fri'].mean()

np.float64(25.0)

In [183]:
high_temps.loc[['Sun', 'Mon']].mean()

np.float64(29.5)

In [184]:
high_temps.iloc[:3]

Fri    27
Sat    28
Sun    30
dtype: int64

In [186]:
high_temps = pd.Series([27, 28, 30, 29, 27, 26, 26, 23, 22],
                       index=['Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'])
high_temps

Fri    27
Sat    28
Sun    30
Mon    29
Tue    27
Wed    26
Thu    26
Fri    23
Sat    22
dtype: int64

In [185]:
list('Fri Sat Sun Mon Tue Wed Thu Fri Sat')

['F',
 'r',
 'i',
 ' ',
 'S',
 'a',
 't',
 ' ',
 'S',
 'u',
 'n',
 ' ',
 'M',
 'o',
 'n',
 ' ',
 'T',
 'u',
 'e',
 ' ',
 'W',
 'e',
 'd',
 ' ',
 'T',
 'h',
 'u',
 ' ',
 'F',
 'r',
 'i',
 ' ',
 'S',
 'a',
 't']

# Next up

- dtypes
- `NaN` -- "not a number" -- but what is?!?!?

In [187]:
s


a    10
b    20
c    30
a    40
b    50
dtype: int64

In [188]:
s.loc['e'] = 999
s

a     10
b     20
c     30
a     40
b     50
e    999
dtype: int64

In [189]:
s.loc['e'] = 888
s

a     10
b     20
c     30
a     40
b     50
e    888
dtype: int64

# dtypes

In a Python list, we can have any Python values we want. The convention is to just have values of one type in a list, but there's no technical reason we need to do this. However, in a series, we are constrained -- we can only have values of one type. Moreover, that type is the underlying NumPy type, which reflects C types, *not* Python types.

The dtype of a series describes the type of data that it contains. All values must be of that type!

In [190]:
s = Series([10, 20, 30])
s

0    10
1    20
2    30
dtype: int64

In [191]:
s.dtype  # this returns the dtype

dtype('int64')

# Specifying a dtype

There are a few ways to specify a dtype. In the case of int64, we can say:

- `'int64'`
- `np.dtype('int64')`
- `np.int64`

# What dtypes are available?

- integers (positive and negative)
    - `int8`
    - `int16`
    - `int32`
    - `int64`  -- default if you give integers
- integers (unsigned, meaning only positive)
    - `uint8`
    - `uint16`
    - `uint32`
    - `uint64`
- floats
    - `float16`
    - `float32`
    - `float64` -- default if you give floats

Why choose other than the default dtype? You can save memory, which (a) allows you to process more data and (b) it'll take less time to perform calculations.

If you have 1m numbers in a series, then 64-bit integers become 64 MB. 

You should thus choose the smallest dtype you can.  But... be careful!

In [192]:
# 8-bit numbers -- go from -128 to +127
s = Series([10, 20, 30, 40, 50], dtype='int8')
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [193]:
2 ** 8

256

In [194]:
s + s

0     20
1     40
2     60
3     80
4    100
dtype: int8

In [195]:
s * s

0    100
1   -112
2   -124
3     64
4    -60
dtype: int8

If it has to make a calculation, and there aren't enough bits to perform that calcluation, it'll do its "best," meaning that it'll roll over as needed within its constraints.

20 * 20 = 400

How can you avoid this? Make sure your dtypes are big enough. 

In [196]:
s * 1000

OverflowError: Python integer 1000 out of bounds for int8

# What if I want to change the dtype?

You cannot assign to the `dtype` attribute. If you try, you'll get an error. That's good, because it doesn't make sense.

However, you can use the `astype` method, specifying what new type you want, and you'll get a new series back, one with the new dtype.

In [197]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [198]:
s.astype('int32')

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [199]:
s = s.astype('int32')
s

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [200]:
s = Series([10, 20, 30, 40, 50])  # Pandas sees our integers, and assumes we want int64
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [201]:
# what if I assign a float value into there?

s.loc[3] = 12.34  # in NumPy, this would be turned into an integer, truncating the decimal.. not in Pandas, though!
s

  s.loc[3] = 12.34  # in NumPy, this would be turned into an integer, truncating the decimal.. not in Pandas, though!


0    10.00
1    20.00
2    30.00
3    12.34
4    50.00
dtype: float64

In [202]:
# What about strings? What if I have string data?

s = Series('this is a test'.split())
s

0    this
1      is
2       a
3    test
dtype: object

What does dtype of `object` mean? That Pandas isn't storing that value in NumPy. Rather, it's using a Python object, and is referring to that Python object. Most often, `object` means that you have a string.