# Agenda 

1. Getting started and series
    - Why Python?
    - What is Pandas?
    - Pandas series -- defining and working with them
    - Broadcasting
    - Boolean/mask indexes
    - Indexes
    - Dtypes
3. Data frames
4. Analyzing data 

# Jupyter notebook

REPL -- read, eval, print loop -- in the browser. You don't have to use Jupyter in this course, though! If you prefer to use VSCode or PyCharm, that's totally OK.

If you *do* want to use Jupyter:

- Inside of VSCode and PyCharm, you can create/work with notebooks
- You can also use Jupyter Lite (https://jupyter.org/try-jupyter/lab/)



# A 5-minute introduction to Jupyter

We type into *cells* in Jupyter. Each cell contains code (Python) or documentation (Markdown). Each cell has two "modes":

- Edit mode -- when I type, the text goes into the cell. I'm in edit mode right now! You can enter edit mode by pressing `ENTER` or by clicking inside of a cell.
- Command mode -- when I type, typically one character, that character is a command to Jupyter. The character is not entered into the cell, but rather tells Jupyter to do something You can enter command mode by pressing `ESC` or by clicking to the left of a cell.

What commands could we use?
- `c` -- copy the current cell
- `x` -- cut the current cell
- `v` -- paste the most recently copied/cut cell
- `a` -- add a new cell *above* the current one
- `b` -- add a new cell *below* the current one
- `y` -- make a cell in Python mode
- `m` -- make a cell in Markdown (documentation) mode

Always, you can press `ENTER` to go down a line. To execute the code in a cell, use shift+`ENTER` together.

# What is data science?

This is a huge, new term in the last decade or two. I divide into several parts:

- All of data science is everything having to do with retrieving, analyzing, and forecasting with data. It's the overall umbrella term.
- *Data analytics* is making sense of data that we've already collected. That's what we're going to be doing in this course.
- *Data engineering* is about getting data from one place to another, so that we can analyze it.
- *Modeling* and *machine learning* is about making predictions, or forecasts, based on existing data.

Example, if I'm a big company:
- My data engineers will move data from our various databases into a central location so that we can analyze it
- The data analysts will look it over and understand how many widgets we sold last year, and in which regions, and which salespeople did the best job.
- The ML specialists will use the data to predict how well we'll sell our widgets next year, and which regions and types of customers we should target.

All of these disciplines now use Python as their main language of choice.

That seems **SUPER WEIRD** if you know anything about Python. Python doesn't run quickly and uses lots of memory.

The biggest reason that Python is the #1 language for data science is NumPy -- which gives us the speed and size of C data, but with a Python layer over it. That makes it easy to work with but also very efficient. NumPy has long been favored by scientists and engineers for working with data. There's even a SciPy package which has lots of libraries for various scientific and engineering fields.

Part of the reason NumPy works so well is that it's *vectorized*, meaning that don't work with individual data points, but rather we work with groups of numbers. NumPy is optimized to do things in that sort of way.

I could use NumPy for everything! But it's a bit low level for many people's tastes.

That's where Pandas comes in: Pandas is a wrapper around NumPy that provides a ton of additional functionality. Using a Pandas series is very similar to using a NumPy array, but you have hundreds of additional methods. Also, you can work with many more data types (e.g., dates and times, and also strings), and you can work with many different file types and formats.

Pandas has been around for about 17 years now, and it continues to be really, really popular -- about 40m downloads every month.

Lots of organizations are moving from Matlab, R, or Excel to use Pandas instead.

In many ways, you can think of Pandas as providing all of the functionality of Excel, but inside of Python. This means that you can perform all of those calculations, but you'll do it without needing to sit in front of your computer.

There is no way to remember all of the Pandas functionality! My goal is to teach you how think the way that Pandas wants you to think, and thus be able to find and understand the right documentation to do what you want.

In [2]:
# before you can "import pandas", you need to install it using either "pip" or "uv" or "conda"

# I type quickly, and avoided using the "as pd" for the first year or two I used Pandas. A big mistake!

import numpy as np
import pandas as pd
from pandas import Series

# Installing packages

The traditional way to install Python packages, and still the "official" way, is with the `pip` command. It's not a Python function you run inside of Jupyter or a program. Rather, you run it on the command line, just as you run a Python program.

    pip install pandas

That's the most standard way to do it. And that installs Pandas into your `site-packages` directory.

Another package installer is `conda`, which is used by people using Anaconda distributions of Python and Pandas. I think you would say

    conda install pandas

The newest way to install packages in Python is "uv", from a company called Astral. The simplest way to install packages with uv is to say

    uv pip install pandas

uv is far, far faster than pip. But it has a lot of functionality that pip doesn't have -- it basically replaces venv, pyenv, Poetry, and a number of other package managers. 

You can learn more at https://uvcrashcourse.com .

# Series

The core data structure in Pandas is the "series," which is a 1D data structure. It's similar to a list in Python, but it also has some big differences. We can create a new series with `pd.Series`, handing it a Python list of values. (Or if you prefer, a NumPy array of values.)

In [3]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
s

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

# Let's take apart what we just did

1. We created a new Series value. This is not a list, but rather a totally different data structure. However, to get the data into Pandas, we needed to use a list.
2. Our series, as we'll see, needs to have values that are all of the same type -- in this case, all integers.
3. The series can have any number of values.
4. The series actually, behind the scenes, creates a NumPy array. Pandas is a Python shell around NumPy values.
5. When we print our series, you can see that we get all of the values on the right column, and their indexes on the left column. Our initial series has the default index, which you probably expect in Python, starting with 0 and going through the length - 1.
6. We can see the `dtype` of the series, which describes what types of values are in it. This series has `int64`, meaning 64-bit integers.

Already that should surprise you! Python doesn't have 64-bit ints, or 32-bit ints, or 128-bit ints. That's because these are not being stored as Python ints, but rather as C ints via NumPy.

In [5]:
# Python list
mylist = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# What happens if I add this list to itself? I get a list of 20 elements:
mylist + mylist

[10,
 20,
 30,
 40,
 50,
 60,
 70,
 80,
 90,
 100,
 10,
 20,
 30,
 40,
 50,
 60,
 70,
 80,
 90,
 100]

In [6]:
# what happens if I add my series to itself?
s + s

0     20
1     40
2     60
3     80
4    100
5    120
6    140
7    160
8    180
9    200
dtype: int64

When you add (or multiply, or subtract) two series in Pandas, the operation is *vectorized*. That means: The operation is done to the values at index 0, at index 1, at index 2, etc.

In [7]:
# like a list, you can retrieve from a series

s[0]

np.int64(10)

In [8]:
s[1]

np.int64(20)

In [11]:
s[2:7]   # I get a slice back from this series! The index is 2, 3, 4, 5, 6

2    30
3    40
4    50
5    60
6    70
dtype: int64

In [10]:
mylist[2:7]  # I get a list back from slicing mylist, and the index is 0, 1, 2, 3, 4

[30, 40, 50, 60, 70]

In [13]:
s[2] = 999   # series are mutable, as well!
s

0     10
1     20
2    999
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

# Exercise: Basic series operations

1. `import pandas as pd` into your environment.
2. Create a Pandas series containing 5 integers.
3. Grab the first integer (at index 0)
4. Grab the third integer (at index 2)
5. What happens when you add the series to itself?

In [14]:
import pandas as pd

s = pd.Series([15, 7, 100, 2, 8])
s

0     15
1      7
2    100
3      2
4      8
dtype: int64

In [15]:
s[0]

np.int64(15)

In [16]:
s[4]

np.int64(8)

In [17]:
s + s

0     30
1     14
2    200
3      4
4     16
dtype: int64

In [18]:
s - s 

0    0
1    0
2    0
3    0
4    0
dtype: int64

In [19]:
s / s

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
dtype: float64

In [20]:
s[-1]

KeyError: -1

# Useful methods

Pandas provides us with a lot of methods that we use to analyze our data. Some of these are taken from NumPy, but others are special to Pandas.

In [21]:
s

0     15
1      7
2    100
3      2
4      8
dtype: int64

In [22]:
s.sum()   # returns the sum of the values in the series

np.int64(132)

In [23]:
s.count()  # how many values (not NaN, or "not a number") are there in the series?

np.int64(5)

In [24]:
s.mean()  # what is the arithmetic mean of the series?

np.float64(26.4)

In [25]:
s.sum() / s.count()

np.float64(26.4)

In [26]:
s.std()   # standard deviation

np.float64(41.40410607657168)

In [28]:
(s / s).std()

np.float64(0.0)

In [29]:
# SS: is there a difference between len(s) and s.count()?
# yes!

len(s)  # this returns the number of elements, no matter what they are, in the series

5

In [30]:
s.count()  # this returns the number of non-missing values in the series

np.int64(5)

In [32]:
s.min()  # what's the smallest value?

np.int64(2)

In [33]:
s.max()  # what's the biggest number?

np.int64(100)

In [34]:
s.median()  # what is the median?

np.float64(8.0)

In [35]:
s.describe()  # this gives me all of the values we just saw, and then some!

count      5.000000
mean      26.400000
std       41.404106
min        2.000000
25%        7.000000
50%        8.000000
75%       15.000000
max      100.000000
dtype: float64

In [36]:
s.head()  # this returns the first 5 elements of a series

0     15
1      7
2    100
3      2
4      8
dtype: int64

In [37]:
s.head(2)  # now give me the first 2

0    15
1     7
dtype: int64

In [38]:
# similarly, I can use s.tail() -- gives 5 bottom/final values by default

s.tail(3)

2    100
3      2
4      8
dtype: int64

In [39]:
s = Series([10, 10, 10, 10, 10, 20, 20, 20, 30, 30, 40, 40, 40, 30, 20, 10, 50])

# how many times does each value appear in s?
# we can find out with value_counts

# the result of value_counts is a series -- one in which the values from the original series (s)
# are the index, and the results are the number of times each value appears

# also, the result is sorted, from most common to least common.

s.value_counts()

10    6
20    4
30    3
40    3
50    1
Name: count, dtype: int64

# Exercise: Series calculations

1. Create a series of 10 numbers. It's OK for some of them to repeat.
2. Find the largest and smallest numbers in two ways -- using individual methods and then using `describe`.
3. Which number occurs the most often in your series?

In [40]:
s = Series([10, 20, 30, 20, 30, 40, 30, 40, 50, 10])



In [41]:
s.min()

np.int64(10)

In [42]:
s.max()

np.int64(50)

In [43]:
s.describe()

count    10.000000
mean     28.000000
std      13.165612
min      10.000000
25%      20.000000
50%      30.000000
75%      37.500000
max      50.000000
dtype: float64

In [44]:
s.value_counts()

30    3
10    2
20    2
40    2
50    1
Name: count, dtype: int64

In [46]:
# this gives us the maximum value, but not the index associated with it
# if we want to know *which* number appeared most often, this won't do it.
s.value_counts().max()

np.int64(3)

In [47]:
s.value_counts().idxmax()   # this returns the index associated with the most common value

np.int64(30)

In [48]:
s.value_counts().head(1)

30    3
Name: count, dtype: int64

# Pandas has a limited set of tools

When we get results back from Pandas, they will always be:

- Single NumPy values (e.g., `np.int64`)
- Pandas series
- Pandas data frames (the 2D data structures we'll see next week)

# Next up

- Setting and retrieving values
- Broadcasting