# Agenda 

1. Getting started and series
    - Why Python?
    - What is Pandas?
    - Pandas series -- defining and working with them
    - Broadcasting
    - Boolean/mask indexes
    - Indexes
    - Dtypes
3. Data frames
4. Analyzing data 

# Jupyter notebook

REPL -- read, eval, print loop -- in the browser. You don't have to use Jupyter in this course, though! If you prefer to use VSCode or PyCharm, that's totally OK.

If you *do* want to use Jupyter:

- Inside of VSCode and PyCharm, you can create/work with notebooks
- You can also use Jupyter Lite (https://jupyter.org/try-jupyter/lab/)



# A 5-minute introduction to Jupyter

We type into *cells* in Jupyter. Each cell contains code (Python) or documentation (Markdown). Each cell has two "modes":

- Edit mode -- when I type, the text goes into the cell. I'm in edit mode right now! You can enter edit mode by pressing `ENTER` or by clicking inside of a cell.
- Command mode -- when I type, typically one character, that character is a command to Jupyter. The character is not entered into the cell, but rather tells Jupyter to do something You can enter command mode by pressing `ESC` or by clicking to the left of a cell.

What commands could we use?
- `c` -- copy the current cell
- `x` -- cut the current cell
- `v` -- paste the most recently copied/cut cell
- `a` -- add a new cell *above* the current one
- `b` -- add a new cell *below* the current one
- `y` -- make a cell in Python mode
- `m` -- make a cell in Markdown (documentation) mode

Always, you can press `ENTER` to go down a line. To execute the code in a cell, use shift+`ENTER` together.

# What is data science?

This is a huge, new term in the last decade or two. I divide into several parts:

- All of data science is everything having to do with retrieving, analyzing, and forecasting with data. It's the overall umbrella term.
- *Data analytics* is making sense of data that we've already collected. That's what we're going to be doing in this course.
- *Data engineering* is about getting data from one place to another, so that we can analyze it.
- *Modeling* and *machine learning* is about making predictions, or forecasts, based on existing data.

Example, if I'm a big company:
- My data engineers will move data from our various databases into a central location so that we can analyze it
- The data analysts will look it over and understand how many widgets we sold last year, and in which regions, and which salespeople did the best job.
- The ML specialists will use the data to predict how well we'll sell our widgets next year, and which regions and types of customers we should target.

All of these disciplines now use Python as their main language of choice.

That seems **SUPER WEIRD** if you know anything about Python. Python doesn't run quickly and uses lots of memory.

The biggest reason that Python is the #1 language for data science is NumPy -- which gives us the speed and size of C data, but with a Python layer over it. That makes it easy to work with but also very efficient. NumPy has long been favored by scientists and engineers for working with data. There's even a SciPy package which has lots of libraries for various scientific and engineering fields.

Part of the reason NumPy works so well is that it's *vectorized*, meaning that don't work with individual data points, but rather we work with groups of numbers. NumPy is optimized to do things in that sort of way.

I could use NumPy for everything! But it's a bit low level for many people's tastes.

That's where Pandas comes in: Pandas is a wrapper around NumPy that provides a ton of additional functionality. Using a Pandas series is very similar to using a NumPy array, but you have hundreds of additional methods. Also, you can work with many more data types (e.g., dates and times, and also strings), and you can work with many different file types and formats.

Pandas has been around for about 17 years now, and it continues to be really, really popular -- about 40m downloads every month.

Lots of organizations are moving from Matlab, R, or Excel to use Pandas instead.

In many ways, you can think of Pandas as providing all of the functionality of Excel, but inside of Python. This means that you can perform all of those calculations, but you'll do it without needing to sit in front of your computer.

There is no way to remember all of the Pandas functionality! My goal is to teach you how think the way that Pandas wants you to think, and thus be able to find and understand the right documentation to do what you want.

In [2]:
# before you can "import pandas", you need to install it using either "pip" or "uv" or "conda"

# I type quickly, and avoided using the "as pd" for the first year or two I used Pandas. A big mistake!

import numpy as np
import pandas as pd
from pandas import Series

# Installing packages

The traditional way to install Python packages, and still the "official" way, is with the `pip` command. It's not a Python function you run inside of Jupyter or a program. Rather, you run it on the command line, just as you run a Python program.

    pip install pandas

That's the most standard way to do it. And that installs Pandas into your `site-packages` directory.

Another package installer is `conda`, which is used by people using Anaconda distributions of Python and Pandas. I think you would say

    conda install pandas

The newest way to install packages in Python is "uv", from a company called Astral. The simplest way to install packages with uv is to say

    uv pip install pandas

uv is far, far faster than pip. But it has a lot of functionality that pip doesn't have -- it basically replaces venv, pyenv, Poetry, and a number of other package managers. 



# Series

The core data structure in Pandas is the "series," which is a 1D data structure. It's similar to a list in Python, but it also has some big differences. We can create a new series with `pd.Series`, handing it a Python list of values. (Or if you prefer, a NumPy array of values.)

In [3]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
s

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64