# Agenda: Data analysis in 3 weeks

1. Getting started + series
    - What is Pandas?
    - Getting started with Pandas
    - What is a series?
    - Useful methods
    - Setting and retrieving values
    - Broadcasting
    - Mask arrays ("boolean arrays")
    - Indexes
    - Dtypes, including `NaN`
2. Data frames -- 2D data
    - Creating data frames
    - Adding and removing data
    - Querying with boolean indexes
    - Using `.loc` and `.iloc`
    - Reading CSV files
    - Reading HTML, Excel
3. Analyzing our data
    - Sorting
    - Grouping
    - Joining
    - Basic plotting and visualization

# What is data analysis? What does Python have to do with it?

You might have heard about "data science." I define things as follows:

- Data analysis: Learning about the past, based on data you've collected
- Machine learning (or AI): Given past experience, how would we predict the future?
- Data science is the combination of these two things

If we have data, then we can analyze it! We can learn about it, which is good for:

- Us
- Our company
- Our organization
- People in general (medical, scientific progress)

We can get data from all over:
- Our phones produce (and collect) lots of data
- Web sites we browse
- Social media
- When we buy things online

## Why Python?

Python is a really easy to use language, but it isn't known for being very fast, very good with numbers, or good at being memory efficient. How can it be that data analytics, which involves lots of calculations with large data sets, is somehow dominated by Python?

If we were to use regular Python data structures, it would be very bad. But we aren't going to use Python's integers and floats. Rather, we're going to use NumPy, which provides a thin layer of Python over a C-language implementation of numbers. Pandas provides us with a layer over NumPy which is easier to work with and provides a lot of extra functionality.

You get the best of all worlds:

- Access to Python's relatively easy learning curve
- Python's extensive library
- Speed of NumPy
- Convenience of Pandas

## What's the bad news here?

If you're used to standard Python data types, such as ints, floats, and strings, you're going to learn a lot of new ways to think about data. 

- You won't want to run `for` loops with your Pandas series and data frames.
- You won't want to use `if` to make decisions in Pandas



# How can we use Pandas?

(I'm going to assume that you know how to install things from PyPI with `pip` or `uv`.)

We can load Pandas with a simple

In [1]:
import pandas as pd    # everyone, but *everyone*, defines the "pd" alias

In [2]:
# what version of Pandas am I running?

pd.__version__

'2.2.3'

# Today's focus is *series*

A Pandas series is a one-dimensional data structure, similar in many ways to a Python list.  This is the cornerstone of everything we do in Pandas. 

A data frame is a 2D table in Pandas. Each of its columns is a Pandas series.

In [3]:
# a simple series

# we create a series by giving it (among other things) a list of Python integers

s = pd.Series([10, 20, 30, 40, 50])

In [4]:
# let's take a look at s!

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [5]:
type(s)

pandas.core.series.Series

# Some basic operations on a series

Most operations are written as methods. For example: 

- `.mean()` gives us the numeric mean of the values
- `.min()` and `.max()` which give ue the min/max values
- `.std()` gives us the standard deviation, meaning how much the numbers "wiggle" from the mean
- `.sum()` sums the numbers
- `.count()` tells us how many values there are

In [6]:
s.mean()

np.float64(30.0)

In [7]:
s.min()

np.int64(10)

In [8]:
s.max()

np.int64(50)

In [9]:
s.std()

np.float64(15.811388300841896)

In [10]:
s.sum()

np.int64(150)

In [11]:
s.count()

np.int64(5)

# Exercise: Weather forecast

1. Create a series with the forecast high temps for your city in the next 10 days.
2. What will be the highest temperature?
3. What will be the mean temperature?
4. What will be the lowest temperature

In [12]:
s = pd.Series([27, 29, 32, 36, 37, 36, 30, 32, 31, 30])

s

0    27
1    29
2    32
3    36
4    37
5    36
6    30
7    32
8    31
9    30
dtype: int64

In [13]:
s.max()

np.int64(37)

In [14]:
s.mean()

np.float64(32.0)

In [15]:
s.min()

np.int64(27)

In [16]:
s.std()

np.float64(3.3333333333333335)

In [17]:
s.mean() - s.std()

np.float64(28.666666666666668)

In [18]:
s.mean() + s.std()

np.float64(35.333333333333336)

In [19]:
# most of the temps in the coming week will be from 28.6 - 35.3

In [20]:
# you might remember that there is *another* kind of average we can calculate
# that is the *median*. That is calculated by taking all of the values in a series,
# from smallest to largest, and we take the middle one.  

s.median()   # same as s.quantile(0.5)

np.float64(31.5)

# Why the median?

It's easy for a few outliers to skew the mean, either higher or lower.  By using the median, we know that we're getting a "middle" value, and one that is actually higher than half and lower than half.  

You might also be interested in the first-quartile and third-quartile values, meaning the 25% mark and the 75% mark.

In [21]:
s.quantile(0.25)

np.float64(30.0)

In [22]:
s.quantile(0.75)

np.float64(35.0)

In [23]:
# you can even calculate the IQR -- inter-quartile range -- which tells us the distance between
# the 25% mark and the 75% mark

s.quantile(0.75) - s.quantile(0.25)

np.float64(5.0)

A famous statistician, John Tukey, loved the use of median and IQR. He described the "five figure summary" for a data set, so that we can have a good "picture" of its behavior:

- min
- median
- iqr
- max
- mean

In [24]:
# in Pandas, we can get all of these (more or less), plus a few more, with the "describe" method

s.describe()

count    10.000000
mean     32.000000
std       3.333333
min      27.000000
25%      30.000000
50%      31.500000
75%      35.000000
max      37.000000
dtype: float64

# Some other really useful methods for working with your data

- Get the first few values with `head`
- Get the last few values with `tail`

In both cases, we get 5 values by default, but can pass any number we want.

In [25]:
s.head(5)

0    27
1    29
2    32
3    36
4    37
dtype: int64

In [26]:
s.tail(5)

5    36
6    30
7    32
8    31
9    30
dtype: int64

In [27]:
s.head(1)

0    27
dtype: int64

In [28]:
s.tail(3)

7    32
8    31
9    30
dtype: int64

# My favorite method is : `value_counts`

This counts how many times every value appears in a series. It returns a new series, one in which your original values are now the index, and the values in the new series are integers -- how many times did each value appear.

In [29]:
s

0    27
1    29
2    32
3    36
4    37
5    36
6    30
7    32
8    31
9    30
dtype: int64

In [30]:
s.value_counts()   # how often did each value in s appear?

32    2
36    2
30    2
27    1
29    1
37    1
31    1
Name: count, dtype: int64

# Exercise: More with temperatures

1. Compare the mean and median temperatures forecast for your city. Are they the same or similar? Why or why not?
2. Calculate the IQR for your temperatures.
3. What are the three most common temperatures forecast in your city?
4. Using `head` and `tail`, get the temperatures forecast from the 4th day through the 8th day.

In [34]:
s.mean()

np.float64(32.0)

In [35]:
s.median()

np.float64(31.5)

In [36]:
s.quantile(0.75) - s.quantile(0.25)

np.float64(5.0)

In [38]:
s.value_counts().head(3) 

32    2
36    2
30    2
Name: count, dtype: int64

In [41]:
s.tail(7).head(5)

3    36
4    37
5    36
6    30
7    32
dtype: int64

# Next up

- Setting and retrieving values
- Broadcasting