# Agenda: Data analysis in 3 weeks

1. Getting started + series
    - What is Pandas?
    - Getting started with Pandas
    - What is a series?
    - Useful methods
    - Setting and retrieving values
    - Broadcasting
    - Mask arrays ("boolean arrays")
    - Indexes
    - Dtypes, including `NaN`
2. Data frames -- 2D data
    - Creating data frames
    - Adding and removing data
    - Querying with boolean indexes
    - Using `.loc` and `.iloc`
    - Reading CSV files
    - Reading HTML, Excel
3. Analyzing our data
    - Sorting
    - Grouping
    - Joining
    - Basic plotting and visualization

# What is data analysis? What does Python have to do with it?

You might have heard about "data science." I define things as follows:

- Data analysis: Learning about the past, based on data you've collected
- Machine learning (or AI): Given past experience, how would we predict the future?
- Data science is the combination of these two things

If we have data, then we can analyze it! We can learn about it, which is good for:

- Us
- Our company
- Our organization
- People in general (medical, scientific progress)

We can get data from all over:
- Our phones produce (and collect) lots of data
- Web sites we browse
- Social media
- When we buy things online

## Why Python?

Python is a really easy to use language, but it isn't known for being very fast, very good with numbers, or good at being memory efficient. How can it be that data analytics, which involves lots of calculations with large data sets, is somehow dominated by Python?

If we were to use regular Python data structures, it would be very bad. But we aren't going to use Python's integers and floats. Rather, we're going to use NumPy, which provides a thin layer of Python over a C-language implementation of numbers. Pandas provides us with a layer over NumPy which is easier to work with and provides a lot of extra functionality.

You get the best of all worlds:

- Access to Python's relatively easy learning curve
- Python's extensive library
- Speed of NumPy
- Convenience of Pandas

## What's the bad news here?

If you're used to standard Python data types, such as ints, floats, and strings, you're going to learn a lot of new ways to think about data. 

- You won't want to run `for` loops with your Pandas series and data frames.
- You won't want to use `if` to make decisions in Pandas



# How can we use Pandas?

(I'm going to assume that you know how to install things from PyPI with `pip` or `uv`.)

We can load Pandas with a simple

In [1]:
import pandas as pd    # everyone, but *everyone*, defines the "pd" alias

In [2]:
# what version of Pandas am I running?

pd.__version__

'2.2.3'

# Today's focus is *series*

A Pandas series is a one-dimensional data structure, similar in many ways to a Python list.  This is the cornerstone of everything we do in Pandas. 

A data frame is a 2D table in Pandas. Each of its columns is a Pandas series.

In [3]:
# a simple series

# we create a series by giving it (among other things) a list of Python integers

s = pd.Series([10, 20, 30, 40, 50])

In [4]:
# let's take a look at s!

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [5]:
type(s)

pandas.core.series.Series

# Some basic operations on a series

Most operations are written as methods. For example: 

- `.mean()` gives us the numeric mean of the values
- `.min()` and `.max()` which give ue the min/max values
- `.std()` gives us the standard deviation, meaning how much the numbers "wiggle" from the mean
- `.sum()` sums the numbers
- `.count()` tells us how many values there are

In [6]:
s.mean()

np.float64(30.0)

In [7]:
s.min()

np.int64(10)

In [8]:
s.max()

np.int64(50)

In [9]:
s.std()

np.float64(15.811388300841896)

In [10]:
s.sum()

np.int64(150)

In [11]:
s.count()

np.int64(5)

# Exercise: Weather forecast

1. Create a series with the forecast high temps for your city in the next 10 days.
2. What will be the highest temperature?
3. What will be the mean temperature?
4. What will be the lowest temperature

In [12]:
s = pd.Series([27, 29, 32, 36, 37, 36, 30, 32, 31, 30])

s

0    27
1    29
2    32
3    36
4    37
5    36
6    30
7    32
8    31
9    30
dtype: int64

In [13]:
s.max()

np.int64(37)

In [14]:
s.mean()

np.float64(32.0)

In [15]:
s.min()

np.int64(27)

In [16]:
s.std()

np.float64(3.3333333333333335)

In [17]:
s.mean() - s.std()

np.float64(28.666666666666668)

In [18]:
s.mean() + s.std()

np.float64(35.333333333333336)

In [19]:
# most of the temps in the coming week will be from 28.6 - 35.3

In [20]:
# you might remember that there is *another* kind of average we can calculate
# that is the *median*. That is calculated by taking all of the values in a series,
# from smallest to largest, and we take the middle one.  

s.median()   # same as s.quantile(0.5)

np.float64(31.5)

# Why the median?

It's easy for a few outliers to skew the mean, either higher or lower.  By using the median, we know that we're getting a "middle" value, and one that is actually higher than half and lower than half.  

You might also be interested in the first-quartile and third-quartile values, meaning the 25% mark and the 75% mark.

In [21]:
s.quantile(0.25)

np.float64(30.0)

In [22]:
s.quantile(0.75)

np.float64(35.0)

In [23]:
# you can even calculate the IQR -- inter-quartile range -- which tells us the distance between
# the 25% mark and the 75% mark

s.quantile(0.75) - s.quantile(0.25)

np.float64(5.0)

A famous statistician, John Tukey, loved the use of median and IQR. He described the "five figure summary" for a data set, so that we can have a good "picture" of its behavior:

- min
- median
- iqr
- max
- mean

In [24]:
# in Pandas, we can get all of these (more or less), plus a few more, with the "describe" method

s.describe()

count    10.000000
mean     32.000000
std       3.333333
min      27.000000
25%      30.000000
50%      31.500000
75%      35.000000
max      37.000000
dtype: float64

# Some other really useful methods for working with your data

- Get the first few values with `head`
- Get the last few values with `tail`

In both cases, we get 5 values by default, but can pass any number we want.

In [25]:
s.head(5)

0    27
1    29
2    32
3    36
4    37
dtype: int64

In [26]:
s.tail(5)

5    36
6    30
7    32
8    31
9    30
dtype: int64

In [27]:
s.head(1)

0    27
dtype: int64

In [28]:
s.tail(3)

7    32
8    31
9    30
dtype: int64

# My favorite method is : `value_counts`

This counts how many times every value appears in a series. It returns a new series, one in which your original values are now the index, and the values in the new series are integers -- how many times did each value appear.

In [29]:
s

0    27
1    29
2    32
3    36
4    37
5    36
6    30
7    32
8    31
9    30
dtype: int64

In [30]:
s.value_counts()   # how often did each value in s appear?

32    2
36    2
30    2
27    1
29    1
37    1
31    1
Name: count, dtype: int64

# Exercise: More with temperatures

1. Compare the mean and median temperatures forecast for your city. Are they the same or similar? Why or why not?
2. Calculate the IQR for your temperatures.
3. What are the three most common temperatures forecast in your city?
4. Using `head` and `tail`, get the temperatures forecast from the 4th day through the 8th day.

In [34]:
s.mean()

np.float64(32.0)

In [35]:
s.median()

np.float64(31.5)

In [36]:
s.quantile(0.75) - s.quantile(0.25)

np.float64(5.0)

In [38]:
s.value_counts().head(3) 

32    2
36    2
30    2
Name: count, dtype: int64

In [41]:
s.tail(7).head(5)

3    36
4    37
5    36
6    30
7    32
dtype: int64

# Next up

- Setting and retrieving values
- Broadcasting

In [42]:
# CC

s[4:9] # is this pythonic? borrwoing your lingo from the programming class

4    37
5    36
6    30
7    32
8    31
dtype: int64

# How can we retrieve values from our series?

- Yes, we can use `[]`, as we do with Python lists and other values
- But it's more common to use `.loc` and `.iloc`

In [43]:
s = pd.Series([10, 15, 28, 30, 12, 74, 52, 36, 41])
s

0    10
1    15
2    28
3    30
4    12
5    74
6    52
7    36
8    41
dtype: int64

In [44]:
s[5]

np.int64(74)

In [45]:
# I can retrieve more than one value using a slice

s[3:6]   # from index 3 up to and not including index 6

3    30
4    12
5    74
dtype: int64

In [46]:
# In Pandas, we can do "fancy indexing," meaning that we request more than one index in []

s[4]

np.int64(12)

In [47]:
s[7]

np.int64(36)

In [49]:
s[   [4,7]   ]   # I put a list inside of the []  -- I get back a series, whose indexes are 4 and 7

4    12
7    36
dtype: int64

In [50]:
# you can even repeat indexes!

s[[2,4,5,2,4,5]]

2    28
4    12
5    74
2    28
4    12
5    74
dtype: int64

# Single values vs. series

When we're retrieving from a series, we might get a single value and we might get more than one value.

If we get more than one value, then we have a series, which means index + values.

# Another way: `.loc`

While it's tempting to use `[]` to retrieve from a series, you should try to get into the habit of using `.loc` for doing so. That's because (a) `.loc` is more flexible and (b) `.loc` will also be useful when we work with data frames.

`.loc` looks, feels, and acts like a method, except that it uses `[]` rather than `()`.

In [51]:
s.loc[3]

np.int64(30)

In [52]:
s.loc[[2, 5]]

2    28
5    74
dtype: int64

In [53]:
s.loc[4:8]  # notice: when we use a slice with .loc, the end index is INCLUDED!

4    12
5    74
6    52
7    36
8    41
dtype: int64

# Assigning to a series

Just as we can retrieve from a series, we can also assign to it, changing values. Anything that can retrieve from a series can also be used to assign.

In [54]:
s

0    10
1    15
2    28
3    30
4    12
5    74
6    52
7    36
8    41
dtype: int64

In [55]:
s.loc[3] = 999
s

0     10
1     15
2     28
3    999
4     12
5     74
6     52
7     36
8     41
dtype: int64

In [56]:
s.loc[[2, 5]] = 888   # yes, we can assign to a fancy index -- all of those values will be replaced!
s

0     10
1     15
2    888
3    999
4     12
5    888
6     52
7     36
8     41
dtype: int64

In [57]:
s.loc[3:6] = 777
s

0     10
1     15
2    888
3    777
4    777
5    777
6    777
7     36
8     41
dtype: int64

The one thing that you cannot do to a series is change its length! Once you have created a series, you cannot add or remove elements, but you can swap the values of the existing elements.

# Exercise: Retrieving + setting weather

1. Retrieve the forecast high temperature, using `.loc`, for the 3rd day in your series.
2. Retrieve the high temp for days 3, 5, and 7 in your series. What is the mean of those?
3. What is the lowest high temp expected in the latter half of your series?
4. Change the first value in the series to be the max value. Does that change the mean, the median, or both?

In [58]:
s

0     10
1     15
2    888
3    777
4    777
5    777
6    777
7     36
8     41
dtype: int64

In [59]:
s = pd.Series([27, 29, 32, 36, 37, 36, 30, 32, 31, 30])

s

0    27
1    29
2    32
3    36
4    37
5    36
6    30
7    32
8    31
9    30
dtype: int64

In [60]:
# Retrieve the forecast high temperature, using .loc, for the 3rd day in your series.

s.loc[2]

np.int64(32)

In [62]:
# Retrieve the high temp for days 3, 5, and 7 in your series. What is the mean of those?

s.loc[[2, 4, 6]].mean()

np.float64(33.0)

In [65]:
# What is the lowest high temp expected in the latter half of your series?

s.loc[5:].min()

np.int64(30)

In [66]:
s.tail().min()

np.int64(30)

In [67]:
# Change the first value in the series to be the max value. Does that change the mean, the median, or both?

s

0    27
1    29
2    32
3    36
4    37
5    36
6    30
7    32
8    31
9    30
dtype: int64

In [68]:
s.mean()

np.float64(32.0)

In [69]:
s.median()

np.float64(31.5)

In [70]:
s.loc[0] = s.max()
s

0    37
1    29
2    32
3    36
4    37
5    36
6    30
7    32
8    31
9    30
dtype: int64

In [71]:
s.mean()

np.float64(33.0)

In [72]:
s.median()

np.float64(32.0)

# Broadcasting

In [73]:
# let's create a Python list

mylist = [10, 20, 30]

mylist + mylist 

[10, 20, 30, 10, 20, 30]

In [74]:
# what if I add 5 to mylist?

mylist + 5

TypeError: can only concatenate list (not "int") to list

In [75]:
s = pd.Series([10, 20, 30])

s + s  # we get a series back of the same length, with the values added on a per-index basis

0    20
1    40
2    60
dtype: int64

In [76]:
s + 5   # what will happen here?  

0    15
1    25
2    35
dtype: int64

"Broadcasting" means that if we have a series, and we apply a math operation to it, then the operation is applied to all of the values in the series, and we get a new series back whose index is identical, but whose values represent the operation's output.

In [77]:
s * 3

0    30
1    60
2    90
dtype: int64

In [78]:
s / 10

0    1.0
1    2.0
2    3.0
dtype: float64

In [79]:
s ** 4

0     10000
1    160000
2    810000
dtype: int64

# Exercise: Convert from F -> C (or in the other direction)

Using your high-temperature series, create a new series with the temperatures in the other measurement system.

- F = (C*1.8) + 32
- C = (F-32) * 5/9

In [80]:
s

0    10
1    20
2    30
dtype: int64

In [81]:
high_temps = pd.Series([27, 29, 32, 36, 37, 36, 30, 32, 31, 30])


In [83]:
high_temps * 1.8 + 32

0    80.6
1    84.2
2    89.6
3    96.8
4    98.6
5    96.8
6    86.0
7    89.6
8    87.8
9    86.0
dtype: float64

In [88]:
import numpy as np

np.random.rand(10)  # this gives me 10 random numbers between 0 and 1

array([0.74851177, 0.28124988, 0.73443505, 0.16467216, 0.37422706,
       0.34021064, 0.11266713, 0.16065707, 0.3836314 , 0.12099456])

In [89]:
np.random.rand(10) * 100   # now, using broadcasting, I get them between 0-100!

array([37.96821267, 69.59860794,  7.52244048, 44.63886175, 98.9883686 ,
        2.79516043, 96.99143814, 54.57876132,  7.55605421, 71.8067026 ])

In [90]:
# create a series based on this NumPy array, and then use broadcasting

pd.Series(np.random.rand(10)) * 100

0    56.047383
1    35.752172
2    34.171235
3    58.544005
4    24.935056
5    94.864622
6    31.266376
7    92.177712
8    45.020467
9    38.975684
dtype: float64

# CM
Sorry, I am still a bit confused on the original vs the new series. In your definition of Broadcasting, it says " ... we get back a new series ...".  Unfortunately,  I am not set up to actually work the examples today, or I would test it. 

In [94]:
high_temps_f = high_temps * 1.8 + 32

high_temps_f

0    80.6
1    84.2
2    89.6
3    96.8
4    98.6
5    96.8
6    86.0
7    89.6
8    87.8
9    86.0
dtype: float64

In [95]:
high_temps

0    27
1    29
2    32
3    36
4    37
5    36
6    30
7    32
8    31
9    30
dtype: int64

In [97]:
# create a series of random numbers from 0-100

s = pd.Series(np.random.rand(10)) * 100

s.describe()

count    10.000000
mean     46.681874
std      38.038156
min       4.101851
25%      19.113310
50%      30.247713
75%      86.177897
max      98.225961
dtype: float64

# Exercise:

1. Multiply your high-temp forecast series by a series of random numbers of the same length. Each random int should be between 0 and 10.
2. What are the mean and median of the result?

In [100]:
high_temps * pd.Series(np.random.rand(10)) * 10

0     77.233711
1     69.171719
2    218.062164
3     72.049774
4    283.622743
5    292.418819
6    294.553869
7      6.629147
8    190.053903
9    138.553970
dtype: float64

In [101]:
result = high_temps * pd.Series(np.random.rand(10)) * 10
result.describe()

count     10.000000
mean     165.681894
std       94.400841
min       25.083403
25%      108.925009
50%      140.473676
75%      223.862699
max      319.456991
dtype: float64

In [102]:
(high_temps * pd.Series(np.random.rand(10)) * 10).describe()

count     10.000000
mean     118.121951
std       91.450707
min        8.976001
25%       40.196066
50%      106.900405
75%      171.001494
max      264.390168
dtype: float64

# Next up

1. Mask/boolean arrays
2. Indexes

In [103]:
# 1. We've seen that we can use operators with broadcast

s = pd.Series([10, 15, 20, 30, 37, 38, 42, 45])
s

0    10
1    15
2    20
3    30
4    37
5    38
6    42
7    45
dtype: int64

In [104]:
s + 10

0    20
1    25
2    30
3    40
4    47
5    48
6    52
7    55
dtype: int64

In [105]:
# 2. We know that we can use fancy indexing to retrieve multiple values  -- that is, pass a list of indexes
# inside of [], rather than just one. I can do this with .loc (as well as plain ol' [])

s.loc[[2, 4, 6]]

2    20
4    37
6    42
dtype: int64

In [106]:
# 3. It turns out that if you don't pass a list of integers, but rather a list of booleans (True/False)
# values, you get a series back that matches the index/values of the original series *BUT* only where
# it got a True value. If there's a False value, it is silently dropped.

s.loc[ [True, False, True, False, True, False, True, False] ]

0    10
2    20
4    37
6    42
dtype: int64

In [107]:
s.loc[ [True, False, True, True, True, True, True, False] ]

0    10
2    20
3    30
4    37
5    38
6    42
dtype: int64

In [108]:
# this is a really tedious way to retrieve certain values!
# there is a better way -- using broadcasting along with comparison operators

s == 30  # this operation will be broadcast to every value in s... and we'll get a boolean back for each one

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7    False
dtype: bool

In [109]:
# always look *inside* of the [] first
#   that returns a boolean series
#   we apply it to s via loc as a mask index
#   we'll get back only those values == 30

s.loc[s == 30]   # notice that s is both inside the [], and outside of the []!

3    30
dtype: int64

In [111]:
s.loc[s > 30]   # get the values of s > 30

4    37
5    38
6    42
7    45
dtype: int64

In [112]:
s.loc[s > s.mean()]   # only return values of s that are greater than its mean!

3    30
4    37
5    38
6    42
7    45
dtype: int64

In [114]:
# what if I want only the odd numbers from s?

s.loc[s % 2 == 1]   # returns 1 if odd, 0 if even

1    15
4    37
7    45
dtype: int64

# Exercise: Working with mask indexes

1. Find all of the high temperatures that are more than the mean.
2. Find all of the high temperatures that are more than the mean + 1 std.
3. Create a series whose values are the ages of people in your family. Find all of the ages that are below the mean.

In [116]:
high_temps.loc[high_temps > high_temps.mean()]

3    36
4    37
5    36
dtype: int64

In [117]:
high_temps.loc[high_temps > high_temps.mean() + high_temps.std()]

3    36
4    37
5    36
dtype: int64

In [118]:
ages = pd.Series([19, 22, 24, 53, 54])
ages

0    19
1    22
2    24
3    53
4    54
dtype: int64

In [120]:
# show the mean of ages
ages.mean()

np.float64(34.4)

In [121]:
# return a boolean series, where ages < the mean
ages < ages.mean()

0     True
1     True
2     True
3    False
4    False
dtype: bool

In [122]:
ages.loc[ages < ages.mean()]

0    19
1    22
2    24
dtype: int64

In [125]:
ages.sum() / 5  # this is the mean

np.float64(34.4)

In [126]:
ages.mean()

np.float64(34.4)

In [127]:
ages.median()

np.float64(24.0)

In [128]:
s = pd.Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [129]:
# I can set the index, so it's not 0-4!
# the index can be *ANY* values I want!
# I set it, when I create the series, with the "index" keyword argument

s = pd.Series([10, 20, 30, 40, 50],
              index=list('abcde'))
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [130]:
# I can now still retrieve via the index, if I use .loc!

s.loc['a']

np.int64(10)

In [131]:
s.loc['c']

np.int64(30)

In [132]:
s.loc[['a', 'c']]    # fancy index

a    10
c    30
dtype: int64

In [133]:
s.loc['b':'d']    # this retrieves a slice... up to and including, because it's with .loc

b    20
c    30
d    40
dtype: int64

In [134]:
# wait -- what if I still want to retrieve via the position?

s.loc[0]

KeyError: 0

In [135]:
# to retrieve via the position, we use .iloc

s.iloc[0]

np.int64(10)

In [136]:
s.iloc[2]

np.int64(30)

In [137]:
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [138]:
# what if the index repeats? 

s = pd.Series([10, 20, 30, 40, 50],
              index=list('abcab'))
s

a    10
b    20
c    30
a    40
b    50
dtype: int64

In [139]:
s.loc['a']

a    10
a    40
dtype: int64

In [140]:
s.loc['c']

np.int64(30)

In [141]:
s.loc['a'] = 999
s.loc['c'] = 888
s

a    999
b     20
c    888
a    999
b     50
dtype: int64

# CC

why am i getting different values form here 

In [146]:
np.random.seed(0)
family = pd.Series(np.random.randint(low=30, high=40, size=9))
family

0    35
1    30
2    33
3    33
4    37
5    39
6    33
7    35
8    32
dtype: int64

In [147]:
family.loc[family > family.mean()] 

0    35
4    37
5    39
7    35
dtype: int64

In [148]:
np.random.seed(0)
family = pd.Series(np.random.randint(low=30, high=40, size=9))

family[family > family.mean()]

0    35
4    37
5    39
7    35
dtype: int64

In [149]:
s

a    999
b     20
c    888
a    999
b     50
dtype: int64

In [151]:
s.loc[s > 500]

a    999
c    888
a    999
dtype: int64

# Exercise: High temps

1. Re-create your series with 10 forecast high temps, but the index should consist of day names (Sun, Mon, Tue, etc.)
2. What is the mean temperature forecast for Tuesdays and Thursdays?
3. What is the max temperature found from position 3 until (not including) position 7?
4. What is the mean of all even temperatures?

In [152]:
high_temps

0    27
1    29
2    32
3    36
4    37
5    36
6    30
7    32
8    31
9    30
dtype: int64

In [153]:
high_temps = pd.Series([27, 29, 32, 36, 37, 36, 30, 32, 31, 30],
                       index=['Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu'])
high_temps

Tue    27
Wed    29
Thu    32
Fri    36
Sat    37
Sun    36
Mon    30
Tue    32
Wed    31
Thu    30
dtype: int64

In [154]:
high_temps.describe()

count    10.000000
mean     32.000000
std       3.333333
min      27.000000
25%      30.000000
50%      31.500000
75%      35.000000
max      37.000000
dtype: float64

In [157]:
# What is the mean temperature forecast for Tuesdays and Thursdays?

high_temps.loc[['Tue', 'Thu']].mean()

np.float64(30.25)

In [159]:
# What is the max temperature found from position 3 until (not including) position 7?

high_temps.iloc[3:7]  # .loc is up to and including... but .iloc doesn't include the endpoint (as per Python tradition)

Fri    36
Sat    37
Sun    36
Mon    30
dtype: int64

In [163]:
# What is the mean of all even temperatures?

high_temps.loc[high_temps % 2 == 0].mean()

np.float64(32.666666666666664)

In [165]:
high_temps.loc[high_temps % 2 == 0].head(1)

Thu    32
dtype: int64

In [171]:
# we got, in that exercise, the records with even temps
# what if we wanted the even-numbered positions?

high_temps.iloc[[0, 2, 4, 6, 8]]

Tue    27
Thu    32
Sat    37
Mon    30
Wed    31
dtype: int64

In [172]:
high_temps.iloc[::2]   # this uses advanced slice syntax, where I say, "From the start, to the end, every other position"

Tue    27
Thu    32
Sat    37
Mon    30
Wed    31
dtype: int64

# Next up

- dtypes (what kinds of data can we store?)
- `NaN` (not a number)

In Python, a list can contain, in theory, any values we want. Traditionally, we only put one type of value in a list, but we can put any combination in a tuple. That's considered OK.

Regardless of what we want, a series can only contain values of ONE type. That type is known as the "dtype", and is one of the NumPy-defined types. We've already seen `np.int64` and `np.float64`. There are a bunch of other dtypes, too. 

In [174]:
s = pd.Series([10, 20, 30, 40, 50])
s.dtype

dtype('int64')

In [175]:
s = pd.Series([10, 20, 30, 40, 50], 
             dtype=np.int8)   # this means: 8-bit integers
s

0    10
1    20
2    30
3    40
4    50
dtype: int8