# Agenda

1. Getting started
    - What is data analysis?
    - Python and Pandas
    - Descriptive statistics and aggregate methods
    - Setting and retrieving values
    - Broadcasting operations
    - Mask arrays
    - Indexes
    - Some additional useful methods
2. Data types and data frames
    - Data types and `NaN` ("not a number")
    - Data frames (2D data)
    - Adding and removing data
    - Retrieving data
    - Queries with mask indexes 
3. Real-world data
    - Working with CSV files
    - Sorting data
    - Grouping data
    - Pivot tables
    - Joining
4. Text and dates
    - Working with text data
    - Working with dates and times
    - Time series (where datetime values are our index)
5. Visualization
    - Plots via the Pandas interface
    - Scatter plots
    - What next?

# I'm using Jupyter

You can install Jupyter on your own computer if you have Python and Pandas -- just install it from PyPI. 

    pip install jupyter
    
If you don't have Jupyter on your computer, then you use one of the online systems, such as Google Colab or Python Anywhere or Replit.  Or https://try.jupyter.org.

# Jupyter intro

Jupyter is divided into "cells," like the one I'm typing into right now. Cells can be in one of two modes at any time:

- Edit mode: When you type into it, the text appears (like right now). It has a green outline. Enter edit mode by clicking inside of the cell, or by pressing ENTER.
- Command mode: When you type, you're giving commands to the Jupyter notebook itself. It has a blue outline. Enter command mode by clicking to the left of the cell or by pressing ESC.

In command mode, you can type a bunch of keys and get Jupyter to do things:

- `h` -- help
- `c` -- copy the current cell
- `v` -- paste the current cell
- `x` -- cut the current cell
- `z` -- undo the last action
- `a` -- create a new cell *above* the current one
- `b` -- create a new cell *below* the current one


# What is data? What is data analytics?

Data is everywhere in our world -- cellphones, computers, servers, vehicles, refrigerators.

How do we work with it? How can we use it to answer questions that are useful and/or interesting?

I say: Data science = data analytics + machine learning.

# (Thought) Exercise: Amazon's data

1. What sort of data does Amazon have?
2. What sorts of questions can Amazon ask of that data?
3. What sorts of things can Amazon do once they have answers to those questions?

# We have data. How can we analyze it?

There are many tools: SQL databases. NoSQL databases. Programming languages like R or Julia. Or even Java or C#.

Python has become the #1 language for data science over the last decade. That seems really weird!

- Python is not super efficient
- Python's numbers are not small

How did this happen? Answer: NumPy. NumPy is a library written in C, and thus runs at C speeds, *but* it has an interface in Python that lets us use it in Python.

We get the benefits of Python's ease of use, but C's efficiency.

NumPy is a bit low level for many people to work with. Pandas is (mostly) a wrapper around NumPy that makes it easier to work with, and feels like a higher-level system.

Pandas is a Python package (on PyPI), and is the main way that people analyze data in the Python world nowadays.

In [2]:
# I can load Pandas into memory by saying

import pandas as pd      # import it, and assign it the alias "pd"
from pandas import Series, DataFrame   # these are useful shortcuts, so we don't have to say pd.Series / pd.DataFrame

In [3]:
# series

# a series is a Pandas data structure containing 1D data
# it's kind of, sort of, like a list  -- but it isn't one

s = Series([10, 20, 30, 40, 50])  # creating a series with a Python list

In [4]:
type(s)

pandas.core.series.Series

In [5]:
s   # show me the series

0    10
1    20
2    30
3    40
4    50
dtype: int64

# What are we seeing?

1. The series contains two parallel set of elements:
    - The index, which currently contains integers 0-4
    - The values, which currently contains the integers 10, 20, 30, 40, and 50
2. The series will always present itself with both index and values
3. At the bottom, we see the `dtype` of the series, i.e., the type of data that it contains. Here it's `int64`, which is a way of saying that every integer in our series is stored as a 64-bit integer.  This data type comes from NumPy.

In [6]:
# what happens if I add the series to itself?
# I get a new series of the same length... at each index, the values have been added

s + s

0     20
1     40
2     60
3     80
4    100
dtype: int64

In [7]:
# I can retrieve values with []

s[2]

30

In [8]:
s[4]

50

In [9]:
s[100]

KeyError: 100

In [10]:
# some methods that I can call on my series

s.sum()

150

In [11]:
s.mean()

30.0

In [12]:
s.std()   # standard deviation

15.811388300841896

In [13]:
s.min()  

10

In [14]:
s.max()

50

In [15]:
s.median()

30.0

In [16]:
s.mode()

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [17]:
s2 = Series([10.5, 20.5, 30.5, 40.5, 50.5])

s + s2

0     20.5
1     40.5
2     60.5
3     80.5
4    100.5
dtype: float64

# Exercise: Weather

1. Find a Web site with your city's weather forecast for the next 10 days.
2. Create a series, called `lows`, and put the forecast low temperatures there.
3. Create a series, called `highs`, and put the forecast high temperatures there.
4. Subtract `lows` from `highs`, and show the difference in temperature for each day.

In [18]:
lows = Series([16, 16, 17, 15, 15, 16, 17, 17, 17, 18])
highs = Series([29, 33, 27, 23, 24, 29, 30, 30, 29, 29])

highs - lows

0    13
1    17
2    10
3     8
4     9
5    13
6    13
7    13
8    12
9    11
dtype: int64

In [19]:
highs[0] - lows[0]

13

In [20]:
highs[1] - lows[1]

17

In [21]:
extra_long_highs = Series([29, 33, 27, 23, 24, 29, 30, 30, 29, 29, 30, 30])

extra_long_highs - lows

0     13.0
1     17.0
2     10.0
3      8.0
4      9.0
5     13.0
6     13.0
7     13.0
8     12.0
9     11.0
10     NaN
11     NaN
dtype: float64

To create a series, I need to do several things:

1. Import Pandas
2. I need a Python list of values (for now, just integers)
3. I need to call `Series` and pass it the Python list

Notice that

- `Series` has a capital `S`!
- If you want to use `Series` by itself, you need to have the line `from pandas import Series`. Otherwise, you need to say `pd.Series`.

In [23]:
s = Series([10, 20, 30])

In [24]:
s

0    10
1    20
2    30
dtype: int64

# Next up:

1. Descriptive statistics
2. Five-number summary
3. Mean and standard deviation
4. Setting and retrieving values
5. `.loc`
6. Fancy indexing
7. Slicing



In [26]:
# I want to create a series of random numbers
# the easiest way to do that is with NumPy, 

import numpy as np                          # this is the standard way to do it
np.random.seed(0)                           # this sets the random-number system to a known point
s = Series(np.random.randint(0, 100, 10))   # give me 10 random integers, each 0-100

s   # this series contains 10 random ints, all 0-100

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [27]:
# let's get some descriptive statistics -- numbers that describe our series, so that we can better
# understand it.

s.mean()   # get the average

52.5

In [28]:
s.std()    # standard deviation

25.67424130654432

In [29]:
# often, I'll want all of these descriptive statistics together
# for that reason, it's great that Pandas provides a "describe" method

s.describe()

count    10.000000
mean     52.500000
std      25.674241
min       9.000000
25%      38.000000
50%      55.500000
75%      67.000000
max      87.000000
dtype: float64

In [30]:
# if I want the quantiles, I can call a method for that

s.quantile(0.25)   # this is the 25% mark

38.0

In [31]:
s.quantile(0.5)  # this is the median; you can also call s.median()

55.5

In [32]:
s.median()

55.5

# Exercise: Weather statistics

1. Find the mean high temperature in your city over the next 10 days.
2. Find the median high temperature.
3. Explain, looking at the full series of high temperatures, why this makes sense. (Why are they close, or not?)
4. Get the mean difference between the high and low temps.

In [33]:
highs

0    29
1    33
2    27
3    23
4    24
5    29
6    30
7    30
8    29
9    29
dtype: int64

In [34]:
highs.mean()

28.3

In [35]:
highs.median()

29.0

In [36]:
highs.describe()

count    10.000000
mean     28.300000
std       2.945807
min      23.000000
25%      27.500000
50%      29.000000
75%      29.750000
max      33.000000
dtype: float64

In [39]:
(highs - lows).mean()   # what is the average difference in temperature between highs and lows?

11.9

# Retrieving from our series

We've already seen that we can retrieve from our series with `[]`, just like a list (or tuple or dict).

It turns out that there are more sophisticated ways to retrieve from a series than just that.

In [40]:
# first: I want to see the value at index 5
s[5]

9

In [41]:
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [42]:
# I want to see the value at index 1
s[1]

47

In [44]:
# I want to see the values at indexes 5 and 1, in that order

# "fancy indexing"

s[ [5, 1] ]  # outer [] mean: I want an element, and inner [mean]: I want these two indexes

5     9
1    47
dtype: int64

In [46]:
# slice

s[3:7]  # slices give us results: starting_at:ending_at -- from 3 until (not including) 7

3    67
4    67
5     9
6    83
dtype: int64

In [47]:
# don't use []!
# instead, use `.loc[].

s.loc[3]

67

In [49]:
s.loc[8]

36

In [50]:
s.loc[8] = 2345  # we can assign to a series via .loc!
s

0      44
1      47
2      64
3      67
4      67
5       9
6      83
7      21
8    2345
9      87
dtype: int64

In [51]:
# I can use fancy indexing to assign to more than element

s.loc[ [2, 3] ] = 10   # this assigns 10 to the values at indexes 2 and 3


In [52]:
s

0      44
1      47
2      10
3      10
4      67
5       9
6      83
7      21
8    2345
9      87
dtype: int64

In [54]:
s   # my series is..a series, containing many values. 

0      44
1      47
2      10
3      10
4      67
5       9
6      83
7      21
8    2345
9      87
dtype: int64

In [56]:
s.loc[7]    # this returns a single value

21

# Exercise: Weather

1. Retrieve the high-temp values at indexes 2, 5, and 7.
2. Find the mean of these temps.
3. Assign the mean of these temps back to the indexes themselves.

In [57]:
highs

0    29
1    33
2    27
3    23
4    24
5    29
6    30
7    30
8    29
9    29
dtype: int64

In [63]:
highs[ [2,5,7] ]

2    27
5    29
7    30
dtype: int64

In [59]:
highs[ 2 ]  # I asked for a single index, and got one scalar integer back

27

In [62]:
highs[ [2] ]  # I asked for a list of indexes, but that list contains a single digit... I get back a series

2    27
dtype: int64

In [64]:
highs[ [2,5,7] ].mean()

28.666666666666668

In [65]:
highs[[2,5,7]]

2    27
5    29
7    30
dtype: int64

In [66]:
highs.mean()

28.3

In [67]:
# assign highs.mean() into those three indexes

highs[[2,5,7]] = highs.mean()

In [68]:
highs

0    29.0
1    33.0
2    28.3
3    23.0
4    24.0
5    28.3
6    30.0
7    28.3
8    29.0
9    29.0
dtype: float64

# Broadcasting

We've seen already that if I have two series, and I run an arithmetic operation on them, then Pandas will try to match up the indexes on the two, and will execute that operator on each pair of parallel values.

What if I have a series and a single, scalar value?  Can I run a mathematical operation on them together?

Yes - this is broadcasting, where the scalar value is applied to the operator for every value in the series.

In [70]:
s = Series([10, 20, 30, 40, 50])  

s + 3   # we get back a series

0    13
1    23
2    33
3    43
4    53
dtype: int64

In [71]:
s - 3

0     7
1    17
2    27
3    37
4    47
dtype: int64

In [72]:
s * 3

0     30
1     60
2     90
3    120
4    150
dtype: int64

In [73]:
s / 3

0     3.333333
1     6.666667
2    10.000000
3    13.333333
4    16.666667
dtype: float64

In [74]:
s ** 3  # s to the 3rd power


0      1000
1      8000
2     27000
3     64000
4    125000
dtype: int64

In [75]:
s % 3  # asking for the remainder from this division

0    1
1    2
2    0
3    1
4    2
dtype: int64

# Exercises: Weather and broadcasting

If you entered the temps in C, convert to Fahrenheit, and vice versa.


In [76]:
highs

0    29.0
1    33.0
2    28.3
3    23.0
4    24.0
5    28.3
6    30.0
7    28.3
8    29.0
9    29.0
dtype: float64

In [78]:
# conversion formula: c * (9/5) + 32

highs * (9/5) + 32

0    84.20
1    91.40
2    82.94
3    73.40
4    75.20
5    82.94
6    86.00
7    82.94
8    84.20
9    84.20
dtype: float64

In [79]:
highs

0    29.0
1    33.0
2    28.3
3    23.0
4    24.0
5    28.3
6    30.0
7    28.3
8    29.0
9    29.0
dtype: float64

In [80]:
highs = Series([29, 33, 27, 23, 24, 29, 30, 30, 29, 29])
highs

0    29
1    33
2    27
3    23
4    24
5    29
6    30
7    30
8    29
9    29
dtype: int64

In [81]:
highs + 0.5

0    29.5
1    33.5
2    27.5
3    23.5
4    24.5
5    29.5
6    30.5
7    30.5
8    29.5
9    29.5
dtype: float64

In [82]:
# we've seen how we can generate random integers

np.random.seed(0)
s = Series(np.random.randint(0, 100, 20))

s


0     44
1     47
2     64
3     67
4     67
5      9
6     83
7     21
8     36
9     87
10    70
11    88
12    88
13    12
14    58
15    65
16    39
17    87
18    46
19    88
dtype: int64

In [84]:
# what if I want random floating point numbers?
# then I have to use np.random.rand, which returns as many floats
# as we want between 0-1.

# we can, multiply the resulting series by any number we want, typically a power of 10

s = Series(np.random.rand(10)) * 100
s  # now they're all between 0-100

0    52.047748
1    67.887953
2    72.063265
3    58.201979
4    53.737323
5    75.861562
6    10.590761
7    47.360042
8    18.633234
9    73.691818
dtype: float64

# Exercise: Random numbers

1. Define a series containing 10 random floats between 0-1,000.
2. Find whether the mean and median are close to one another.
3. Modify 3 floats to be very big or very small.
4. How much did this affect the mean? How much the median?

In [87]:
np.random.seed(0)
s = Series(np.random.rand(10) * 1000)  # here, we indicate how many we want (10), then multiply by the factor
s

0    548.813504
1    715.189366
2    602.763376
3    544.883183
4    423.654799
5    645.894113
6    437.587211
7    891.773001
8    963.662761
9    383.441519
dtype: float64

In [88]:
s.describe()

count     10.000000
mean     615.766283
std      194.453613
min      383.441519
25%      464.411204
50%      575.788440
75%      697.865553
max      963.662761
dtype: float64

In [89]:
s.loc[  [3, 5, 7]  ] = 1.0
s

0    548.813504
1    715.189366
2    602.763376
3      1.000000
4    423.654799
5      1.000000
6    437.587211
7      1.000000
8    963.662761
9    383.441519
dtype: float64

In [90]:
s.describe()

count     10.000000
mean     407.811254
std      326.523415
min        1.000000
25%       96.610380
50%      430.621005
75%      589.275908
max      963.662761
dtype: float64

In [91]:
s = Series([10, 20, 30, 40, 50])

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [92]:
# let's modify one of the elements

s.loc[3] = 12.34  # this is clearly a floating-point number
s

0    10.00
1    20.00
2    30.00
3    12.34
4    50.00
dtype: float64

In [93]:
s = Series([10, 20, 30, 40, 50])
s.loc[3] = 12.00   # this is a float, but it loses nothing in becoming an int

s

0    10
1    20
2    30
3    12
4    50
dtype: int64

# Next up

1. Conditionals
2. Mask indexes
3. Combining conditionals
4. Indexes
    - Setting
    - Retrieving

In [94]:
# Conditionals

# In Python, we check things with == , and get back a True/False value
# in Pandas, we want to check all of the values in a series with == (or another comparison operator)

# it's going to work with *broadcasting*



In [95]:
s = Series([10, 20, 30, 40, 50])

s == 30   # the result of running this broadcast comparison is a boolean series!

0    False
1    False
2     True
3    False
4    False
dtype: bool

In [96]:
s < 30

0     True
1     True
2    False
3    False
4    False
dtype: bool

In [97]:
s < s.mean()   # first, calculate s.mean() and get its number. Then compare everything in s with that number

0     True
1     True
2    False
3    False
4    False
dtype: bool

# Boolean indexes aka mask indexes

We've already seen that we can pass a list of integers to `.loc[]`, and get back multiple values. That's known as "fancy indexing."

If, however, we pass a list of *booleans* to `.loc[]`, then that list (or series) functions as a mask, as a filter.  Any elements of the series that correspond to `True` values are returned. Those corresponding to `False` values are ignored/dropped.

In [98]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [99]:
s.loc[ [True, False, False, True, True]  ]   # mask index

0    10
3    40
4    50
dtype: int64

# Mask indexes, the grownup version

1. In order to filter elements of a series, we can use a boolean list/series. Only those items corresponding to `True` values will be returned.
2. We can create a boolean series by running a comparison operator against the series. Only those items for which the comparison is `True` will be returned.

In [100]:
s < 30

0     True
1     True
2    False
3    False
4    False
dtype: bool

In [101]:
s.loc[s < 30]  # show me the elements of s that are < 30

0    10
1    20
dtype: int64

In [102]:
s.loc[s < s.mean()]  # show me the elements of s that are < s.mean()

0    10
1    20
dtype: int64

In [103]:
s % 2 == 0    # which elements of s are even? That is, dividing by 2 and comparing the remainder to 0 is True?

0    True
1    True
2    True
3    True
4    True
dtype: bool

In [105]:
s.loc[s%2 == 0]  # actually, they're all even!

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [106]:
np.random.seed(0)
s = Series(np.random.randint(0, 100, 10))

s.loc[s%2 == 0]  # get even numbers in s

0    44
2    64
8    36
dtype: int64

# Exercises: Retrieving values

1. Create a series of 50 floats between 0-1,000.
2. Find those numbers that are even.
3. Find those numbers that are less than the mean, and get their descriptive statistics.
4. Find those numbers that are less than the mean - 1 standard deviation.

In [107]:
np.random.seed(0)
s = Series(np.random.rand(50) * 1000)

s

0     548.813504
1     715.189366
2     602.763376
3     544.883183
4     423.654799
5     645.894113
6     437.587211
7     891.773001
8     963.662761
9     383.441519
10    791.725038
11    528.894920
12    568.044561
13    925.596638
14     71.036058
15     87.129300
16     20.218397
17    832.619846
18    778.156751
19    870.012148
20    978.618342
21    799.158564
22    461.479362
23    780.529176
24    118.274426
25    639.921021
26    143.353287
27    944.668917
28    521.848322
29    414.661940
30    264.555612
31    774.233689
32    456.150332
33    568.433949
34     18.789800
35    617.635497
36    612.095723
37    616.933997
38    943.748079
39    681.820299
40    359.507901
41    437.031954
42    697.631196
43     60.225472
44    666.766715
45    670.637870
46    210.382561
47    128.926298
48    315.428351
49    363.710771
dtype: float64

In [110]:
# how many are even ? None!

s.loc[s % 2 == 0]

Series([], dtype: float64)

In [113]:
# Find those numbers that are less than the mean, and get their descriptive statistics.

s.loc[ s < s.mean() ].describe()

count     22.000000
mean     283.013118
std      173.941680
min       18.789800
25%      120.937394
50%      337.468126
75%      433.687665
max      528.894920
dtype: float64

In [116]:
# 4. Find those numbers that are less than the mean - 1 standard deviation.

s.loc[s < s.mean() - s.std()]

14     71.036058
15     87.129300
16     20.218397
24    118.274426
26    143.353287
34     18.789800
43     60.225472
46    210.382561
47    128.926298
dtype: float64

# Can we combine conditionals?
# Yes!

In Pandas, if we want to combine two conditions, we can do that but *not* using the standard Python boolean operators, `and`, `or`, and `not`. Rather, in Pandas, we'll use the symbols `&`, `|`, and `~`. (These are in Python for use as bitwise operators, but they're borrowed by Pandas, too.)

If I have two conditions, and I want to find those elements that meet both of their criteria, I can `&` them together. Note that each of the condition clauses should be in `()` to avoid parsing problems.

In [120]:
# Let's find all even numbers less than the mean

np.random.seed(0)
s = Series(np.random.randint(0, 1000, 20))

s.loc[(s < s.mean()) 
    & (s % 2 == 0)]

3     192
14     70
15    472
17    396
18    314
dtype: int64

# Indexes in Pandas

In regular Python data structures, indexes aren't anything special. They start at 0, and go up to the length of the data structure (string, list, tuple) - 1.

In Pandas, we can assign *any* values to our series index. They can be integers, they can be strings, and they can be tuples.

Some questions:
1. How to set this when we create a series?
2. How to modify this after a series is created?
3. How to use it for setting and retrieving?

In [121]:
s = Series([10, 20, 30, 40, 50],
          index=list('abcde'))   # this trick creates ['a', 'b', 'c', 'd', 'e']

s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [122]:
# retrieve via the index!

s.loc['a']

10

In [123]:
s.loc['e']

50

In [124]:
s.loc[['a', 'e']]  # fancy indexing!

a    10
e    50
dtype: int64

In [126]:
# slice
s.loc['b':'d']  # when using .loc and slices, we get up to *AND INCLUDING* the endpoint

b    20
c    30
d    40
dtype: int64

In [127]:
# how about updating/changing the index?

# retrieve the index from a series with .index
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [128]:
# set the index by assigning a new list or series of the right length
s.index = 'this is a new index'.split()

In [129]:
s

this     10
is       20
a        30
new      40
index    50
dtype: int64

In [130]:
s.loc['a']

30

In [132]:
s.loc['new']

40

In [133]:
# index elements can repeat!

s = Series([10, 20, 30, 40, 50],
          index=list('abcab'))

s

a    10
b    20
c    30
a    40
b    50
dtype: int64

In [134]:
# retrieve items with index 'a'
s.loc['a']

a    10
a    40
dtype: int64

In [135]:
# retrieve items with index 'c'
s.loc['c']

30

In [136]:
s = Series([10, 20, 30, 40, 50],
          index=list('abcba'))

s

a    10
b    20
c    30
b    40
a    50
dtype: int64

In [137]:
s.loc['a':'c']

KeyError: "Cannot get left slice bound for non-unique label: 'a'"

# What if I want to ignore the index, and get items positionally?

For that, we have the `.iloc` attribute.  `.loc` uses the index, but `.iloc` uses the position, starting at index 0.

In [138]:
s.iloc[2]

30

In [141]:
s.iloc[4]

50

# Exercise: Weather days

1. Define a series with the 10 days of high temperatures forecast for your city. The values should be the temps, and the index should contain the names of the days (Tuesday, Wednesday, etc.).
2. What is the average temperature forecast for Wednesdays?
3. What is the average temperature forecast for Tuesdays and Thursdays?
4. What is the mean of the items with even-numbered positional indexes?


In [None]:
highs = Series([29, 33, 27, 23, 24, 29, 30, 30, 29, 29],
              index=)