# Agenda

1. Getting started with data analysis
    - Jupyter (as a tool)
    - What is data analysis?
    - Pandas -- what is it?
    - Pandas series
    - Descriptive statistics
    - Aggregate methods
    - Setting and retrieving values
    - Broadcasting
    - Mask arrays
2. Data frames (for 2D data)
    - Indexes vs. columns
    - Retreiving
    - Setting
    - Querying with mask arrays
    - Reading CSV data from external files
3. Real-world data
    - Sorting
    - Grouping
    - Cleaning
    - Pivot table
4. Text and time
    - Working with text data
    - Textual statistics
    - Dates and times
    - Time deltas
    - Time series
    - Resampling
5. Visualization
    - Line plots
    - Bar plots
    - Pie plots
    - Scatter lots


# Data analysis -- what is data?

1. Collecting data
2. Understanding the data
3. Making decisions based on that data

# Where does data come from?

- Phones
- Vehicles
- Temperature sensors

# Data science -- what is it?

1. Data analytics -- take data from the past, and understand it
2. Data engineering -- the logistics of working with data, even when it's huge or hard to work with it
3. Machine learning -- take the data from the past, and predict the future with it

Data science is these three things together, coupled with a scientific approach -- we ask a question, and we try to get an answer that we believe is justifiable with the data.

# Why are we using Python?

Python is known as a language that uses lots of memory and is slow to execute.

Data analytics needs to work with lots of data, and we want fast results.

The answer: NumPy, which provides C-style data and speed, but Python-style access.

NumPy is a little low level for many people. That's where Pandas comes in.  Pandas is a high-level data analysis system that uses NumPy behind the scenes (for speed and efficiency) but gives us very easy analysis in Python.

Pandas is sort of like Excel, inside of Python.

# What Python will we *not* be using?

- `if`/`else`
- `for` or `while` loops
- `def` to define new functions

# What will we be using?

- Basic data structures (ints, floats, strings, and lists) to communicate with Pandas.
- Comparison operators (e.g., `==` and `<`)
- Variable assignment
- Pandas object methods and constructors

# Quick Jupyter tutorial

Jupyter is the tool I'm using to type this, and which I'll use for the entire course. It lets us run code and write documentation in the browser. Jupyter is super popular among data scientists.

Learn more at https://Jupyter.org/.

Every rectangle in Jupyter is known as a "cell." To execute a cell, you press shift+Enter.




I just pressed Enter a few times, and descended those lines.  But if press shift+Enter...

# Jupyter has two modes

- Edit mode -- the outline of the cell is green. Enter edit mode by clicking inside of the cell or pressing ENTER. You can then type to enter text.
- Command mode -- the outline of the cell is blue. Enter command mode by clicking to the left of the cell or pressing ESC. Anything you type is interpreted as a command to Jupyter.

When you're in command mode, here are some commands you can use:
- `h` -- help
- `c` -- copy the current cell
- `x` -- cut the current cell
- `v` -- copy the current cell
- `m` -- set cell type to Markdown (for documentation)
- `y` -- set cell type to Python code (for coding, this is the default)
- `a` -- create a new cell above the current one
- `b` -- create a new cell below the current one

In [1]:
print('Hello!')

Hello!


In [2]:
# to use Pandas, I have to load it!

import numpy as np           # easier access to NumPy things (the lower-level library)
import pandas as pd          # import the Pandas library, but give it an alias, "pd"
from pandas import Series    # this way, we don't have to say pd.Series

In [3]:
x = 1   # regular Python code

In [4]:
# how much memory does that 1 take up?
import sys
sys.getsizeof(x)   # how many bytes does this 1 take?

28

In [5]:
# I create a Pandas series
# I use the name Series that I just imported
# I hand Series a Python list of integers
# behind the scenes, Pandas turns this into a NumPy array, and assigns the series to the variable s

s = Series([10, 20, 30, 40, 50])

In [6]:
# what kind of data does s contain?
type(s)

pandas.core.series.Series

In [7]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# What is the `dtype`?

In pure Python, we have regular integers. We don't need to think about how many bits our integer contains. But in Pandas, because we're using these low-level C types, we need to think about how many bits it contains.

By default, integers in Pandas are all 64-bit ints.  We see that, whenever we have data in a series.

In [8]:
# what operations can I run on s?

s[0]  # get the first item

10

In [9]:
s[1]   # get the second item

20

In [10]:
len(s)   # how many items are in s?

5

In [13]:
# can I modify one of these items using the index? Yes!

s[2] = 999

In [14]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [15]:
s[-1]    # normally, Python allows us to retrieve the final element with index -1 ... but not here!

KeyError: -1

In [16]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [17]:
s2 = Series([2, 4, 6, 8, 10])

In [18]:
# what happens if I add s and s2?

s + s2

0      12
1      24
2    1005
3      48
4      60
dtype: int64

# What happened here?

When you add two series together, Pandas gives you a new series back.

That series is of the same length as the two series you added, and each index is the result of adding the two inputs at that index.

Any math operation you run on two series will actually be run on each index.

In [19]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [20]:
s2

0     2
1     4
2     6
3     8
4    10
dtype: int64

In [21]:
s + s2

0      12
1      24
2    1005
3      48
4      60
dtype: int64

In [22]:
s - s2

0      8
1     16
2    993
3     32
4     40
dtype: int64

In [23]:
s * s2

0      20
1      80
2    5994
3     320
4     500
dtype: int64

In [24]:
s / s2

0      5.0
1      5.0
2    166.5
3      5.0
4      5.0
dtype: float64

In [25]:
s // s2    # this returns integer division, ignoring any remainder

0      5
1      5
2    166
3      5
4      5
dtype: int64

In [26]:
s ** s2    # exponentiate 

0                   100
1                160000
2    994014980014994001
3         6553600000000
4     97656250000000000
dtype: int64

In [27]:
s % s2   # modulus

0    0
1    0
2    3
3    0
4    0
dtype: int64

In [28]:
s2

0     2
1     4
2     6
3     8
4    10
dtype: int64

In [29]:
s2[3] = 8.5

In [30]:
s2

0     2.0
1     4.0
2     6.0
3     8.5
4    10.0
dtype: float64

# Lists vs. series

A Python list traditionally contains only one type of value. But that's a convention; lists can contain any number of values, and each value can be of any type. 

In Pandas, a series cannot contain more than one type. Every single value must be of the same dtype. This means that if you assign a float to one element of a series that previously contained only integers, the dtype will change to reflect that.

In [31]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [32]:
s ** 100000  # this goes far beyond the ability of 64-bit integers!

0                      0
1                      0
2   -2115034505054000383
3                      0
4                      0
dtype: int64

# Exercise: Weather forecast

1. Create a series with 10 elements, the high temperatures forecast for your city in the next 10 days.
2. Create a second series with 10 elements, the low temperatures forecast for your city in the next 10 days.
3. Find out the difference between the high and low temps that your city will experience.

In [34]:
highs = Series([19, 19, 23, 27, 27, 25, 21, 18, 19, 20])
lows = Series([10, 9, 8, 10, 13, 11, 14, 11, 11, 11])

In [35]:
highs - lows

0     9
1    10
2    15
3    17
4    14
5    14
6     7
7     7
8     8
9     9
dtype: int64

In [36]:
highs - s   # what happens if we try to subtract from two series of different lengths?

0      9.0
1     -1.0
2   -976.0
3    -13.0
4    -23.0
5      NaN
6      NaN
7      NaN
8      NaN
9      NaN
dtype: float64

# Next up

- Methods we can run on our series
- Aggregate methods
- Descriptive statistics

# Methods in Python

Methods are object-oriented functions -- we can invoke them via a `.` and then the name of the method.

Some examples from regular Python:

```python
name = 'reuven'
name.upper()     # this returns a new string, 'REUVEN', based on name
```

# Methods in Pandas

We can invoke a large number of different methods on a series. If you're in Jupyter, then you can always see what methods are available on an object by naming the object, writing a `.`, and then pressing `TAB`.

In [37]:
s

0     10
1     20
2    999
3     40
4     50
dtype: int64

In [39]:
# some methods we'll want to use

s.count()   # returns the number of values in our series (ignoring NaN)

5

In [41]:
# another way to find the size of our series is to ask the size

s.size    # how many values are there, including NaN?  NOTICE -- NOT A METHOD

5

In [42]:
# what if we want the average, or mean, of the numbers?

s = Series([10, 20, 30, 40, 50])
s.mean()

30.0

In [43]:
# what is the mean? It's the total of all values, divided by the size

# it turns out that there's a .sum() method, which returns the total
s.sum()

150

In [44]:
# this is the same as calling s.mean()
s.sum() / s.size

30.0

In [45]:
# the issue with calculating the mean is that it can be skewed with one or two large outliers
# sometimes, we aren't going to want the mean to understand our data

# instead, we'll want to find the median -- we line the values up from smallest to biggest, and find
# the value at that halfway point

s.median()

30.0

In [46]:
# if the mean and median are the same, our data is normally distributed

In [47]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [48]:
s[4] = 1000000

In [49]:
s.median()   # this hasn't changed!

30.0

In [50]:
s.mean()    # this has changed a bit

200020.0

In [51]:
# what are the minimum and maximum values?

s.min()

10

In [52]:
s.max()

1000000

In [54]:
s.mode()   # which value(s) appear most?

0         10
1         20
2         30
3         40
4    1000000
dtype: int64

In [55]:
s[4] = 40    # make 40 appear twice in our series

In [57]:
s.mode()     # show the value(s) that appear most

0    40
dtype: int64

In [58]:
# we have the min, the median, and the max
# sometimes we want to know what the values are along the way
# it's common to ask for the 25th and 75th quantile

s.quantile(0.25)

20.0

In [59]:
s.quantile(0.75)

40.0

In [60]:
# some data analysts want to know the IQR -- the inter-quartile range
# this is the 75th quantile - 25th quantile

In [61]:
# all of these are useful in various ways
# together, they are known as "descriptive statistics"

# John Tewkey 

# Pandas has a method that gives us all of this information about a series -- it's called "describe"

In [62]:
s.describe()

count     5.000000
mean     28.000000
std      13.038405
min      10.000000
25%      20.000000
50%      30.000000
75%      40.000000
max      40.000000
dtype: float64

In [63]:
# standard deviation -- how much does the data fluctuate/differ from the mean?
# a std of 0 means -- all values in the data set are the same
# a big standard deviation means that the values vary a lot from the mean

In [64]:
# when we invoke describe, we are getting back... a series!
# the series has a text index
# the dtype is float64

# we can retrieve one part of this with square brackets

s.describe()['min']    # don't do this!  just use the .min() method

10.0

# Exercise: Weather analysis

1. Create a series containing the high-temperature forecast for your city in the next 10 days.
2. Create a series containing the low-temperature forecast for your city in the next 10 days.
3. Use the `.describe` method to get descriptive statistics for each of those.  What do you see? Does the standard deviation seem to describe the variation in temperature you'll see?
4. Get the difference between highs and lows, and then run `.describe` on that.  

In [65]:
highs

0    19
1    19
2    23
3    27
4    27
5    25
6    21
7    18
8    19
9    20
dtype: int64

In [66]:
lows

0    10
1     9
2     8
3    10
4    13
5    11
6    14
7    11
8    11
9    11
dtype: int64

In [67]:
highs.describe()

count    10.000000
mean     21.800000
std       3.457681
min      18.000000
25%      19.000000
50%      20.500000
75%      24.500000
max      27.000000
dtype: float64

In [69]:
lows.describe()

count    10.00000
mean     10.80000
std       1.75119
min       8.00000
25%      10.00000
50%      11.00000
75%      11.00000
max      14.00000
dtype: float64

In [71]:
# option 1: assign the difference to a variable
temp_diffs = highs - lows

In [72]:
temp_diffs.describe()

count    10.000000
mean     11.000000
std       3.651484
min       7.000000
25%       8.250000
50%       9.500000
75%      14.000000
max      17.000000
dtype: float64

In [73]:
# option 2: use parentheses and then run it directly, without a variable

(highs - lows).describe()

count    10.000000
mean     11.000000
std       3.651484
min       7.000000
25%       8.250000
50%       9.500000
75%      14.000000
max      17.000000
dtype: float64

# Getting results printed

In normal Python programs, nothing is printed without using the `print` function.

But in Jupyter, if the final line in your cell returns a result, then that result is displayed right after the cell.

In [75]:
x = 5
y = 10

x + y     # final line of the cell, and it returns a value (15), so that is displayed

15

In [76]:
# what if I have three things in the same cell?  Only the final one will be displayed.

x + y
x * y
x - y

-5

# Setting and retrieving values

We've seen that we can set and retrieve values with `[]`.  

In [77]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

s

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

In [78]:
s[3]  # retrieve at index 3

40

In [79]:
s[7]  # retrieve at index 7

80

In [80]:
# what if we try an index that doesn't exist?

s[100]

KeyError: 100

In [81]:
s[5] = 9876

In [82]:
s

0      10
1      20
2      30
3      40
4      50
5    9876
6      70
7      80
8      90
9     100
dtype: int64

# Use `.loc` and `.iloc`

It's true that you can use `[]` to set and retrieve values in a series. As of now, I ask you to *not do that*.

Instead, I want you to use `.loc` and `.iloc`.

In [83]:
# to retrieve from our series at index i, we can say s.loc[i]  (square brackets!)

s.loc[0]   # same as s[0]

10

In [84]:
s.loc[5]   # same as s[5]

9876

In [85]:
# if we have .loc, why do we also have .iloc? What's the difference?
# answer: .loc will work with whatever index we have, even if the index uses text or weird numbers
# .iloc will always start at 0 and go to len(s)-1.

s.iloc[0]   # same as s[0]

10

In [86]:
s.iloc[5]  # same as s[5]

9876

In [87]:
s.loc[5] = 12345

In [88]:
s

0       10
1       20
2       30
3       40
4       50
5    12345
6       70
7       80
8       90
9      100
dtype: int64

In [89]:
# what is s at index 2?
s.loc[2]

30

In [90]:
# what is s at index 7?
s.loc[7]

80

In [91]:
# I want the items at both 2 and 7
# for this I'll use FANCY INDEXING -- meaning, we pass a list of indexes!

s.loc[[2, 7]]  # outer [] are for .loc, inner [] are because we're passing a list

2    30
7    80
dtype: int64

# Fancy indexing

To use fancy indexing, just pass a list of indexes to `.loc` (or `.iloc`).

The result will be a series whose elements have the indexes that you named.

Don't forget to use two sets of square brackets with `.loc`.

In [93]:
s.loc[[2,7,2,7]]   # can we repeat index values? YES!

2    30
7    80
2    30
7    80
dtype: int64

In [94]:
# can we assign using fancy indexing? YES!

s.loc[[2,7]]

2    30
7    80
dtype: int64

In [95]:
s.loc[[2,7]] = [100, 200]   # assign to more than one index!

In [96]:
s

0       10
1       20
2      100
3       40
4       50
5    12345
6       70
7      200
8       90
9      100
dtype: int64

In [98]:
# slices -- (almost) just like in regular Python

s.loc[2:7]     # [from:until]  -- it's up to *AND INCLUDING* the final index

2      100
3       40
4       50
5    12345
6       70
7      200
dtype: int64

In [100]:
# what about slices on .iloc? Can I do that?

s.iloc[2:7]  # now it's up to and NOT including, just like in traditional Python

2      100
3       40
4       50
5    12345
6       70
dtype: int64

In [101]:
s.loc[[2]]  # fancy indexing with a single value

2    100
dtype: int64

In [102]:
s.loc[2]   # here, I'm asking .loc to retrieve a single element. I thus get back the integer

100

In [103]:
s.loc[[2]]   # here, I'm asking for a list of elements... the list just happens to have one element


2    100
dtype: int64

# Exercise: More with weather!

1. Create a series showing how many mm of rain/snow will fall in your city in the next 10 days.
2. Get the descriptive statistics for the rain forecast.
3. What is the mean expected between days 4 and 7, inclusive?
4. What is the max expected between days 2 and 5, exclusive? (Meaning: Don't include the final day.)
5. Modify day 3's rainfall to be twice the current value.
6. How does that change the mean between days 4 and 7, inclusive?

In [104]:
rainfall = Series([0.8, 0.6, 0, 0, 0, 0, 0.9, 2.3, 1.1, 0])
rainfall.size

10

In [105]:
rainfall.describe()

count    10.000000
mean      0.570000
std       0.749889
min       0.000000
25%       0.000000
50%       0.300000
75%       0.875000
max       2.300000
dtype: float64

In [107]:
# get days 4-7 inclusive

rainfall.loc[4:7].mean()

0.7999999999999999

In [110]:
# max from days 2-5, exclusive

rainfall.iloc[2:5].max()

0.0

In [114]:
# modify day 1's rainfall to be twice the current value  (I changed from 3 so it'll be non-zero)

rainfall.loc[1] = rainfall.loc[1] * 2

In [116]:
# how did that change our values from days 4-7?
rainfall.loc[4:7].mean()

0.7999999999999999

In [117]:
rainfall.loc[4:7].describe()

count    4.000000
mean     0.800000
std      1.086278
min      0.000000
25%      0.000000
50%      0.450000
75%      1.250000
max      2.300000
dtype: float64

In [118]:
rainfall.loc[5] = rainfall.loc[1] * 2

In [119]:
rainfall.loc[4:7].describe()

count    4.000000
mean     1.400000
std      1.157584
min      0.000000
25%      0.675000
50%      1.600000
75%      2.325000
max      2.400000
dtype: float64

# Next up

1. Broadcasting operations
2. Mask arrays



In [120]:
s = Series([10, 20, 30, 40, 50])
s2 = Series([2, 4, 6, 8, 10])

In [121]:
s + s2   # vectorized -- the operation is performed on each index, and we get back a new series

0    12
1    24
2    36
3    48
4    60
dtype: int64

In [122]:
# what happens if I perform the operation with a series and a scalar (single) value?

s + 3    # +3 operation is "broadcast" to each of the elements of s, and we get back a new series

0    13
1    23
2    33
3    43
4    53
dtype: int64

In [123]:
s - 3

0     7
1    17
2    27
3    37
4    47
dtype: int64

In [124]:
s * 3

0     30
1     60
2     90
3    120
4    150
dtype: int64

In [125]:
s / 3

0     3.333333
1     6.666667
2    10.000000
3    13.333333
4    16.666667
dtype: float64

In [126]:
s // 3   # floordiv

0     3
1     6
2    10
3    13
4    16
dtype: int64

In [127]:
s ** 3

0      1000
1      8000
2     27000
3     64000
4    125000
dtype: int64

In [128]:
s % 3

0    1
1    2
2    0
3    1
4    2
dtype: int64

In [129]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [131]:
# let's assume that s contains prices, and that the VAT on products is 17%
# can I get the product price including VAT?

prices_including_vat = s * 1.17
prices_including_vat

0    11.7
1    23.4
2    35.1
3    46.8
4    58.5
dtype: float64

In [132]:
# how can we get random numbers in Pandas?
# we'll actually use NumPy to do that, asking for either random integers or random floats

Series(np.random.randint(0, 100, 10))   # give me 10 random ints, each 0-100 -- then pass it to Series

0    76
1    99
2    29
3    66
4    41
5    30
6    24
7    39
8    14
9    10
dtype: int64