# Agenda

1. Getting started
    - What is data analysis?
    - Python and Pandas
    - Descriptive statistics and aggregate methods
    - Setting and retrieving values
    - Broadcasting operations
    - Mask arrays
    - Indexes
    - Some additional useful methods
2. Data types and data frames
    - Data types and `NaN` ("not a number")
    - Data frames (2D data)
    - Adding and removing data
    - Retrieving data
    - Queries with mask indexes 
3. Real-world data
    - Working with CSV files
    - Sorting data
    - Grouping data
    - Pivot tables
    - Joining
4. Text and dates
    - Working with text data
    - Working with dates and times
    - Time series (where datetime values are our index)
5. Visualization
    - Plots via the Pandas interface
    - Scatter plots
    - What next?

# I'm using Jupyter

You can install Jupyter on your own computer if you have Python and Pandas -- just install it from PyPI. 

    pip install jupyter
    
If you don't have Jupyter on your computer, then you use one of the online systems, such as Google Colab or Python Anywhere or Replit.  Or https://try.jupyter.org.

# Jupyter intro

Jupyter is divided into "cells," like the one I'm typing into right now. Cells can be in one of two modes at any time:

- Edit mode: When you type into it, the text appears (like right now). It has a green outline. Enter edit mode by clicking inside of the cell, or by pressing ENTER.
- Command mode: When you type, you're giving commands to the Jupyter notebook itself. It has a blue outline. Enter command mode by clicking to the left of the cell or by pressing ESC.

In command mode, you can type a bunch of keys and get Jupyter to do things:

- `h` -- help
- `c` -- copy the current cell
- `v` -- paste the current cell
- `x` -- cut the current cell
- `z` -- undo the last action
- `a` -- create a new cell *above* the current one
- `b` -- create a new cell *below* the current one


# What is data? What is data analytics?

Data is everywhere in our world -- cellphones, computers, servers, vehicles, refrigerators.

How do we work with it? How can we use it to answer questions that are useful and/or interesting?

I say: Data science = data analytics + machine learning.

# (Thought) Exercise: Amazon's data

1. What sort of data does Amazon have?
2. What sorts of questions can Amazon ask of that data?
3. What sorts of things can Amazon do once they have answers to those questions?

# We have data. How can we analyze it?

There are many tools: SQL databases. NoSQL databases. Programming languages like R or Julia. Or even Java or C#.

Python has become the #1 language for data science over the last decade. That seems really weird!

- Python is not super efficient
- Python's numbers are not small

How did this happen? Answer: NumPy. NumPy is a library written in C, and thus runs at C speeds, *but* it has an interface in Python that lets us use it in Python.

We get the benefits of Python's ease of use, but C's efficiency.

NumPy is a bit low level for many people to work with. Pandas is (mostly) a wrapper around NumPy that makes it easier to work with, and feels like a higher-level system.

Pandas is a Python package (on PyPI), and is the main way that people analyze data in the Python world nowadays.

In [2]:
# I can load Pandas into memory by saying

import pandas as pd      # import it, and assign it the alias "pd"
from pandas import Series, DataFrame   # these are useful shortcuts, so we don't have to say pd.Series / pd.DataFrame

In [3]:
# series

# a series is a Pandas data structure containing 1D data
# it's kind of, sort of, like a list  -- but it isn't one

s = Series([10, 20, 30, 40, 50])  # creating a series with a Python list

In [4]:
type(s)

pandas.core.series.Series

In [5]:
s   # show me the series

0    10
1    20
2    30
3    40
4    50
dtype: int64

# What are we seeing?

1. The series contains two parallel set of elements:
    - The index, which currently contains integers 0-4
    - The values, which currently contains the integers 10, 20, 30, 40, and 50
2. The series will always present itself with both index and values
3. At the bottom, we see the `dtype` of the series, i.e., the type of data that it contains. Here it's `int64`, which is a way of saying that every integer in our series is stored as a 64-bit integer.  This data type comes from NumPy.

In [6]:
# what happens if I add the series to itself?
# I get a new series of the same length... at each index, the values have been added

s + s

0     20
1     40
2     60
3     80
4    100
dtype: int64

In [7]:
# I can retrieve values with []

s[2]

30

In [8]:
s[4]

50

In [9]:
s[100]

KeyError: 100

In [10]:
# some methods that I can call on my series

s.sum()

150

In [11]:
s.mean()

30.0

In [12]:
s.std()   # standard deviation

15.811388300841896

In [13]:
s.min()  

10

In [14]:
s.max()

50

In [15]:
s.median()

30.0

In [16]:
s.mode()

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [17]:
s2 = Series([10.5, 20.5, 30.5, 40.5, 50.5])

s + s2

0     20.5
1     40.5
2     60.5
3     80.5
4    100.5
dtype: float64

# Exercise: Weather

1. Find a Web site with your city's weather forecast for the next 10 days.
2. Create a series, called `lows`, and put the forecast low temperatures there.
3. Create a series, called `highs`, and put the forecast high temperatures there.
4. Subtract `lows` from `highs`, and show the difference in temperature for each day.

In [18]:
lows = Series([16, 16, 17, 15, 15, 16, 17, 17, 17, 18])
highs = Series([29, 33, 27, 23, 24, 29, 30, 30, 29, 29])

highs - lows

0    13
1    17
2    10
3     8
4     9
5    13
6    13
7    13
8    12
9    11
dtype: int64

In [19]:
highs[0] - lows[0]

13

In [20]:
highs[1] - lows[1]

17

In [21]:
extra_long_highs = Series([29, 33, 27, 23, 24, 29, 30, 30, 29, 29, 30, 30])

extra_long_highs - lows

0     13.0
1     17.0
2     10.0
3      8.0
4      9.0
5     13.0
6     13.0
7     13.0
8     12.0
9     11.0
10     NaN
11     NaN
dtype: float64

To create a series, I need to do several things:

1. Import Pandas
2. I need a Python list of values (for now, just integers)
3. I need to call `Series` and pass it the Python list

Notice that

- `Series` has a capital `S`!
- If you want to use `Series` by itself, you need to have the line `from pandas import Series`. Otherwise, you need to say `pd.Series`.

In [23]:
s = Series([10, 20, 30])

In [24]:
s

0    10
1    20
2    30
dtype: int64

# Next up:

1. Descriptive statistics
2. Five-number summary
3. Mean and standard deviation
4. Setting and retrieving values
5. `.loc`
6. Fancy indexing
7. Slicing



In [26]:
# I want to create a series of random numbers
# the easiest way to do that is with NumPy, 

import numpy as np                          # this is the standard way to do it
np.random.seed(0)                           # this sets the random-number system to a known point
s = Series(np.random.randint(0, 100, 10))   # give me 10 random integers, each 0-100

s   # this series contains 10 random ints, all 0-100

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [27]:
# let's get some descriptive statistics -- numbers that describe our series, so that we can better
# understand it.

s.mean()   # get the average

52.5

In [28]:
s.std()    # standard deviation

25.67424130654432

In [29]:
# often, I'll want all of these descriptive statistics together
# for that reason, it's great that Pandas provides a "describe" method

s.describe()

count    10.000000
mean     52.500000
std      25.674241
min       9.000000
25%      38.000000
50%      55.500000
75%      67.000000
max      87.000000
dtype: float64

In [30]:
# if I want the quantiles, I can call a method for that

s.quantile(0.25)   # this is the 25% mark

38.0

In [31]:
s.quantile(0.5)  # this is the median; you can also call s.median()

55.5

In [32]:
s.median()

55.5

# Exercise: Weather statistics

1. Find the mean high temperature in your city over the next 10 days.
2. Find the median high temperature.
3. Explain, looking at the full series of high temperatures, why this makes sense. (Why are they close, or not?)
4. Get the mean difference between the high and low temps.

In [33]:
highs

0    29
1    33
2    27
3    23
4    24
5    29
6    30
7    30
8    29
9    29
dtype: int64

In [34]:
highs.mean()

28.3

In [35]:
highs.median()

29.0

In [36]:
highs.describe()

count    10.000000
mean     28.300000
std       2.945807
min      23.000000
25%      27.500000
50%      29.000000
75%      29.750000
max      33.000000
dtype: float64

In [39]:
(highs - lows).mean()   # what is the average difference in temperature between highs and lows?

11.9

# Retrieving from our series

We've already seen that we can retrieve from our series with `[]`, just like a list (or tuple or dict).

It turns out that there are more sophisticated ways to retrieve from a series than just that.

In [40]:
# first: I want to see the value at index 5
s[5]

9

In [41]:
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [42]:
# I want to see the value at index 1
s[1]

47

In [44]:
# I want to see the values at indexes 5 and 1, in that order

# "fancy indexing"

s[ [5, 1] ]  # outer [] mean: I want an element, and inner [mean]: I want these two indexes

5     9
1    47
dtype: int64

In [46]:
# slice

s[3:7]  # slices give us results: starting_at:ending_at -- from 3 until (not including) 7

3    67
4    67
5     9
6    83
dtype: int64

In [47]:
# don't use []!
# instead, use `.loc[].

s.loc[3]

67

In [49]:
s.loc[8]

36

In [50]:
s.loc[8] = 2345  # we can assign to a series via .loc!
s

0      44
1      47
2      64
3      67
4      67
5       9
6      83
7      21
8    2345
9      87
dtype: int64

In [51]:
# I can use fancy indexing to assign to more than element

s.loc[ [2, 3] ] = 10   # this assigns 10 to the values at indexes 2 and 3


In [52]:
s

0      44
1      47
2      10
3      10
4      67
5       9
6      83
7      21
8    2345
9      87
dtype: int64

In [54]:
s   # my series is..a series, containing many values. 

0      44
1      47
2      10
3      10
4      67
5       9
6      83
7      21
8    2345
9      87
dtype: int64

In [56]:
s.loc[7]    # this returns a single value

21

# Exercise: Weather

1. Retrieve the high-temp values at indexes 2, 5, and 7.
2. Find the mean of these temps.
3. Assign the mean of these temps back to the indexes themselves.

In [57]:
highs

0    29
1    33
2    27
3    23
4    24
5    29
6    30
7    30
8    29
9    29
dtype: int64

In [58]:
highs[ [2,5,7] ]

2    27
5    29
7    30
dtype: int64

In [59]:
highs[ 2 ]  # I asked for a single index, and got one scalar integer back

27

In [60]:
highs[ [2] ]  # I asked for a 

2    27
dtype: int64