# Agenda: The full course

1. Getting started
    - What is data analytics?
    - What is Pandas?
    - Descriptive statistics
    - Pandas series 
    - Retrieving values
    - Setting values
    - Broadcasting operations
    - Mask arrays (boolean arrays) for retrieving selected values
    - Indexes
    - Some useful methods
2. Data frames, for 2-dimensional data
3. Real-world data
4. Text data and date/time data
5. Visualization

# Jupyter

I see Jupyter as my Python laboratory -- I can try things, and then if they don't work, I can modify my experiment and try again.

Everything in Jupyter is done in a "cell." We have two types of cells:

- Code cells, in which we can run Python code
- Markdown cells, in which we have Markdown text, which turns into HTML very nicely

When I press Enter, I go down one line. But when I press shift+Enter together, the cell executes -- which, if it's a code cell, actually runs the Python.  And if it's a Markdown cell, then it gets formatted.

# Two modes in Jupyter

- Edit mode -- this allows us to type into Jupyter. The cell has a green outline. We can get into edit mode by clicking in the cell or by press ENTER.
- Command mode -- this allows us to give Jupyter commands, typically one character long. The cell has a blue outline in this case. We can get into command mode by clicking to the left of the cell or pressing ESC.

What can I do in command mode?

- `m` -- turn the cell into a Markdown cell, for text
- `y` -- turn the cell into a Python code cell, for programming
- `c` -- copy the current cell
- `x` -- cut the current cell
- `v` -- paste the copied/cut cell
- `h` -- get help, a full list of commands
- `a` -- create a new cell *above* the current one
- `b` -- create a new cell *below* the current one

In [2]:
# this is Python code

print('Hello!')   # shift+enter sends the cell's contents to the back-end Python process, where it runs

Hello!


# What is data science?

There is no one definition! (So everyone can choose the definition that they like.)

I define data science as data analytics + (data engineering +) machine learning.

- Data analytics -- we have a bunch of data. We want to understand it.  For example: Who bought from our store? What time do people use the train? How much do people spend on taxis?
- Data engineering -- the data all exists in one place, and is very messy.  How can we easily and efficiently get it from its current location to our systems, cleaning it up along the way?
- Machine learning -- let's use this existing data to make projections/predictions about the future.  How many people will buy at our Black Friday sale? How many toll collectors do we need at 6 a.m. on Sunday?  If we hear a certain soundwave, can we predict the meaning? If we see a picture, can we accurately predict the animal in the picture?

In this class, we'll be talking about how to use Python and Pandas for data analytics.

Data science is all about asking questions and using scientific or scientific-like methods to answer our questions.

# Exercise: 

1. What kinds of data does Amazon have?
2. What sorts of questions can they ask of that data?

# Python and Pandas

Python has been around for about 30 years. It's a fantastic high-level programming language.

But it is *not* very fast or efficient, certainly not compared with C, Java, and C#.  It takes far more time, and it uses far more memory than these other languages.

So: Why would we use Python?  Just because it's fun and easy to use?

No: A library called NumPy provides us with a Python interface to C-language data structures. So we have the speed and efficiency of C, but the ease of use of Python.

NumPy is still very popular and very useful. But it can be a bit low level for many people. Enter Pandas, which is a wrapper around NumPy that makes it easier to use and more convenient.

I call Pandas the automatic transmission to NumPy's stick shift.

We're going to use Pandas, but we will occasionally see hints of NumPy under the hood.

In [3]:
# we have to load Pandas (and also NumPy)

# I very strongly encourage you to load NumPy and Pandas and give them these very standard aliases

import numpy as np
import pandas as pd
from pandas import Series   # this allows me to avoid saying pd.Series; I can just say Series

In [4]:
# I'll create a Pandas series with 5 integers
# I do this by creating a new Series, passing it a Python list of integers

s = Series([10, 20, 30, 40, 50])

In [5]:
# we see that s isn't a list, but rather a Pandas series:

type(s)

pandas.core.series.Series

In [6]:
# let's take a look at s
# note: in Jupyter, I don't have to use print to see something. The final line of a cell,
# if it returns a value, is displayed automatically.

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# Series

A series contains values, and also has an index. The index, by default, is integers that start at 0 and go all the way to the series length - 1.

We can have almost any data types we want in the series, but we'll soon see that we normally will stick to a limited set, matching C's data types

Series are always displayed in this way, with the index and values in two parallel columns. At the end, we see the "dtype," describing the type of data that's in our series.  For now, we'll mostly (not always) have 64-bit integers, known as int64.

In [7]:
# in some ways, a series is like a Python list

s[0]

10

In [8]:
s[1]

20

In [10]:
# can I get the final element with s[-1]?
# no... not quite like a list

s[-1]

KeyError: -1

In [11]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [12]:
s[0] + 10

20

In [13]:
s + s  # can I add a series to itself?  And if so, what do I get back?

0     20
1     40
2     60
3     80
4    100
dtype: int64

In [14]:
# the result of adding a series to itself is a new series,
# one with the same length (and thus index) as before, but in which
# the values are doubled from before.

# what about adding two different series together?

s1 = Series([10, 20, 30, 40, 50])
s2 = Series([100, 200, 300, 400, 500])

s1 + s2

0    110
1    220
2    330
3    440
4    550
dtype: int64

In [15]:
# what if the two series are not the same length?

s3 = Series([123, 456, 789])

s1 + s3  # it works... but we get NaN ("not a number") back

0    133.0
1    476.0
2    819.0
3      NaN
4      NaN
dtype: float64

In [16]:
# all operations in Pandas are *vectorized*
# meaning: When I perform an operation on a series, I'm not running it on a single value in the series
# the operation is repeated for every single element.

# if I say s1 + s2, I get a new series back based on s1 and s2, but not modifying them
s1 + s2

0    110
1    220
2    330
3    440
4    550
dtype: int64

In [17]:
# I can, of course, assign the sum back to s1 (or any other variable, including a new one)

series_sum = s1 + s2
series_sum

0    110
1    220
2    330
3    440
4    550
dtype: int64

In [18]:
# I can use all of our favorite Python operators on two series

s1 + s2

0    110
1    220
2    330
3    440
4    550
dtype: int64

In [19]:
s1 - s2

0    -90
1   -180
2   -270
3   -360
4   -450
dtype: int64

In [20]:
s1 * s2

0     1000
1     4000
2     9000
3    16000
4    25000
dtype: int64

In [21]:
s1 / s2   # notice that division always returns a float... so we'll get back floating-point numbers (and dtype)

0    0.1
1    0.1
2    0.1
3    0.1
4    0.1
dtype: float64

In [22]:
# Python has a // operator, which returns an int from division, ignoring the remainder

s1 // s2

0    0
1    0
2    0
3    0
4    0
dtype: int64

In [23]:
s1 ** s2   # s1 to the s2 power

0    0
1    0
2    0
3    0
4    0
dtype: int64

In [25]:
s1 % s2    # divide s1 by s2, return the remainder

0    10
1    20
2    30
3    40
4    50
dtype: int64

# Exercise: Temperature differences

1. Find a web site with the 10-day forecast for your city, including high and low temperatures.
2. Create a series with the high temperatures
3. Create another series with the low temperatures
4. Create a new series showing how much higher the high temp is each day.

In [26]:
# we're working on this exercise for about 5 minutes... that's why it's quiet... but you can ask in the Q&A!

In [27]:
high_temps = Series([18, 20, 21, 23, 25, 25, 26, 23, 23, 24])
low_temps = Series([14, 14, 13, 13, 14, 15, 15, 14, 14, 14])


In [28]:
high_temps

0    18
1    20
2    21
3    23
4    25
5    25
6    26
7    23
8    23
9    24
dtype: int64

In [29]:
low_temps

0    14
1    14
2    13
3    13
4    14
5    15
6    15
7    14
8    14
9    14
dtype: int64

In [30]:
high_temps - low_temps

0     4
1     6
2     8
3    10
4    11
5    10
6    11
7     9
8     9
9    10
dtype: int64

# Next up

1. Descriptive statistics
2. Aggregate methods we can run on a series
3. Setting and retreiving values with .loc, fancy indexing, and slices

# Random data

We will use some manually entered data in this course, especially today and next week.  But sometimes I'll just want to show you some basic data, and using random data is useful.

NumPy (the lower-level library) can create an array of random integers very easily.  We can say:

    np.random.randint(0, 100, 5) 
    
The above code returns 5 integers, from 0 up to and not including 100.      We can hand that to `Series`:



In [35]:
np.random.seed(0)    # reset the random-number system to a known state

s = Series(np.random.randint(0, 100, 5))
s

0    44
1    47
2    64
3    67
4    67
dtype: int64

In [36]:
# let's create a series of 10 numbers

np.random.seed(0)    # seed the pseudo-random function with a known starting point, so that we'll known what numbers it'll give us
s = Series(np.random.randint(0, 100, 10))

s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

# Descriptive statistics

We see our numbers in `s`, but how would we describe them? What attributes can we describe in our series that'll be useful to someone else?

What can we say about our data that would be useful to someone who wants to understand it better? Remember that this is a collection of numbers. So anything general we say will be a little wrong, but it will help with overall understanding.

We could list:
- The lowest number
- The highest number
- The middle number
- How much do the numbers spread out from the middle?

Indeed, these numbers are exactly what we're going to use to describe our data. They are actually known as "descriptive statistics," giving us a numeric picture of our data.