# Agenda

1. Getting started
    - What is data analysis?
    - Python and Pandas
    - Descriptive statistics and aggregate methods
    - Setting and retrieving values
    - Broadcasting operations
    - Mask arrays
    - Indexes
    - Some additional useful methods
2. Data types and data frames
    - Data types and `NaN` ("not a number")
    - Data frames (2D data)
    - Adding and removing data
    - Retrieving data
    - Queries with mask indexes 
3. Real-world data
    - Working with CSV files
    - Sorting data
    - Grouping data
    - Pivot tables
    - Joining
4. Text and dates
    - Working with text data
    - Working with dates and times
    - Time series (where datetime values are our index)
5. Visualization
    - Plots via the Pandas interface
    - Scatter plots
    - What next?

# I'm using Jupyter

You can install Jupyter on your own computer if you have Python and Pandas -- just install it from PyPI. 

    pip install jupyter
    
If you don't have Jupyter on your computer, then you use one of the online systems, such as Google Colab or Python Anywhere or Replit.  Or https://try.jupyter.org.

# Jupyter intro

Jupyter is divided into "cells," like the one I'm typing into right now. Cells can be in one of two modes at any time:

- Edit mode: When you type into it, the text appears (like right now). It has a green outline. Enter edit mode by clicking inside of the cell, or by pressing ENTER.
- Command mode: When you type, you're giving commands to the Jupyter notebook itself. It has a blue outline. Enter command mode by clicking to the left of the cell or by pressing ESC.

In command mode, you can type a bunch of keys and get Jupyter to do things:

- `h` -- help
- `c` -- copy the current cell
- `v` -- paste the current cell
- `x` -- cut the current cell
- `z` -- undo the last action
- `a` -- create a new cell *above* the current one
- `b` -- create a new cell *below* the current one


# What is data? What is data analytics?

Data is everywhere in our world -- cellphones, computers, servers, vehicles, refrigerators.

How do we work with it? How can we use it to answer questions that are useful and/or interesting?

I say: Data science = data analytics + machine learning.

# (Thought) Exercise: Amazon's data

1. What sort of data does Amazon have?
2. What sorts of questions can Amazon ask of that data?
3. What sorts of things can Amazon do once they have answers to those questions?

# We have data. How can we analyze it?

There are many tools: SQL databases. NoSQL databases. Programming languages like R or Julia. Or even Java or C#.

Python has become the #1 language for data science over the last decade. That seems really weird!

- Python is not super efficient
- Python's numbers are not small

How did this happen? Answer: NumPy. NumPy is a library written in C, and thus runs at C speeds, *but* it has an interface in Python that lets us use it in Python.

We get the benefits of Python's ease of use, but C's efficiency.

NumPy is a bit low level for many people to work with. Pandas is (mostly) a wrapper around NumPy that makes it easier to work with, and feels like a higher-level system.

Pandas is a Python package (on PyPI), and is the main way that people analyze data in the Python world nowadays.

In [2]:
# I can load Pandas into memory by saying

import pandas as pd      # import it, and assign it the alias "pd"
from pandas import Series, DataFrame   # these are useful shortcuts, so we don't have to say pd.Series / pd.DataFrame

In [3]:
# series

# a series is a Pandas data structure containing 1D data
# it's kind of, sort of, like a list  -- but it isn't one

s = Series([10, 20, 30, 40, 50])  # creating a series with a Python list

In [4]:
type(s)

pandas.core.series.Series

In [5]:
s   # show me the series

0    10
1    20
2    30
3    40
4    50
dtype: int64

# What are we seeing?

1. The series contains two parallel set of elements:
    - The index, which currently contains integers 0-4
    - The values, which currently contains the integers 10, 20, 30, 40, and 50
    