# Welcome and agenda

1. Getting started
    - What is data analysis?
    - What is Pandas?
    - Descriptive statistics
    - Using Pandas series
    - Broadcasting 
    - Mask arrays
    - Useful methods
2. Data types and data frames
    - dtypes (different types of data we can store)
    - `NaN` ("not a number")
    - Data frames
    - Querying with boolean/mask indexes
    - Reading CSV data
3. Real-world data
    - CSV
    - Online data
    - Sorting
    - Grouping
    - Pivot tables
    - Joining
    - Cleaning data
4. Text and dates
    - Text
    - Dates
5. Visualization
    - Line plots
    - Bar plots
    - Histograms
    - Pie plots
    - Scatter plots
    - Boxplots
    




# What is data?

If we're talking about data analysis, then we should really know what data is!

The nouns in the computer world are data.
    - Files (lots of types of files)
    - Logs (information about who did what, and when, on which computer)
    - Preferences (e.g., Netflix)
    - Store inventories and purchasing histories
    
Thanks to mobile devices, and computers, and companies all getting interconnected, there is a **LOT** of data out there.  

The problem isn't finding data. The problem is understanding what our data really means, and doing something useful with it.

The scientific method means: I ask a question, and I try to answer that question as best as possible, using techniques that others have demonstrated are reliable.  Data science is all about applying that method to the world of data.

I want to be able to ask a question, and then answer it.  I'll use data in order to do that.

Some examples:
    - What products are selling best?
    - What products are selling best in each country?  In each age demographic? In each country vs. each age demographic?
    - Which employees are bringing in the most sales?
    - Which universities' graduates earn the most money 10 years after graduating?
    - Which stocks/bonds/investments have done best over the last 10 years? 50 years?  

# Data science

The idea is: Use scientific principles to ask questions and answer them with data.

I divide the world of data science into three pieces:

- Data analysis — use the data we've collected to understand the past and present
- Data engineering — there's so much data out there, in so many different formats and sources, and getting it in a timely, organized way to our team's computers is hard -- data engineers solve these problems
- Machine learning — learn from the past, to make predictions about the future


How are we going to gather the data, and ask questions?  We're going to read it into data structures on our computer, and then use methods on those data structures to create queries.  We'll be using Python and Pandas in this course.

# Exercise: Think about data

What data does Amazon have about its products? What data does Amazon have about you? How does Amazon use this data in its business?

# Python -- why is this a language for data analytics?

It's also a very good choice:
- Easy to read
- Easy to learn
- Lots of support
- Open source (cheap or free)

It's a very bad choice, in many ways:
- It runs slowly (relative to many other languages)
- It uses lots of memory

Data analytics uses *lots* of data, often many hundreds of megabytes -- or more!   It's not at all unusual to have data sets that are several GB in size.  If the same data in Python is 10x bigger than the data in C, then you can handle more data in C, even if it's a harder language to work with.

The reason is something called "NumPy." This is a Python module, written mostly in C.  It allows us to work with data in C format (i.e., very fast, very small), using a very thin layer of Python on top of it.  NumPy is super fast and super efficient, but it lets us work with friendly Python code.

The best of both worlds!  (Almost)

NumPy can still be a bit low level and hard to work with.  We, in this class, will be using Pandas.  Pandas is (mostly) a wrapper around NumPy, making it far friendlier and easier to work with.

Just the Pandas library for Python has between 5-10 million users.  People at companies and organizations around the world are using Pandas more and more to analyze their data:

- E-commerce companies
- Manufacturers
- Banks and financial institutions
- Marketing companies



# Data structures in Pandas

Pandas mostly ignores Python data structures, in favor of its own:

- Series (1-dimensional data)
- Data frame (2-dimensional data)

Assuming that you have loaded Pandas and Jupyter onto your computer, you can say:

In [4]:
import numpy as np          # we will use, occasionally, some of the low-level NumPy functionality via "np"
import pandas as pd         # load the Pandas module into memory, and make it available via the "pd" namespace
from pandas import Series   # I also want to use the Series name by itself, rather than pd.Series

In [5]:
# Let's say I want to create a series

# I use a Python list of integers to create my series
s = Series([10, 20, 30, 40, 50, 60, 70])

In [6]:
type(s)   # in Jupyter, if an expression is on the final line of a cell, we get the value back w/o print

pandas.core.series.Series

In [7]:
# A series works a lot like a list, in many ways

s[0]

10

In [8]:
s[5]

60

In [9]:
t = Series([50, 40, 20, 30, 88, 22, 16])

In [11]:
# what happens when I add together two Python lists?
mylist1 = [10, 20, 30]
mylist2 = [40, 50, 60]

mylist1 + mylist2  # we get a new list -- all elements of mylist1, followed by all elements of mylist2

[10, 20, 30, 40, 50, 60]

In [12]:
# let's look at s and t

In [13]:
s

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

In [14]:
t

0    50
1    40
2    20
3    30
4    88
5    22
6    16
dtype: int64

In [15]:
# what happens when I add s and t together?

# we get a new series -- the new series has 7 elements, with the index 0-6
# the new series, at index 0, is s[0] + t[0]
# the new series, at index 1, is s[1] + t[1]

# the addition took places as vectors!

s + t

0     60
1     60
2     50
3     70
4    138
5     82
6     86
dtype: int64

In [16]:
s - t

0   -40
1   -20
2    10
3    10
4   -38
5    38
6    54
dtype: int64

In [17]:
s * t

0     500
1     800
2     600
3    1200
4    4400
5    1320
6    1120
dtype: int64

In [18]:
s / t

0    0.200000
1    0.500000
2    1.500000
3    1.333333
4    0.568182
5    2.727273
6    4.375000
dtype: float64

In [20]:
s % t  # remainder from dividing s/t

0    10
1    20
2    10
3    10
4    50
5    16
6     6
dtype: int64