# Welcome and agenda

1. Getting started
    - What is data analysis?
    - What is Pandas?
    - Descriptive statistics
    - Using Pandas series
    - Broadcasting 
    - Mask arrays
    - Useful methods
2. Data types and data frames
    - dtypes (different types of data we can store)
    - `NaN` ("not a number")
    - Data frames
    - Querying with boolean/mask indexes
    - Reading CSV data
3. Real-world data
    - CSV
    - Online data
    - Sorting
    - Grouping
    - Pivot tables
    - Joining
    - Cleaning data
4. Text and dates
    - Text
    - Dates
5. Visualization
    - Line plots
    - Bar plots
    - Histograms
    - Pie plots
    - Scatter plots
    - Boxplots
    




# What is data?

If we're talking about data analysis, then we should really know what data is!

The nouns in the computer world are data.
    - Files (lots of types of files)
    - Logs (information about who did what, and when, on which computer)
    - Preferences (e.g., Netflix)
    - Store inventories and purchasing histories
    
Thanks to mobile devices, and computers, and companies all getting interconnected, there is a **LOT** of data out there.  

The problem isn't finding data. The problem is understanding what our data really means, and doing something useful with it.

The scientific method means: I ask a question, and I try to answer that question as best as possible, using techniques that others have demonstrated are reliable.  Data science is all about applying that method to the world of data.

I want to be able to ask a question, and then answer it.  I'll use data in order to do that.

Some examples:
    - What products are selling best?
    - What products are selling best in each country?  In each age demographic? In each country vs. each age demographic?
    - Which employees are bringing in the most sales?
    - Which universities' graduates earn the most money 10 years after graduating?
    - Which stocks/bonds/investments have done best over the last 10 years? 50 years?  

# Data science

The idea is: Use scientific principles to ask questions and answer them with data.

I divide the world of data science into three pieces:

- Data analysis — use the data we've collected to understand the past and present
- Data engineering — there's so much data out there, in so many different formats and sources, and getting it in a timely, organized way to our team's computers is hard -- data engineers solve these problems
- Machine learning — learn from the past, to make predictions about the future


How are we going to gather the data, and ask questions?  We're going to read it into data structures on our computer, and then use methods on those data structures to create queries.  We'll be using Python and Pandas in this course.

# Exercise: Think about data

What data does Amazon have about its products? What data does Amazon have about you? How does Amazon use this data in its business?

# Python -- why is this a language for data analytics?

It's also a very good choice:
- Easy to read
- Easy to learn
- Lots of support
- Open source (cheap or free)

It's a very bad choice, in many ways:
- It runs slowly (relative to many other languages)
- It uses lots of memory

Data analytics uses *lots* of data, often many hundreds of megabytes -- or more!   It's not at all unusual to have data sets that are several GB in size.  If the same data in Python is 10x bigger than the data in C, then you can handle more data in C, even if it's a harder language to work with.

The reason is something called "NumPy." This is a Python module, written mostly in C.  It allows us to work with data in C format (i.e., very fast, very small), using a very thin layer of Python on top of it.  NumPy is super fast and super efficient, but it lets us work with friendly Python code.

The best of both worlds!  (Almost)

NumPy can still be a bit low level and hard to work with.  We, in this class, will be using Pandas.  Pandas is (mostly) a wrapper around NumPy, making it far friendlier and easier to work with.

Just the Pandas library for Python has between 5-10 million users.  People at companies and organizations around the world are using Pandas more and more to analyze their data:

- E-commerce companies
- Manufacturers
- Banks and financial institutions
- Marketing companies



# Data structures in Pandas

Pandas mostly ignores Python data structures, in favor of its own:

- Series (1-dimensional data)
- Data frame (2-dimensional data)

Assuming that you have loaded Pandas and Jupyter onto your computer, you can say:

In [4]:
import numpy as np          # we will use, occasionally, some of the low-level NumPy functionality via "np"
import pandas as pd         # load the Pandas module into memory, and make it available via the "pd" namespace
from pandas import Series   # I also want to use the Series name by itself, rather than pd.Series

In [5]:
# Let's say I want to create a series

# I use a Python list of integers to create my series
s = Series([10, 20, 30, 40, 50, 60, 70])

In [6]:
type(s)   # in Jupyter, if an expression is on the final line of a cell, we get the value back w/o print

pandas.core.series.Series

In [7]:
# A series works a lot like a list, in many ways

s[0]

10

In [8]:
s[5]

60

In [9]:
t = Series([50, 40, 20, 30, 88, 22, 16])

In [11]:
# what happens when I add together two Python lists?
mylist1 = [10, 20, 30]
mylist2 = [40, 50, 60]

mylist1 + mylist2  # we get a new list -- all elements of mylist1, followed by all elements of mylist2

[10, 20, 30, 40, 50, 60]

In [12]:
# let's look at s and t

In [13]:
s

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

In [14]:
t

0    50
1    40
2    20
3    30
4    88
5    22
6    16
dtype: int64

In [15]:
# what happens when I add s and t together?

# we get a new series -- the new series has 7 elements, with the index 0-6
# the new series, at index 0, is s[0] + t[0]
# the new series, at index 1, is s[1] + t[1]

# the addition took places as vectors!

s + t

0     60
1     60
2     50
3     70
4    138
5     82
6     86
dtype: int64

In [16]:
s - t

0   -40
1   -20
2    10
3    10
4   -38
5    38
6    54
dtype: int64

In [17]:
s * t

0     500
1     800
2     600
3    1200
4    4400
5    1320
6    1120
dtype: int64

In [18]:
s / t

0    0.200000
1    0.500000
2    1.500000
3    1.333333
4    0.568182
5    2.727273
6    4.375000
dtype: float64

In [20]:
s % t  # remainder from dividing s/t

0    10
1    20
2    10
3    10
4    50
5    16
6     6
dtype: int64

# Exercise: Series operations

1. Create a series containing three elements — your birth year, month, and day.
2. Create a second series containing three elements -- birth year, month, and day -- from someone you know.
3. Show the difference between your two ages in year, month, day

In [22]:
me = Series([1970, 7, 14])
sis = Series([1971, 8, 22])

In [24]:
sis - me

0    1
1    1
2    8
dtype: int64

In [25]:
me = Series([1970, 07, 14])


SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers (2946636120.py, line 1)

# Different number bases

If you want to enter numbers in a different number base, you can do that in Python:

- base 2 (binary): Use 0b, as in 0b101010
- base 16 (hex): Use 0x, as in 0xab12cd34
- base 8 (octal): Use 0o, as in 0o12345

# Next up:

- Descriptive statistics
- Aggregate methods
- Five-number summary
- Mean and standard deviation

# Lists vs. arrays

Python normally uses lists.  Many people say to me, "Oh, those are just arrays, right?"  No, because there are two basic differences between lists and arrays:

1. Arrays have a set length, that cannot be changed once they're created.
2. All elements in an array must be of the same type.

Since Pandas series are based on NumPy arrays, you might think that they cannot have different types in them.  But it turns out that they can, thanks to a bit of Python magic!

In [26]:
s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [27]:
s = Series([10.5, 20.3, 30.4, 40, 50])
s

0    10.5
1    20.3
2    30.4
3    40.0
4    50.0
dtype: float64

In [28]:
# a dtype of "object" means: Something in the Python world, I don't know what
s = Series(['a', 'bc', 'defg', 'hi'])
s

0       a
1      bc
2    defg
3      hi
dtype: object

In [29]:
# once we have a dtype of object, anything is possible
# we can have a mix, then, and Pandas will live with it
# but we should to avoid it

s = Series([10, 20, 'abc', 'def', 30.5, 60.8, [1,2,3,4]])

In [30]:
s

0              10
1              20
2             abc
3             def
4            30.5
5            60.8
6    [1, 2, 3, 4]
dtype: object

In [31]:
# you can install NumPy using pip, just as you did for Pandas:
# pip install numpy

# this is not a Python command -- you need to run it on the command line

In [33]:
# ages of people in my family

# how can we describe this?

s = Series([52, 50, 21, 19, 16])

In [35]:
# How can we describe a bunch of data points, in a way that will make sense?

# one possibility: mean (average) -- sum, divided by the number of data points

# Pandas series have a bunch of methods for this, and other *descriptive statistics*.

s.mean()   # mean is great... but also flawed, because it is easily skewed with one or two outliers

31.6

In [36]:
# very often, we don't use the mean.  Rather, we use the *median*.
# the median is calculated in this way: Take all values, from the smallest to the largest,
# and line them up, sorted.  Take the middle value or (if there is an even number of values)
# the average (mean) of the two middle values

In [37]:
s.median()

21.0

In [38]:
# why are mean and median floats, when the numbers in s are integers?
# In Python 3, division with / always returns a float, even if the numbers are all integers
# if you want an integer result, use //, known as "floordiv," which removes (not rounds!) any decimals 

In [39]:
# how do I get help on things in Pandas?
# My favorite way: help(function)

help(s.median)

Help on method median in module pandas.core.generic:

median(axis: 'int | None | lib.NoDefault' = <no_default>, skipna=True, level=None, numeric_only=None, **kwargs) method of pandas.core.series.Series instance
    Return the median of the values over the requested axis.
    
    Parameters
    ----------
    axis : {index (0)}
        Axis for the function to be applied on.
    skipna : bool, default True
        Exclude NA/null values when computing the result.
    level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a scalar.
    numeric_only : bool, default None
        Include only float, int, boolean columns. If None, will attempt to use
        everything, then use only numeric data. Not implemented for Series.
    **kwargs
        Additional keyword arguments to be passed to the function.
    
    Returns
    -------
    scalar or Series (if level specified)



In [40]:
# standard deviation tells us how much the data varies from the mean
# a standard deviation of 0 means: all data points are the same
# often we talk about mean+1sd mean-1sd

s.std()   # calculate the standard deviation on our data

17.812916661793487

In [41]:
s.mean()

31.6

In [42]:
s.mean() + s.std()

49.41291666179349

In [43]:
s.mean() - s.std()

13.787083338206514

In [44]:
# 68% of the data points in s should be between 49 and 13

In [45]:
s

0    52
1    50
2    21
3    19
4    16
dtype: int64

In [46]:
s.mean() + 2 * s.std()

67.22583332358698

In [47]:
s.mean() - 2 * s.std()

-4.025833323586973

# Descriptive statistics

We can now understand our data in a variety of ways:

- Mean, which tells us the arithmetic average — easy to understand, and often useful, and the reference point for the standard deviation. However, it's easy to skew with outliers.
- Standard deviation, which tells us how much the data spreads out from the mean. A high std means that the data is all over the place, whereas a low one means that it's concentrated around the mean.
- Median, which is the middle value.

You can also think of the median as the 50% point in our data, if we line things up from lowest to highest.  We can similarly calculate the 25% mark and the 75% mark in our data.  These are often known as the 1Q and 3Q values.  Pandas lets us calculate these pretty easily:

In [48]:
s.quantile(0.25)

19.0

In [49]:
s.quantile(0.5)  # median

21.0

In [50]:
s.quantile(0.75)

50.0

In [51]:
s

0    52
1    50
2    21
3    19
4    16
dtype: int64

In [52]:
# we can also see how much the data is spread out using the IQR (inter-quartile range)

In [53]:
# we might also want to know the minimum and maximum values in our data:

s.min()

16

In [54]:
s.max()

52

# Descriptive statistics summary

For a series `s`, we can call:

- `s.mean()`
- `s.std()`
- `s.quantile(0.25)`
- `s.quantile(0.50)` or `s.median()`
- `s.quantile(0.75)`
- `s.min()`
- `s.max()`

# Exercise: Weather forecast

1. Go to a Web site that shows the 10-day forecast for your area, Create a series containing the high temperatures for each day in the next 10 days.
2. Calculate each of the descriptive statistics.  Are there are any obvious outliers (very hot or very cold) in the coming 10 days?

In [56]:
help(s.quantile)

Help on method quantile in module pandas.core.series:

quantile(q=0.5, interpolation='linear') method of pandas.core.series.Series instance
    Return value at the given quantile.
    
    Parameters
    ----------
    q : float or array-like, default 0.5 (50% quantile)
        The quantile(s) to compute, which can lie in range: 0 <= q <= 1.
    interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
        This optional parameter specifies the interpolation method to use,
        when the desired quantile lies between two data points `i` and `j`:
    
            * linear: `i + (j - i) * fraction`, where `fraction` is the
              fractional part of the index surrounded by `i` and `j`.
            * lower: `i`.
            * higher: `j`.
            * nearest: `i` or `j` whichever is nearest.
            * midpoint: (`i` + `j`) / 2.
    
    Returns
    -------
    float or Series
        If ``q`` is an array, a Series will be returned where the
        index is ``q

In [57]:
s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32])

In [58]:
s.min()

32

In [59]:
s.max()

34

In [60]:
s.mean()

33.2

In [61]:
s.median()

33.5

In [62]:
s.std()

0.9189365834726815

In [63]:
s.mean() - s.std()

32.281063416527324

In [64]:
s.mean() + s.std()

34.11893658347268

In [65]:
# we can get all of these descriptive statistics with the "describe" method

s.describe()

count    10.000000
mean     33.200000
std       0.918937
min      32.000000
25%      32.250000
50%      33.500000
75%      34.000000
max      34.000000
dtype: float64

In [66]:
# I'm going to use NumPy to generate some random numbers, and use those to create a series

np.random.seed(0)   # always start random numbers from the same place, so we all get the same values
s = Series(np.random.randint(0, 100, 10))   # get 10 random ints from 0-100, and create a series from them

In [67]:
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [68]:
# to retrieve a value and index n, I could use s[n]
# but it's much better to use s.loc[n]

s.loc[5]

9

In [70]:
# note the double square brackets!
# outer ones are for s.loc
# inner ones mean: I'm asking for 3 values

s.loc[[3, 5, 7]]  # this is a "fancy" index, where I ask for three different values

3    67
5     9
7    21
dtype: int64

In [72]:
# I can also use a slice (just like in Python)

s.loc[4:7]  # starting at 4, until 7

4    67
5     9
6    83
7    21
dtype: int64

In [73]:
# can I set values in this way? YES!

s.loc[5] = 999
s

0     44
1     47
2     64
3     67
4     67
5    999
6     83
7     21
8     36
9     87
dtype: int64

In [74]:
# if I assign to a fancy index, all of the elements are changed to the new value

s.loc[[3,5,7]] = 888


In [75]:
s

0     44
1     47
2     64
3    888
4     67
5    888
6     83
7    888
8     36
9     87
dtype: int64

In [76]:
# this is the series I created with the high temps in Modi'in (my city)

s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32])

In [77]:
s

0    33
1    34
2    34
3    34
4    34
5    34
6    33
7    32
8    32
9    32
dtype: int64

In [78]:
# I don't want to have an index that's 0-9.  I want an index that is easier to understand.
# For example, MMDD

# Pandas allows for this!

# pass index = and a list of values (strings, integers, anything!)
s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32],
          index=['0802', '0803', '0804', '0805', '0806', 
                '0807', '0808', '0809', '0810', '0811' ])

In [79]:
s

0802    33
0803    34
0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
0810    32
0811    32
dtype: int64

In [80]:
# what will the weather be like on August 9th?

s.loc['0809']

32

In [81]:
# what will the high temperature be on August 8th through 10th?

s.loc['0808':'0810']  # use a slice

0808    33
0809    32
0810    32
dtype: int64

In [82]:
# or, if I prefer, just state them explicitly with a fancy index
s.loc[['0808', '0809', '0810']]

0808    33
0809    32
0810    32
dtype: int64

In [83]:
s.loc[['0808', '0809', '0810']] = 98,99,100

In [84]:
s

0802     33
0803     34
0804     34
0805     34
0806     34
0807     34
0808     98
0809     99
0810    100
0811     32
dtype: int64

In [85]:
s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32],
          index=['0802', '0803', '0804', '0805', '0806', 
                '0807', '0808', '0809', '0810', '0811' ])

# Retrieving via indexes

We've now seen that we can use `.loc` to retrieve one or more values from our series, using the index.  If the index is the default (numbers), we can use it via the positions. But if the index is custom, either strings or integers, we can still use it.

But what if we want to retrieve by position, even though we have our own string index? We can do that with the `.iloc` accessor.  It works just like `.loc`, but uses the numeric positions (starting at 0), rather than our own custom idex.