# Agenda: Week 2

1. Recap
2. Dtypes (the types of data that we can use in Pandas)
3. `NaN` ("not a number")
4. DataFrames (2D data structures that we use in Pandas)
5. Adding and removing data in our data frames
6. Useful methods in our data frames
7. Querying with boolean indexes
8. Querying with `loc`
9. Reading CSV data (meaning: Real-world data into our data frames)

# Last week, quick recap

1. Pandas has two types of data
    - series (1D)
    - data frame (2D)
2. We can create a series by passing a Python list (or a similar iterable, such as a NumPy array)
3. We can assign an index of our choosing to our series
    - By default, it numbers the elements starting at 0, just like a list/string/tuple
    - If we assign our own index, we can use integers, floats, or even strings
4. Retrieving
    - Retrieve via the index using `.loc`
    - Retrieve via the numeric position (starting at 0) with `.iloc`
    - We can use slices
5. Broadcasting
    - If we apply an operator to a series and a single value, that value is "broadcast" with the operator to every element of the series.  We get back a new series, the result of applying the value with the operator.
    - If we use broadcasting with `==` or other comparison operators, we get back a "boolean series," containing only `True` and `False` values
6. Boolean indexing
    - If we apply a boolean index to an existing series, then only those items that match a `True` value are returned.  Those matching a `False` value are ignored.
    - This is why we don't need to use `if` or `for` in Pandas.  We use boolean indexes to find all of the values that match what we want.

In [1]:
import numpy as np   # not critical, but might come in handy
import pandas as pd  # uses the standard alias "pd"
from pandas import Series, DataFrame  # this lets me use the names without pd. before them

In [3]:
s = Series([10, 20, 30, 20, 30, 40, 20, 30, 40, 50]) # default, numeric index
s

0    10
1    20
2    30
3    20
4    30
5    40
6    20
7    30
8    40
9    50
dtype: int64

In [4]:
s.mean()  # what is the mean of our series?

29.0

In [6]:
s.sum() / s.count()

29.0

In [7]:
# to retrieve an item, I can use either .loc or .iloc  (they'll be the same here)
s.loc[4]

30

In [8]:
s.iloc[8]

40

In [9]:
# If I want, I can define my own index on the series
s.index = list('abcdefghij')  # 10 elements require 10 index items

In [10]:
s

a    10
b    20
c    30
d    20
e    30
f    40
g    20
h    30
i    40
j    50
dtype: int64

In [11]:
s.loc['c']    # .loc uses the index

30

In [13]:
s.iloc[6]      # .iloc uses the position

20

In [14]:
s.head()   # show me the first 5 elements in s

a    10
b    20
c    30
d    20
e    30
dtype: int64

In [15]:
s.tail()   # show me the final 5 elements in s

f    40
g    20
h    30
i    40
j    50
dtype: int64

In [16]:
s.value_counts()  # how often is each value found in s?

20    3
30    3
40    2
10    1
50    1
dtype: int64

In [17]:
s.describe()   # gives me the descriptive statistics for s

count    10.00000
mean     29.00000
std      11.97219
min      10.00000
25%      20.00000
50%      30.00000
75%      37.50000
max      50.00000
dtype: float64

# Dtypes 

We see a Pandas series, and we think of it as a Python data structure. But really, we're seeing 1/2 of the actual data structure. Half of it exists in Python, and we use that half.  But the actual data is the other half, sort of like the submerged part of an iceberg, and it is implemented in the C language.  Why? Because C is super fast and efficient.

So the data isn't being stored in Python at all!  We don't get to use our familiar Python data structures, such as `int` or `float`.

Instead, we need to think like C programmers, at least a little bit.  There's no such thing as an "int" in C.  Rather, they have 8-bit ints, 16-bit ints, 32-bit ints, etc.  The more bits, the larger/smaller the ints are that we can handle.  But the more bits, the more memory each value takes up.

In Pandas, we think about these as "dtypes".

In [18]:
s = Series([10, 20, 30, 40, 50])  # Pandas sees only integers, so it guesses what dtype we want

In [19]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [20]:
# The highest number we can get to with 64-bit ints is 2**32  (because of negative + positive numbers)
2**32

4294967296

# How to decide what dtype to use

- How big/small will the numbers be that you'll be dealing with?
- The bigger the dtype, the more numbers you can handle
- But the bigger the dtype, the more memory it'll use

If I have 1b integers in my series:
- 64-bit integers will result in 64 GB of memory to store this series
- 32-bit integers will need 32 GB
- 8-bit integers will need only 8GB

In [21]:
# how can I specify the dtype, if I don't like what Pandas is choosing by default?

s = Series([10, 20, 30, 40, 50], dtype=np.int32)  # notice: dtypes are from NumPy

In [22]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [23]:
s = Series([10, 20, 30, 40, 50], dtype=np.int8) 

In [24]:
# 8-bit integers

s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [25]:
# what happens if I do this:

s * 10

0    100
1    -56
2     44
3   -112
4    -12
dtype: int8

In [26]:
s.loc[0] = 127
s

0    127
1     20
2     30
3     40
4     50
dtype: int8

In [27]:
s + 1

0   -128
1     21
2     31
3     41
4     51
dtype: int8

# What dtypes are available?

int, signed integers (both positive and negative)
- `np.int8`  (or `np.dtype('int8')`)
- `np.int16`
- `np.int32`
- `np.int64`

uint, unsigned integers (only positive)
- `np.uint8`
- `np.uint16`
- `np.uint32`
- `np.uint64`

Floating-point numbers
- `np.float16`
- `np.float32`
- `np.float64`
- `np.float128`

In [28]:
# how can I find out the dtype of a series?  Just ask it:

s.dtype

dtype('int8')

In [29]:
s

0    127
1     20
2     30
3     40
4     50
dtype: int8

In [30]:
# I now realize that I'm going to need a bigger dtype for my numbers in s.
# what can I do?

# option 1: assign to the dtype -- not possible in Pandas!
s.dtype = np.int16

AttributeError: can't set attribute 'dtype'

In [32]:
# option 2: Create a new series, based on the existing one, with a new dtype
# the way we do this is with the "astype" method

# just run "astype" on a series, passing as an argument the dtype you want
# this returns a new series -- it doesn't change s!
s.astype(np.int16)

0    127
1     20
2     30
3     40
4     50
dtype: int16

In [33]:
# this is how we really change the dtype
s = s.astype(np.int16)

In [34]:
s

0    127
1     20
2     30
3     40
4     50
dtype: int16

In [36]:
# Pandas automatically detects that it needs to use floating-point numbers
# because some of the input list's elements are floats

# So it chooses float64 -- not a bad choice!
s = Series([10, 20, 30.5, 40, 50.8])
s

0    10.0
1    20.0
2    30.5
3    40.0
4    50.8
dtype: float64

In [38]:
# I want to remove the floating-point parts from s
# this basically runs "int" on every element
s.astype(np.int64)

0    10
1    20
2    30
3    40
4    50
dtype: int64

# What happens if I create a series of strings?



In [39]:
# this created a series of *strings*
# the dtype here is ... object -- meaning, Python objects, handled by Python, not NumPy

s = Series('this is the most amazing course ever!'.split())
s

0       this
1         is
2        the
3       most
4    amazing
5     course
6      ever!
dtype: object

In [40]:
s = Series('10 20 30'.split())
s

0    10
1    20
2    30
dtype: object

In [41]:
s + s   # what will I get back?

0    1010
1    2020
2    3030
dtype: object

In [42]:
# get integers from our strings, using astype

s.astype(np.int8)

0    10
1    20
2    30
dtype: int8

In [45]:
# to use astype with object, use "object" not "np.object"
s.astype(object)

0    10
1    20
2    30
dtype: object

# Exercise: Calcuating from text input

1. Define a Python string containing integers, separated by spaces.
2. Define a Pandas series whose elements are strings, based on that string.
3. Calculate the mean of the numbers in the series.

In [46]:
text = '10 20 30 20 30 40 50 60'

s = Series(text.split())
s

0    10
1    20
2    30
3    20
4    30
5    40
6    50
7    60
dtype: object

In [53]:
# option 1: tell it the dtype at creation time
s = Series(text.split())
s

0    10
1    20
2    30
3    20
4    30
5    40
6    50
7    60
dtype: object

In [57]:
# option 2 (and better?) is to use astype, to convert from one dtype to another
s = s.astype(np.int32)
s

0    10
1    20
2    30
3    20
4    30
5    40
6    50
7    60
dtype: int32

In [58]:
s = Series([10, 20, 30.5, 40, 50])    # because there is a single float value, the dtype will be np.float64
s

0    10.0
1    20.0
2    30.5
3    40.0
4    50.0
dtype: float64

In [62]:
# what if I start with an int dtype?

s = Series([10, 20, 30, 40 ,50])  # int64 dtype, because that's the default
s.loc[3] = 40.5                   # set an element to be a float

# what's the dtype now?  float64 -- pandas changes the dtype, to accommodate what we want to add
s.dtype


dtype('float64')

In [63]:
s

0    10.0
1    20.0
2    30.0
3    40.5
4    50.0
dtype: float64

In [64]:
s.loc[1] = 'hello'  # what will happen to the dtype now?


In [65]:
s

0     10.0
1    hello
2     30.0
3     40.5
4     50.0
dtype: object

# NaN

`NaN` is a float, and is represents the lack of a number, or a number that you can (will) want to use.

In [70]:
# if we want to take the mean of five test scores, or four if they were absent for one test,
# then how do we record the fifth score if they were absent?

scores = Series([95, 89, 92, 86, 0])     # 0 is a bad idea
scores.mean()

72.4

In [71]:
# a better idea is to use np.nan (aka np.NaN)
# this value means "not a number" -- it cannot be calculated with

scores = Series([95, 89, 92, 86, np.nan]) 

In [72]:
scores

0    95.0
1    89.0
2    92.0
3    86.0
4     NaN
dtype: float64

In [74]:
scores.mean()  # will this work?  Yes -- by default, most methods in Pandas ignore NaN values

90.5

In [75]:
# let's ask the mean method what it does

help(scores.mean)    # notice -- I'm not running the method, so no () after its name

Help on method mean in module pandas.core.generic:

mean(axis: 'int | None | lib.NoDefault' = <no_default>, skipna=True, level=None, numeric_only=None, **kwargs) method of pandas.core.series.Series instance
    Return the mean of the values over the requested axis.
    
    Parameters
    ----------
    axis : {index (0)}
        Axis for the function to be applied on.
    skipna : bool, default True
        Exclude NA/null values when computing the result.
    level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a scalar.
    numeric_only : bool, default None
        Include only float, int, boolean columns. If None, will attempt to use
        everything, then use only numeric data. Not implemented for Series.
    **kwargs
        Additional keyword arguments to be passed to the function.
    
    Returns
    -------
    scalar or Series (if level specified)



In [76]:
# what if I want to be very strict, and keep all NaN values when I'm calculating the mean?
scores.mean(skipna=False)

nan

# No matter what you do, you have to deal with `NaN`

Nearly any data set you have will include `NaN`:
- Faulty measurements
- Lost records
- People didn't respond
- People didn't respond on time
- Things are being processed
- The data set was changed, and earlier records lack certain values

Some ways to deal with `NaN`:
1. Ignore it. If you invoke many methods (e.g., `mean`), then it'll just ignore `NaN`.
2. Remove it.  The `dropna` method returns a new series, just like the original, but without any `NaN` values.
3. Replace it. The `fillna` method returns a new series, just like the original, but with `NaN` values replaced by something else.  It's often a good idea to use the mean here, but you don't have to.

In [77]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

s.loc[[2, 5, 7, 9]] = np.nan

In [78]:
s

0    10.0
1    20.0
2     NaN
3    40.0
4    50.0
5     NaN
6    70.0
7     NaN
8    90.0
9     NaN
dtype: float64

In [79]:
s.count()  # how many non-NaN values do we have?

6

In [80]:
s.describe()

count     6.000000
mean     46.666667
std      30.110906
min      10.000000
25%      25.000000
50%      45.000000
75%      65.000000
max      90.000000
dtype: float64

In [82]:
# to remove it, we can use s.dropna

s.dropna()  # returns a new series, with the index of s, and without s's NaN values

0    10.0
1    20.0
3    40.0
4    50.0
6    70.0
8    90.0
dtype: float64

In [83]:
# to change it, we can use s.fillna

s.fillna(3)

0    10.0
1    20.0
2     3.0
3    40.0
4    50.0
5     3.0
6    70.0
7     3.0
8    90.0
9     3.0
dtype: float64

In [84]:
# very common to use the mean for filling -- it's often less wrong than other values would be
s.fillna(s.mean())

0    10.000000
1    20.000000
2    46.666667
3    40.000000
4    50.000000
5    46.666667
6    70.000000
7    46.666667
8    90.000000
9    46.666667
dtype: float64

In [85]:
# if I run s.fillna(n), then we get back a new series, just like s,
# but in which all values of NaN have been replaced by n

s.fillna(999)

0     10.0
1     20.0
2    999.0
3     40.0
4     50.0
5    999.0
6     70.0
7    999.0
8     90.0
9    999.0
dtype: float64

# Exercise: Descriptive statistics with missing data

1. Create a series with 10 elements, whose values are the high temperatures forecast for the next 10 days where you live.
2. Assign `NaN` to several of these elements.
3. Replace `NaN` with the mean of the remaining elements.
4. After you replace `NaN` with the mean, has the mean shifted? 

In [86]:
s = Series([32, 33, 32, 32, 32, 32, 32, 33, 34, 35, 35, 36, 35, 35])
s

0     32
1     33
2     32
3     32
4     32
5     32
6     32
7     33
8     34
9     35
10    35
11    36
12    35
13    35
dtype: int64

In [87]:
# fancy indexing to set NaN on a bunch of elements
s.loc[[3, 7, 8, 12]] = np.nan
s

0     32.0
1     33.0
2     32.0
3      NaN
4     32.0
5     32.0
6     32.0
7      NaN
8      NaN
9     35.0
10    35.0
11    36.0
12     NaN
13    35.0
dtype: float64

In [89]:
s.mean()   # mean of the non-NaN elements

33.4

In [90]:
s.fillna(s.mean())   # replace the NaN values with the mean 

0     32.0
1     33.0
2     32.0
3     33.4
4     32.0
5     32.0
6     32.0
7     33.4
8     33.4
9     35.0
10    35.0
11    36.0
12    33.4
13    35.0
dtype: float64

In [91]:
s = s.fillna(s.mean())   # replace the s (with NaN values) with a new s (without NaN values)

In [93]:
s.mean()

33.4

# Next up:

- Data frames!
- Adding and removing data in our data frames

# Data frames

Data frames are 2-dimensional

- They have rows, with each row labeled -- with an index, the same index we have used for our series to date
- They have columns, with each column labeled -- with the same kind of index object, these are known as the "columns."

Each column in a data frame is a series object.  Meaning: Each column can have its own distinct dtype.

In [94]:
# let's create a simple data frame, using a list of lists

df = DataFrame([[10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120]])
df

Unnamed: 0,0,1,2,3
0,10,20,30,40
1,50,60,70,80
2,90,100,110,120


In [95]:
# let's create our data frame again, labeling the rows and the columns

df = DataFrame([[10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120]],
              index=list('xyz'),
              columns=list('abcd'))
df

Unnamed: 0,a,b,c,d
x,10,20,30,40
y,50,60,70,80
z,90,100,110,120


In [96]:
# get the index from df
df.index

Index(['x', 'y', 'z'], dtype='object')