# Agenda

1. Overview of data science
2. Jupyter
3. NumPy + NumPy arrays
4. Pandas
    - Series (1-dimensional data)
    - Data frames (2-dimensional data)
    - time and Pandas
    - visualization + plots
5. Machine learning
    - Classification models
    - Regression models
    - Clustering
    - testing our models
    - a few different algorithms

In [1]:
print('Hello, out there!')  # shift + enter to execute the cell

Hello, out there!


In [2]:
x = 100   # enter == go down one line
y = 200   # shift+enter == execute the cell

In [3]:
x + y     # variables persist across cells... also, the final line of a cell, if an expression, is returned

300

# Jupyter modes

- If you click in the cell, and you have a green outline, you're in "edit mode." Typing will go into the cell. Use shift+enter to execute/finalize the cell.  You can also press [ENTER] to get the same thing.
- If you click to the left of the cell, and you have a blue outline, you're in "command mode." Typing will go to Jupyter, which will take your keystrokes as commands. You can also press [ESC] to get into command mode.

In [4]:
x * y

20000

In [5]:
x / y

0.5

# Jupyter modes

- If you click in the cell, and you have a green outline, you're in "edit mode." Typing will go into the cell. Use shift+enter to execute/finalize the cell.  You can also press [ENTER] to get the same thing.
- If you click to the left of the cell, and you have a blue outline, you're in "command mode." Typing will go to Jupyter, which will take your keystrokes as commands. You can also press [ESC] to get into command mode.

In [6]:
# Magic commands -- they all start with %

%whos

Variable   Type    Data/Info
----------------------------
x          int     100
y          int     200


In [7]:
%ls

'Cisco - 2021-12Dec-13-datascience.ipynb'


In [8]:
%pwd

'/Users/reuven/Courses/Current/Cisco-2021-12Dec-13-datascience'

In [9]:
# if you see a * between [] in the "In" of your cell, Jupyter might be
# wedged and need a kick/interrupt or even to restart.

# This is available in the Kernel menu at the top.


x + y  

300

# NumPy

NumPy provides us with an efficient array type, known as an `ndarray`, short for `n-dimensional array`. It can handle any number of dimensions, and is thus very popular in math, science, and engineering.

In [10]:
x = 0
import sys
sys.getsizeof(x)  # how many bytes does our 0 consume?

24

In [11]:
x = 10
sys.getsizeof(x)

28

In [1]:
# deprecated 
# %pylab inline   

import numpy as np   # covered by %pylab inline
from numpy import *

import matplotlib.pyplot as plt

%matplotlib inline

In [16]:
plt

<module 'matplotlib.pyplot' from '/usr/local/lib/python3.10/site-packages/matplotlib/pyplot.py'>

In [2]:
np.ndarray

numpy.ndarray

In [4]:
type(np.ndarray)  # a new data structure, a new type!

type

In [5]:
# normally, in Python, we create a new object by calling (invoking) its type
# don't do that with np.ndarray!
# rather, call np.array, and pass it an iterable (list) of numbers

a = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90,100])

In [6]:
# this is the repr (inherent object display) for a
a

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

In [7]:
type(a)

numpy.ndarray

In [8]:
# retrieve by index
a[0]

10

In [9]:
a[5]

60

In [10]:
a[7]

80

In [11]:
a[0:5]  # Python slice -- from 0 up to (and not including) 5

array([10, 20, 30, 40, 50])

In [12]:
# python3 -m pip install numpy

# Arrays vs. lists

Arrays have two properties:
- Their size cannot change, one they are created
- They can only contain a single data type



In [13]:
# upgrades an existing version
# pip install -U numpy 

In [14]:
a

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

In [15]:
# How else can we create NumPy arrays?

np.arange(5, 20)   # returns a NumPy array from 5 until (not including) 20

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [16]:
np.arange(5)

array([0, 1, 2, 3, 4])

In [17]:
np.arange(5, 20, 3)  # from 5 to 20 (not included), skipping by 3

array([ 5,  8, 11, 14, 17])

In [24]:
# random integers!

np.random.seed(0)              # guarantees that random numbers will be predictable
np.random.randint(0, 100, 5)   # give me 5 random integers between 0 and 100

array([44, 47, 64, 67, 67])

In [25]:
# random floats

np.random.rand(5)  # returns 5 numbers between 0 and 100

array([0.84725174, 0.6235637 , 0.38438171, 0.29753461, 0.05671298])

In [26]:
# NumPy arrays are mutable -- we *can* change them.  We cannot the length, but we can change the content
a

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

In [27]:
# indexing is 0 based, just like in Python

a[3] = 70
a[5] = 25


In [28]:
a

array([ 10,  20,  30,  70,  50,  25,  70,  80,  90, 100])

In [30]:
# help(np.random)

# Exercises: Simple NumPy manipulations

1. Create a NumPy array with three elements -- the integers from your birthday! It should contain the year, month, and day (all in numbers).
2. Retrieve the year.
3. Replace the year with the current year, and print the array.
4. Create a NumPy array with every 3rd number from 567 to 890. What is the 8th number (i.e., at index 7) in this array? 

In [31]:
# pip install -U matplotlib 

In [32]:
a = np.array([1970, 7, 14])

a[0]

1970

In [33]:
a[0] = 2021
a

array([2021,    7,   14])

In [35]:
a = np.arange(567, 890, 3)
a

array([567, 570, 573, 576, 579, 582, 585, 588, 591, 594, 597, 600, 603,
       606, 609, 612, 615, 618, 621, 624, 627, 630, 633, 636, 639, 642,
       645, 648, 651, 654, 657, 660, 663, 666, 669, 672, 675, 678, 681,
       684, 687, 690, 693, 696, 699, 702, 705, 708, 711, 714, 717, 720,
       723, 726, 729, 732, 735, 738, 741, 744, 747, 750, 753, 756, 759,
       762, 765, 768, 771, 774, 777, 780, 783, 786, 789, 792, 795, 798,
       801, 804, 807, 810, 813, 816, 819, 822, 825, 828, 831, 834, 837,
       840, 843, 846, 849, 852, 855, 858, 861, 864, 867, 870, 873, 876,
       879, 882, 885, 888])

In [36]:
a[7]

588

In [37]:
# Python list!
mylist = [10, 20, 30]

mylist + mylist  # add two lists together

[10, 20, 30, 10, 20, 30]

In [38]:
mylist + 4

TypeError: can only concatenate list (not "int") to list

In [41]:
# NumPy array!
a = np.array([10, 20, 30])
a + a                      # all operations on an array are vectorized

array([20, 40, 60])

In [42]:
a + np.array([2,4,6,8,10])

ValueError: operands could not be broadcast together with shapes (3,) (5,) 

In [43]:
a

array([10, 20, 30])

In [44]:
a + 4 # broadcasting -- vector + scalar can have operations run on them, the scalar is used with each element

array([14, 24, 34])

In [45]:
a

array([10, 20, 30])

In [46]:
# some basic methods for working with NumPy arrays

a = np.array([10, 20, 30, 35, 15, -8, 20, 50])

a.mean()

21.5

In [47]:
a.sum()  

172

In [50]:
a.size  # this is not a method!

8

In [51]:
a.sum() / a.size

21.5

In [52]:
a.std()   # standard deviation -- how much do the values vary from the mean?

16.263455967290593

In [53]:
a.min()

-8

In [54]:
a.max()

50

In [56]:
np.random.rand(5)  # 5 numbers between 0 and 1

array([0.27265629, 0.47766512, 0.81216873, 0.47997717, 0.3927848 ])

In [57]:
np.random.rand(5) * 100    # 5 numbers between 0 and 100

array([83.60787635, 33.73961604, 64.81718721, 36.82415398, 95.7155159 ])

In [58]:
# another method: np.ones
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [59]:
np.ones(10) * 5

array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])

In [60]:
# also: np.zeros 
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [61]:
np.zeros(10) + 5

array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])

# Exercise: Temperatures + more

1. Look up the 10-day forecast for wherever you live.
2. Create a 10-element NumPy array with the predicted high temperatures.
3. Find the mean and standard deviation of those.
4. Create a 20-element NumPy array of random ints between 0 and 100.  What is the mean?  Do you expect the mean to change if you have 200 elements?  How should it change?  How about 2,000 elements?

In [62]:
a = np.array([23, 22, 18, 17, 16, 15, 17, 18, 17, 16])
a.mean()


17.9

In [63]:
a.std()

2.467792535850613

In [64]:
a = np.random.randint(0, 100, 20)
a.mean()

47.55

In [66]:
a = np.random.randint(0, 100, 2000)
a.mean()

49.5745

In [67]:
help(np.random.randint)

Help on built-in function randint:

randint(...) method of numpy.random.mtrand.RandomState instance
    randint(low, high=None, size=None, dtype=int)
    
    Return random integers from `low` (inclusive) to `high` (exclusive).
    
    Return random integers from the "discrete uniform" distribution of
    the specified dtype in the "half-open" interval [`low`, `high`). If
    `high` is None (the default), then results are from [0, `low`).
    
    .. note::
        New code should use the ``integers`` method of a ``default_rng()``
        instance instead; please see the :ref:`random-quick-start`.
    
    Parameters
    ----------
    low : int or array-like of ints
        Lowest (signed) integers to be drawn from the distribution (unless
        ``high=None``, in which case this parameter is one above the
        *highest* such integer).
    high : int or array-like of ints, optional
        If provided, one above the largest (signed) integer to be drawn
        from the distributi

In [69]:
# Indexing

a = np.array([23, 22, 18, 17, 16, 15, 17, 18, 17, 16])
a

array([23, 22, 18, 17, 16, 15, 17, 18, 17, 16])

In [70]:
a[0]

23

In [71]:
a[1]

22

In [72]:
# fancy indexing
a[[0, 1]]   # give me an array back with the elements of a at both 0 and 1

array([23, 22])

In [73]:
a[[2, 4,6, 2,4,6]]

array([18, 16, 17, 18, 16, 17])

In [74]:
# boolean indexing

a = np.array([10, 20, 30, 40])
a

array([10, 20, 30, 40])

In [77]:
# use booleans (True/False) as elements of our selection index!
# note: if we do this, we must have the same number of booleans as elements in a

# "boolean index" or a "mask index"
a[[True, False, True, False]]    # only get the elements back that match True

array([10, 30])

In [78]:
a[[True, True, True, True]]

array([10, 20, 30, 40])

In [79]:
a[[False, False, False, False]]

array([], dtype=int64)

In [80]:
# remember that all operators are vectorized
# if we use a array and a scalar, then the operation is broadcast

a + 5

array([15, 25, 35, 45])

In [81]:
a > 20

array([False, False,  True,  True])

In [83]:
# first: create a boolean array by comparing a>20
# then: apply that boolean array as a boolean index on a 
# thus: get all of the elements of a that are > 20

a[a>20]

array([30, 40])

In [84]:
# Resume at :50

In [85]:
a

array([10, 20, 30, 40])

In [87]:
# always think about the [] as saying, "Where this is the case"
a[a == 30]

array([30])

In [88]:
# show me all elements of a that are greater than the mean
a[a>a.mean()]

array([30, 40])

In [89]:
a = np.array([10, 15, 20, 25, 30])

a

array([10, 15, 20, 25, 30])

In [90]:
# I want to find all of the odd elements
a%2

array([0, 1, 0, 1, 0])

In [92]:
# let's get the odd values!

# this doesn't work the way we expect because we're getting
# integers (and thus a fancy index with elements at indexes 0 and 1)

# we are *not* getting True and False values, which would return
# only those elements that are odd.
a[a%2]

array([10, 15, 10, 15, 10])

In [93]:
# how can/should we do it?
a[a%2==1]

array([15, 25])

In [94]:
# get the even elements
a[a%2==0]

array([10, 20, 30])

# Exercises: Boolean indexes

1. Create an array of 20 random integers from 0 - 100.
2. Find the largest even number.
3. Find the mean of the odd numbers.
4. Create an array of 20 floats from 0 - 1,000.
5. Find the items that are less than the mean.
6. Find the items that are less than the mean-std.

In [97]:
np.random.seed(0)
a = np.random.randint(0, 100, 20)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [101]:
a[a%2==0].max()

88

In [103]:
a[a%2==1].mean()

57.2

In [104]:
np.random.seed(0)
a = np.random.rand(20) * 1000
a

array([548.81350393, 715.18936637, 602.76337607, 544.883183  ,
       423.65479934, 645.89411307, 437.58721126, 891.77300078,
       963.6627605 , 383.44151883, 791.72503808, 528.89491975,
       568.04456109, 925.59663829,  71.0360582 ,  87.1292997 ,
        20.21839744, 832.61984555, 778.15675095, 870.01214825])

In [107]:
a[a < a.mean()]

array([548.81350393, 544.883183  , 423.65479934, 437.58721126,
       383.44151883, 528.89491975, 568.04456109,  71.0360582 ,
        87.1292997 ,  20.21839744])

In [109]:
a[a < a.mean()-a.std()]

array([71.0360582 , 87.1292997 , 20.21839744])

In [110]:
a[a > a.mean()+a.std()]

array([891.77300078, 963.6627605 , 925.59663829, 870.01214825])

In [111]:
a[a < a.mean()-2*a.std()]

array([20.21839744])

In [112]:
a[a > a.mean()+2*a.std()]

array([], dtype=float64)

In [113]:
# how can we check multiple criteria?
# for example: show me the odd numbers that are < mean

a[a%2==0 and a<a.mean()]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [119]:
# in normal Python, any 'if' (as well as a few other things, such as and/or/not)
# put our data in a "boolean context"

# anything in Python in a boolean context is True except for: 
# None, False, 0, and anything empty ('', [], (), {})

# NumPy arrays are the exception.
# a 1-element NumPy array works in this way
# with more than element, it's not clear what we should get back!

# Python has bitwise operators. They can be overloaded -- and NumPy did exactly that.
# & is our "and"
# | is our "or"
# ~ is our "not"

a = np.random.randint(0, 100, 20)
a[(a%2==0) &
  (a<a.mean())]

array([28,  0,  0,  4])

# Exercise: Complex criteria

1. Create a NumPy array of 20 random ints from 0-100.
2. What's the smallest even number that's also greater than the mean?
3. Show all numbers that are either < mean-std or > mean+std.
4. Show odd numbers < mean and even numbers > mean.

In [122]:
np.random.seed(0)
a = np.random.randint(0, 100, 20)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [125]:
a[(a%2==0) &            # even
  (a>a.mean())].min()   # greater than the mean

64

In [126]:
a%2==0

array([ True, False,  True, False, False, False, False, False,  True,
       False,  True,  True,  True,  True,  True, False, False, False,
        True,  True])

In [127]:
a>a.mean()

array([False, False,  True,  True,  True, False,  True, False, False,
        True,  True,  True,  True, False, False,  True, False,  True,
       False,  True])

In [131]:
# numbers either < mean-std  or >mean+std

a[(a<a.mean()-a.std()) |
  (a>a.mean()+a.std())]

array([ 9, 21, 87, 88, 88, 12, 87, 88])

In [128]:
a<a.mean()-a.std()

array([False, False, False, False, False,  True, False,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False])

In [130]:
a>a.mean()+a.std()

array([False, False, False, False, False, False, False, False, False,
        True, False,  True,  True, False, False, False, False,  True,
       False,  True])

In [133]:
a[(a%2==1) & (a<a.mean())  |
  (a%2==0) & (a>a.mean())]

array([47, 64,  9, 21, 70, 88, 88, 39, 88])

# dtypes

In [134]:
a = np.array([10, 20, 30])
a

array([10, 20, 30])

In [135]:
type(a)

numpy.ndarray

In [136]:
# to find out the type of data that a NumPy array contains, I need to ask its *dtype*

a.dtype

dtype('int64')

When I create a new NumPy array, NumPy tries to guess the best dtype that it can for my data. It'll typically choose one of 3-4 different options:

- `np.int64` -- 64-bit ints
- `np.float64` -- 64-fit floats
- Unicode strings and/or Python objects

The names of the different NumPy types are a bit odd, and there are several ways to describe each type. For example, the default int type is `np.int64`, which we can write as:

- `np.int64`
- `np.dtype('int64')`

If you have `pylab inline` in place, then you can also say

- `int64`
- `dtype('int64')`

I tend to prefer the former, slightly longer ones, and especially like saying `np.int64`.

What else do we have?

- Integers: `np.int8`, `np.int16`, `np.int32`, `np.int64`
- Unsigned integers: `np.uint8`, `np.uint16`, `np.uint32`, `np.uint64`
- Floats: `np.float16`, `np.float32`, `np.float64`, `np.float128`



In [137]:
# 8 bits per byte
# 64-bit ints are 8-byte ints
# if I have 1 billion elements in my array, then 64-bit ints will be 64 GB. But 32-bit ints will be 32 GB.
# if your numbers are all between 1-10, don't use 64 bit ints!

In [138]:
a = np.array([10, 20, 30], dtype=np.int8)
a

array([10, 20, 30], dtype=int8)

In [139]:
a ** 2

array([ 100, -112, -124], dtype=int8)

In [140]:
a = np.array([10, 20, 30], dtype=np.uint8)
a

array([10, 20, 30], dtype=uint8)

In [141]:
a ** 2

array([100, 144, 132], dtype=uint8)

In [142]:
a

array([10, 20, 30], dtype=uint8)

In [143]:
a = np.array([10, 20, 30 ,40, 50])
a

array([10, 20, 30, 40, 50])

In [144]:
a[2] = 32
a

array([10, 20, 32, 40, 50])

In [145]:
a[2] = 12.34  # dtype is int, so we run int() on the value before assigning it

In [146]:
a

array([10, 20, 12, 40, 50])

In [147]:
a[2] = 'abcd'

ValueError: invalid literal for int() with base 10: 'abcd'

In [148]:
a[2] = '123'  # runs int() on the string, and gets an integer which we can use

In [149]:
a

array([ 10,  20, 123,  40,  50])

In [150]:
# even a single float in my array will force it to be np.float64 (by default)
a = np.array([1.5, 2.5, 3.5, 4])

In [151]:
a

array([1.5, 2.5, 3.5, 4. ])

In [152]:
a.dtype

dtype('float64')

In [156]:
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float16)
a

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=float16)

In [157]:
a + 0.1

array([0.2   , 0.2998, 0.4   , 0.5   , 0.6   , 0.7   ], dtype=float16)

In [158]:
a + 0.2

array([0.2998, 0.4   , 0.5   , 0.5996, 0.7   , 0.8   ], dtype=float16)

In [159]:
a + 0.3

array([0.4   , 0.5   , 0.6   , 0.7   , 0.8   , 0.9004], dtype=float16)

In [162]:
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float64)
a

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

In [163]:
a+0.1

array([0.2, 0.3, 0.4, 0.5, 0.6, 0.7])

In [164]:
0.1 + 0.2

0.30000000000000004

In [165]:
x = 1.

In [166]:
type(x)

float

In [167]:
a = np.array([1,2,3,4.5])

In [168]:
a

array([1. , 2. , 3. , 4.5])

In [171]:
a = np.array([10, 20, 30 ,40, 50, 60], dtype=np.int8)
a

array([10, 20, 30, 40, 50, 60], dtype=int8)

In [172]:
a.dtype

dtype('int8')

In [173]:
# I want to change the dtype! Let's make it 16 bits instead

# NEVER EVER EVER EVER DO THIS!
a.dtype = np.int16

In [174]:
a

array([ 5130, 10270, 15410], dtype=int16)

In [177]:
a = np.array([10, 20, 30, 40, 50])
a.dtype = np.int32  

In [178]:
a

array([10,  0, 20,  0, 30,  0, 40,  0, 50,  0], dtype=int32)

In [179]:
a.dtype = np.int8

In [180]:
a

array([10,  0,  0,  0,  0,  0,  0,  0, 20,  0,  0,  0,  0,  0,  0,  0, 30,
        0,  0,  0,  0,  0,  0,  0, 40,  0,  0,  0,  0,  0,  0,  0, 50,  0,
        0,  0,  0,  0,  0,  0], dtype=int8)

In [181]:
a = np.array([10.5, 20.5, 30.5])
a.dtype

dtype('float64')

In [183]:
a.dtype = np.float32
a

array([0.       , 2.578125 , 0.       , 2.8203125, 0.       , 2.9765625],
      dtype=float32)

In [184]:
a.dtype = np.int64
a

array([4622100592565682176, 4626463454704697344, 4629278204471803904])

In [185]:
# how *should* we change the dtype of an array?
# You don't. You create a new array with the new dtype, based on the old array.
# if you want, you can assign the new one back to the old one.

# the way to do this is with .astype(NEWTYPE)

a.dtype = np.float64
a

array([10.5, 20.5, 30.5])

In [186]:
a.astype(np.int8) # get me a new array, whose dtype is np.int8, based on a

array([10, 20, 30], dtype=int8)

In [187]:
a = a.astype(np.int8)
a

array([10, 20, 30], dtype=int8)

In [188]:
a = np.array('this is a bunch of words'.split())  # dtype will be "<U5", 5 Unicode characters
a

array(['this', 'is', 'a', 'bunch', 'of', 'words'], dtype='<U5')

In [189]:
a[0] = 'vwxyz'
a

array(['vwxyz', 'is', 'a', 'bunch', 'of', 'words'], dtype='<U5')

In [191]:
a[0] = 'uvwxyz'  # silently truncated
a

array(['uvwxy', 'is', 'a', 'bunch', 'of', 'words'], dtype='<U5')

In [192]:
# we've seen that we can modify an array by assigning to it
a

array(['uvwxy', 'is', 'a', 'bunch', 'of', 'words'], dtype='<U5')

In [193]:
a = np.array([10, 20, 30, 40, 50])
a

array([10, 20, 30, 40, 50])

In [194]:
a[3] = 22
a

array([10, 20, 30, 22, 50])

In [195]:
# I can also assign to an array based on (a) fancy indexing and (b) boolean indexing

In [196]:
a[[2,4]] = 99  # assign the value 99 to both indexes 2 and 4
a

array([10, 20, 99, 22, 99])

In [197]:
a[a%2==0]  # find even numbers in a

array([10, 20, 22])

In [198]:
a[a%2==0] = 44  # assign 44 to all even numbers in a
a

array([44, 44, 99, 44, 99])

# Exercises: dtypes

1. Create an array of 40 random intgers from 0-100.
2. Find all numbers that are within 1 standard deviation of the mean, and set all of them to be equal to the mean. Has the mean changed? Has the std changed?
3. Create a new array of 10 random integers, from 0-100.
4. Set the items at even indexes to be equal to the items at the odd indexes. So the item at index 0 will get the value at index 1, etc.

In [203]:
np.random.seed(0)
a = np.random.randint(0, 100, 40)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88, 81, 37, 25, 77, 72,  9, 20, 80, 69, 79, 47, 64, 82, 99,
       88, 49, 29, 19, 19, 14])

In [204]:
a.mean()

55.625

In [205]:
# all numbers that are > mean-std  and < mean+std
a[(a>a.mean()-a.std()) &
  (a<a.mean()+a.std())]

array([44, 47, 64, 67, 67, 36, 70, 58, 65, 39, 46, 81, 37, 77, 72, 80, 69,
       79, 47, 64, 82, 49, 29])

In [206]:
a[(a>a.mean()-a.std()) &
  (a<a.mean()+a.std())] = a.mean()
a

array([55, 55, 55, 55, 55,  9, 83, 21, 55, 87, 55, 88, 88, 12, 55, 55, 55,
       87, 55, 88, 55, 55, 25, 55, 55,  9, 20, 55, 55, 55, 55, 55, 55, 99,
       88, 55, 55, 19, 19, 14])

In [207]:
a.mean()

53.025

In [211]:
np.random.seed(0)
a = np.random.randint(0, 100, 40)
a = a.astype(np.float64)
a

array([44., 47., 64., 67., 67.,  9., 83., 21., 36., 87., 70., 88., 88.,
       12., 58., 65., 39., 87., 46., 88., 81., 37., 25., 77., 72.,  9.,
       20., 80., 69., 79., 47., 64., 82., 99., 88., 49., 29., 19., 19.,
       14.])

In [212]:
a.mean()

55.625

In [213]:
a.std()

26.965429256735373

In [214]:
a[(a>a.mean()-a.std()) &
  (a<a.mean()+a.std())] = a.mean()


In [215]:
a.mean()

53.384375

In [216]:
a.std()

23.803137718258384

In [217]:
np.random.seed(0)
a = np.random.randint(0, 100, 10)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87])

In [218]:
# I want the items at the odd indexes -- 1,3,5,7,9

np.arange(1,10,2)

array([1, 3, 5, 7, 9])

In [219]:
a[np.arange(1,10,2)]

array([47, 67,  9, 21, 87])

In [222]:
# items at even indexes
a[np.arange(0,10,2)]

array([44, 64, 67, 83, 36])

In [223]:
# assign from the odd indexes to the even indexes
a[np.arange(0,10,2)] = a[np.arange(1,10,2)]
a

array([47, 47, 67, 67,  9,  9, 21, 21, 87, 87])

# Next up

1. Missing data with `nan`
2. 2-dimensional arrays
3. Pandas series

resume at 1:25 p.m. Eastern

# Exercise: dtypes (really, this time!)

1. Create a NumPy array with 20 random floats from 0-100.
2. Replace those numbers whose int portion is even with the mean of all the numbers. So you would replace 20.5 (because 20 is even), but not 21.6 (because 21 is odd).

In [224]:
np.random.seed(0)
a = np.random.rand(20) * 100
a

array([54.88135039, 71.51893664, 60.27633761, 54.4883183 , 42.36547993,
       64.58941131, 43.75872113, 89.17730008, 96.36627605, 38.34415188,
       79.17250381, 52.88949198, 56.80445611, 92.55966383,  7.10360582,
        8.71292997,  2.02183974, 83.26198455, 77.81567509, 87.00121482])

In [232]:
a[a.astype(np.int8)%2==0] = a.mean()
a

array([58.15548245, 71.51893664, 58.15548245, 58.15548245, 58.15548245,
       58.15548245, 43.75872113, 89.17730008, 58.15548245, 58.15548245,
       79.17250381, 58.15548245, 58.15548245, 58.15548245,  7.10360582,
       58.15548245, 58.15548245, 83.26198455, 77.81567509, 87.00121482])

# `nan` value

In [233]:
a = np.array([95, 90, 92, 85, 92])
a.mean()

90.8

In [234]:
a = np.array([95, 90, 92, 85, 0])
a.mean()

72.4

In [235]:
# we need a value that's distinct from 0 (and thus won't mess up the calculations),
# but that lets us know that there's something different here, that it's a missing value

# that value is "nan" , short for "not a number"

np.nan

nan

In [236]:
nan

nan

In [237]:
type(nan)

float

In [238]:
nan > 0

False

In [239]:
nan < 0

False

In [240]:
nan == 0

False

In [241]:
nan == nan

False

In [242]:
a = np.array([95, 90, 92, 85, nan])
a.mean()

nan

In [243]:
2 + nan

nan

In [244]:
3 * nan

nan

In [245]:
# in order to work with a (which has a nan value), I'll need to
# remove the nan from a

In [246]:
a.dtype

dtype('float64')

In [247]:
a

array([95., 90., 92., 85., nan])

In [248]:
# remove nan by finding everything not equal to it

a[a!=nan]  # not going to help -- because nan≠nan

array([95., 90., 92., 85., nan])

In [249]:
# we can use the "np.isnan" function, which returns True when something is nan
np.isnan(a)

array([False, False, False, False,  True])

In [252]:
# get all elements of a that aren't nan, then get their mean
a[~np.isnan(a)].mean()

90.5

In [253]:
a = np.array([10, 20, 30, 40, 50])
a[3] = nan

ValueError: cannot convert float NaN to integer

In [254]:
a = np.array([10, 20, 30, 40, 50])
a = a.astype(np.float64)
a[3] = nan

In [255]:
a

array([10., 20., 30., nan, 50.])

# Exercises: Working with `nan`

1. Create a NumPy array with 30 random ints from 0-1,000.
2. Find the numbers that are < mean-std or > mean+std, and set them to be `nan`.
3. Change those `nan` values to be the mean of the array's remaining numbers.

In [259]:
np.random.seed(0)
a = np.random.randint(0, 1000, 30).astype(np.float64)
a

array([684., 559., 629., 192., 835., 763., 707., 359.,   9., 723., 277.,
       754., 804., 599.,  70., 472., 600., 396., 314., 705., 486., 551.,
        87., 174., 600., 849., 677., 537., 845.,  72.])

In [260]:
a[(a < a.mean()-a.std())   |
  (a > a.mean()+a.std())] = np.nan

In [261]:
a

array([684., 559., 629.,  nan,  nan, 763., 707., 359.,  nan, 723., 277.,
       754.,  nan, 599.,  nan, 472., 600., 396., 314., 705., 486., 551.,
        nan,  nan, 600.,  nan, 677., 537.,  nan,  nan])

In [268]:
# assign the mean of non-nan values to all of the nan values
a[np.isnan(a)] = a[~np.isnan(a)].mean()

In [269]:
a

array([684. , 559. , 629. , 569.6, 569.6, 763. , 707. , 359. , 569.6,
       723. , 277. , 754. , 569.6, 599. , 569.6, 472. , 600. , 396. ,
       314. , 705. , 486. , 551. , 569.6, 569.6, 600. , 569.6, 677. ,
       537. , 569.6, 569.6])

# Multidimensional arrays in NumPy

We're using the `ndarray` class, which stands for `n-dimensional array`.  So we know that it can be used for more than 1 dimension. We're just going to talk about 2 dimensions.

In [270]:
# I can create a 2D NumPy array by passing a list of lists to np.array

a = np.array([[10, 20, 30, 40],
             [50, 60, 70, 80],
             [90, 100, 110, 120]])



In [271]:
a

array([[ 10,  20,  30,  40],
       [ 50,  60,  70,  80],
       [ 90, 100, 110, 120]])

In [272]:
# what is the shape of a?
a.shape  # (rows, columns)

(3, 4)

In [273]:
a[0]   

array([10, 20, 30, 40])

In [274]:
a[2]

array([ 90, 100, 110, 120])

In [275]:
# what if I want both index 0 and index 2?
# fancy indexing!
a[[0, 2]]

array([[ 10,  20,  30,  40],
       [ 90, 100, 110, 120]])

In [276]:
# I can also use a slice
a[0:3:2]    # from 0 until (but not including) 3, step size 2

array([[ 10,  20,  30,  40],
       [ 90, 100, 110, 120]])

In [277]:
# what about retrieving individual elements?
a

array([[ 10,  20,  30,  40],
       [ 50,  60,  70,  80],
       [ 90, 100, 110, 120]])

In [278]:
# I want the item at row index 1, column index 2
# DO NOT DO THIS!
a[1][2]

70

In [280]:
# instead, I should do this as follows:
a[1,2]   # we're passing the tuple (1,2) to a's [] 

70

In [281]:
a[[0, 2], 3]   # rows 0+2, column 3

array([ 40, 120])

In [290]:
a[[0, 2]]

array([[ 10,  20,  30,  40],
       [ 90, 100, 110, 120]])

In [282]:
a

array([[ 10,  20,  30,  40],
       [ 50,  60,  70,  80],
       [ 90, 100, 110, 120]])

In [285]:
# I can specify which row(s) before the comma, and which column(s) after the comma
# you can specify with [] fancy indexing or (often better) with a slice

a[0:3:2, 1:4]

array([[ 20,  30,  40],
       [100, 110, 120]])

In [286]:
# what if I want everything in column 3
a[:, 3]

array([ 40,  80, 120])

In [288]:
a

array([[ 10,  20,  30,  40],
       [ 50,  60,  70,  80],
       [ 90, 100, 110, 120]])

In [289]:
a.shape

(3, 4)

In [291]:
# what if I want to change the shape of an array?

# two ways to do this:
# (1) assign to .shape.  
# (2) BETTER FOR SURE: use the .reshape method, which returns a new array based on the old one.

a.reshape(2, 6)

array([[ 10,  20,  30,  40,  50,  60],
       [ 70,  80,  90, 100, 110, 120]])

In [292]:
a.reshape(9, 100)

ValueError: cannot reshape array of size 12 into shape (9,100)

In [293]:
a

array([[ 10,  20,  30,  40],
       [ 50,  60,  70,  80],
       [ 90, 100, 110, 120]])

In [294]:
b = a.reshape(2, 6)  # get a new array of a different shape back from a
b

array([[ 10,  20,  30,  40,  50,  60],
       [ 70,  80,  90, 100, 110, 120]])

In [295]:
a[0, 0] = 999
a

array([[999,  20,  30,  40],
       [ 50,  60,  70,  80],
       [ 90, 100, 110, 120]])

In [297]:
b # b has changed, too!

array([[999,  20,  30,  40,  50,  60],
       [ 70,  80,  90, 100, 110, 120]])

# Exercise: Working with 2D arrays

1. Create a 2-dimensional 5x9 array of 45 random integers from 0-100. 
2. Retrieve the items at row index 2.
3. Retrieve the items at column index 3.
4. Retrieve the items at row indexes 1+4.
5. Retrieve the items at column indexes 1 and 4.
6. Get the mean of the even numbers in row index 4.
7, Get the mean of the odd numbers in column index 4.



In [299]:
np.random.seed(0)
a = np.random.randint(0, 100, 45).reshape(5, 9)
a

array([[44, 47, 64, 67, 67,  9, 83, 21, 36],
       [87, 70, 88, 88, 12, 58, 65, 39, 87],
       [46, 88, 81, 37, 25, 77, 72,  9, 20],
       [80, 69, 79, 47, 64, 82, 99, 88, 49],
       [29, 19, 19, 14, 39, 32, 65,  9, 57]])

In [300]:
np.random.seed(0)
a = np.random.randint(0, 100, [5,9])
a

array([[44, 47, 64, 67, 67,  9, 83, 21, 36],
       [87, 70, 88, 88, 12, 58, 65, 39, 87],
       [46, 88, 81, 37, 25, 77, 72,  9, 20],
       [80, 69, 79, 47, 64, 82, 99, 88, 49],
       [29, 19, 19, 14, 39, 32, 65,  9, 57]])

In [303]:
(np.random.random([5,9]) * 100).astype(np.int64)

array([[58, 88, 69, 72, 50, 95, 64, 42, 60],
       [ 1, 30, 66, 29, 61, 42, 13, 29, 56],
       [59, 57, 65, 65, 43, 89, 36, 43, 89],
       [80, 70, 10, 91, 71, 99, 14, 86, 16],
       [61, 12, 84, 80, 56, 40,  6, 69, 45]])

In [304]:
a

array([[44, 47, 64, 67, 67,  9, 83, 21, 36],
       [87, 70, 88, 88, 12, 58, 65, 39, 87],
       [46, 88, 81, 37, 25, 77, 72,  9, 20],
       [80, 69, 79, 47, 64, 82, 99, 88, 49],
       [29, 19, 19, 14, 39, 32, 65,  9, 57]])

In [305]:
# items at row index 2
a[2]

array([46, 88, 81, 37, 25, 77, 72,  9, 20])

In [306]:
# items at column index 3
a[:, 3]

array([67, 88, 37, 47, 14])

In [307]:
# row indexes 1 + 4
a[[1, 4]]

array([[87, 70, 88, 88, 12, 58, 65, 39, 87],
       [29, 19, 19, 14, 39, 32, 65,  9, 57]])

In [310]:
# retrieve the items at column indexes 1+4
a[:, [1,4]]

array([[47, 67],
       [70, 12],
       [88, 25],
       [69, 64],
       [19, 39]])

In [315]:
# mean of the even numbers in row index 4

a[4][a[4] % 2 == 0].mean()

23.0

In [318]:
# mean of odd numbers in column index 4

a[:, 4][a[:, 4] % 2 == 1]

array([67, 25, 39])

In [319]:
a

array([[44, 47, 64, 67, 67,  9, 83, 21, 36],
       [87, 70, 88, 88, 12, 58, 65, 39, 87],
       [46, 88, 81, 37, 25, 77, 72,  9, 20],
       [80, 69, 79, 47, 64, 82, 99, 88, 49],
       [29, 19, 19, 14, 39, 32, 65,  9, 57]])

In [320]:
a.sum()

2427

In [321]:
# if I want to sum the columns, getting a "new row"
a.sum(axis=0)

array([286, 293, 331, 253, 207, 258, 384, 166, 249])

In [322]:
# I can also sum the rows
a.sum(axis=1)

array([438, 594, 455, 657, 283])

In [323]:
# this is true for all aggregatio methods -- mean, sum, std, min, max

# Next up:

1. Pandas series!

Resume at :05

# Pandas

Pandas is the Python data analysis library. I call it "Excel in Python." 

In [325]:
# to load Pandas, you have to say

import pandas as pd
from pandas import Series, DataFrame