## Some Python if you need a quick refresher

Let's look at some basic data structures in Python

   - str (string)
   - List (mutable, order maintained)
   - Tuple (immutable)
   - Set (unordered, unique)
   - Dictionary (mutable, key-value pairs)

    


In [1]:
#this is a string (str)
hackathon = "spectra"
print('length of the string:', len(hackathon))
print('first character of the string:', hackathon[0])
print('last character of the string:', hackathon[-1])
print('capitalize string: ', hackathon.capitalize())
print('uppper case string: ', hackathon.upper())
print('find tra in the string: ', hackathon.find('tra')) #start counting from 0
#Many more methods available. Read the docs - https://docs.python.org/3/library/stdtypes.html#string-methods

length of the string: 7
first character of the string: s
last character of the string: a
capitalize string:  Spectra
uppper case string:  SPECTRA
find tra in the string:  4


In [2]:
#this is a list
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida"]
print('number of states:', len(states))
print('first state: ', states[0])
print('last state: ', states[-1])
print('list contains California?', 'California' in states)
print('where is California in the list?', states.index('California'))
print('list first three states:', states[:3])
print('list last three states:', states[-3:])

#more advanced
print('filter list to only states starting with C:', list(filter(lambda x: x.startswith('C'),states)))
print('filter list to only states starting with C:', [state for state in states if state.startswith('C')])
#Many more methods available. Read the docs - https://docs.python.org/3/tutorial/datastructures.html

number of states: 9
first state:  Alabama
last state:  Florida
list contains California? True
where is California in the list? 4
list first three states: ['Alabama', 'Alaska', 'Arizona']
list last three states: ['Connecticut', 'Delaware', 'Florida']
filter list to only states starting with C: ['California', 'Colorado', 'Connecticut']
filter list to only states starting with C: ['California', 'Colorado', 'Connecticut']


In [3]:
#this is a dictionary. They are unordered, so first and last makes little sense
states = {
'AL': 'Alabama',
'AK': 'Alaska',
'AZ': 'Arizona',
'AR': 'Arkansas',
'CA' : 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DE': 'Delaware',
'FL': 'Florida'
}
print('keys in states:', states.keys())
print('values in states:', states.values())
print('# states in dict:', len(states))
print('find ca:', states['CA'])
print('is FL in the list:', 'FL' in states)
print('is Florida in the list: ', 'Florida' in states.values())
#Many more methods available. Read the docs - https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries

keys in states: dict_keys(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL'])
values in states: dict_values(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida'])
# states in dict: 9
find ca: California
is FL in the list: True
is Florida in the list:  True


## Data structures in Pandas - Series

In [5]:
import pandas as pd
import numpy as np

In [6]:
#This is a series. Think of this as a list with an index!
states = pd.Series(["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida"])
states

0        Alabama
1         Alaska
2        Arizona
3       Arkansas
4     California
5       Colorado
6    Connecticut
7       Delaware
8        Florida
dtype: object

In [7]:
#you can give the data your own index!
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida"]
abb = ['AL','AK','AZ','AR','CA','CO','CT','DE','FL']
states = pd.Series(data=states, index=abb)
states


AL        Alabama
AK         Alaska
AZ        Arizona
AR       Arkansas
CA     California
CO       Colorado
CT    Connecticut
DE       Delaware
FL        Florida
dtype: object

In [8]:
#slice and dice
print('first state:', states[0])
print('first index:', states.index[0])
print('\n\n')
print('first three states:')
print(states[0:3])
print('\n\n')
print('last three states:')
print(states[-3:])

first state: Alabama
first index: AL



first three states:
AL    Alabama
AK     Alaska
AZ    Arizona
dtype: object



last three states:
CT    Connecticut
DE       Delaware
FL        Florida
dtype: object


In [9]:
#you can create series out of comprehensions and anything else that returns an iterable
fours = pd.Series([x**4 for x in range(6)])
fours

0      0
1      1
2     16
3     81
4    256
5    625
dtype: int64

In [10]:
#numpy has a wonderful random number generator that is useful in creating dummy data. Returns np.array
rand = pd.Series(np.random.rand(5))
rand


0    0.855780
1    0.351379
2    0.343122
3    0.582148
4    0.279583
dtype: float64

## Data structures in Pandas - Dataframes

In [11]:
#dataframe is a like a table
state_data = {
    'abb':['AL','AK','AZ','AR','CA','CO','CT','DE','FL'],
    'name':["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida"],
    'temp': np.random.rand(9)
}
df_state = pd.DataFrame(state_data)
df_state

Unnamed: 0,abb,name,temp
0,AL,Alabama,0.78938
1,AK,Alaska,0.402384
2,AZ,Arizona,0.693051
3,AR,Arkansas,0.11322
4,CA,California,0.949887
5,CO,Colorado,0.316077
6,CT,Connecticut,0.641843
7,DE,Delaware,0.856061
8,FL,Florida,0.541673


In [12]:
df = pd.DataFrame(data = np.random.randn(2,2), columns=['Temp', 'Humidity'])
df

Unnamed: 0,Temp,Humidity
0,0.175592,2.010558
1,0.25732,-0.70862


In [31]:
#the very cool thing about dataframes is that you can operate on them as a single unit. NO NEED TO ITERATE EXPLICITLY!
df = df * 10
df

Unnamed: 0,Temp,Humidity
0,17.559187,201.055765
1,25.732014,-70.861997


In [32]:
#you can use apply method to perform a more complex operation. 
#remove the comment below and shift+tab+tab_tab inside the apply brackets to bring up the documentation. Yes, three tabs. See what each one does!

#df.apply()

In [33]:
def add100(x):
    return x + 100

In [34]:
#let's apply this function to 
df.apply(add100)

Unnamed: 0,Temp,Humidity
0,117.559187,301.055765
1,125.732014,29.138003


In [35]:
#notice that the original dataframe does not change. This can get tricky. 
#If you want to change the df, assign it to itself. Or some methods have inplace=True parameters to force change inplace
df

Unnamed: 0,Temp,Humidity
0,17.559187,201.055765
1,25.732014,-70.861997


In [36]:
#you can also use lambda or anonymous functions to do the same
df.apply(lambda x: x+100)

Unnamed: 0,Temp,Humidity
0,117.559187,301.055765
1,125.732014,29.138003


## Let's load in our data files!

In [37]:
casts_url = "https://ibm.box.com/shared/static/569iue5znz5angfxaaojbd7olgegk0bz.csv"
release_dates_url = "https://ibm.box.com/shared/static/fxu6rhfktvjs0uvgtbhjsp5g5k9qgjh1.csv"
titles_url = "https://ibm.box.com/shared/static/cw3wqtzuljiyqz4kbuk26ojrrm9rzfow.csv"
film_locations_url = "https://ibm.box.com/shared/static/kcot1vu0r1tusff85m5shrr7ehsee8np.csv"

In [38]:
casts = pd.read_csv(casts_url)

In [39]:
release = pd.read_csv(release_dates_url)

In [40]:
titles = pd.read_csv(titles_url)

In [41]:
locations = pd.read_csv(film_locations_url)

In [43]:
#woh! that was smooth! Let's explore the casts dataset 

#how big is this dataset? shape returns a tuple. The first number is the number of rows and the second is the number of columns.
#3.7M rows! not a bad start!
print(casts.shape)

(3786176, 6)


In [86]:
#what are the different columns in this dataframe
casts.columns

Index(['title', 'year', 'name', 'type', 'character', 'n'], dtype='object')

In [88]:
#cool! What are some of the datatypes? Read object is mostly string
casts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3786176 entries, 0 to 3786175
Data columns (total 6 columns):
title        object
year         int64
name         object
type         object
character    object
n            float64
dtypes: float64(1), int64(1), object(4)
memory usage: 173.3+ MB


In [89]:
#are there any null values? this takes a little time to go over 3.7M rows! Also only works for numerical data
casts.describe()

Unnamed: 0,year,n
count,3786176.0,2327603.0
mean,1988.911,16.90599
std,27.89225,31.70679
min,1894.0,1.0
25%,1970.0,5.0
50%,2001.0,10.0
75%,2012.0,21.0
max,2115.0,33613.0


In [92]:
#lots of missing values for the n column! There is another method to find the number of missing values
casts.n.isnull()

0          False
1          False
2          False
3           True
4           True
5           True
6           True
7           True
8          False
9           True
10         False
11         False
12          True
13          True
14          True
15          True
16         False
17         False
18          True
19         False
20          True
21          True
22          True
23          True
24          True
25         False
26         False
27         False
28         False
29         False
           ...  
3786146     True
3786147     True
3786148     True
3786149     True
3786150     True
3786151     True
3786152     True
3786153     True
3786154     True
3786155     True
3786156    False
3786157     True
3786158    False
3786159     True
3786160     True
3786161    False
3786162    False
3786163    False
3786164    False
3786165    False
3786166     True
3786167     True
3786168     True
3786169    False
3786170    False
3786171    False
3786172    False
3786173    Fal

In [93]:
#woh! that happened? Were you expecting a number? isnull() goes through all the rows and returns True if the value is missing and False otherwise.
#so how do we count? In python, False is a 0 and True is a 1. So we can just call the sum method!
casts.n.isnull().sum()

1458573

## Querying

In [50]:
#does slicing work here? well YES!
print('first 10 rows:')
casts[:10]

first 10 rows:


Unnamed: 0,title,year,name,type,character,n
0,Closet Monster,2015,Buffy #1,actor,Buffy 4,31.0
1,Suuri illusioni,1985,Homo $,actor,Guests,22.0
2,Battle of the Sexes,2017,$hutter,actor,Bobby Riggs Fan,10.0
3,Secret in Their Eyes,2015,$hutter,actor,2002 Dodger Fan,
4,Steve Jobs,2015,$hutter,actor,1988 Opera House Patron,
5,Straight Outta Compton,2015,$hutter,actor,Club Patron,
6,Straight Outta Compton,2015,$hutter,actor,Dopeman,
7,For Thy Love 2,2009,Bee Moe $lim,actor,Thug 1,
8,"Lapis, Ballpen at Diploma, a True to Life Journey",2014,Jori ' Danilo' Jurado Jr.,actor,Jaime (young),9.0
9,Desire (III),2014,Syaiful 'Ariffin,actor,Actor Playing Eteocles from 'Antigone',


In [52]:
print('last 10 rows:')
casts[-10:]

last 10 rows:


Unnamed: 0,title,year,name,type,character,n
3786166,Foreldrar,2007,Lilja Gu?r?n ?orvaldsd?ttir,actress,Katrin Eldri,
3786167,Rokland,2011,Lilja Gu?r?n ?orvaldsd?ttir,actress,A?albj?rg - Dagga's Mother,
3786168,XL,2013,Lilja Gu?r?n ?orvaldsd?ttir,actress,Tengdamamma,
3786169,Niceland (Population. 1.000.002),2004,Steinunn ?orvaldsd?ttir,actress,Factory Worker,21.0
3786170,Stuttur Frakki,1993,Sveinbj?rg ??rhallsd?ttir,actress,Flugfreyja,24.0
3786171,Foxtrot,1988,Lilja ??risd?ttir,actress,D?ra,24.0
3786172,Niceland (Population. 1.000.002),2004,Sigr??ur J?na ??risd?ttir,actress,Woman in Bus,26.0
3786173,Skammdegi,1985,Dalla ??r?ard?ttir,actress,Hj?krunarkona,9.0
3786174,U.S.S.S.S...,2003,Krist?n Andrea ??r?ard?ttir,actress,Afgr.dama ? bens?nst??,17.0
3786175,Bye Bye Blue Bird,1999,Rosa ? R?gvu,actress,Pensionatv?rtinde,


In [53]:
#cool deal, but there are other methods that are more popular to take a quick peek at the data
casts.head()

Unnamed: 0,title,year,name,type,character,n
0,Closet Monster,2015,Buffy #1,actor,Buffy 4,31.0
1,Suuri illusioni,1985,Homo $,actor,Guests,22.0
2,Battle of the Sexes,2017,$hutter,actor,Bobby Riggs Fan,10.0
3,Secret in Their Eyes,2015,$hutter,actor,2002 Dodger Fan,
4,Steve Jobs,2015,$hutter,actor,1988 Opera House Patron,


In [54]:
casts.tail()

Unnamed: 0,title,year,name,type,character,n
3786171,Foxtrot,1988,Lilja ??risd?ttir,actress,D?ra,24.0
3786172,Niceland (Population. 1.000.002),2004,Sigr??ur J?na ??risd?ttir,actress,Woman in Bus,26.0
3786173,Skammdegi,1985,Dalla ??r?ard?ttir,actress,Hj?krunarkona,9.0
3786174,U.S.S.S.S...,2003,Krist?n Andrea ??r?ard?ttir,actress,Afgr.dama ? bens?nst??,17.0
3786175,Bye Bye Blue Bird,1999,Rosa ? R?gvu,actress,Pensionatv?rtinde,


In [55]:
casts.sample()

Unnamed: 0,title,year,name,type,character,n
3676417,2 Fast 2 Furious,2003,Phuong Tuyet Vo,actress,Suki's Girl,27.0


In [56]:
casts.sample(5)

Unnamed: 0,title,year,name,type,character,n
3569088,Dear Me,2008,Sarah Shoup,actress,Blonder Betty,
2270961,Der Weg ins Freie,1941,Jakob Tiedtke,actor,Director der Oper Bergamo,26.0
3060661,Sucker Punch,2003,Laurie (IV) Jackson,actress,Extra,50.0
2482212,Replicate,2017,Calum Worthy,actor,Randy Foster,2.0
1349059,All the Winners,1920,Sam Livesey,actor,Pedro Darondary,3.0


In [58]:
#this is great, but I want to get specific rows and columns. There are two very important methods to look up something in a dataframe

## loc and iloc

In [None]:
#loc is used to look up by label
#iloc is used to look up by index or location

In [59]:
#let's use iloc first. Get the first row with iloc. The general syntax for iloc is dataframe.iloc[rows,columns]
casts.iloc[0]

title        Closet Monster
year                   2015
name               Buffy #1
type                  actor
character           Buffy 4
n                        31
Name: 0, dtype: object

In [60]:
#notice that returned a series. Another way to check is
type(casts.iloc[0])

pandas.core.series.Series

In [62]:
#now let's get the 5th and 6th rows. Remember numerical index is 0 based and also the second limit in the iloc index is excluded
casts.iloc[5:7]

Unnamed: 0,title,year,name,type,character,n
5,Straight Outta Compton,2015,$hutter,actor,Club Patron,
6,Straight Outta Compton,2015,$hutter,actor,Dopeman,


### Exercise 1

In [None]:
#great, can you get the 100th and 101th row using the iloc method?

-----

In [65]:
#that's wonderful. How do we get only the columns we need using iloc? Let's say we only want title column for the 5th and 6th row
casts.iloc[5:7,0]

5    Straight Outta Compton
6    Straight Outta Compton
Name: title, dtype: object

In [68]:
#notice we again got a series. To get a dataframe, we can speficy a range as the second parameter.
casts.iloc[5:7, 0:1]

Unnamed: 0,title
5,Straight Outta Compton
6,Straight Outta Compton


In [72]:
#What if we want title and year?
casts.iloc[5:7,0:2]

Unnamed: 0,title,year
5,Straight Outta Compton,2015
6,Straight Outta Compton,2015


In [73]:
#what if we want name and character?
casts.iloc[5:7,[2,4]]

Unnamed: 0,name,character
5,$hutter,Club Patron
6,$hutter,Dopeman


### Exercise 2

In [None]:
#can you get title, year and character for the 200th and 300th rows?

----

In [74]:
#let's try loc now. The general syntax remains the same dataframe.loc[rows,columns]. We now use labels instead of location or numerical index
#let's get the first row again
casts.loc[0]

title        Closet Monster
year                   2015
name               Buffy #1
type                  actor
character           Buffy 4
n                        31
Name: 0, dtype: object

In [76]:
#woh! why did that work? It just so happens that in this case, the label or the index is the same as the numerical location
casts.index

RangeIndex(start=0, stop=3786176, step=1)

In [78]:
#try it for the columns? It will give you an error. 
# casts.loc[0,0]

In [82]:
#now, let's get the title for the first two rows. notice in this case, the second argument is inclusive! Ya! Don't ask me why!
casts.loc[0:1,'title']

0     Closet Monster
1    Suuri illusioni
Name: title, dtype: object

### Exercise 3

In [None]:
#grab the 200th and 201st rows with character and the n value

<hr/>

## Conditional Lookups!

## Let's chart!

In [22]:
%matplotlib inline