## Some Python if you need a quick refresher

Let's look at some basic data structures in Python

   - str (string)
   - List (mutable, order maintained)
   - Tuple (immutable)
   - Set (unordered, unique)
   - Dictionary (mutable, key-value pairs)

    


In [59]:
#this is a string (str)
hackathon = "spectra"
print('length of the string:', len(hackathon))
print('first character of the string:', hackathon[0])
print('last character of the string:', hackathon[-1])
print('capitalize string: ', hackathon.capitalize())
print('uppper case string: ', hackathon.upper())
print('find tra in the string: ', hackathon.find('tra')) #start counting from 0
#Many more methods available. Read the docs - https://docs.python.org/3/library/stdtypes.html#string-methods

length of the string: 7
first character of the string: s
last character of the string: a
capitalize string:  Spectra
uppper case string:  SPECTRA
find tra in the string:  4


In [60]:
#this is a list
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida"]
print('number of states:', len(states))
print('first state: ', states[0])
print('last state: ', states[-1])
print('list contains California?', 'California' in states)
print('where is California in the list?', states.index('California'))
print('list first three states:', states[:3])
print('list last three states:', states[-3:])

#more advanced
print('filter list to only states starting with C:', list(filter(lambda x: x.startswith('C'),states)))
print('filter list to only states starting with C:', [state for state in states if state.startswith('C')])
#Many more methods available. Read the docs - https://docs.python.org/3/tutorial/datastructures.html

number of states: 9
first state:  Alabama
last state:  Florida
list contains California? True
where is California in the list? 4
list first three states: ['Alabama', 'Alaska', 'Arizona']
list last three states: ['Connecticut', 'Delaware', 'Florida']
filter list to only states starting with C: ['California', 'Colorado', 'Connecticut']
filter list to only states starting with C: ['California', 'Colorado', 'Connecticut']


In [61]:
#this is a dictionary. They are unordered, so first and last makes little sense
states = {
'AL': 'Alabama',
'AK': 'Alaska',
'AZ': 'Arizona',
'AR': 'Arkansas',
'CA' : 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DE': 'Delaware',
'FL': 'Florida'
}
print('keys in states:', states.keys())
print('values in states:', states.values())
print('# states in dict:', len(states))
print('find ca:', states['CA'])
print('is FL in the list:', 'FL' in states)
print('is Florida in the list: ', 'Florida' in states.values())
#Many more methods available. Read the docs - https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries

keys in states: dict_keys(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL'])
values in states: dict_values(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida'])
# states in dict: 9
find ca: California
is FL in the list: True
is Florida in the list:  True


## Data structures in Pandas - Series

In [62]:
#This is a series. Think of this as a list with an index!
states = pd.Series(["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida"])
states

0        Alabama
1         Alaska
2        Arizona
3       Arkansas
4     California
5       Colorado
6    Connecticut
7       Delaware
8        Florida
dtype: object

In [63]:
#you can give the data your own index!
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida"]
abb = ['AL','AK','AZ','AR','CA','CO','CT','DE','FL']
states = pd.Series(data=states, index=abb)
states


AL        Alabama
AK         Alaska
AZ        Arizona
AR       Arkansas
CA     California
CO       Colorado
CT    Connecticut
DE       Delaware
FL        Florida
dtype: object

In [77]:
#slice and dice
print('first state:', states[0])
print('first index:', states.index[0])
print('\n\n')
print('first three states:')
print(states[0:3])
print('\n\n')
print('last three states:')
print(states[-3:])

first state: Alabama
first index: AL



first three states:
AL    Alabama
AK     Alaska
AZ    Arizona
dtype: object



last three states:
CT    Connecticut
DE       Delaware
FL        Florida
dtype: object


In [79]:
#you can create series out of comprehensions and anything else that returns an iterable
fours = pd.Series([x**4 for x in range(6)])
fours

0      0
1      1
2     16
3     81
4    256
5    625
dtype: int64

In [89]:
#numpy has a wonderful random number generator that is useful in creating dummy data. Returns np.array
rand = pd.Series(np.random.rand(5))
rand


0    0.028778
1    0.082763
2    0.221035
3    0.956849
4    0.146512
dtype: float64

## Data structures in Pandas - Dataframes

In [93]:
#dataframe is a like a table
state_data = {
    'abb':['AL','AK','AZ','AR','CA','CO','CT','DE','FL'],
    'name':["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida"],
    'temp': np.random.rand(9)
}
df_state = pd.DataFrame(state_data)
df_state

Unnamed: 0,abb,name,temp
0,AL,Alabama,0.632735
1,AK,Alaska,0.699852
2,AZ,Arizona,0.622987
3,AR,Arkansas,0.852903
4,CA,California,0.415096
5,CO,Colorado,0.530582
6,CT,Connecticut,0.910025
7,DE,Delaware,0.769109
8,FL,Florida,0.614724


In [100]:
df = pd.DataFrame(data = np.random.randn(2,2), columns=['Temp', 'Humidity'])
df

Unnamed: 0,Temp,Humidity
0,1.815043,0.137917
1,1.586126,-0.788654


In [101]:
#the very cool thing about dataframes is that you can operate on them as a single unit. NO NEED TO ITERATE EXPLICITLY!
df = df * 10
df

Unnamed: 0,Temp,Humidity
0,18.150432,1.379165
1,15.861262,-7.886544


In [108]:
#you can use apply method to perform a more complex operation. Run this cell to see how apply works! Essentially, it takes a function as an agurment to apply to every column
?df.apply()

[0;31mSignature:[0m [0mdf[0m[0;34m.[0m[0mapply[0m[0;34m([0m[0mfunc[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m [0mbroadcast[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mraw[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mreduce[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mresult_type[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0margs[0m[0;34m=[0m[0;34m([0m[0;34m)[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is
either the DataFrame's index (``axis=0``) or the DataFrame's columns
(``axis=1``). By default (``result_type=None``), the final return type
is inferred from the return type of the applied function. Otherwise,
it depends on the `result_type` argument.

Parameters
----------
func : function
    Function to apply to each column or row.
axis : {0 or 'index', 1 or 'columns'},

In [105]:
def add100(x):
    return x + 100

In [106]:
#let's apply this function to 
df.apply(add100)

Unnamed: 0,Temp,Humidity
0,118.150432,101.379165
1,115.861262,92.113456


In [109]:
#notice that the original dataframe does not change. This can get tricky. 
#If you want to change the df, assign it to itself. Or some methods have inplace=True parameters to force change inplace
df

Unnamed: 0,Temp,Humidity
0,18.150432,1.379165
1,15.861262,-7.886544


In [111]:
#you can also use lambda or anonymous functions to do the same
df.apply(lambda x: x+100)

Unnamed: 0,Temp,Humidity
0,118.150432,101.379165
1,115.861262,92.113456


## Let's load in our data files!

In [112]:
casts_url = "https://ibm.box.com/shared/static/569iue5znz5angfxaaojbd7olgegk0bz.csv"
release_dates_url = "https://ibm.box.com/shared/static/fxu6rhfktvjs0uvgtbhjsp5g5k9qgjh1.csv"
titles_url = "https://ibm.box.com/shared/static/cw3wqtzuljiyqz4kbuk26ojrrm9rzfow.csv"
film_locations_url = "https://ibm.box.com/shared/static/kcot1vu0r1tusff85m5shrr7ehsee8np.csv"

In [None]:
casts = pd.read_csv(casts_url)

In [None]:
release = pd.read_csv(release_dates_url)

In [None]:
titles = pd.read_csv(titles_url)

In [None]:
locations = pd.read_csv(film_locations_url)