In [1]:
import pandas as pd
from pandas import Series, DataFrame
# We can explictly import Series and DataFrame, why might we do this?

###  Series Review


#### Series from `list`

In [2]:
scores_list = [54, 22, 19, 73, 80]
scores_series = Series(scores_list)
scores_series

# what name do we call the  0, 1, 2, ... ??       A:  index
# what name do we call the  54, 22, 19, .... ??   A:  value

0    54
1    22
2    19
3    73
4    80
dtype: int64

#### Selecting certain scores.
What are all the scores `> 50`?

In [3]:
scores_series[scores_series > 50]

0    54
3    73
4    80
dtype: int64

**Answer:** Boolean indexing. Try the following...

In [4]:
scores_series[[True, True, False, False, True]] # often called a "mask"

0    54
1    22
4    80
dtype: int64

We are really writing a "mask" for our data.

In [5]:
scores_series > 50

0     True
1    False
2    False
3     True
4     True
dtype: bool

In [6]:
scores_series[scores_series > 50]

0    54
3    73
4    80
dtype: int64

#### Series from `dict`

In [7]:
# Imagine we hire students and track their weekly hours
week1 = Series({"Rita":5, "Therese":3, "Janice": 6})
week2 = Series({"Rita":3, "Therese":7, "Janice": 4})
week3 = Series({"Therese":5, "Janice":5, "Rita": 8}) # Wrong order! Will this matter?
print(week1)
print(week2)
print(week3)

Rita       5
Therese    3
Janice     6
dtype: int64
Rita       3
Therese    7
Janice     4
dtype: int64
Therese    5
Janice     5
Rita       8
dtype: int64


####  For everyone in Week 1, add 3 to their hours 

In [8]:
week1 = week1 + 3
week1

Rita       8
Therese    6
Janice     9
dtype: int64

#### Total up everyone's hours

In [9]:
total_hours = week1 + week2 + week3
total_hours

Janice     18
Rita       19
Therese    18
dtype: int64

#### What is week1 / week3 ?

In [10]:
week1 / week3
# Notice that we didn't have to worry about the order of indices

Janice     1.8
Rita       1.0
Therese    1.2
dtype: float64

#### What type of values are stored in  week1 > week2?

In [11]:
print(week1)
print(week2)
week1 > week2 # indices are ordered the same

Rita       8
Therese    6
Janice     9
dtype: int64
Rita       3
Therese    7
Janice     4
dtype: int64


Rita        True
Therese    False
Janice      True
dtype: bool

####  What is week1 > week3?

In [12]:
print(week1)
print(week3)
# week1 > week3 # indices not in same order
# week1.sort_index() > week3.sort_index() #proper way

Rita       8
Therese    6
Janice     9
dtype: int64
Therese    5
Janice     5
Rita       8
dtype: int64



# Lecture 28:  Pandas 2 - DataFrames


Learning Objectives:
- Create a DataFrame from 
 - a dictionary of Series, lists, or dicts
 - a list of Series, lists, dicts
- Select a column, row, cell, or rectangular region of a DataFrame
- Convert CSV files into DataFrames and DataFrames into CSV Files
- Access the head or tail of a DataFrame

**Big Idea**: Data Frames store 2-dimensional data in tables! It is a collection of Series.

## You can create a DataFrame in a variety of ways!
### From a dictionary of Series

In [13]:
names = Series(["Alice", "Bob", "Cindy", "Dan"])
scores = Series([6, 7, 8, 9])

# to make a dictionary of Series, need to write column names for the keys
DataFrame({
    "Player name": names,
    "Score": scores
})

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### From a dictionary of lists

In [14]:
name_list = ["Alice", "Bob", "Cindy", "Dan"]
score_list = [6, 7, 8, 9]
# this is the same as above, reminding us that Series act like lists
DataFrame({
    "Player name": name_list,
    "Score": score_list
})

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### From a dictionary of dictionaries
We need to make up keys to match the things in each column

In [15]:
data = {
    "Player name": {0: "Alice", 1: "Bob", 2: "Cindy", 3: "Dan"},
    "Score": {0: 6, 1: 7, 2: 8, 3: 9}
}
DataFrame(data)

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### From a list of lists
We have to add the column names, we do this with `columns = [name1, name2, ....]` 

In [16]:
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
data
DataFrame(data, columns = ["Player name", "Score"])

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### From a list of dicts

In [17]:
data = [
    {"Player name": "Alice", "Score": 6},
    {"Player name": "Bob", "Score": 7},
    {"Player name": "Cindy", "Score": 8},
    {"Player name": "Dan", "Score": 9}
]
data
DataFrame(data)

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### Explicitly naming the indices
We can use `index = [name1, name2, ...]` to rename the index of each row

In [18]:
# 
data = [
    {"Player name": "Alice", "Score": 6},
    {"Player name": "Bob", "Score": 7},
    {"Player name": "Cindy", "Score": 8},
    {"Player name": "Dan", "Score": 9}
]
data
DataFrame(data, index=["A", "B", "C", "D"]) # must have a name for each row

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


### Explicitly naming the columns

In [19]:
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
DataFrame(data, columns=["Player name", "Score"])


Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


In [20]:
# You try: 
# Make a DataFrame of 4 people you know with different ages
# Give names to both the columns and rows
ages = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
DataFrame(ages, index=["A", "B", "C", "D"], columns=["Name", "Age"])

# Share how you did with this with your neighbor
# If you both did it the same way, try it a different way.

Unnamed: 0,Name,Age
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


## Select a column, row, cell, or rectangular region of a DataFrame
### Data lookup: Series
- `s.loc[X]`   <- lookup by pandas index
- `s.iloc[X]`  <- lookup by integer position

In [21]:
hours = Series({"Alice":6, "Bob":7, "Cindy":8, "Dan":9})
hours

Alice    6
Bob      7
Cindy    8
Dan      9
dtype: int64

In [22]:
# Lookup Bob's hours by pandas index.
hours.loc["Bob"]

7

In [23]:
# Lookup Bob's hours by integer position.
hours.iloc[2]

8

In [24]:
# Lookup Cindy's hours by pandas index.
hours.loc["Cindy"]

8

###  Data lookup: DataFrame


- `d.loc[r]`     lookup ROW by pandas ROW index
- `d.iloc[r]`    lookup ROW by ROW integer position
- `d[c]`         lookup COL by pandas COL index
- `d.loc[r, c]`  lookup by pandas ROW index and pandas COL index
- `d.iloc[r, c]`  lookup by ROW integer position and COL integer position

In [25]:
# We often call the object that we make df
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
df = DataFrame(data, index=["A", "B", "C", "D"], columns = ["Player name", "Score"])
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


### What are 3 different ways of accessing row D? 

In [26]:
#df["D"] # Nope!
print(df.loc["D"])
print(df.iloc[3])
print(df.iloc[-1])

Player name    Dan
Score            9
Name: D, dtype: object
Player name    Dan
Score            9
Name: D, dtype: object
Player name    Dan
Score            9
Name: D, dtype: object


### How about accessing a column?

In [27]:
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


In [28]:
#df[0] # Nope!
print(df["Player name"])

A    Alice
B      Bob
C    Cindy
D      Dan
Name: Player name, dtype: object


### What are 3 different ways to access a single cell?

In [29]:
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


In [30]:
# How to access Cindy?
#print(df["C", "Player name"]) # Nope!
print(df.loc["C", "Player name"])
print(df["Player name"].loc["C"])
print(df.iloc[2, 0])

Cindy
Cindy
Cindy


## How to set values for a specific entry?

- `d.loc[r, c] = new_val`
- `d.iloc[r, c] = new_val`

In [31]:
#change player D's name
df.loc["D", "Player name"] = "Bianca"
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Bianca,9


In [32]:
# then add 3 to that player's score using .loc
df.loc["D","Score"] += 3
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Bianca,12


In [33]:
# add 7 to a different player's score using .iloc
df.iloc[0, 1] += 7
df

Unnamed: 0,Player name,Score
A,Alice,13
B,Bob,7
C,Cindy,8
D,Bianca,12


### Find the max score and the mean score

In [34]:
# find the max and mean of the "Score" column
print(df["Score"].max(), df["Score"].mean())

13 10.0


### Find the highest scoring player

In [35]:
df

Unnamed: 0,Player name,Score
A,Alice,13
B,Bob,7
C,Cindy,8
D,Bianca,12


In [36]:
highest_scorer = df["Score"].idxmax()
df["Player name"].loc[highest_scorer]

'Alice'

##  Slicing a DataFrame

- `df.iloc[ROW_SLICE, COL_SLICE]` <- make a rectangular slice from the DataFrame using integer positions
- `df.loc[ROW_SLICE, COL_SLICE]` <- make a rectangular slice from the DataFrame using index

In [37]:
df.iloc[1:3, 0:2]

Unnamed: 0,Player name,Score
B,Bob,7
C,Cindy,8


In [38]:
df.loc["B":"C", "Player name":"Score"] # notice that this way is inclusive of endpoints

Unnamed: 0,Player name,Score
B,Bob,7
C,Cindy,8


## Set values for sliced DataFrame

- `d.loc[ROW_SLICE, COL_SLICE] = new_val` <- set value by ROW INDEX and COL INDEX
- `d.iloc[ROW_SLICE, COL_SLICE] = new_val` <- set value by ROW Integer position and COL Integer position

In [39]:
df

Unnamed: 0,Player name,Score
A,Alice,13
B,Bob,7
C,Cindy,8
D,Bianca,12


In [40]:
df.loc["B":"C", "Score"] += 5
df

Unnamed: 0,Player name,Score
A,Alice,13
B,Bob,12
C,Cindy,13
D,Bianca,12


### Pandas allows slicing of non-contiguous columns

In [41]:
# just get Player name for Index B and D
df.loc[["B", "D"],"Player name"]

B       Bob
D    Bianca
Name: Player name, dtype: object

In [42]:
# add 2 to the people in rows B and D
df.loc[["B", "D"],"Score"] += 2
df

Unnamed: 0,Player name,Score
A,Alice,13
B,Bob,14
C,Cindy,13
D,Bianca,14


## Boolean indexing on a DataFrame

- `d[BOOL SERIES]`  <- makes a new DF of all rows that lined up were True

In [43]:
df

Unnamed: 0,Player name,Score
A,Alice,13
B,Bob,14
C,Cindy,13
D,Bianca,14


### Make a Series of Booleans based on Score >= 15

In [44]:
b = df["Score"] >= 15
b


A    False
B    False
C    False
D    False
Name: Score, dtype: bool

### use b to slice the DataFrame
if b is true, include this row in the new df

In [45]:
df[b]

Unnamed: 0,Player name,Score


### do the last two things in a single step

In [46]:
df[df["Score"] >= 15]

Unnamed: 0,Player name,Score


## Creating DataFrame from csv

In [47]:
# it's that easy!  
df = pd.read_csv("IMDB-Movie-Data.csv")
df

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...,...
993,993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0
994,994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0


###   View the first few lines of the DataFrame
- `.head(n)` gets the first n lines, 5 is the default

In [48]:
df.head()

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02


### get the first 2 rows

In [49]:
df.head(2)

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M


###   View the first few lines of the DataFrame
- `.tail(n)` gets the last n lines, 5 is the default

In [50]:
df.tail()

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
993,993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0.0
994,994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0.0
997,997,Nine Lives,"Comedy,Family,Fantasy",Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,19.64


In [51]:
df.tail(3)

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
995,995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0.0
997,997,Nine Lives,"Comedy,Family,Fantasy",Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,19.64


### What is the last year in our DataFrame?

In [52]:
df["Year"].max()

2016

In [53]:
### What are the rows that correspond to movies whose title contains "Harry" ? 
df[df["Title"].str.contains("Harry")]

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
114,114,Harry Potter and the Deathly Hallows: Part 2,"Adventure,Drama,Fantasy",David Yates,"Daniel Radcliffe, Emma Watson, Rupert Grint, M...",2011,130,8.1,380.96
314,314,Harry Potter and the Order of the Phoenix,"Adventure,Family,Fantasy",David Yates,"Daniel Radcliffe, Emma Watson, Rupert Grint, B...",2007,138,7.5,292.0
417,417,Harry Potter and the Deathly Hallows: Part 1,"Adventure,Family,Fantasy",David Yates,"Daniel Radcliffe, Emma Watson, Rupert Grint, B...",2010,146,7.7,294.98
472,472,Harry Potter and the Half-Blood Prince,"Adventure,Family,Fantasy",David Yates,"Daniel Radcliffe, Emma Watson, Rupert Grint, M...",2009,153,7.5,301.96


### What is the movie at index 6 ? 

In [54]:
df.iloc[6]

Index                                                       6
Title                                              La La Land
Genre                                      Comedy,Drama,Music
Director                                      Damien Chazelle
Cast        Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....
Year                                                     2016
Runtime                                                   128
Rating                                                    8.3
Revenue                                               151.06M
Name: 6, dtype: object

In [55]:
df

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...,...
993,993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0
994,994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0


## Notice that there are two index columns
- That happened because when you write a csv from pandas to a file, it writes a new index column
- So if the dataFrame already contains an index, you are going to get two index columns
- Let's fix that problem

### How can you use slicing to get just columns with Title and Year?

In [56]:
df2 = df[["Title", "Year"]]
df2
# notice that this does not have the 'index' column

Unnamed: 0,Title,Year
0,Guardians of the Galaxy,2014
1,Prometheus,2012
2,Split,2016
3,Sing,2016
4,Suicide Squad,2016
...,...,...
993,Secret in Their Eyes,2015
994,Hostel: Part II,2007
995,Step Up 2: The Streets,2008
996,Search Party,2014


### How can you use slicing to get rid of the first column?

In [57]:
df = df.iloc[:, 1:] #all the rows, not column 0
df

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...
993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0
994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0


### Write a df to a csv file

In [58]:
df.to_csv("better_movies.csv", index = False)

## Practice on your own.....Data Analysis with Data Frames


In [59]:
# What are all the movies that have above average run time? 
long_movies = df [df["Runtime"] > df["Runtime"].mean()]
long_movies

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
6,La La Land,"Comedy,Drama,Music",Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,151.06M
...,...,...,...,...,...,...,...,...
977,The Skin I Live In,"Drama,Thriller",Pedro Almodóvar,"Antonio Banderas, Elena Anaya, Jan Cornet,Mari...",2011,120,7.6,3.19
979,Annie,"Comedy,Drama,Family",Will Gluck,"Quvenzhané Wallis, Cameron Diaz, Jamie Foxx, R...",2014,118,5.3,85.91
980,Across the Universe,"Drama,Fantasy,Musical",Julie Taymor,"Evan Rachel Wood, Jim Sturgess, Joe Anderson, ...",2007,133,7.4,24.34
987,Selma,"Biography,Drama,History",Ava DuVernay,"David Oyelowo, Carmen Ejogo, Tim Roth, Lorrain...",2014,128,7.5,52.07


In [60]:
# of these movies, what was the min rating? 
min_rating = long_movies["Rating"].min()
min_rating

3.2

In [61]:
# Which movies had this min rating?
long_movies[long_movies["Rating"] == min_rating]

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
646,Tall Men,"Fantasy,Horror,Thriller",Jonathan Holbrook,"Dan Crisafulli, Kay Whitney, Richard Garcia, P...",2016,133,3.2,0


### What are all long_movies with someone in the cast named "Emma" ? 

In [62]:
long_movies[long_movies["Cast"].str.contains("Emma")]

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
6,La La Land,"Comedy,Drama,Music",Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,151.06M
92,The Help,Drama,Tate Taylor,"Emma Stone, Viola Davis, Octavia Spencer, Bryc...",2011,146,8.1,169.71M
114,Harry Potter and the Deathly Hallows: Part 2,"Adventure,Drama,Fantasy",David Yates,"Daniel Radcliffe, Emma Watson, Rupert Grint, M...",2011,130,8.1,380.96
157,"Crazy, Stupid, Love.","Comedy,Drama,Romance",Glenn Ficarra,"Steve Carell, Ryan Gosling, Julianne Moore, Em...",2011,118,7.4,84.24
253,The Amazing Spider-Man 2,"Action,Adventure,Sci-Fi",Marc Webb,"Andrew Garfield, Emma Stone, Jamie Foxx, Paul ...",2014,142,6.7,202.85
314,Harry Potter and the Order of the Phoenix,"Adventure,Family,Fantasy",David Yates,"Daniel Radcliffe, Emma Watson, Rupert Grint, B...",2007,138,7.5,292
367,The Amazing Spider-Man,"Action,Adventure",Marc Webb,"Andrew Garfield, Emma Stone, Rhys Ifans, Irrfa...",2012,136,7.0,262.03
417,Harry Potter and the Deathly Hallows: Part 1,"Adventure,Family,Fantasy",David Yates,"Daniel Radcliffe, Emma Watson, Rupert Grint, B...",2010,146,7.7,294.98
472,Harry Potter and the Half-Blood Prince,"Adventure,Family,Fantasy",David Yates,"Daniel Radcliffe, Emma Watson, Rupert Grint, M...",2009,153,7.5,301.96
609,Beautiful Creatures,"Drama,Fantasy,Romance",Richard LaGravenese,"Alice Englert, Viola Davis, Emma Thompson,Alde...",2013,124,6.2,19.45


In [63]:
# What is the title of the shortest movie?
df[df["Runtime"] == df["Runtime"].min()]["Title"]

792    Ma vie de Courgette
Name: Title, dtype: object

In [64]:
# What movie had the highest revenue?
# df["Revnue"].max() did not work
# we need to clean our data

def format_revenue(revenue):
    #TODO: Check the last character of the string
    if type(revenue) == float: # need this in here if we run code multiple times
        return revenue
    elif revenue[-1] == 'M': # some have an "M" at the end
        return float(revenue[:-1]) * 1e6
    else:
        return float(revenue) * 1e6

In [65]:
# What movie had the highest revenue?
revenue = df["Revenue"].apply(format_revenue) # apply a function to a column
print(revenue.head())
max_revenue = revenue.max()

# make a copy of our df
rev_df = df.copy()
rev_df["Rev as fl"] = revenue
rev_df

0    333130000.0
1    126460000.0
2    138120000.0
3    270320000.0
4    325020000.0
Name: Revenue, dtype: float64


Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue,Rev as fl
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13,333130000.0
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M,126460000.0
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M,138120000.0
3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32,270320000.0
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02,325020000.0
...,...,...,...,...,...,...,...,...,...
993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0,0.0
994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54,17540000.0
995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01,58010000.0
996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0,0.0


In [66]:
# Now we can answer the question!
rev_df[rev_df["Rev as fl"] == max_revenue]

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue,Rev as fl
50,Star Wars: Episode VII - The Force Awakens,"Action,Adventure,Fantasy",J.J. Abrams,"Daisy Ridley, John Boyega, Oscar Isaac, Domhna...",2015,136,8.1,936.63,936630000.0


In [67]:
# Or more generally...
rev_df.sort_values(by="Rev as fl", ascending=False)

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue,Rev as fl
50,Star Wars: Episode VII - The Force Awakens,"Action,Adventure,Fantasy",J.J. Abrams,"Daisy Ridley, John Boyega, Oscar Isaac, Domhna...",2015,136,8.1,936.63,936630000.0
87,Avatar,"Action,Adventure,Fantasy",James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...",2009,162,7.8,760.51,760510000.0
85,Jurassic World,"Action,Adventure,Sci-Fi",Colin Trevorrow,"Chris Pratt, Bryce Dallas Howard, Ty Simpkins,...",2015,124,7.0,652.18,652180000.0
76,The Avengers,"Action,Sci-Fi",Joss Whedon,"Robert Downey Jr., Chris Evans, Scarlett Johan...",2012,143,8.1,623.28,623280000.0
54,The Dark Knight,"Action,Crime,Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,533.32,533320000.0
...,...,...,...,...,...,...,...,...,...
888,The Intent,"Crime,Drama",Femi Oyeniran,"Dylan Duffus, Scorcher,Shone Romulus, Jade Asha",2016,104,3.5,0,0.0
392,Whisky Galore,"Comedy,Romance",Gillies MacKinnon,"Tim Pigott-Smith, Naomi Battrick, Ellie Kendri...",2016,98,5.0,0,0.0
478,Macbeth,"Drama,War",Justin Kurzel,"Michael Fassbender, Marion Cotillard, J...",2015,113,6.7,0,0.0
823,Man Down,"Drama,Thriller",Dito Montiel,"Shia LaBeouf, Jai Courtney, Gary Oldman, Kate ...",2015,90,5.8,0,0.0


In [68]:
df

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...
993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0
994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0


In [69]:
# What is the average runtime for movies by "Francis Lawrence"?
fl_movies = df[df["Director"] == "Francis Lawrence"]
fl_movies["Runtime"].mean()

126.75

### More complicated questions...

In [70]:
# which director had the highest average rating? 

# one way is to make a python dict of director, list of ratings
director_dict = dict()

# make the dictionary: key is director, value is list of ratings
for i in range(len(df)):
    director = df.loc[i, "Director"]
    rating = df.loc[i, "Rating"]
    #print(i, director, rating)
    if director not in director_dict:
        director_dict[director] = []
    director_dict[director].append(rating)

# make a ratings dict key is directory, value is average
# only include directors with > 4 movies
ratings_dict = {k:sum(v)/len(v) for (k,v) in director_dict.items() if len(v) > 4}

#sort a dict by values
dict(sorted(ratings_dict.items(), key=lambda t:t[-1], reverse=True))
    

{'Christopher Nolan': 8.680000000000001,
 'Martin Scorsese': 7.92,
 'David Fincher': 7.8199999999999985,
 'Denis Villeneuve': 7.76,
 'J.J. Abrams': 7.58,
 'David Yates': 7.433333333333334,
 'Danny Boyle': 7.42,
 'Antoine Fuqua': 7.040000000000001,
 'Zack Snyder': 7.040000000000001,
 'Woody Allen': 7.019999999999999,
 'Peter Berg': 6.860000000000001,
 'Ridley Scott': 6.85,
 'Justin Lin': 6.82,
 'Michael Bay': 6.483333333333334,
 'Paul W.S. Anderson': 5.766666666666666,
 'M. Night Shyamalan': 5.533333333333332}

In [71]:
# FOR DEMONSTRATION PURPOSES ONLY
# We haven't (and will not) learn about "groupby"
# Pandas has many operations which will be helpful!

# Consider what you already know, and what Pandas can solve
# when formulating your solutions.
rating_groups = df.groupby("Director")["Rating"]
rating_groups.mean()[rating_groups.count() > 4].sort_values(ascending=False)

Director
Christopher Nolan     8.680000
Martin Scorsese       7.920000
David Fincher         7.820000
Denis Villeneuve      7.760000
J.J. Abrams           7.580000
David Yates           7.433333
Danny Boyle           7.420000
Antoine Fuqua         7.040000
Zack Snyder           7.040000
Woody Allen           7.020000
Peter Berg            6.860000
Ridley Scott          6.850000
Justin Lin            6.820000
Michael Bay           6.483333
Paul W.S. Anderson    5.766667
M. Night Shyamalan    5.533333
Name: Rating, dtype: float64