# Outline for Monday, April 5
## Pandas 2 - DataFrame

Core ideas:
 - DataFrames aka tables (topic for Monday)
     - built from series
     - each series will be a column in the table

Remember to go back and watch the last Pandas 1 - Series video!

## Data alignment review (element-wise operation: series op series)

In [2]:
import pandas as pd
from pandas import Series, DataFrame

In [None]:
x = Series({"A":10, "B":100})
s1 = Series({"A":2, "B":3})
s2 = Series({"B":3, "A":2})
print(x)
print(s1)
print(s2)

## What is x * s1?

## What is x * s2?

## What is x < s1?

## What is x < s2?

## Oops, let's try series.lt(series)

## How would you apply greater than?

## What about equal comparison?

## ge (>=), le (<=), and ne (!=) are the other options

## Can we fix the ordering of s2 series?
### Try series.sort_index()

## DataFrame can be created from:
1. dict of Series
2. dict of lists
3. list of lists
4. dict of dicts
5. list of dicts

### DataFrame from dictionary of Series

In [3]:
col1 = Series(["Alice", "Bob", "Cindy", "Dan"])
col2 = Series([6, 7, 8, 9])
DataFrame({
    "Player name":col1,
    "Score":col2
})

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


In [4]:
col1 = Series({"A":"Alice","C":"Cindy","D":"Dan"})
col2 = Series({"A":6,"B":7,"C":8})
DataFrame({
    "Player name":col1,
    "Score":col2
})

Unnamed: 0,Player name,Score
A,Alice,6.0
B,,7.0
C,Cindy,8.0
D,Dan,


### DataFrame from dictionary of lists

In [5]:
DataFrame({
    "Player name": ["Alice", "Bob", "Cindy", "Dan"],
    "Score":[6, 7, 8, 9]
})

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### DataFrame from list of lists

In [6]:
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
data
DataFrame(data)

Unnamed: 0,0,1
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### DataFrame from dictionary of dicts

In [8]:
data = {
    "Player name": {"A": "Alice", "B": "Bob", 2: "Cindy", 3: "Dan"},
    "Score": {"A": 6, "B": 7, 2: 8, 3: 9}
}
data
DataFrame(data)

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
2,Cindy,8
3,Dan,9


### DataFrame from list of dicts

In [10]:
data = [
    {"Player name": "Alice", "Score": 6},
    {"Player name": "Bob", "Score": 7},
    {"Player name": "Cindy", "SCORE": 8},
    {"Player name": "Dan", "Score": 9}
]
data
DataFrame(data)

Unnamed: 0,Player name,Score,SCORE
0,Alice,6.0,
1,Bob,7.0,
2,Cindy,,8.0
3,Dan,9.0,


### Renaming the row index

In [13]:
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
data
d = DataFrame(data, index = ["A","B","C","D"], columns = ["Player name","Score"])
d

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


## Data lookup

### Series
- s.loc[X]   <- lookup by pandas index
- s.iloc[X]  <- lookup by integer position
- s[X]       <- depends (first try index, use integer position if necessary)

In [None]:
col1 = Series({"Alice":6, "Bob":7, "Cindy":8, "Dan":9})
col1

In [None]:
col1.loc["Bob"] #Series index

In [None]:
col1.iloc[2] #Series integer position

In [None]:
col1["Cindy"] #Series index

In [None]:
col1[1] #Series integer position 
#No conflict between index and integer position in this example!

## Data lookup

### Series
- s.loc[X]   <- lookup by pandas index
- s.iloc[X]  <- lookup by integer position
- s[X]       <- depends (first try index, use integer position if necessary)

### DataFrame
- d.loc[X]    <- lookup ROW by pandas ROW index
- d.iloc[X]   <- lookup ROW by ROW integer position
- d[X]        <- lookup COL by COL index
- d.loc[X, Y] <- lookup by ROW index and COL index
- d.iloc[X, Y] <- lookup by ROW integer position and COL integer position

In [14]:
d

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


In [15]:
d.loc["A"]

Player name    Alice
Score              6
Name: A, dtype: object

In [16]:
type(d.loc["A"])

pandas.core.series.Series

In [17]:
d.iloc[1]

Player name    Bob
Score            7
Name: B, dtype: object

In [18]:
d.iloc[-1]

Player name    Dan
Score            9
Name: D, dtype: object

In [19]:
d["Score"] #COLUMN INDEX

A    6
B    7
C    8
D    9
Name: Score, dtype: int64

In [20]:
d["Scores"] #KeyError - this column label does not exist

KeyError: 'Scores'

In [21]:
d.loc["B","Player name"]

'Bob'

In [22]:
type(d.loc["B","Player name"]) #ROW index, COL index

str

In [23]:
d.iloc[-1,0] #ROW integer position, COL integer position

'Dan'

## How to set values for a specific entry?

- d.loc[X, Y] = new_val <- set value by ROW INDEX and COL INDEX
- d.iloc[X, Y] = new_val <- set value by ROW Integer position and COL Integer position

In [25]:
d.loc["B","Score"] = 12
d

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,12
C,Cindy,8
D,Dan,9


In [27]:
d.loc["B","Score"] += 3
d

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,18
C,Cindy,8
D,Dan,9


In [28]:
d.iloc[-1,1] += 2
d

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,18
C,Cindy,8
D,Dan,11


## How to compute max score?

In [29]:
d["Score"].max()

18

## How to compute mean score?

In [31]:
d["Score"].mean()

10.75

## Slicing DataFrame

- df.iloc[ROW_SLICE, COL_SLICE] <- take a rectangular slice from the DataFrame using integer positions
- df.loc[ROW_SLICE, COL_SLICE] <- take a rectangular slice from the DataFrame using index

In [32]:
d.iloc[1:3,1:]

Unnamed: 0,Score
B,18
C,8


In [33]:
d.loc["B":"C", :]

Unnamed: 0,Player name,Score
B,Bob,18
C,Cindy,8


## How to set values for sliced DataFrame?

- d.loc[ROW_SLICE, COL_SLICE] = new_val <- set value by ROW INDEX and COL INDEX
- d.iloc[ROW_SLICE, COL_SLICE] = new_val <- set value by ROW Integer position and COL Integer position

In [34]:
d

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,18
C,Cindy,8
D,Dan,11


In [36]:
d.loc["B":"C", "Score"] += 5
d

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,23
C,Cindy,13
D,Dan,11


## Instead of a slice, you could use a list of indexes or integer positions.

In [37]:
d.loc[["B","D"],"Player name"]

B    Bob
D    Dan
Name: Player name, dtype: object

In [38]:
d.loc[["B","D"],"Score"] += 2
d

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,25
C,Cindy,13
D,Dan,13


## Boolean indexing

### Series
- s[BOOL SERIES]  <- gets all s values lined up with True

### DataFrame
- d[BOOL SERIES]  <- pulls out rows lined up with True

In [39]:
d

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,25
C,Cindy,13
D,Dan,13


In [43]:
b = d["Score"] >= 13
b

A    False
B     True
C     True
D     True
Name: Score, dtype: bool

In [44]:
d[b]

Unnamed: 0,Player name,Score
B,Bob,25
C,Cindy,13
D,Dan,13


## Creating DataFrame from csv

In [45]:
d = pd.read_csv("IMDB-Movie-Data.csv")
d

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...,...
993,993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0
994,994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0


## How to see first few lines of the DataFrame?

In [46]:
d.head()

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02


In [47]:
d.head(2)

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M


## How to see last few lines of the DataFrame?

In [48]:
d.tail()

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
993,993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0.0
994,994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0.0
997,997,Nine Lives,"Comedy,Family,Fantasy",Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,19.64


In [49]:
d.tail(7)

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
991,991,Resident Evil: Afterlife,"Action,Adventure,Horror",Paul W.S. Anderson,"Milla Jovovich, Ali Larter, Wentworth Miller,K...",2010,97,5.9,60.13
992,992,Project X,Comedy,Nima Nourizadeh,"Thomas Mann, Oliver Cooper, Jonathan Daniel Br...",2012,88,6.7,54.72
993,993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0.0
994,994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0.0
997,997,Nine Lives,"Comedy,Family,Fantasy",Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,19.64


In [50]:
d["Year"]

0      2014
1      2012
2      2016
3      2016
4      2016
       ... 
993    2015
994    2007
995    2008
996    2014
997    2016
Name: Year, Length: 998, dtype: int64

In [51]:
d["Rating"]

0      8.1
1      7.0
2      7.3
3      7.2
4      6.2
      ... 
993    6.2
994    5.5
995    6.2
996    5.6
997    5.3
Name: Rating, Length: 998, dtype: float64

## Notice that there are two index columns
- That happened because when you write a csv from pandas to a file, it writes a new index column
- So if the dataFrame already contains an index, you are going to get two index columns
- Let's fix that problem

### How can you use slicing to get rid of the first column?

In [52]:
df2 = d.iloc[:, 1:]
df2.head()

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02


### Wrong way to write a df to a csv file

In [53]:
df2.to_csv("wrong_movies.csv")

### Correct way to write a df to a csv file

In [54]:
df2.to_csv("better_movies.csv", index = False)

## What is the highest rated movie that had an above average runtime?

In [55]:
df2

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...
993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0
994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0


In [56]:
df2["Runtime"].mean()

113.17034068136273

In [57]:
b = df2["Runtime"] > df2["Runtime"].mean()
b

0       True
1       True
2       True
3      False
4       True
       ...  
993    False
994    False
995    False
996    False
997    False
Name: Runtime, Length: 998, dtype: bool

In [58]:
long_movies = df2[b]

In [59]:
long_movies["Rating"].max()

9.0

In [60]:
long_movies[long_movies["Rating"] == long_movies["Rating"].max()]

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
54,The Dark Knight,"Action,Crime,Drama",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,533.32


## DataFrame.describe()
- provides a lot of useful stats
- works only for columns with numbers as values

### How to get median runtime of all the movies?