# Monday, Nov 15
## Pandas 2 - DataFrames

Core ideas:
 - DataFrames are tables, 
     - meant to be similar to Spreadsheets
     - each column in the table is a Series


## Outline
### A. Series Review and more operations
### B. How to make a DataFrame
### C. Rules of accessing DataFrame elements
### D. Reading a CSV file into a DataFrame
### E. Analyzing Movies
### F.  Describe

## A. Series Review
## Data alignment  (element-wise operation: series op series)

In [1]:
import pandas as pd
from pandas import Series, DataFrame

# double importing allows us flexibility in naming

## Volunteer Hours

In [2]:
# We can make a Series from a Python dict
week1= Series({"Rita":5, "Therese":3, "Janice": 6})
week2 = Series({"Rita":3, "Therese":7, "Janice": 4})
week3 = Series({"Therese":5, "Janice":5, "Rita": 8})   # wrong order!
print(week1)
print(week2)
print(week3)

Rita       5
Therese    3
Janice     6
dtype: int64
Rita       3
Therese    7
Janice     4
dtype: int64
Therese    5
Janice     5
Rita       8
dtype: int64


## Give everyone 3 more hours in Week 1

In [4]:
week1

Rita       8
Therese    6
Janice     9
dtype: int64

## What is week1 +  week2 + week3?

Janice     18
Rita       19
Therese    18
dtype: int64

## What is week1 / week2 ?

In [34]:

# notice that we didn't have to worry about the order of indices

Rita       2.666667
Therese    0.857143
Janice     2.250000
dtype: float64

## What is week1 > week2?

In [35]:
print(week1)
print(week2)
week1 > week2 # indices are ordered the same

Rita       8
Therese    6
Janice     9
dtype: int64
Rita       3
Therese    7
Janice     4
dtype: int64


Rita        True
Therese    False
Janice      True
dtype: bool

## What is week1 > week3?

In [36]:
print(week1)
print(week3)
week1 > week3 # indices not in same order

Rita       8
Therese    6
Janice     9
dtype: int64
Therese    5
Janice     5
Rita       8
dtype: int64


ValueError: Can only compare identically-labeled Series objects

## There is a method called .gt, for Greater Than

In [170]:
week1
week3
week1.gt(week3) # very helpful for answering questions about our data

Janice      True
Rita       False
Therese     True
dtype: bool

## eq(=), ne (!=), ge (>=), le (<=),   are the other options

In [40]:
print(week1)
print(week3)
print(week1.eq(week3))
print(week1.ne(week3))
print(week1.ge(week3))
print(week1.le(week3))

Rita       8
Therese    6
Janice     9
dtype: int64
Therese    5
Janice     5
Rita       8
dtype: int64
Janice     False
Rita        True
Therese    False
dtype: bool
Janice      True
Rita       False
Therese     True
dtype: bool
Janice     True
Rita       True
Therese    True
dtype: bool
Janice     False
Rita        True
Therese    False
dtype: bool



# Data Frames store 2-dimensional data in tables

## B. A DataFrame can be created from:
1. dict of Series
2. dict of lists
3. list of lists
4. dict of dicts
5. list of dicts

### DataFrame from dictionary of Series

In [48]:
col1 = Series(["Alice", "Bob", "Cindy", "Dan"])
col2 = Series([6, 7, 8, 9])
# to make a dictionary of Series, need to write column names for the keys
DataFrame({
    "Player name": col1,
    "Score": col2
})

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### DataFrame from dictionary of lists

In [53]:
name_list = ["Alice", "Bob", "Cindy", "Dan"]
score_list = [6, 7, 8, 9]
# this is the same as above, reminding us that Series act like lists
DataFrame({
    "Player name": name_list,
    "Score": score_list
})

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### DataFrame from list of lists

In [59]:
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
data
DataFrame(data)
#notice this DataFrame has no column names....

Unnamed: 0,0,1
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


In [56]:
# Reminder:  A series can be made from a dict
Series({0: "Alice", 1: "Bob", 2: "Cindy", 3: "Dan"})

0    Alice
1      Bob
2    Cindy
3      Dan
dtype: object

### DataFrame from dictionary of dicts

In [57]:
# do you see how this is just like the previous examples?

data = {
    "Player name": {0: "Alice", 1: "Bob", 2: "Cindy", 3: "Dan"},
    "Score": {0: 6, 1: 7, 2: 8, 3: 9}
}
DataFrame(data)

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


### DataFrame from list of dicts

In [60]:
data = [
    {"Player name": "Alice", "Score": 6},
    {"Player name": "Bob", "Score": 7},
    {"Player name": "Cindy", "Score": 8},
    {"Player name": "Dan", "Score": 9}
]
data
DataFrame(data)

Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


In [64]:
# We can use keyword arguments to rename the index of each row
data = [
    {"Player name": "Alice", "Score": 6},
    {"Player name": "Bob", "Score": 7},
    {"Player name": "Cindy", "Score": 8},
    {"Player name": "Dan", "Score": 9}
]
data
DataFrame(data, index=["A", "B", "C", "D"]) # must have a name for each row

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


### Naming the Columns

In [65]:
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
DataFrame(data, columns=["Player name", "Score"])


Unnamed: 0,Player name,Score
0,Alice,6
1,Bob,7
2,Cindy,8
3,Dan,9


In [66]:
# Give names to both the columns and rows
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
data

[['Alice', 6], ['Bob', 7], ['Cindy', 8], ['Dan', 9]]

## 3. Rules of Data Lookup:
### This is "required reading"
### Will be good to summarize this on your exam sheet

## Data lookup: Series
- `s.loc[X]   lookup by pandas index`
- `s.iloc[X]  lookup by integer position`
- `s[X]       depends (first try index, use integer position if necessary)`

In [67]:
col1 = Series({"Alice":6, "Bob":7, "Cindy":8, "Dan":9})
col1

Alice    6
Bob      7
Cindy    8
Dan      9
dtype: int64

In [68]:
col1.loc["Bob"] #Series index

7

In [69]:
col1.iloc[2] #Series integer position

8

In [70]:
col1["Cindy"] #Series index

8

In [71]:
col1[1] #Series integer position 
#No conflict between index and integer position in this example!

7

## Data lookup

### Series
- `s.loc[X]     lookup by pandas index`
- `s.iloc[X]    lookup by integer position`
- `s[X]         depends (first try index, use integer position if necessary)`

### DataFrame

- `d.loc[r]     lookup ROW by pandas ROW index`
- `d.iloc[r]    lookup ROW by ROW integer position`
- `d[c]         lookup COL by COL index`
- `d.loc[r, c]  lookup by ROW index and COL index`
- `d.iloc[r, c] lookup by ROW integer position and COL integer position`

In [75]:
# we often call the object that we make df
data = [
    ["Alice", 6],
    ["Bob", 7],
    ["Cindy", 8],
    ["Dan", 9]
]
df = DataFrame(data, index=["A", "B", "C", "D"], columns = ["Player name", "Score"])
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


### What are all the different ways of accessing row D?

In [97]:
#df["D"] # Nope!
print(df.loc["D"])
print(df.iloc[3])
print(df.iloc[-1])

Player name    Dan
Score            9
Name: D, dtype: object
Player name    Dan
Score            9
Name: D, dtype: object
Player name    Dan
Score            9
Name: D, dtype: object


In [86]:
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


### What is the only  way to access a column?

In [98]:
#df[0] # Nope!
print(df["Player name"])

A    Alice
B      Bob
C    Cindy
D      Dan
Name: Player name, dtype: object


### What are the ways to access a single cell?

In [87]:
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Dan,9


In [94]:
# How to access Cindy?
#print(df["C", "Player name"]) # Nope!
print(df.loc["C", "Player name"])
print(df.iloc[2, 0])
print(df.iloc[-2, 0])

Cindy
Cindy
Cindy


## How to set values for a specific entry?

- d.loc[r, c] = new_val
- d.iloc[r, c] = new_val 

In [102]:
df.loc["D", "Player name"] = "Bianca"
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,7
C,Cindy,8
D,Bianca,9


In [105]:
df.loc["B","Score"] += 3
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,10
C,Cindy,8
D,Bianca,9


In [106]:
df.iloc[-1,1] += 2
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,10
C,Cindy,8
D,Bianca,11


## How to compute max score of a column?

In [113]:
print(df["Score"].max(), df["Score"].mean())


11 8.75


## Slicing DataFrame

- df.iloc[ROW_SLICE, COL_SLICE] <- take a rectangular slice from the DataFrame using integer positions
- df.loc[ROW_SLICE, COL_SLICE] <- take a rectangular slice from the DataFrame using index

In [114]:
df.iloc[1:3, 0:2]

Unnamed: 0,Player name,Score
B,Bob,10
C,Cindy,8


In [117]:
df.loc["B":"C", "Player name":"Score"] # notice that its inclusive

Unnamed: 0,Player name,Score
B,Bob,10
C,Cindy,8


## How to set values for sliced DataFrame?

- d.loc[ROW_SLICE, COL_SLICE] = new_val <- set value by ROW INDEX and COL INDEX
- d.iloc[ROW_SLICE, COL_SLICE] = new_val <- set value by ROW Integer position and COL Integer position

In [118]:
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,10
C,Cindy,8
D,Bianca,11


In [120]:

df.loc["B":"C", "Score"] += 5
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,15
C,Cindy,13
D,Bianca,11


## Instead of a slice, you could use a list of indexes or integer positions.

In [121]:

df.loc[["B", "D"],"Player name"]

B       Bob
D    Bianca
Name: Player name, dtype: object

In [123]:

df.loc[["B", "D"],"Score"] += 2

## Boolean indexing

### Series
- s[BOOL SERIES]  <- gets all s values lined up with True

### DataFrame
- d[BOOL SERIES]  <- pulls out rows lined up with True

In [126]:
df

Unnamed: 0,Player name,Score
A,Alice,6
B,Bob,17
C,Cindy,13
D,Bianca,13


In [127]:
b = df["Score"] >= 15
b


A    False
B     True
C    False
D    False
Name: Score, dtype: bool

In [128]:
df[b]

Unnamed: 0,Player name,Score
B,Bob,17


In [129]:
df[df["Score"] >= 15]

Unnamed: 0,Player name,Score
B,Bob,17


## 4. Creating DataFrame from csv

In [7]:
# its that easy!  
df = pd.read_csv("IMDB-Movie-Data.csv")
df

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...,...
993,993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0
994,994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0


## How to see first few lines of the DataFrame?

In [8]:
# just the first two rows

## How to see last few lines of the DataFrame?

In [9]:
# just the last 3 rows

In [10]:
# Print out the max year and the min year in our dataframe


In [11]:
# find the row for which the Title is "La La Land"

In [12]:
df

Unnamed: 0,Index,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...,...
993,993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0
994,994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0


## Notice that there are two index columns
- That happened because when you write a csv from pandas to a file, it writes a new index column
- So if the dataFrame already contains an index, you are going to get two index columns
- Let's fix that problem

### How can you use slicing to get rid of the first column?

In [13]:
df2 =  df.iloc[:, 1:] #all the rows, not column 0
df2

Unnamed: 0,Title,Genre,Director,Cast,Year,Runtime,Rating,Revenue
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,333.13
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael ...",2012,124,7.0,126.46M
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,138.12M
3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,270.32
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,325.02
...,...,...,...,...,...,...,...,...
993,Secret in Their Eyes,"Crime,Drama,Mystery",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,0
994,Hostel: Part II,Horror,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,17.54
995,Step Up 2: The Streets,"Drama,Music,Romance",Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,58.01
996,Search Party,"Adventure,Comedy",Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,0


### Wrong way to write a df to a csv file

In [14]:
df2.to_csv("wrong_movies.csv")

### Correct way to write a df to a csv file

In [15]:
df2.to_csv("better_movies.csv", index = False)

## 5. Data Analysis with Data Frames


In [32]:
# make a dataframe called long_movies whose Runtile is > the mean Runtime
long_movies = None

In [33]:
# find the value of the 99% percentile of Ratings
top1percent = None

In [35]:
# find all movies in long_movies with a rating at or above the 90th percentile


In [36]:
# find the cast of a certain movie title


In [37]:
# find all movies with Jennifer Lawrence in the cast


In [38]:
# if a column contains all strings, there are Series methods that work on strings
# Pandas Series string methods: 
# https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html 

# Series.str.contains(target), .lower(), .upper(), .strip(), .split()
# long_movies[long_movies["Title"].str.contains("Hunger")]

TypeError: 'NoneType' object is not subscriptable

## DataFrame.describe()
- provides a lot of useful stats
- works only for columns with numbers as values

In [159]:
stats = df.describe()
stats

Unnamed: 0,Year,Runtime,Rating
count,998.0,998.0,998.0
mean,2012.779559,113.170341,6.723447
std,3.207549,18.828877,0.945682
min,2006.0,66.0,1.9
25%,2010.0,100.0,6.2
50%,2014.0,111.0,6.8
75%,2016.0,123.0,7.4
max,2016.0,191.0,9.0


### How to get Quartile 1 of runtime of all the movies?

In [39]:
stats.loc["25%", "Runtime"]

NameError: name 'stats' is not defined