# Data wrangling Pandas

## Learning outcomes
- Inspect a dataframe with `df.head()`, `df.tail()`, `df.info()`, `df.describe()`.
- Obtain dataframe summaries with `df.info()` and `df.describe()`.
- Rename columns of a dataframe using the `df.rename()` function or by accessing the `df.columns` attribute.
- Use `df.melt()` and `df.pivot()` to reshape dataframes, specifically to make tidy dataframes.
- Combine dataframes using `df.merge()` and `pd.concat()` and know when to use these different methods.
- Apply functions to a dataframe `df.apply()` and `df.applymap()`
- Perform grouping and aggregating operations using `df.groupby()` and `df.agg()`.
- Perform aggregating methods on grouped or ungrouped objects such as finding the minimum, maximum and sum of values in a dataframe using `df.agg()`.
- Remove or fill missing values in a dataframe with `df.dropna()` and `df.fillna()`.

In [2]:
import numpy as np
import pandas as pd

## DataFrame characteristics
---

- Last lecture we looked at how we can create dataframes
- Let's now look at some helpful ways we can view our dataframe

### Head/Tail

- The `.head()` and `.tail()` methods allow you to view the top/bottom *n* (default 5) rows of a dataframe
- Let's load in the [IMDB movie dataset](https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows) and try them out:

In [4]:
df = pd.read_csv('data/imdb.csv')
df.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0


- The default return value is 5 rows, but we can pass in any number we like
- For example, let's take a look at the top 10 rows:

In [None]:
df.head(10)

- Or the bottom 5 rows:

In [None]:
df.tail()

### DataFrame summaries

- Three very helpful attributes/functions for getting high-level summaries of your dataframe are:
    - `.shape`
    - `.info()`
    - `.describe()`

- `.shape` is just like the ndarray attribute we've seen previously
- It gives the shape (rows, cols) of your dataframe:

In [5]:
df.shape

(1000, 15)

- `.info()` prints information about the dataframe itself, such as dtypes, memory usages, non-null values, etc:

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   1000 non-null   object 
 1   Released_Year  1000 non-null   int64  
 2   Certificate    899 non-null    object 
 3   Runtime        1000 non-null   object 
 4   Genre          1000 non-null   object 
 5   IMDB_Rating    1000 non-null   float64
 6   Overview       1000 non-null   object 
 7   Meta_score     843 non-null    float64
 8   Director       1000 non-null   object 
 9   Star1          1000 non-null   object 
 10  Star2          1000 non-null   object 
 11  Star3          1000 non-null   object 
 12  Star4          1000 non-null   object 
 13  No_of_Votes    1000 non-null   int64  
 14  Gross          831 non-null    float64
dtypes: float64(3), int64(2), object(10)
memory usage: 117.3+ KB


- `.describe()` provides summary statistics of the values within a dataframe:

In [7]:
df.describe()

Unnamed: 0,Released_Year,IMDB_Rating,Meta_score,No_of_Votes,Gross
count,1000.0,1000.0,843.0,1000.0,831.0
mean,1992.221,7.9493,77.97153,273692.9,68034750.0
std,39.746924,0.275491,12.376099,327372.7,109750000.0
min,1920.0,7.6,28.0,25088.0,1305.0
25%,1976.0,7.7,70.0,55526.25,3253559.0
50%,1999.0,7.9,79.0,138548.5,23530890.0
75%,2009.0,8.1,87.0,374161.2,80750890.0
max,3010.0,9.3,100.0,2343110.0,936662200.0


- By default, `.describe()` only print summaries of numeric features
- We can force it to give summaries on all features using the argument `include='all'` (although they may not make sense!):

In [8]:
df.describe(include='all')

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
count,1000,1000.0,899,1000,1000,1000.0,1000,843.0,1000,1000,1000,1000,1000,1000.0,831.0
unique,999,,16,140,202,,1000,,548,660,841,891,939,,
top,Drishyam,,U,100 min,Drama,,Two imprisoned men bond over a number of years...,,Alfred Hitchcock,Tom Hanks,Emma Watson,Rupert Grint,Michael Caine,,
freq,2,,234,23,85,,1,,14,12,7,5,4,,
mean,,1992.221,,,,7.9493,,77.97153,,,,,,273692.9,68034750.0
std,,39.746924,,,,0.275491,,12.376099,,,,,,327372.7,109750000.0
min,,1920.0,,,,7.6,,28.0,,,,,,25088.0,1305.0
25%,,1976.0,,,,7.7,,70.0,,,,,,55526.25,3253559.0
50%,,1999.0,,,,7.9,,79.0,,,,,,138548.5,23530890.0
75%,,2009.0,,,,8.1,,87.0,,,,,,374161.2,80750890.0


### Displaying DataFrames

- Displaying your dataframes effectively can be an important part of your workflow
- If a dataframe has more than 60 rows, Pandas will only display the first 5 and last 5 rows:

In [9]:
pd.DataFrame(np.random.rand(100))

Unnamed: 0,0
0,0.030260
1,0.645142
2,0.973557
3,0.290801
4,0.836502
...,...
95,0.178665
96,0.712136
97,0.706546
98,0.667451


- For dataframes of less than 60 rows, Pandas will print the whole dataframe

In [10]:
pd.DataFrame(np.random.rand(25))

Unnamed: 0,0
0,0.882468
1,0.72602
2,0.766781
3,0.177564
4,0.768787
5,0.452322
6,0.547723
7,0.754078
8,0.449264
9,0.956303


### Views vs copies

- In previous lectures we've discussed views ("looking" at a part of an existing object) and copies (making a new copy of the object in memory)
- These things get a little abstract with Pandas and "...it’s very hard to predict whether it will return a view or a copy" (that's a quote straight [from a dedicated section in the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy))
- Basically, it depends on the operation you are trying to perform, your dataframe's structure and the memory layout of the underlying array
- But don't worry, let me tell you all you need to know
- Firstly, the most common warning you'll encounter in Pandas is the `SettingWithCopy`, Pandas raises it as a warning that you might not be doing what you think you're doing
- Let's see an example: one of movies in our dataframe has an incorrect value in the `Released_Year` field:

In [11]:
df[df['Released_Year'] > 2021]

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
8,Inception,3010,UA,148 min,"Action, Adventure, Sci-Fi",8.8,A thief who steals corporate secrets through t...,74.0,Christopher Nolan,Leonardo DiCaprio,Joseph Gordon-Levitt,Elliot Page,Ken Watanabe,2067042,292576195.0


- Imagine we wanted to change this to `2010`
- You'd probably do the following:

In [12]:
df[df['Released_Year'] > 2021]['Released_Year'] = 2010

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['Released_Year'] > 2021]['Released_Year'] = 2010


- Ah, there's that warning
- Did our dataframe get changed?

In [13]:
df[df['Released_Year'] > 2021]

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
8,Inception,3010,UA,148 min,"Action, Adventure, Sci-Fi",8.8,A thief who steals corporate secrets through t...,74.0,Christopher Nolan,Leonardo DiCaprio,Joseph Gordon-Levitt,Elliot Page,Ken Watanabe,2067042,292576195.0


- No it didn't, even though you probably thought it did
- What happened above is that `df[df['Released_Year'] > 2021]` was executed first and returned a copy of the dataframe, we can confirm by using `id()`:

In [14]:
print(f"The id of the original dataframe is: {id(df)}")
print(f" The id of the indexed dataframe is: {id(df[df['Released_Year'] > 2021])}")

The id of the original dataframe is: 2416625116608
 The id of the indexed dataframe is: 2416636322688


- We then tried to set a value on this new object by appending `['Released_Year'] = 2010`
- Pandas is warning us that we are doing that operation on a copy of the original dataframe, which is probably not what we want
- To fix this, you need to index in a single go, using `.loc[]` for example:

In [19]:
df.loc[df['Released_Year'] > 2021, 'Released_Year'] = 2010

- No error this time! And let's confirm the change:

In [20]:
df[df['Released_Year'] > 2021]

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross


- The second thing you need to know is that if you're ever in doubt about whether something is a view or a copy, you can just use the `.copy()` method to force a copy of a dataframe
- Just like this:

In [21]:
df2 = df[df['Released_Year'] > 2021].copy()

- That way, your guaranteed a copy that you can modify as you wish

## Basic DataFrame manipulations
---

### Renaming columns

- We can rename columns two ways:
    1. Using `.rename()` (to selectively change column names)
    2. By setting the `.columns` attribute (to change all column names at once)

In [23]:
df

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374,30500000.0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


- Let's give it a go:

In [24]:
df.rename(columns={"Released_Year": "Year",
                   "Overview": "Synopsis"})
df

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374,30500000.0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


- Wait? What happened? Nothing changed?
- In the code above we did actually rename columns of our dataframe but we didn't modify the dataframe inplace, we made a copy of it (more on that later)
- There are generally two options for making permanent dataframe changes:
    - 1. Use the argument `inplace=True`, e.g., `df.rename(..., inplace=True)`, available in most functions/methods
    - 2. Re-assign, e.g., `df = df.rename(...)`
- The Pandas team recommends **Method 2 (re-assign)**, for a [few reasons](https://www.youtube.com/watch?v=hK6o_TDXXN8&t=700) (mostly to do with how memory is allocated under the hood)

In [25]:
df = df.rename(columns={"Released_Year": "Year",
                   "Overview": "Synopsis"})
df

Unnamed: 0,Series_Title,Year,Certificate,Runtime,Genre,IMDB_Rating,Synopsis,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374,30500000.0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


- If you wish to change all of the columns of a dataframe, you can do so by setting the `.columns` attribute

In [26]:
df.columns = [f"Column {_}" for _ in range(15)]
df

Unnamed: 0,Column 0,Column 1,Column 2,Column 3,Column 4,Column 5,Column 6,Column 7,Column 8,Column 9,Column 10,Column 11,Column 12,Column 13,Column 14
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374,30500000.0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


- You can change the index labels of a dataframe in 3 main ways:
    1. `.set_index()` to make one of the columns of the dataframe the index
    2. Directly modify `df.index.name` to change the index name
    3. `.reset_index()` to move the current index as a column and to reset the index with integer labels starting from 0
    4. Directly modify the `.index()` attribute

In [27]:
df

Unnamed: 0,Column 0,Column 1,Column 2,Column 3,Column 4,Column 5,Column 6,Column 7,Column 8,Column 9,Column 10,Column 11,Column 12,Column 13,Column 14
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374,30500000.0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


### Adding/Removing columns

- There are two main ways to add/remove columns of a dataframe
    1. Use `[]` to add columns
    2. Use `.drop()` to drop columns
- Let's re-read in a fresh copy of IMDB movie dataset:

In [28]:
df = pd.read_csv('data/imdb.csv')
df

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374,30500000.0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


- We can add a new column to a dataframe by simply using `[]` with a new column name and value(s)

In [29]:
df['RottenTomato_score'] = 0

In [30]:
df = df.drop(columns=['Star3', 'Star4'])
df

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,No_of_Votes,Gross,RottenTomato_score
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,2343110,28341469.0,0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,1620367,134966411.0,0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,2303232,534858444.0,0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,1129952,57300000.0,0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,689845,4360000.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,166544,,0
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,34075,,0
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,43374,30500000.0,0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,26471,,0


### Adding/Removing rows

- You won't often be adding rows to a dataframe manually (you'll usually add rows through concatenating/joining - that's coming up next)
- You can add/remove rows of a dataframe in two ways:
    1. Use `.append()` to add rows
    2. Use `.drop()` to drop rows

In [31]:
df

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,No_of_Votes,Gross,RottenTomato_score
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,2343110,28341469.0,0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,1620367,134966411.0,0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,2303232,534858444.0,0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,1129952,57300000.0,0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,689845,4360000.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,166544,,0
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,34075,,0
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,43374,30500000.0,0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,26471,,0


- Let's add a new row to the bottom of this dataframe

In [32]:
another_row = pd.DataFrame(
    [
        [
            "Zone 414",
            "2021",
            "R",
            "98 min",
            "Sci-Fi",
            "6.5",
            "Set in the near future in a colony of state-of-the-art humanoid robots.",
            75.0,
            "Andrew Baird",
            "Travis Fimmel",
            "Guy Pearce",
            12343,
            1229123,
            0
        ]
    ],
    columns=df.columns,
    index=[1100]
)
df = df.append(another_row)
df

  df = df.append(another_row)


Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,No_of_Votes,Gross,RottenTomato_score
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,2343110,28341469.0,0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,1620367,134966411.0,0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,2303232,534858444.0,0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,1129952,57300000.0,0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,689845,4360000.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,34075,,0
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,43374,30500000.0,0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,26471,,0
999,The 39 Steps,1935,,86 min,"Crime, Mystery, Thriller",7.6,A man in London tries to help a counter-espion...,93.0,Alfred Hitchcock,Robert Donat,Madeleine Carroll,51853,,0


- We can drop all rows beyond row 5 using `.drop()`

In [33]:
df.drop(df.index[5:], axis=0)

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,No_of_Votes,Gross,RottenTomato_score
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,2343110,28341469.0,0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,1620367,134966411.0,0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,2303232,534858444.0,0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,1129952,57300000.0,0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,689845,4360000.0,0


We can also drop rows indirectly by slicing the original dataframe and assign the result to a new one:

In [35]:
df2 = df.iloc[:5, :]
df2

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,No_of_Votes,Gross,RottenTomato_score
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,2343110,28341469.0,0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,1620367,134966411.0,0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,2303232,534858444.0,0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,1129952,57300000.0,0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,689845,4360000.0,0


## DataFrame reshaping
---

### Tidy data

- [Tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) is about "linking the structure of a dataset with its semantics (its meaning)"
- You've already looked at tidy data in 523
- It is defined by:
    1. Each variable forms a column
    2. Each observation forms a row
    3. Each type of observational unit forms a table
- Often you'll need to reshape a dataframe to make it tidy (or for some other purpose)
    
![](img/lecture7/tidy.png)

Source: [r4ds](https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure)

### Melt and pivot

- Pandas `.melt()`, `.pivot()` and `.pivot_table()` can help reshape dataframes
    - `.melt()`: make wide data long (like `melt()` in R)
    - `.pivot()`: make long data width (like `cast()` in R)
    - `.pivot_table()`: same as `.pivot()` but can handle multiple indexes
    
![](img/lecture7/melt_pivot.gif)

Source: [Garrick Aden-Buie's GitHub](https://github.com/gadenbuie/tidyexplain#spread-and-gather)

- The below data shows how many courses different instructors taught across different years:

In [36]:
df = pd.DataFrame({"Name": ["Arman", "Mike", "Tiffany", "Varada", "Joel"],
                   "2018": [1, 3, 4, 5, 3],
                   "2019": [2, 4, 3, 2, 1],
                   "2020": [5, 2, 4, 4, 3]})
df

Unnamed: 0,Name,2018,2019,2020
0,Arman,1,2,5
1,Mike,3,4,2
2,Tiffany,4,3,4
3,Varada,5,2,4
4,Joel,3,1,3


- Let's try to tidy the data with `.melt()`. We first have to know what exactly `.melt()` does, so let's apply it to our dataframe without any arguments:

In [37]:
df.melt()

Unnamed: 0,variable,value
0,Name,Arman
1,Name,Mike
2,Name,Tiffany
3,Name,Varada
4,Name,Joel
5,2018,1
6,2018,3
7,2018,4
8,2018,5
9,2018,3


- Think of `.melt()` as trying to make everything look like `key: value` pairs
- By default, `.melt()` takes each column name as a key and binds it with all column values
- Here we're interested in questions about each instructor, so we want the rows to be identified with instructor names
- This can be done using the `id_vars` argument, which determines which column should be the "identifier", i.e. the "key":

In [38]:
df.melt(id_vars="Name")

Unnamed: 0,Name,variable,value
0,Arman,2018,1
1,Mike,2018,3
2,Tiffany,2018,4
3,Varada,2018,5
4,Joel,2018,3
5,Arman,2019,2
6,Mike,2019,4
7,Tiffany,2019,3
8,Varada,2019,2
9,Joel,2019,1


Much better!

- The `value_vars` argument allows us to select which specific variables we want to "melt" (if you don't specify `value_vars`, all non-identifier columns will be used)
- For example, below I'm only using the `2020` column:

In [None]:
df.melt(id_vars="Name", value_vars=["2020"])

In [39]:
df_melt = df.melt(id_vars="Name", var_name="Year")
df_melt

Unnamed: 0,Name,Year,value
0,Arman,2018,1
1,Mike,2018,3
2,Tiffany,2018,4
3,Varada,2018,5
4,Joel,2018,3
5,Arman,2019,2
6,Mike,2019,4
7,Tiffany,2019,3
8,Varada,2019,2
9,Joel,2019,1


- The `value` column can be renamed by passing a name to the `value_name` argument:

In [40]:
df_melt = df.melt(id_vars="Name", var_name="Year", value_name="Course")
df_melt

Unnamed: 0,Name,Year,Course
0,Arman,2018,1
1,Mike,2018,3
2,Tiffany,2018,4
3,Varada,2018,5
4,Joel,2018,3
5,Arman,2019,2
6,Mike,2019,4
7,Tiffany,2019,3
8,Varada,2019,2
9,Joel,2019,1


- Sometimes, you want to make long data wide, which we can do with `.pivot()`
- When using `.pivot()` we need to specify the `index` to pivot on, and the `columns` that will be used to make the new columns of the wider dataframe:

In [41]:
df_pivot = df_melt.pivot(index="Name",
                         columns="Year",
                         values="Course"
                        )
df_pivot

Year,2018,2019,2020
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Arman,1,2,5
Joel,3,1,3
Mike,3,4,2
Tiffany,4,3,4
Varada,5,2,4


- You'll notice that Pandas set our specified `index="Name"` argument as the index of the new dataframe, and preserved the label of the columns
- We can easily remove these names and reset the index to make our dataframe look like it originally did

In [42]:
df_pivot.columns

Index(['2018', '2019', '2020'], dtype='object', name='Year')

In [43]:
df_pivot.columns.name = None
df_pivot = df_pivot.reset_index()
df_pivot

Unnamed: 0,Name,2018,2019,2020
0,Arman,1,2,5
1,Joel,3,1,3
2,Mike,3,4,2
3,Tiffany,4,3,4
4,Varada,5,2,4


- `.pivot()` will often get you what you want, but it won't work if you want to:
    - Use multiple indexes (next lecture)
    - Have duplicate index/column labels
- In these cases you'll have to use `.pivot_table()`
- I won't focus on it too much here because I'd rather you learn about `pivot()` first

In [44]:
df = pd.DataFrame(
    {
        "Name": ["Arman", "Arman", "Mike", "Mike"],
        "Department": ["CS", "STATS", "CS", "STATS"],
        "2018": [1, 2, 3, 1],
        "2019": [2, 3, 4, 2],
        "2020": [5, 1, 2, 2],
    }
)
df

Unnamed: 0,Name,Department,2018,2019,2020
0,Arman,CS,1,2,5
1,Arman,STATS,2,3,1
2,Mike,CS,3,4,2
3,Mike,STATS,1,2,2


In [45]:
df = df.melt(
    id_vars=["Name", "Department"],
    var_name="Year",
    value_name="Courses"
)
df

Unnamed: 0,Name,Department,Year,Courses
0,Arman,CS,2018,1
1,Arman,STATS,2018,2
2,Mike,CS,2018,3
3,Mike,STATS,2018,1
4,Arman,CS,2019,2
5,Arman,STATS,2019,3
6,Mike,CS,2019,4
7,Mike,STATS,2019,2
8,Arman,CS,2020,5
9,Arman,STATS,2020,1


- In the above case, we have duplicates in `Name`, so `pivot()` won't work
- It will throw us a `ValueError: Index contains duplicate entries, cannot reshape`:

In [46]:
df.pivot(index="Name",
         columns="Year",
         values="Courses")

ValueError: Index contains duplicate entries, cannot reshape

- In such a case, we'd use `.pivot_table()`
- It will apply an aggregation function to our duplicates, in this case, we'll `sum()` them up:

In [47]:
df.pivot_table(index="Name", columns='Year', values='Courses', aggfunc='sum')

Year,2018,2019,2020
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Arman,3,5,6
Mike,4,6,4


- If we wanted to keep the numbers per department, we could specify both `Name` and `Department` as multiple indexes:

In [48]:
df.pivot_table(index=["Name", "Department"], columns='Year', values='Courses')

Unnamed: 0_level_0,Year,2018,2019,2020
Name,Department,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Arman,CS,1,2,5
Arman,STATS,2,3,1
Mike,CS,3,4,2
Mike,STATS,1,2,2


- The result above is a mutlti-index or "hierarchically indexed" dataframe (more on those next lecture)
- If you ever have a need to use it, you can read more about `pivot_table()` in the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#pivot-tables)

## More DataFrame operations
---

### Applying custom functions

- There will be times when you want to apply a function that is not built-in to Pandas
- For this, we also have methods:
    - `df.apply()`, applies a function column-wise or row-wise across a dataframe (the function must be able to accept/return an array)
    - `df.applymap()`, applies a function element-wise (for functions that accept/return single values at a time)
    - `series.apply()`/`series.map()`, same as above but for Pandas series

- For example, say you want to use a numpy function on a column in your dataframe

In [49]:
df = pd.read_csv('data/YVR_weather_data.csv', usecols=range(0, 4))
df[['Mean Max Temp (°C)']].apply(np.log)

Unnamed: 0,Mean Max Temp (°C)
0,-0.510826
1,1.648659
2,2.459589
3,2.476538
4,2.791165
...,...
912,2.054124
913,2.351375
914,2.549445
915,2.839078


- Or you may want to apply your own custom function

In [50]:
def convert_to_fahrenheit(x):
    return x * (9 / 5) + 32


df[['Mean Max Temp (°C)']].apply(convert_to_fahrenheit)

Unnamed: 0,Mean Max Temp (°C)
0,33.08
1,41.36
2,53.06
3,53.42
4,61.34
...,...
912,46.04
913,50.90
914,55.04
915,62.78


- This may have been better as a lambda function...

In [51]:
df[['Mean Max Temp (°C)']].apply(lambda x: x * (9 / 5) + 32)

Unnamed: 0,Mean Max Temp (°C)
0,33.08
1,41.36
2,53.06
3,53.42
4,61.34
...,...
912,46.04
913,50.90
914,55.04
915,62.78


- You can even use functions that require additional arguments
- Just specify the arguments in `.apply()`

In [52]:
def convert_temperature(x, to="absolute"):
    if to == "absolute":
        return x + 273.15
    elif to == "fahrenheit":
        return x * (9 / 5) + 32


df[['Mean Max Temp (°C)']].apply(convert_temperature, to="absolute")

Unnamed: 0,Mean Max Temp (°C)
0,273.75
1,278.35
2,284.85
3,285.05
4,289.45
...,...
912,280.95
913,283.65
914,285.95
915,290.25


### Grouping

- Often we are interested in examining specific groups in our data
- `df.groupby()` allows us to group our data based on a variable(s)
- Analgous to the `group_by` function in R

In [53]:
df = pd.read_csv('data/imdb.csv')
df

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
997,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374,30500000.0
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


- Let's group this dataframe on the column `Name`

In [54]:
dfg = df.groupby(by='Genre')
dfg

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000232ACFB7EE0>

- What is a `DataFrameGroupBy` object?
- It contains information about the groups of the dataframe

- The groupby object is really just a dictionary of index-mappings, which we could look at if we wanted to

In [55]:
dfg.groups

{'Action, Adventure': [63, 72, 155, 168, 840], 'Action, Adventure, Biography': [540], 'Action, Adventure, Comedy': [177, 320, 325, 339, 348, 473, 532, 722, 730, 887], 'Action, Adventure, Crime': [909], 'Action, Adventure, Drama': [5, 10, 13, 31, 39, 59, 343, 496, 625, 642, 709, 821, 898, 944], 'Action, Adventure, Family': [927], 'Action, Adventure, Fantasy': [16, 29, 109, 376, 623, 645], 'Action, Adventure, History': [507], 'Action, Adventure, Horror': [535], 'Action, Adventure, Mystery': [914], 'Action, Adventure, Romance': [564], 'Action, Adventure, Sci-Fi': [8, 60, 106, 223, 262, 357, 477, 479, 482, 493, 502, 582, 583, 634, 677, 737, 746, 749, 807, 839, 982], 'Action, Adventure, Thriller': [368, 725, 751, 861, 963], 'Action, Adventure, War': [854, 856], 'Action, Adventure, Western': [543, 865], 'Action, Biography, Crime': [142, 702, 985], 'Action, Biography, Drama': [57, 216, 217, 351, 659, 889, 924], 'Action, Comedy, Crime': [140, 160, 161, 294, 569, 908], 'Action, Comedy, Fantasy'

- We can also access a group using the `.get_group()` method

In [56]:
dfg.get_group('Action, Adventure')

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
63,The Dark Knight Rises,2012,UA,164 min,"Action, Adventure",8.4,Eight years after the Joker's reign of anarchy...,78.0,Christopher Nolan,Christian Bale,Tom Hardy,Anne Hathaway,Gary Oldman,1516346,448139099.0
72,Raiders of the Lost Ark,1981,A,115 min,"Action, Adventure",8.4,"In 1936, archaeologist and adventurer Indiana ...",85.0,Steven Spielberg,Harrison Ford,Karen Allen,Paul Freeman,John Rhys-Davies,884112,248159971.0
155,Batman Begins,2005,UA,140 min,"Action, Adventure",8.2,"After training with his mentor, Batman begins ...",70.0,Christopher Nolan,Christian Bale,Michael Caine,Ken Watanabe,Liam Neeson,1308302,206852432.0
168,Indiana Jones and the Last Crusade,1989,U,127 min,"Action, Adventure",8.2,"In 1938, after his father Professor Henry Jone...",65.0,Steven Spielberg,Harrison Ford,Sean Connery,Alison Doody,Denholm Elliott,692366,197171806.0
840,First Blood,1982,A,93 min,"Action, Adventure",7.7,A veteran Green Beret is forced by a cruel She...,61.0,Ted Kotcheff,Sylvester Stallone,Brian Dennehy,Richard Crenna,Bill McKinney,226541,47212904.0


- The usual thing to do, however, is to apply aggregation functions to the groupby object

In [57]:
dfg.mean()[['IMDB_Rating']]

Unnamed: 0_level_0,IMDB_Rating
Genre,Unnamed: 1_level_1
"Action, Adventure",8.180000
"Action, Adventure, Biography",7.900000
"Action, Adventure, Comedy",7.910000
"Action, Adventure, Crime",7.600000
"Action, Adventure, Drama",8.150000
...,...
"Mystery, Romance, Thriller",8.300000
"Mystery, Sci-Fi, Thriller",7.800000
"Mystery, Thriller",7.977778
Thriller,7.800000


- We can apply multiple functions using `.aggregate()`

In [58]:
dfg.aggregate(['mean', 'sum', 'count'])

  dfg.aggregate(['mean', 'sum', 'count'])


Unnamed: 0_level_0,Released_Year,Released_Year,Released_Year,IMDB_Rating,IMDB_Rating,IMDB_Rating,Meta_score,Meta_score,Meta_score,No_of_Votes,No_of_Votes,No_of_Votes,Gross,Gross,Gross
Unnamed: 0_level_1,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
"Action, Adventure",1993.800000,9969,5,8.180000,40.9,5,71.800000,359.0,5,925533.400000,4627667,5,2.295072e+08,1.147536e+09,5
"Action, Adventure, Biography",1972.000000,1972,1,7.900000,7.9,1,,0.0,0,52397.000000,52397,1,,0.000000e+00,0
"Action, Adventure, Comedy",1999.200000,19992,10,7.910000,79.1,10,66.857143,468.0,7,456076.600000,4560766,10,2.133793e+08,1.920414e+09,9
"Action, Adventure, Crime",2009.000000,2009,1,7.600000,7.6,1,,0.0,0,63882.000000,63882,1,,0.000000e+00,0
"Action, Adventure, Drama",1997.285714,27962,14,8.150000,114.1,14,80.461538,1046.0,13,663989.928571,9295859,14,2.224030e+08,2.668836e+09,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"Mystery, Romance, Thriller",1958.000000,1958,1,8.300000,8.3,1,100.000000,100.0,1,364368.000000,364368,1,3.200000e+06,3.200000e+06,1
"Mystery, Sci-Fi, Thriller",1996.500000,3993,2,7.800000,15.6,2,70.000000,140.0,2,383185.000000,766370,2,3.575990e+07,7.151979e+07,2
"Mystery, Thriller",1987.000000,17883,9,7.977778,71.8,9,78.600000,393.0,5,341362.888889,3072266,9,3.320600e+07,1.992360e+08,6
Thriller,1967.000000,1967,1,7.800000,7.8,1,81.000000,81.0,1,27733.000000,27733,1,1.755074e+07,1.755074e+07,1


- And even apply different functions to different columns

In [59]:
def num_range(x):
    return x.max() - x.min()

dfg.aggregate({"Meta_score": ['max', 'min', 'mean', num_range], 
               "Gross": ['sum']})

Unnamed: 0_level_0,Meta_score,Meta_score,Meta_score,Meta_score,Gross
Unnamed: 0_level_1,max,min,mean,num_range,sum
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
"Action, Adventure",85.0,61.0,71.800000,24.0,1.147536e+09
"Action, Adventure, Biography",,,,,0.000000e+00
"Action, Adventure, Comedy",76.0,60.0,66.857143,16.0,1.920414e+09
"Action, Adventure, Crime",,,,,0.000000e+00
"Action, Adventure, Drama",98.0,61.0,80.461538,37.0,2.668836e+09
...,...,...,...,...,...
"Mystery, Romance, Thriller",100.0,100.0,100.000000,0.0,3.200000e+06
"Mystery, Sci-Fi, Thriller",74.0,66.0,70.000000,8.0,7.151979e+07
"Mystery, Thriller",100.0,52.0,78.600000,48.0,1.992360e+08
Thriller,81.0,81.0,81.000000,0.0,1.755074e+07


- By the way, you can use aggregate for non-grouped dataframes too
- This is pretty much what `df.describe` does under the hood

In [60]:
df.agg(['mean', 'min', 'count', num_range])

  df.agg(['mean', 'min', 'count', num_range])


Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
mean,,1992.221,,,,7.9493,,77.97153,,,,,,273692.911,68034750.0
min,(500) Days of Summer,1920.0,,100 min,"Action, Adventure",7.6,"""Documentary"" about a man who can look and act...",28.0,Aamir Khan,Aamir Khan,Adesh Prasad,Aamir Khan,Aamir Bashir,25088.0,1305.0
count,1000,1000.0,899.0,1000,1000,1000.0,1000,843.0,1000,1000,1000,1000,1000,1000.0,831.0
num_range,,1090.0,,,,1.7,,72.0,,,,,,2318022.0,936660900.0


### Dealing with missing values

- Missing values are typically denoted with `NaN`
- We can use `df.isna()` or `df.isnull()` to find missing values in a dataframe (both functions do exactly the same thing!)
- It returns a boolean for each element in the dataframe

In [61]:
df.isna()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
996,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
997,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
998,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True


- But it's usually more helpful to get this information by row or by column using the `.any()` or `.info()` method

In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   1000 non-null   object 
 1   Released_Year  1000 non-null   int64  
 2   Certificate    899 non-null    object 
 3   Runtime        1000 non-null   object 
 4   Genre          1000 non-null   object 
 5   IMDB_Rating    1000 non-null   float64
 6   Overview       1000 non-null   object 
 7   Meta_score     843 non-null    float64
 8   Director       1000 non-null   object 
 9   Star1          1000 non-null   object 
 10  Star2          1000 non-null   object 
 11  Star3          1000 non-null   object 
 12  Star4          1000 non-null   object 
 13  No_of_Votes    1000 non-null   int64  
 14  Gross          831 non-null    float64
dtypes: float64(3), int64(2), object(10)
memory usage: 117.3+ KB


In [63]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
18,Hamilton,2020,PG-13,160 min,"Biography, Drama, History",8.6,The real life of one of America's foremost fou...,90.0,Thomas Kail,Lin-Manuel Miranda,Phillipa Soo,Leslie Odom Jr.,Renée Elise Goldsberry,55291,
20,Soorarai Pottru,2020,U,153 min,Drama,8.6,"Nedumaaran Rajangam ""Maara"" sets out to make t...",,Sudha Kongara,Suriya,Madhavan,Paresh Rawal,Aparna Balamurali,54995,
30,Seppuku,1962,,133 min,"Action, Drama, Mystery",8.6,When a ronin requesting seppuku at a feudal lo...,85.0,Masaki Kobayashi,Tatsuya Nakadai,Akira Ishihama,Shima Iwashita,Tetsurô Tanba,42004,
32,It's a Wonderful Life,1946,PG,130 min,"Drama, Family, Fantasy",8.6,An angel is sent from Heaven to help a despera...,89.0,Frank Capra,James Stewart,Donna Reed,Lionel Barrymore,Thomas Mitchell,405801,
46,Hotaru no haka,1988,U,89 min,"Animation, Drama, War",8.5,A young boy and his little sister struggle to ...,94.0,Isao Takahata,Tsutomu Tatsumi,Ayano Shiraishi,Akemi Yamaguchi,Yoshiko Shinohara,235231,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,Blowup,1966,A,111 min,"Drama, Mystery, Thriller",7.6,A fashion photographer unknowingly captures a ...,82.0,Michelangelo Antonioni,David Hemmings,Vanessa Redgrave,Sarah Miles,John Castle,56513,
995,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
998,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


- When you have missing values, we usually either drop them or impute them
- You'll learn more about imputing in DSCI 562/571/573
- For now, you can drop missing values using `df.dropna()`

In [64]:
df.dropna()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469.0
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,Giù la testa,1971,PG,157 min,"Drama, War, Western",7.6,A low-life bandit and an I.R.A. explosives exp...,77.0,Sergio Leone,Rod Steiger,James Coburn,Romolo Valli,Maria Monti,30144,696690.0
991,Kelly's Heroes,1970,GP,144 min,"Adventure, Comedy, War",7.6,A group of U.S. soldiers sneaks across enemy l...,50.0,Brian G. Hutton,Clint Eastwood,Telly Savalas,Don Rickles,Carroll O'Connor,45338,1378435.0
992,The Jungle Book,1967,U,78 min,"Animation, Adventure, Family",7.6,Bagheera the Panther and Baloo the Bear have a...,65.0,Wolfgang Reitherman,Phil Harris,Sebastian Cabot,Louis Prima,Bruce Reitherman,166409,141843612.0
994,A Hard Day's Night,1964,U,87 min,"Comedy, Music, Musical",7.6,"Over two ""typical"" days in the life of The Bea...",96.0,Richard Lester,John Lennon,Paul McCartney,George Harrison,Ringo Starr,40351,13780024.0


- Or you can fill them using `.fillna()`
- This method has various options for filling, you can use a fixed value, the mean of the column, the previous non-nan value, etc
- You'll use this method more in the machine learning courses

In [65]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5
3,,3.0,,4


In [66]:
df.fillna(0)  # fill with 0

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5
3,0.0,3.0,0.0,4


In [67]:
df.fillna(df.mean())  # fill with the mean

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,3.0,3.0,,5
3,3.0,3.0,,4


In [68]:
df.fillna(method='bfill')  # backward (upwards) propagate last valid value

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0
1,3.0,4.0,,1
2,,3.0,,5
3,,3.0,,4


In [69]:
df.fillna(method='ffill')  # forward (downward) propagate last valid value

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,3.0,4.0,,5
3,3.0,3.0,,4


## Deliverables

Now that you have had a whirlwind introduction to Pandas, complete the [Finding Pandas quiz on Canvas](https://canvas.ubc.ca/courses/106515/quizzes/579039).
While this quiz is graded for participation and not performance, your performance is a strong indicator of your surface knowledge to use Pandas.
We will use Pandas regularly in this course, so going over these modules multiple times is beneficial.
