# The pandas DataFrame

We will use the following convention for pandas: `import pandas as pd`

In [79]:
import pandas as pd 

Whenever you see `pd.` in code, it's referring to pandas.

Pandas two main data structures are: `DataFrames` and `Series`.

- **Series**: A one-dimensional array-like object containing a sequence of values of a single type and associated labels, called an index.

- **DataFrame**: Rectangular table of data, with an ordered colletion of columns that can be different types. 
It has row and column labels.

**Table of Contents:**

- [Loading a DataFrame](#1-Loading-a-DataFrame)
- [Selecting a pandas column from a DataFrame](#2.-Selecting-a-pandas-column-from-a-DataFrame)
- [Renaming columns in a pandas DataFrame](#3.-Renaming-columns-in-a-pandas-DataFrame)
- [Removing columns from a pandas DataFrame](#4.-Removing-columns-from-a-pandas-DataFrame)
- [Selecting multiple rows and columns from a pandas DataFrame](#5.-Selecting-multiple-rows-and-columns-from-a-pandas-DataFrame)
- [Practice exercises](#6-Practice-exercises)

## 1. Loading a DataFrame from a CSV (comma-separate value) file

Documentation for [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [80]:
url = 'movies.csv'
movies = pd.read_csv(url)
movies

Unnamed: 0,color,director name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,...,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,...,7.1,2.35,0
2,Color,Sam Mendes,602.0,...,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,...,8.5,2.35,164000
4,,Doug Walker,,...,7.1,,0
...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,...,7.7,,84
4912,Color,,43.0,...,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,...,6.3,,16
4914,Color,Daniel Hsia,14.0,...,6.3,2.35,660


The image bellow provides a labeled diagram of all DataFrames major components

<img src="dataframe_anatomy.png" alt="drawing" width="700"/>

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p>pandas uses NaN (not a number) to represent missing values.</p>
</div>

You can use the `.columns`, `.index` and `.values` attributes to access, respectively, the columns, the index and the data of a DataFrame.

In [36]:
# dataframe columns
movies.columns

Index(['color', 'director name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [37]:
# dataframe index
movies.index

RangeIndex(start=0, stop=4916, step=1)

In [38]:
# dataframe data
movies.values

array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

You can use the `.dtypes` attribute to display each column name along with its **data type**.

In [39]:
movies.dtypes

color                         object
director name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
m

**Common pandas data types:**

| Type | Description |
| --- | :-- |
| `float64` | Numpy **float** (decimal) type |
| `Int64` | Numpy **integer** type |
| `object` | Numpy type for storing **strings** |
| `category` | pandas **categorical** type |
| `bool` | Numpy **Boolean** type |
| `datetime64[ns]` | NumPy **date** type | 

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p>In broad terms, data can be classified as either continuous or categorical.</p>
<p> <b>Continuous</b> data represents some kind of measurements, such as height or temperature.
Continuous data can take on an infinite number of possibilities.
<p> <b>Categorical</b> data represents discrete, finite amounts of values such as car color or movie genre.
</div>

In [40]:
# examine the first rows
movies.head(10)

Unnamed: 0,color,director name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,...,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,...,7.1,2.35,0
2,Color,Sam Mendes,602.0,...,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,...,8.5,2.35,164000
4,,Doug Walker,,...,7.1,,0
5,Color,Andrew Stanton,462.0,...,6.6,2.35,24000
6,Color,Sam Raimi,392.0,...,6.2,2.35,0
7,Color,Nathan Greno,324.0,...,7.8,1.85,29000
8,Color,Joss Whedon,635.0,...,7.5,2.35,118000
9,Color,David Yates,375.0,...,7.5,2.35,10000


In [41]:
# examine the bottom rows
movies.tail(7)

Unnamed: 0,color,director name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
4909,Color,Anthony Vallone,,...,7.8,,4
4910,Color,Edward Burns,14.0,...,6.4,,413
4911,Color,Scott Smith,1.0,...,7.7,,84
4912,Color,,43.0,...,7.5,16.0,32000
4913,Color,Benjamin Roberds,13.0,...,6.3,,16
4914,Color,Daniel Hsia,14.0,...,6.3,2.35,660
4915,Color,Jon Gunn,43.0,...,6.6,1.85,456


In [42]:
# use python len function to get the number of rows
len(movies)

4916

In [43]:
# get size of the dataframe: rows x columns
movies.shape

(4916, 28)

## 2. Selecting a pandas column from a DataFrame

Selecting a single column from a DataFrame returns a **pandas Series** (that has the same index as the DataFrame).
A column in a DataFrame can be selected as a Series by **dict-like (bracket) notation or by attribute (dot notation)**:

In [44]:
# select the 'imbd_score' column using bracket notation
movies['imdb_score']

0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

In [45]:
# or equivalently, use dot notation
movies.imdb_score

0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

In [46]:
# elements in a Series can be selected by index (using bracket notation)
scores = movies.imdb_score
scores[5]

6.6

We can access more than one column

In [83]:
movies[['director name','movie title']]

Unnamed: 0,director name,movie title
0,James Cameron,Avatar
1,Gore Verbinski,Pirates of the Caribbean: At World's End
2,Sam Mendes,Spectre
3,Christopher Nolan,The Dark Knight Rises
4,Doug Walker,Star Wars: Episode VII - The Force Awakens
...,...,...
4911,Scott Smith,Signed Sealed Delivered
4912,,The Following
4913,Benjamin Roberds,A Plague So Pleasant
4914,Daniel Hsia,Shanghai Calling


<div class="alert alert-block alert-danger"> 
<p><b>Warning</b></p>
<p>The bracket notation will always work, whereas the dot notation has some limitations:</p> 
<ul>
  <li> The dot notation doesn't work if there are spaces in the column name (see Example 1 bellow)</li>
  <li> The dot notation doesn't work if the column has the same name as a DataFrame method or attribute (like 'head' or 'shape')</li>
  <li> The dot notation can't be used to define the name of a new column (see Example 2 bellow) </li>
</ul>
</div>

**Example 1**

In [48]:
movies['director name']

0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director name, Length: 4916, dtype: object

In [49]:
movies.director name

SyntaxError: invalid syntax (<ipython-input-49-a19716366549>, line 1)

**Example 2:** There are several columns that contain data on the number of Facebook likes.

In [81]:
movies[['actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes','director_facebook_likes']]

Unnamed: 0,actor_1_facebook_likes,actor_2_facebook_likes,actor_3_facebook_likes,director_facebook_likes
0,1000.0,936.0,855.0,0.0
1,40000.0,5000.0,1000.0,563.0
2,11000.0,393.0,161.0,0.0
3,27000.0,23000.0,23000.0,22000.0
4,131.0,12.0,,131.0
...,...,...,...,...
4911,637.0,470.0,318.0,2.0
4912,841.0,593.0,319.0,
4913,0.0,0.0,0.0,0.0
4914,946.0,719.0,489.0,0.0


Let's add up all actor and director Facebook like columns and assign them to the `total_likes` column

In [85]:
movies.total_likes = movies.actor_1_facebook_likes + movies.actor_2_facebook_likes + movies.actor_3_facebook_likes + movies.director_facebook_likes

  movies.total_likes = movies.actor_1_facebook_likes + movies.actor_2_facebook_likes + movies.actor_3_facebook_likes + movies.director_facebook_likes


In [86]:
movies['total_likes'] = movies.actor_1_facebook_likes + movies.actor_2_facebook_likes + movies.actor_3_facebook_likes + movies.director_facebook_likes

In [87]:
movies.head()

Unnamed: 0,color,director name,num_critic_for_reviews,...,aspect_ratio,movie_facebook_likes,total_likes
0,Color,James Cameron,723.0,...,1.78,33000,2791.0
1,Color,Gore Verbinski,302.0,...,2.35,0,46563.0
2,Color,Sam Mendes,602.0,...,2.35,85000,11554.0
3,Color,Christopher Nolan,813.0,...,2.35,164000,95000.0
4,,Doug Walker,,...,,0,


**Homework:** Subtract `budget` from `gross` columns and assign the result to the `profit` column

In [None]:
# your code here


## 3. Renaming columns in a pandas DataFrame

Documentation for [`rename`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)

In [52]:
# examine the column names
movies.columns

Index(['color', 'director name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

Let's rename the columns 'director name' and 'movie title' by using the 'rename' method

In [53]:
# create a dictionary with the new names
new_column_names = {'director name':'director_name', 'movie title':'movie_title'}

# rename columns
movies.rename(columns=new_column_names, inplace=True)
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,...,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,...,7.1,2.35,0
2,Color,Sam Mendes,602.0,...,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,...,8.5,2.35,164000
4,,Doug Walker,,...,7.1,,0


## 4. Removing columns and/or rows from a pandas DataFrame

Documentation for [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

In [54]:
# remove a single column (axis=1 refers to columns)
movies.drop('director_name', axis=1, inplace=True) 
movies.head()

Unnamed: 0,color,num_critic_for_reviews,duration,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,723.0,178.0,...,7.9,1.78,33000
1,Color,302.0,169.0,...,7.1,2.35,0
2,Color,602.0,148.0,...,6.8,2.35,85000
3,Color,813.0,164.0,...,8.5,2.35,164000
4,,,,...,7.1,,0


In [56]:
# remove multiple columns at once
movies.drop(['color', 'duration'], axis=1, inplace=True)
movies.head()

Unnamed: 0,num_critic_for_reviews,director_facebook_likes,actor_3_facebook_likes,...,imdb_score,aspect_ratio,movie_facebook_likes
0,723.0,0.0,855.0,...,7.9,1.78,33000
1,302.0,563.0,1000.0,...,7.1,2.35,0
2,602.0,0.0,161.0,...,6.8,2.35,85000
3,813.0,22000.0,23000.0,...,8.5,2.35,164000
4,,131.0,,...,7.1,,0


In [57]:
# remove multiple rows at once (axis=0 refers to rows)
movies.drop([0, 3], axis=0, inplace=True)
movies.head()

Unnamed: 0,num_critic_for_reviews,director_facebook_likes,actor_3_facebook_likes,...,imdb_score,aspect_ratio,movie_facebook_likes
1,302.0,563.0,1000.0,...,7.1,2.35,0
2,602.0,0.0,161.0,...,6.8,2.35,85000
4,,131.0,,...,7.1,,0
5,462.0,475.0,530.0,...,6.6,2.35,24000
6,392.0,0.0,4000.0,...,6.2,2.35,0


## 5. Selecting multiple rows and columns from a pandas DataFrame

With ``loc`` and ``iloc`` you can do practically any data selection operation on DataFrames you can think of. ``loc`` is label-based, which means that you have to specify rows and columns based on their row and column labels. ``iloc`` is integer index based, so you have to specify rows and columns by their integer index.

- [The loc attribute](#5.1.-The-loc-attribute)
- [The iloc attribute](#5.2.-The-iloc-attribute)

In [58]:
# reload the movies dataframe
movies = pd.read_csv('movies.csv')
movies.head(5)

Unnamed: 0,color,director name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,...,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,...,7.1,2.35,0
2,Color,Sam Mendes,602.0,...,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,...,8.5,2.35,164000
4,,Doug Walker,,...,7.1,,0


### 5.1. The loc attribute

Documentation for [`loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)

The ``loc`` attribute is for **filtering rows and selecting columns by label (by their names)**

In [60]:
# row 0, all columns
movies.loc[0,:] 

color                                                                    Color
director name                                                    James Cameron
num_critic_for_reviews                                                     723
duration                                                                   178
director_facebook_likes                                                      0
actor_3_facebook_likes                                                     855
actor_2_name                                                  Joel David Moore
actor_1_facebook_likes                                                    1000
gross                                                              7.60506e+08
genres                                         Action|Adventure|Fantasy|Sci-Fi
actor_1_name                                                       CCH Pounder
movie title                                                             Avatar
num_voted_users                                     

In [64]:
# rows 0, 1, 2 and 3, all columns
movies.loc[[0,1,2,3],:] 

Unnamed: 0,color,director name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,...,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,...,7.1,2.35,0
2,Color,Sam Mendes,602.0,...,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,...,8.5,2.35,164000


In [66]:
# a better way:
movies.loc[0:3,:] #rows 0 through 3, all columns

Unnamed: 0,color,director name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,...,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,...,7.1,2.35,0
2,Color,Sam Mendes,602.0,...,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,...,8.5,2.35,164000


In [68]:
# all rows, column 'color'
movies.loc[:,'color'] # 

0       Color
1       Color
2       Color
3       Color
4         NaN
        ...  
4911    Color
4912    Color
4913    Color
4914    Color
4915    Color
Name: color, Length: 4916, dtype: object

In [71]:
# all rows, columns  and 'director name' and 'movie title'
movies.loc[:,['director name','movie title']] 

Unnamed: 0,director name,movie title
0,James Cameron,Avatar
1,Gore Verbinski,Pirates of the Caribbean: At World's End
2,Sam Mendes,Spectre
3,Christopher Nolan,The Dark Knight Rises
4,Doug Walker,Star Wars: Episode VII - The Force Awakens
...,...,...
4911,Scott Smith,Signed Sealed Delivered
4912,,The Following
4913,Benjamin Roberds,A Plague So Pleasant
4914,Daniel Hsia,Shanghai Calling


In [73]:
# all rows, columns 'movie title' through 'budget'
movies.loc[:,'movie title':'budget'] 

Unnamed: 0,movie title,num_voted_users,cast_total_facebook_likes,...,country,content_rating,budget
0,Avatar,886204,4834,...,USA,PG-13,237000000.0
1,Pirates of the Caribbean: At World's End,471220,48350,...,USA,PG-13,300000000.0
2,Spectre,275868,11700,...,UK,PG-13,245000000.0
3,The Dark Knight Rises,1144337,106759,...,USA,PG-13,250000000.0
4,Star Wars: Episode VII - The Force Awakens,8,143,...,,,
...,...,...,...,...,...,...,...
4911,Signed Sealed Delivered,629,2283,...,Canada,,
4912,The Following,73839,1753,...,USA,TV-14,
4913,A Plague So Pleasant,38,0,...,USA,,1400.0
4914,Shanghai Calling,1255,2386,...,USA,PG-13,


In [75]:
# rows 0 through 5, columns 'movie title' through 'budget'
movies.loc[0:5,'movie title':'budget']   

Unnamed: 0,movie title,num_voted_users,cast_total_facebook_likes,...,country,content_rating,budget
0,Avatar,886204,4834,...,USA,PG-13,237000000.0
1,Pirates of the Caribbean: At World's End,471220,48350,...,USA,PG-13,300000000.0
2,Spectre,275868,11700,...,UK,PG-13,245000000.0
3,The Dark Knight Rises,1144337,106759,...,USA,PG-13,250000000.0
4,Star Wars: Episode VII - The Force Awakens,8,143,...,,,
5,John Carter,212204,1873,...,USA,PG-13,263700000.0


### 5.2. The iloc attribute

Documentation for ['iloc'](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)

The iloc is for filtering rows and selecting columns by integer position

In [76]:
# all rows, columns 0 and 3
movies.iloc[:,[0,3]] 

Unnamed: 0,color,duration
0,Color,178.0
1,Color,169.0
2,Color,148.0
3,Color,164.0
4,,
...,...,...
4911,Color,87.0
4912,Color,43.0
4913,Color,76.0
4914,Color,100.0


In [77]:
# all rows, columns 0 through 3
movies.iloc[:,0:4] 

Unnamed: 0,color,director name,num_critic_for_reviews,duration
0,Color,James Cameron,723.0,178.0
1,Color,Gore Verbinski,302.0,169.0
2,Color,Sam Mendes,602.0,148.0
3,Color,Christopher Nolan,813.0,164.0
4,,Doug Walker,,
...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0
4912,Color,,43.0,43.0
4913,Color,Benjamin Roberds,13.0,76.0
4914,Color,Daniel Hsia,14.0,100.0


In [78]:
# rows 0 through 2, all columns
movies.iloc[0:3,:] 

Unnamed: 0,color,director name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,...,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,...,7.1,2.35,0
2,Color,Sam Mendes,602.0,...,6.8,2.35,85000
