# The pandas DataFrame

We will use the following convention for pandas: `import pandas as pd`

In [1]:
import pandas as pd 

Whenever you see `pd.` in code, it's referring to pandas.

Pandas two main data structures are: `Series` and `DataFrames`.

- **`Series`**: A one-dimensional array-like object containing a sequence of values of a single type and associated labels, called an index.

- **`DataFrame`**: Rectangular table of data, with an ordered colletion of columns that can be different types. 
It has row and column labels.

**Table of Contents:**

- [Series introduction](#1.-Series-Introduction)
- [DataFrames introduction](#2.-DataFrame-Introduction)
- [Selecting a Series from a DataFrame](#3.-Selecting-a-Series-from-a-DataFrame)
- [Renaming columns in a pandas DataFrame](#4.-Renaming-columns-in-a-pandas-DataFrame)
- [Removing columns from a pandas DataFrame](#5.-Removing-columns-from-a-pandas-DataFrame)
- [Selecting multiple rows and columns from a pandas DataFrame](#6.-Selecting-multiple-rows-and-columns-from-a-pandas-DataFrame)
- [Handling missing values in pandas](#7.-Handling-missing-values-in-pandas)

## 1. Series Introduction

Series are used to model one-dimensional data. 
We can create a `Series` from an array of data.

In [28]:
s1 = pd.Series([3,10,0,1,20])
s1

0     3
1    10
2     0
3     1
4    20
dtype: int64

The leftmost column is the `index`. The default values of an index are monotonically increasing integers. 
The rightmost column contains the values of the series.
The image bellow provides a labeled diagram of all `Series` major components.

<img src="images\series_anatomy.png" alt="drawing" width="700"/>

You can use the `.name`, `.index`, `.values`, and `.dtype` attributes to access, respectively, the name, the index, the data, and the data type of a `Series`.

In [29]:
s2 = pd.Series([3,10,0,1,20],
               index=['a','b','c','d','e'],
               name='my first series')
s2

a     3
b    10
c     0
d     1
e    20
Name: my first series, dtype: int64

In [30]:
s2.name

'my first series'

In [31]:
s2.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [32]:
s2.values

array([ 3, 10,  0,  1, 20], dtype=int64)

In [33]:
s2.dtype

dtype('int64')

**Common `pandas` data types:**

| Type | Description |
| --- | :-- |
| `float64` | Numpy **float** (decimal) type |
| `Int64` | Numpy **integer** type |
| `object` | Numpy type for storing **strings** |
| `category` | pandas **categorical** type |
| `bool` | Numpy **Boolean** type |
| `datetime64[ns]` | NumPy **date** type | 

We can use index labels to select a single value of a set of values

In [34]:
s2

a     3
b    10
c     0
d     1
e    20
Name: my first series, dtype: int64

In [35]:
s2['c']

0

In [36]:
s2[['a','d']]

a    3
d    1
Name: my first series, dtype: int64

In [37]:
s2[['e','d','c']]

e    20
d     1
c     0
Name: my first series, dtype: int64

**Loading a Series from a CSV (comma-separate value) file:** Pandas' `.read_csv` method reads a file, and parses its content into a `Series` or a `DataFrame`.
Documentation for [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [27]:
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/incidents_MT.csv'
count = pd.read_csv(url,
                    index_col='county',
                    squeeze=True,)  # by default (squeeze=False), it returns a DataFrame)
count

county
YELLOWSTONE        10095
MISSOULA            6195
CASCADE             5740
FLATHEAD            3876
GALLATIN            3690
LEWIS AND CLARK     3246
SILVER BOW          2534
LAKE                1385
HILL                1165
RAVALLI             1040
ROOSEVELT            580
CUSTER               499
DEER LODGE           489
PARK                 409
RICHLAND             409
LINCOLN              378
GLACIER              325
CARBON               290
JEFFERSON            265
FERGUS               230
BIG HORN             226
VALLEY               224
STILLWATER           205
TOOLE                188
DAWSON               184
POWELL               153
MADISON              152
BROADWATER           145
SANDERS              142
PHILLIPS             141
BEAVERHEAD           129
MUSSELSHELL          112
SWEET GRASS          107
TETON                104
ROSEBUD               99
WHEATLAND             52
FALLON                45
MEAGHER               39
PONDERA               35
SHERIDAN          

## 2. DataFrame Introduction

In [2]:
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/movies.csv'
movies = pd.read_csv(url)
movies

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


The image bellow provides a labeled diagram of all DataFrames major components

<img src="dataframe_anatomy.png" alt="drawing" width="700"/>

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p>pandas uses NaN (not a number) to represent missing values.</p>
</div>

You can use the `.columns`, `.index` and `.values` attributes to access, respectively, the columns, the index and the data of a DataFrame.

In [4]:
# dataframe columns
movies.columns

Index(['color', 'director name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [5]:
# dataframe index
movies.index

RangeIndex(start=0, stop=4916, step=1)

In [6]:
# dataframe data
movies.values

array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

You can use the `.dtypes` attribute to display each column name along with its **data type**.

In [7]:
movies.dtypes

color                         object
director name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
m

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p>In broad terms, data can be classified as either continuous or categorical.</p>
<p> <b>Continuous</b> data represents some kind of measurements, such as height or temperature.
Continuous data can take on an infinite number of possibilities.
<p> <b>Categorical</b> data represents discrete, finite amounts of values such as car color or movie genre.
</div>

In [8]:
# examine the first rows
movies.head()

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [9]:
# examine the last rows
movies.tail()

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.0,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660
4915,Color,Jon Gunn,43.0,90.0,16.0,16.0,Brian Herzlinger,86.0,85222.0,Documentary,...,84.0,English,USA,PG,1100.0,2004.0,23.0,6.6,1.85,456


In [11]:
# display a random sample of rows
movies.sample(5)

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
1656,Color,Warren Beatty,110.0,108.0,631.0,95.0,Kirk Baltz,631.0,26525834.0,Comedy|Drama|Romance,...,227.0,English,USA,R,30000000.0,1998.0,199.0,6.8,1.85,0
246,Color,Josh Trank,369.0,100.0,128.0,78.0,Reg E. Cathey,596.0,56114221.0,Action|Adventure|Sci-Fi,...,695.0,English,USA,PG-13,120000000.0,2015.0,360.0,4.3,2.35,41000
2473,Color,Chris Robinson,54.0,105.0,49.0,104.0,Adam Boyer,680.0,21160089.0,Comedy|Crime|Drama|Music|Romance,...,92.0,English,USA,PG-13,,2006.0,503.0,6.0,2.35,874
2793,Color,Renny Harlin,102.0,99.0,212.0,54.0,Rodney Eastman,130.0,49369900.0,Fantasy|Horror|Thriller,...,260.0,English,USA,R,7000000.0,1988.0,125.0,5.7,1.85,0
4116,Color,James Dodson,22.0,106.0,8.0,315.0,Anupam Kher,611.0,115504.0,Comedy|Drama|Romance,...,26.0,English,UK,PG-13,14000000.0,2008.0,397.0,6.2,2.35,0


In [42]:
# use python len function to get the number of rows
len(movies)

4916

In [43]:
# get size of the dataframe: rows x columns
movies.shape

(4916, 28)

## 3. Selecting a Series from a DataFrame

Selecting a single column from a DataFrame returns a **pandas Series** (that has the same index as the DataFrame).
A column in a DataFrame can be selected as a Series by **dictionary-like (bracket) notation or by attribute (dot notation)**:

In [45]:
# select the 'imbd_score' column using dot notation
movies.imdb_score

0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

In [43]:
# or equivalently, use bracket notation
movies['imdb_score']

0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

We can access more than one column

In [42]:
movies[['director name','movie title','imdb_score']]

Unnamed: 0,director name,movie title,imdb_score
0,James Cameron,Avatar,7.9
1,Gore Verbinski,Pirates of the Caribbean: At World's End,7.1
2,Sam Mendes,Spectre,6.8
3,Christopher Nolan,The Dark Knight Rises,8.5
4,Doug Walker,Star Wars: Episode VII - The Force Awakens,7.1
...,...,...,...
4911,Scott Smith,Signed Sealed Delivered,7.7
4912,,The Following,7.5
4913,Benjamin Roberds,A Plague So Pleasant,6.3
4914,Daniel Hsia,Shanghai Calling,6.3


<div class="alert alert-block alert-danger"> 
<p><b>Warning</b></p>
<p>The bracket notation will always work, whereas the dot notation has some limitations:</p> 
<ul>
  <li> The dot notation doesn't work if there are spaces in the column name (see Example 1 bellow)</li>
  <li> The dot notation doesn't work if the column has the same name as a DataFrame method or attribute (like 'head' or 'shape')</li>
  <li> The dot notation can't be used to define the name of a new column (see Example 2 bellow) </li>
</ul>
</div>

**Example 1**

In [44]:
movies['director name']

0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director name, Length: 4916, dtype: object

In [45]:
movies.director name

SyntaxError: invalid syntax (<ipython-input-45-a19716366549>, line 1)

**Example 2:** There are several columns that contain data on the number of Facebook likes.

In [81]:
movies[['actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes','director_facebook_likes']]

Unnamed: 0,actor_1_facebook_likes,actor_2_facebook_likes,actor_3_facebook_likes,director_facebook_likes
0,1000.0,936.0,855.0,0.0
1,40000.0,5000.0,1000.0,563.0
2,11000.0,393.0,161.0,0.0
3,27000.0,23000.0,23000.0,22000.0
4,131.0,12.0,,131.0
...,...,...,...,...
4911,637.0,470.0,318.0,2.0
4912,841.0,593.0,319.0,
4913,0.0,0.0,0.0,0.0
4914,946.0,719.0,489.0,0.0


Let's add up all actor and director Facebook like columns and assign them to the `total_likes` column

In [85]:
movies.total_likes = movies.actor_1_facebook_likes + movies.actor_2_facebook_likes + movies.actor_3_facebook_likes + movies.director_facebook_likes

  movies.total_likes = movies.actor_1_facebook_likes + movies.actor_2_facebook_likes + movies.actor_3_facebook_likes + movies.director_facebook_likes


In [86]:
movies['total_likes'] = movies.actor_1_facebook_likes + movies.actor_2_facebook_likes + movies.actor_3_facebook_likes + movies.director_facebook_likes

In [87]:
movies.head()

Unnamed: 0,color,director name,num_critic_for_reviews,...,aspect_ratio,movie_facebook_likes,total_likes
0,Color,James Cameron,723.0,...,1.78,33000,2791.0
1,Color,Gore Verbinski,302.0,...,2.35,0,46563.0
2,Color,Sam Mendes,602.0,...,2.35,85000,11554.0
3,Color,Christopher Nolan,813.0,...,2.35,164000,95000.0
4,,Doug Walker,,...,,0,


**Challenge:** Subtract `budget` from `gross` and assign the result to the `profit` column

In [89]:
# your code here
movies['profit'] = movies.gross-movies.budget
movies.head()

Unnamed: 0,color,director name,num_critic_for_reviews,...,movie_facebook_likes,total_likes,profit
0,Color,James Cameron,723.0,...,33000,2791.0,523505847.0
1,Color,Gore Verbinski,302.0,...,0,46563.0,9404152.0
2,Color,Sam Mendes,602.0,...,85000,11554.0,-44925825.0
3,Color,Christopher Nolan,813.0,...,164000,95000.0,198130642.0
4,,Doug Walker,,...,0,,


**Extra:** set the DataFrame index using existing columns

In [46]:
movies.set_index('movie title', 
                 inplace=True)
movies.head()

Unnamed: 0_level_0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [47]:
movies.index

Index(['Avatar', 'Pirates of the Caribbean: At World's End', 'Spectre',
       'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens',
       'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron',
       'Harry Potter and the Half-Blood Prince',
       ...
       'Primer', 'Cavite', 'El Mariachi', 'The Mongol King', 'Newlyweds',
       'Signed Sealed Delivered', 'The Following', 'A Plague So Pleasant',
       'Shanghai Calling', 'My Date with Drew'],
      dtype='object', name='movie title', length=4916)

In [48]:
movies['movie title']

KeyError: 'movie title'

We can use the `reset_index` method to set the index to monotonic increasing integers.

In [49]:
movies.reset_index(inplace=True,
                   drop=False) # insert index into dataframe colum

In [50]:
movies.index

RangeIndex(start=0, stop=4916, step=1)

In [51]:
movies['movie title']

0                                           Avatar
1         Pirates of the Caribbean: At World's End
2                                          Spectre
3                            The Dark Knight Rises
4       Star Wars: Episode VII - The Force Awakens
                           ...                    
4911                       Signed Sealed Delivered
4912                                 The Following
4913                          A Plague So Pleasant
4914                              Shanghai Calling
4915                             My Date with Drew
Name: movie title, Length: 4916, dtype: object

## 4. Renaming columns in a pandas DataFrame

In [15]:
# reload the movies dataframe
movies = pd.read_csv('https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/movies.csv')
movies.head(5)

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


Documentation for [`rename`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)

In [16]:
# examine the column names
movies.columns

Index(['color', 'director name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

Let's rename the columns 'director name' and 'movie title' by using the 'rename' method

In [53]:
# create a dictionary with the new names
new_column_names = {'director name':'director_name', 'movie title':'movie_title'}

# rename columns
movies.rename(columns=new_column_names, inplace=True)
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,...,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,...,7.1,2.35,0
2,Color,Sam Mendes,602.0,...,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,...,8.5,2.35,164000
4,,Doug Walker,,...,7.1,,0


## 5. Removing columns and/or rows from a pandas DataFrame

Documentation for [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

In [54]:
# remove a single column (axis=1 refers to columns)
movies.drop('director_name', axis=1, inplace=True) 
movies.head()

Unnamed: 0,color,num_critic_for_reviews,duration,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,723.0,178.0,...,7.9,1.78,33000
1,Color,302.0,169.0,...,7.1,2.35,0
2,Color,602.0,148.0,...,6.8,2.35,85000
3,Color,813.0,164.0,...,8.5,2.35,164000
4,,,,...,7.1,,0


In [56]:
# remove multiple columns at once
movies.drop(['color', 'duration'], axis=1, inplace=True)
movies.head()

Unnamed: 0,num_critic_for_reviews,director_facebook_likes,actor_3_facebook_likes,...,imdb_score,aspect_ratio,movie_facebook_likes
0,723.0,0.0,855.0,...,7.9,1.78,33000
1,302.0,563.0,1000.0,...,7.1,2.35,0
2,602.0,0.0,161.0,...,6.8,2.35,85000
3,813.0,22000.0,23000.0,...,8.5,2.35,164000
4,,131.0,,...,7.1,,0


In [57]:
# remove multiple rows at once (axis=0 refers to rows)
movies.drop([0, 3], axis=0, inplace=True)
movies.head()

Unnamed: 0,num_critic_for_reviews,director_facebook_likes,actor_3_facebook_likes,...,imdb_score,aspect_ratio,movie_facebook_likes
1,302.0,563.0,1000.0,...,7.1,2.35,0
2,602.0,0.0,161.0,...,6.8,2.35,85000
4,,131.0,,...,7.1,,0
5,462.0,475.0,530.0,...,6.6,2.35,24000
6,392.0,0.0,4000.0,...,6.2,2.35,0


## 6. Selecting multiple rows and columns from a pandas DataFrame

With ``loc`` and ``iloc`` you can do practically any data selection operation on DataFrames you can think of. ``loc`` is label-based, which means that you have to specify rows and columns based on their row and column labels. ``iloc`` is integer index based, so you have to specify rows and columns by their integer index.

- [The loc attribute](#6.1.-The-loc-attribute)
- [The iloc attribute](#6.2.-The-iloc-attribute)

In [12]:
# reload the movies dataframe
movies = pd.read_csv('https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/movies.csv', index_col='movie title')
movies.head()

Unnamed: 0_level_0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


### 6.1. The loc attribute

Documentation for [`loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)

The ``loc`` attribute is for **filtering rows and selecting columns by label (by their names)**

In [7]:
# select Avatar, The Avengers and Toy Story, and all columns
movies.loc[['Avatar','The Avengers','Toy Story'],:] 

Unnamed: 0_level_0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
The Avengers,Color,Joss Whedon,703.0,173.0,0.0,19000.0,Robert Downey Jr.,26000.0,623279547.0,Action|Adventure|Sci-Fi,...,1722.0,English,USA,PG-13,220000000.0,2012.0,21000.0,8.1,1.85,123000
Toy Story,Color,John Lasseter,166.0,74.0,487.0,802.0,John Ratzenberger,15000.0,191796233.0,Adventure|Animation|Comedy|Family|Fantasy,...,391.0,English,USA,G,30000000.0,1995.0,1000.0,8.3,1.85,0


In [13]:
movies.head(10)

Unnamed: 0_level_0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
John Carter,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
Spider-Man 3,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
Tangled,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000
Avengers: Age of Ultron,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000
Harry Potter and the Half-Blood Prince,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000


In [16]:
# movies Spectre through Harry Potter and the Half-Blood Prince
movies.loc['Spectre':'Harry Potter and the Half-Blood Prince',:] 

Unnamed: 0_level_0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
John Carter,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
Spider-Man 3,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
Tangled,Color,Nathan Greno,324.0,100.0,15.0,284.0,Donna Murphy,799.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,...,387.0,English,USA,PG,260000000.0,2010.0,553.0,7.8,1.85,29000
Avengers: Age of Ultron,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000
Harry Potter and the Half-Blood Prince,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000


In [68]:
# all rows, column 'color'
movies.loc[:,'color'] # 

0       Color
1       Color
2       Color
3       Color
4         NaN
        ...  
4911    Color
4912    Color
4913    Color
4914    Color
4915    Color
Name: color, Length: 4916, dtype: object

In [71]:
# all rows, columns'director name' and 'movie title'
movies.loc[:,['director name','movie title']] 

Unnamed: 0,director name,movie title
0,James Cameron,Avatar
1,Gore Verbinski,Pirates of the Caribbean: At World's End
2,Sam Mendes,Spectre
3,Christopher Nolan,The Dark Knight Rises
4,Doug Walker,Star Wars: Episode VII - The Force Awakens
...,...,...
4911,Scott Smith,Signed Sealed Delivered
4912,,The Following
4913,Benjamin Roberds,A Plague So Pleasant
4914,Daniel Hsia,Shanghai Calling


In [73]:
# all rows, columns 'movie title' through 'budget'
movies.loc[:,'movie title':'budget'] 

Unnamed: 0,movie title,num_voted_users,cast_total_facebook_likes,...,country,content_rating,budget
0,Avatar,886204,4834,...,USA,PG-13,237000000.0
1,Pirates of the Caribbean: At World's End,471220,48350,...,USA,PG-13,300000000.0
2,Spectre,275868,11700,...,UK,PG-13,245000000.0
3,The Dark Knight Rises,1144337,106759,...,USA,PG-13,250000000.0
4,Star Wars: Episode VII - The Force Awakens,8,143,...,,,
...,...,...,...,...,...,...,...
4911,Signed Sealed Delivered,629,2283,...,Canada,,
4912,The Following,73839,1753,...,USA,TV-14,
4913,A Plague So Pleasant,38,0,...,USA,,1400.0
4914,Shanghai Calling,1255,2386,...,USA,PG-13,


In [75]:
# rows 0 through 5, columns 'movie title' through 'budget'
movies.loc[0:5,'movie title':'budget']   

Unnamed: 0,movie title,num_voted_users,cast_total_facebook_likes,...,country,content_rating,budget
0,Avatar,886204,4834,...,USA,PG-13,237000000.0
1,Pirates of the Caribbean: At World's End,471220,48350,...,USA,PG-13,300000000.0
2,Spectre,275868,11700,...,UK,PG-13,245000000.0
3,The Dark Knight Rises,1144337,106759,...,USA,PG-13,250000000.0
4,Star Wars: Episode VII - The Force Awakens,8,143,...,,,
5,John Carter,212204,1873,...,USA,PG-13,263700000.0


### 6.2. The iloc attribute

Documentation for ['iloc'](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)

The iloc is for filtering rows and selecting columns by integer position

In [76]:
# all rows, columns 0 and 3
movies.iloc[:,[0,3]] 

Unnamed: 0,color,duration
0,Color,178.0
1,Color,169.0
2,Color,148.0
3,Color,164.0
4,,
...,...,...
4911,Color,87.0
4912,Color,43.0
4913,Color,76.0
4914,Color,100.0


In [77]:
# all rows, columns 0 through 3
movies.iloc[:,0:4] 

Unnamed: 0,color,director name,num_critic_for_reviews,duration
0,Color,James Cameron,723.0,178.0
1,Color,Gore Verbinski,302.0,169.0
2,Color,Sam Mendes,602.0,148.0
3,Color,Christopher Nolan,813.0,164.0
4,,Doug Walker,,
...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0
4912,Color,,43.0,43.0
4913,Color,Benjamin Roberds,13.0,76.0
4914,Color,Daniel Hsia,14.0,100.0


In [78]:
# rows 0 through 2, all columns
movies.iloc[0:3,:] 

Unnamed: 0,color,director name,num_critic_for_reviews,...,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,...,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,...,7.1,2.35,0
2,Color,Sam Mendes,602.0,...,6.8,2.35,85000


## 7. Handling missing values in pandas

- [Droping rows/columns with missing values](#7.1.-Droping-rows-with-missing-values)
- [Filling in missing values](#7.2.-Filling-in-missing-values)

What does "NaN" mean?

- "NaN" is not a string, rather it's a special value: numpy.nan.
- It stands for "Not a Number" and indicates a **missing value**.
- read_csv detects missing values (by default) when reading the file, and replaces them with this special value.

In [18]:
movies

Unnamed: 0_level_0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Signed Sealed Delivered,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
The Following,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
A Plague So Pleasant,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
Shanghai Calling,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


The `isna` attribute returns a DataFrame of booleans (True if missing, False if not missing)

In [17]:
movies.isna()

Unnamed: 0_level_0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Pirates of the Caribbean: At World's End,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Spectre,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
The Dark Knight Rises,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Star Wars: Episode VII - The Force Awakens,True,False,True,True,False,True,False,False,True,False,...,True,True,True,True,True,True,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Signed Sealed Delivered,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,True,False,False,False,True,False
The Following,False,True,False,False,True,False,False,False,True,False,...,False,False,False,False,True,True,False,False,False,False
A Plague So Pleasant,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,False,False,False,True,False
Shanghai Calling,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False


The  `notna` attribute returns the opposite of `isna` (True if not missing, False if missing)

In [19]:
movies.notna()

Unnamed: 0_level_0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
Pirates of the Caribbean: At World's End,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
Spectre,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
The Dark Knight Rises,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
Star Wars: Episode VII - The Force Awakens,False,True,False,False,True,False,True,True,False,True,...,False,False,False,False,False,False,True,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Signed Sealed Delivered,True,True,True,True,True,True,True,True,False,True,...,True,True,True,False,False,True,True,True,False,True
The Following,True,False,True,True,False,True,True,True,False,True,...,True,True,True,True,False,False,True,True,True,True
A Plague So Pleasant,True,True,True,True,True,True,True,True,False,True,...,True,True,True,False,True,True,True,True,False,True
Shanghai Calling,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,False,True,True,True,True,True


Documentation for [isna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) and [notna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html)

We can use the `sum` attribute to count the number of missing values in each column

In [20]:
movies.isna().sum()

color                         19
director name                102
num_critic_for_reviews        49
duration                      15
director_facebook_likes      102
actor_3_facebook_likes        23
actor_2_name                  13
actor_1_facebook_likes         7
gross                        862
genres                         0
actor_1_name                   7
num_voted_users                0
cast_total_facebook_likes      0
actor_3_name                  23
facenumber_in_poster          13
plot_keywords                152
movie_imdb_link                0
num_user_for_reviews          21
language                      12
country                        5
content_rating               300
budget                       484
title_year                   106
actor_2_facebook_likes        13
imdb_score                     0
aspect_ratio                 326
movie_facebook_likes           0
dtype: int64

This calculation works because:

- The sum method for a DataFrame operates on axis=0 by default (and thus produces column sums).
- In order to add boolean values, pandas converts True to 1 and False to 0.

**How to handle missing values** depends on the dataset as well as the nature of your analysis. 
Here are some options:

### 7.1. Dropping rows/columns with missing values

 Documentation for [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) ('inplace' parameter for 'dropna' is False by default, thus rows are only dropped temporarily)

**Example 1:**  if 'all' values are missing in a row, then drop that row 

In [11]:
movies.dropna(axis=0, how='all')

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


**Example 2:** if 'any' values are missing in a row, then drop that row

In [12]:
movies.dropna(axis=0, how='any')

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4899,Color,Olivier Assayas,81.0,110.0,107.0,45.0,Béatrice Dalle,576.0,136007.0,Drama|Music|Romance,...,39.0,French,France,R,4500.0,2004.0,133.0,6.9,2.35,171
4900,Color,Jafar Panahi,64.0,90.0,397.0,0.0,Nargess Mamizadeh,5.0,673780.0,Drama,...,26.0,Persian,Iran,Not Rated,10000.0,2000.0,0.0,7.5,1.85,697
4906,Color,Shane Carruth,143.0,77.0,291.0,8.0,David Sullivan,291.0,424760.0,Drama|Sci-Fi|Thriller,...,371.0,English,USA,PG-13,7000.0,2004.0,45.0,7.0,1.85,19000
4908,Color,Robert Rodriguez,56.0,81.0,0.0,6.0,Peter Marquardt,121.0,2040920.0,Action|Crime|Drama|Romance|Thriller,...,130.0,Spanish,USA,R,7000.0,1992.0,20.0,6.9,1.37,0


**Example 3:** if 'any' values are missing in a row (considering only `director name` and `country`), then drop that row

In [25]:
movies.dropna(axis=0, how='any',subset=['director name', 'country'])

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4910,Color,Edward Burns,14.0,95.0,0.0,133.0,Caitlin FitzGerald,296.0,4584.0,Comedy|Drama,...,14.0,English,USA,Not Rated,9000.0,2011.0,205.0,6.4,,413
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


**Example 4:** if 'all' values are missing in a row (considering only `director name` and `country`), then drop that row

In [26]:
movies.dropna(axis=0, how='all',subset=['director name', 'country'])

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


**Example 5:** if 'any' values are missing in a column, then drop that column

In [13]:
movies.dropna(axis=1, how='any')

Unnamed: 0,genres,movie title,num_voted_users,cast_total_facebook_likes,movie_imdb_link,imdb_score,movie_facebook_likes
0,Action|Adventure|Fantasy|Sci-Fi,Avatar,886204,4834,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,7.9,33000
1,Action|Adventure|Fantasy,Pirates of the Caribbean: At World's End,471220,48350,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,7.1,0
2,Action|Adventure|Thriller,Spectre,275868,11700,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,6.8,85000
3,Action|Thriller,The Dark Knight Rises,1144337,106759,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,8.5,164000
4,Documentary,Star Wars: Episode VII - The Force Awakens,8,143,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,7.1,0
...,...,...,...,...,...,...,...
4911,Comedy|Drama,Signed Sealed Delivered,629,2283,http://www.imdb.com/title/tt3000844/?ref_=fn_t...,7.7,84
4912,Crime|Drama|Mystery|Thriller,The Following,73839,1753,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,7.5,32000
4913,Drama|Horror|Thriller,A Plague So Pleasant,38,0,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,6.3,16
4914,Comedy|Drama|Romance,Shanghai Calling,1255,2386,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,6.3,660


**Example 6 (advanced)**  drop a column only if more that 15% of its values are missing

In [19]:
100*movies.isnull().sum()/movies.shape[0]<15

color                         True
director name                 True
num_critic_for_reviews        True
duration                      True
director_facebook_likes       True
actor_3_facebook_likes        True
actor_2_name                  True
actor_1_facebook_likes        True
gross                        False
genres                        True
actor_1_name                  True
movie title                   True
num_voted_users               True
cast_total_facebook_likes     True
actor_3_name                  True
facenumber_in_poster          True
plot_keywords                 True
movie_imdb_link               True
num_user_for_reviews          True
language                      True
country                       True
content_rating                True
budget                        True
title_year                    True
actor_2_facebook_likes        True
imdb_score                    True
aspect_ratio                  True
movie_facebook_likes          True
dtype: bool

In [27]:
movies = movies.loc[:,100*movies.isnull().sum()/movies.shape[0]<15]
movies

Unnamed: 0,color,director name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,genres,actor_1_name,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,Action|Adventure|Fantasy,Johnny Depp,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,Action|Adventure|Thriller,Christoph Waltz,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,Action|Thriller,Tom Hardy,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,Documentary,Doug Walker,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,Comedy|Drama,Eric Mabius,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,Crime|Drama|Mystery|Thriller,Natalie Zea,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,Drama|Horror|Thriller,Eva Boehnke,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,Comedy|Drama|Romance,Alan Ruck,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


### 7.2. Filling in missing values

Documentation for [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

In [24]:
movies['director name'].isna().sum()

102

In [25]:
# fill in missing values with a specified value
movies['director name'].fillna(value='unknown', inplace=True)

In [26]:
movies['director name'].isna().sum()

0