## Summary of my work over the Chapter 02 of "Pandas 1.x Cookbook" by Harrison & Petrou.
### In this notebook I'll cover:
1. Selecting Multiple Data Frame Columns;
2. Selecting Columns with methods;
3. Ordering Column Names;
4. Summarizing a Data Frame;
5. Chaining DataFrame Methods;
6. DataFrame operations;
7. Comparing Missing Values;
8. Transposing the direction of a DataFrame operation;
9. Determining College Campus Diversity.


In [55]:
import pandas as pd
import numpy as np

### 01 - Selecting Multiple DataFrame columns:

In [56]:
movies = pd.read_csv("../input/pandas-cookbook-data/data/movie.csv")
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [57]:
cols = [
    "actor_1_name",
    "actor_2_name",
    "actor_3_name",
    "director_name"
]
movies_actor_director = movies[cols]
movies_actor_director.head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Doug Walker,Rob Walker,,Doug Walker


Look at the subtle difference on the following 2 cells:

In [58]:
type(movies[["director_name"]]) # if i pass a list on the index operation it will return a DataFrame...

pandas.core.frame.DataFrame

In [59]:
type(movies["director_name"]) # if i pass just a string on the index operation it will return a Series...

pandas.core.series.Series

We can also use `.loc[]` to pull out a column by name. We are going to get a DataFrame it we pass a list, and a Series if we passa a string. The `:`(collon) is for sellecting all the rows:

In [60]:
print(type(movies.loc[:, ["director_name"]]))
movies.loc[:, ["director_name"]].head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes
3,Christopher Nolan
4,Doug Walker


In [61]:
print(type(movies.loc[:, "director_name"]))
movies.loc[:, "director_name"].head()

<class 'pandas.core.series.Series'>


0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

### 02 - Selecting Columns with methods:
`.select_dtypes()` and `.filter()`:

First, lets rename a bunch of columns at once:

In [62]:
def shorten(col):
    #print("col type:", type(col), "col:", col)
    return(
        str(col)
        .replace("facebook_likes", "fb")
        .replace("_for_reviews", "")
    )

movies = movies.rename(columns=shorten)
movies.head()

Unnamed: 0,color,director_name,num_critic,duration,director_fb,actor_3_fb,actor_2_name,actor_1_fb,gross,genres,...,num_user,language,country,content_rating,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


Studying the distribution of data types in each column by counting them:

In [63]:
movies.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

#### Now using the `.select_dtypes()` method to select only the columns with the desired type:

In [64]:
movies.select_dtypes(include="int").head()

Unnamed: 0,num_voted_users,cast_total_fb,movie_fb
0,886204,4834,33000
1,471220,48350,0
2,275868,11700,85000
3,1144337,106759,164000
4,8,143,0


In [65]:
movies.select_dtypes(include="number").head()

Unnamed: 0,num_critic,duration,director_fb,actor_3_fb,actor_1_fb,gross,num_voted_users,cast_total_fb,facenumber_in_poster,num_user,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,723.0,178.0,0.0,855.0,1000.0,760505847.0,886204,4834,0.0,3054.0,237000000.0,2009.0,936.0,7.9,1.78,33000
1,302.0,169.0,563.0,1000.0,40000.0,309404152.0,471220,48350,0.0,1238.0,300000000.0,2007.0,5000.0,7.1,2.35,0
2,602.0,148.0,0.0,161.0,11000.0,200074175.0,275868,11700,1.0,994.0,245000000.0,2015.0,393.0,6.8,2.35,85000
3,813.0,164.0,22000.0,23000.0,27000.0,448130642.0,1144337,106759,0.0,2701.0,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,,131.0,,131.0,,8,143,0.0,,,,12.0,7.1,,0


In [66]:
movies.select_dtypes(include=["int", "object"]).head(3)

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000


.. or we can exclude some types:

In [67]:
movies.select_dtypes(exclude="float").head(3)

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000


#### Now using the `.filter()` method:
The `like` parameter is checking for substrings in column names:

In [68]:
movies.filter(like="fb").head(3)

Unnamed: 0,director_fb,actor_3_fb,actor_1_fb,cast_total_fb,actor_2_fb,movie_fb
0,0.0,855.0,1000.0,4834,936.0,33000
1,563.0,1000.0,40000.0,48350,5000.0,0
2,0.0,161.0,11000.0,11700,393.0,85000


.. or using the `items` parameter I can pass a list of column names:

In [69]:
# the 'cols' list is defined above...
movies.filter(items=cols).head(3)

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes


.. or using the `regex` parameter, for using regular expressions:

In [70]:
movies.filter(regex=r"\d").head(3) ## searching for a column that have a digit somewhere in their name.

Unnamed: 0,actor_3_fb,actor_2_name,actor_1_fb,actor_1_name,actor_3_name,actor_2_fb
0,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
1,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
2,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0


### 03 - Ordering Column Names:
"[...] The following is a guideline to order columns:
* Classify each column as either categorical or continuous
* Group common columns within the categorical and continuous columns
* Place the most important groups of columns first with categorical columns before continuous ones [...]"

In [71]:
movies.columns

Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
       'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
       'movie_fb'],
      dtype='object')

Organizing the names sensibly into lists so that the above guideline is followed:

In [72]:
## cat: categorical
## cont: continuous
cat_core = [ #1
    "movie_title",
    "title_year",
    "content_rating",
    "genres"
]
cat_people = [ #2
    "director_name",
    "actor_1_name",
    "actor_2_name",
    "actor_3_name"
]
cat_other = [ #3
    "color",
    "country",
    "language",
    "plot_keywords",
    "movie_imdb_link"
]
cont_fb = [ #4
    "director_fb",
    "actor_1_fb",
    "actor_2_fb",
    "actor_3_fb",
    "cast_total_fb",
    "movie_fb"
]
cont_finance = [ #5
    "budget",
    "gross"
]
cont_num_reviews = [ #6
    "num_voted_users",
    "num_user",
    "num_critic"
]
cont_other = [ #7
    "imdb_score",
    "duration",
    "aspect_ratio",
    "facenumber_in_poster"
]

In [73]:
new_col_order = (
    cat_core
    + cat_people
    + cat_other
    + cont_fb
    + cont_finance
    + cont_num_reviews
    + cont_other
)

.. checking if the set with the new column names is the same as the old:

In [74]:
set(movies.columns) == set(new_col_order)

True

Now i just need to pass the new column order to the indexing operator of the DataFrame:

In [75]:
movies = movies[new_col_order]
movies.head(3)

Unnamed: 0,movie_title,title_year,content_rating,genres,director_name,actor_1_name,actor_2_name,actor_3_name,color,country,...,movie_fb,budget,gross,num_voted_users,num_user,num_critic,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,Avatar,2009.0,PG-13,Action|Adventure|Fantasy|Sci-Fi,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Color,USA,...,33000,237000000.0,760505847.0,886204,3054.0,723.0,7.9,178.0,1.78,0.0
1,Pirates of the Caribbean: At World's End,2007.0,PG-13,Action|Adventure|Fantasy,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Color,USA,...,0,300000000.0,309404152.0,471220,1238.0,302.0,7.1,169.0,2.35,0.0
2,Spectre,2015.0,PG-13,Action|Adventure|Thriller,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Color,UK,...,85000,245000000.0,200074175.0,275868,994.0,602.0,6.8,148.0,2.35,1.0


### 04 - Summarizing a Data Frame:

In [76]:
movies.shape

(4916, 28)

In [77]:
movies.size

137648

In [78]:
movies.ndim

2

In [88]:
len(movies) # when a DataFrame is passed to the built-in len() function it returns the number of rows...

4916

In [94]:
movies.count().head()

movie_title       4916
title_year        4810
content_rating    4616
genres            4916
director_name     4814
dtype: int64

In [90]:
movies.select_dtypes(include="number").min() # I could also use .max() .mean() .median() .std()

title_year              1916.00
director_fb                0.00
actor_1_fb                 0.00
actor_2_fb                 0.00
actor_3_fb                 0.00
cast_total_fb              0.00
movie_fb                   0.00
budget                   218.00
gross                    162.00
num_voted_users            5.00
num_user                   1.00
num_critic                 1.00
imdb_score                 1.60
duration                   7.00
aspect_ratio               1.18
facenumber_in_poster       0.00
dtype: float64

In [95]:
movies.select_dtypes(include="number").min(skipna=False).head() ## with skipna=False, only numeric columns with missing values will calculate a result.

title_year    NaN
director_fb   NaN
actor_1_fb    NaN
actor_2_fb    NaN
actor_3_fb    NaN
dtype: float64

In [92]:
movies.describe().T.head(3)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
title_year,4810.0,2002.447609,12.453977,1916.0,1999.0,2005.0,2011.0,2016.0
director_fb,4814.0,691.014541,2832.954125,0.0,7.0,48.0,189.75,23000.0
actor_1_fb,4909.0,6494.488491,15106.986884,0.0,607.0,982.0,11000.0,640000.0


In [93]:
movies.describe(percentiles=[0.01, 0.3, 0.99]).T.head(3) ## using the 'percentiles' paramenter

Unnamed: 0,count,mean,std,min,1%,30%,50%,99%,max
title_year,4810.0,2002.447609,12.453977,1916.0,1951.0,2000.0,2005.0,2016.0,2016.0
director_fb,4814.0,691.014541,2832.954125,0.0,0.0,11.0,48.0,16000.0,23000.0
actor_1_fb,4909.0,6494.488491,15106.986884,0.0,6.08,694.0,982.0,44920.0,640000.0


### 05 - Chaining DataFrame Methods:

In [96]:
movies.isnull().head()

Unnamed: 0,movie_title,title_year,content_rating,genres,director_name,actor_1_name,actor_2_name,actor_3_name,color,country,...,movie_fb,budget,gross,num_voted_users,num_user,num_critic,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,True,True,False,False,False,False,True,True,True,...,False,True,True,False,True,True,False,True,True,False


In [97]:
movies.isna().head()

Unnamed: 0,movie_title,title_year,content_rating,genres,director_name,actor_1_name,actor_2_name,actor_3_name,color,country,...,movie_fb,budget,gross,num_voted_users,num_user,num_critic,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,True,True,False,False,False,False,True,True,True,...,False,True,True,False,True,True,False,True,True,False


In [100]:
movies.isnull().sum().head(10) # counts the number of missing values in each column

movie_title         0
title_year        106
content_rating    300
genres              0
director_name     102
actor_1_name        7
actor_2_name       13
actor_3_name       23
color              19
country             5
dtype: int64

In [101]:
movies.isnull().sum().sum()

2654

In [103]:
movies.isnull().any().any()

True

In [105]:
movies.isnull().dtypes.value_counts()

bool    28
dtype: int64

In [109]:
(
    movies
    .select_dtypes(["object"])
    .columns
    .to_list()
)

['movie_title',
 'content_rating',
 'genres',
 'director_name',
 'actor_1_name',
 'actor_2_name',
 'actor_3_name',
 'color',
 'country',
 'language',
 'plot_keywords',
 'movie_imdb_link']

## ... I'm still working on it. The notebook will be finished by tomorow, I hope you find the above content helpful in some way.