## Summary of my work over the Chapter 02 of "Pandas 1.x Cookbook" by Harrison & Petrou.
### In this notebook I'll cover:
1. Selecting Multiple Data Frame Columns;
2. Selecting Columns with methods;
3. Ordering Column Names;
4. Summarizing a Data Frame;
5. Chaining DataFrame Methods;
6. DataFrame operations;
7. Comparing Missing Values;
8. Transposing the direction of a DataFrame operation;
9. Determining College Campus Diversity.


In [12]:
import pandas as pd
import numpy as np

### 1 - Selecting Multiple DataFrame columns:

In [13]:
movies = pd.read_csv("../input/pandas-cookbook-data/data/movie.csv")
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [14]:
cols = [
    "actor_1_name",
    "actor_2_name",
    "actor_3_name",
    "director_name"
]
movies_actor_director = movies[cols]
movies_actor_director.head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Doug Walker,Rob Walker,,Doug Walker


Look at the subtle difference on the following 2 cells:

In [15]:
type(movies[["director_name"]]) # if i pass a list on the index operation it will return a DataFrame...

pandas.core.frame.DataFrame

In [16]:
type(movies["director_name"]) # if i pass just a string on the index operation it will return a Series...

pandas.core.series.Series

We can also use `.loc[]` to pull out a column by name. We are going to get a DataFrame it we pass a list, and a Series if we passa a string. The `:`(collon) is for sellecting all the rows:

In [17]:
print(type(movies.loc[:, ["director_name"]]))
movies.loc[:, ["director_name"]].head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes
3,Christopher Nolan
4,Doug Walker


In [18]:
print(type(movies.loc[:, "director_name"]))
movies.loc[:, "director_name"].head()

<class 'pandas.core.series.Series'>


0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

### 02 - Selecting Columns with methods:
`.select_dtypes()` and `.filter()`:

First, lets rename a bunch of columns at once:

In [22]:
def shorten(col):
    #print("col type:", type(col), "col:", col)
    return(
        str(col)
        .replace("facebook_likes", "fb")
        .replace("_for_reviews", "")
    )

movies = movies.rename(columns=shorten)
movies.head()

Unnamed: 0,color,director_name,num_critic,duration,director_fb,actor_3_fb,actor_2_name,actor_1_fb,gross,genres,...,num_user,language,country,content_rating,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


Studying the data types in each column:

In [25]:
movies.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

#### Now using the `.select_dtypes()` method to select only the columns with the desired type:

In [30]:
movies.select_dtypes(include="int").head()

Unnamed: 0,num_voted_users,cast_total_fb,movie_fb
0,886204,4834,33000
1,471220,48350,0
2,275868,11700,85000
3,1144337,106759,164000
4,8,143,0
