# Chapter 1: Pandas Foundations

## Recipes
* [Dissecting the anatomy of a DataFrame](#Dissecting-the-anatomy-of-a-DataFrame)
* [Accessing the main DataFrame components](#Accessing-the-main-DataFrame-components)
* [Understanding data types](#Understanding-data-types)
* [Selecting a single column of data as a Series](#Selecting-a-single-column-of-data-as-a-Series)
* [Calling Series methods](#Calling-Series-methods)
* [Working with operators on a Series](#Working-with-operators-on-a-Series)
* [Chaining Series methods together](#Chaining-Series-methods-together)
* [Making the index meaningful](#Making-the-index-meaningful)
* [Renaming row and column names](#Renaming-row-and-column-names)
* [Creating and deleting columns](#Creating-and-deleting-columns)

In [1]:
import numpy as np
import pandas as pd

# Dissecting the anatomy of a DataFrame

#### Change options to get specific output for book

In [2]:
# You can use this for set Priority for column and Rows that how many you want to see
pd.set_option('max_columns', 8, 'max_rows', 10)

In [3]:
movie =pd.read_csv("data/movie.csv")

In [4]:
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,...,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,...,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,...,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,...,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,...,23000.0,8.5,2.35,164000
4,,Doug Walker,,,...,12.0,7.1,,0


In [14]:
#column have always axis =1 like " color, director_name"
# rows have always  axis=0  like " index"

# Accessing the main DataFrame components

Each of the Data frame component index ,columns  and data --may be accessed directly from a data frame 
 
'Each of these components is itself  a python object with its own unique attributes and methods'

In [16]:
index =movie.index

In [17]:
index

RangeIndex(start=0, stop=4916, step=1)

In [18]:
columns =movie.columns

In [23]:
columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

In [24]:
data =movie.values

In [25]:
data

array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
       ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
       ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
       ...,
       ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
       ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
       ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)

In [26]:
type(index)

pandas.core.indexes.range.RangeIndex

In [27]:
type(columns)

pandas.core.indexes.base.Index

In [28]:
type(data)

numpy.ndarray

Interestingly, both the types for both the index and the columns appear to be closely related. The built-in issubclass method checks whether RangeIndex is indeed a subclass of Index:


In [32]:
issubclass(pd.RangeIndex,pd.Index)

True

# Understanding data types

In very broad terms, data may be classified as either continuous or categorical. Continuous data is always numeric and 
represents some kind of measurement, such as height, wage, or salary. Continuous data can take on an infinite number of
possibilities. Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, 
type of poker hand, or brand of cereal. Pandas does not broadly classify data as either continuous or categorical.  



In [38]:
# Use the dtypes attribute to display each column along with its data type
movie.dtypes

color                         object
director_name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie_title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
m

In [50]:
#Find out the types of all columns in the dataframe
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      4897 non-null   object 
 1   director_name              4814 non-null   object 
 2   num_critic_for_reviews     4867 non-null   float64
 3   duration                   4901 non-null   float64
 4   director_facebook_likes    4814 non-null   float64
 5   actor_3_facebook_likes     4893 non-null   float64
 6   actor_2_name               4903 non-null   object 
 7   actor_1_facebook_likes     4909 non-null   float64
 8   gross                      4054 non-null   float64
 9   genres                     4916 non-null   object 
 10  actor_1_name               4909 non-null   object 
 11  movie_title                4916 non-null   object 
 12  num_voted_users            4916 non-null   int64  
 13  cast_total_facebook_likes  4916 non-null   int64

Homogeneous data is another term for referring to columns that all have the same type. DataFrames as a whole may 
contain heterogeneous data of different data types for different columns

# Selecting a single column of data as a Series

In [53]:
#A Series is a single column of data from a DataFrame. It is a single dimension of data, composed of just an index and the data.

movie['director_name']

0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

In [55]:
# You Can Also Use Dot notation to accomplish the task
movie.director_name

0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel Hsia
4915             Jon Gunn
Name: director_name, Length: 4916, dtype: object

Python has several built-in objects for containing data, such as lists, tuples, and dictionaries. All three of these 
objects use the indexing operator to select their data. DataFrames are more powerful and complex containers of data,
but they too use the indexing operator as the primary means to select data. Passing a single string to the DataFrame 
indexing operator returns a Series.

The visual output of the Series is less stylized than the DataFrame. It represents a single column of data. Along with the index and values, the output displays the name, length, and data type of the Series. 

Yet another reason to be aware of the dot notation is the proliferation of its use online at the popular question and 
answer site Stack Overflow. Also, notice that the old column name is now the name of the Series and has actually become 
an attribute:


In [58]:
director=movie['director_name']
director.name

'director_name'

In [61]:
#It is possible to turn this Series into a one-column DataFrame with the to_frame method. 
#                                                        This method will use the Series name as the new column name:

director.to_frame()

Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes
3,Christopher Nolan
4,Doug Walker
...,...
4911,Scott Smith
4912,
4913,Benjamin Roberds
4914,Daniel Hsia


# Calling Series methods

Both Series and DataFrames have a tremendous amount of power. We can use the dir function to uncover all the attributes 
and methods of a Series. Additionally, we can find the number of attributes and methods common to both Series and DataFrames.
Both of these objects share the vast majority of attribute and method names:


In [63]:
s_attr_methods = set(dir(pd.Series))
len(s_attr_methods)

434

In [65]:
df_attr_methods = set(dir(pd.DataFrame))
len(df_attr_methods)

431

In [66]:
len(s_attr_methods & df_attr_methods)

378

# How to do it...........

In [67]:
director=movie['director_name']
actor_1_fb_likes=movie['actor_1_facebook_likes']

In [68]:
# Inspect the head of each series 
director.head()

0        James Cameron
1       Gore Verbinski
2           Sam Mendes
3    Christopher Nolan
4          Doug Walker
Name: director_name, dtype: object

In [70]:
actor_1_fb_likes.head()

0     1000.0
1    40000.0
2    11000.0
3    27000.0
4      131.0
Name: actor_1_facebook_likes, dtype: float64

 The data type of the Series usually determines which of the methods will be the most useful. For instance, one of the most 
       useful methods for the object data type Series is value_counts, which counts all the occurrences of each unique value:


In [71]:
director.value_counts()

Steven Spielberg    26
Woody Allen         22
Clint Eastwood      20
Martin Scorsese     20
Ridley Scott        16
                    ..
Menno Meyjes         1
Jaco Booyens         1
Daniel Schechter     1
Fabián Bielinsky     1
Joon-ho Bong         1
Name: director_name, Length: 2397, dtype: int64

The value_counts method is typically more useful for Series with object data types but can occasionally provide insight
into numeric Series as well. Used with actor_1_fb_likes, it appears that higher numbers have been rounded to the nearest 
thousand as it is unlikely that so many movies received exactly 1,000 likes:


In [74]:
actor_1_fb_likes.value_counts()

1000.0     436
11000.0    206
2000.0     189
3000.0     150
12000.0    131
          ... 
362.0        1
216.0        1
859.0        1
225.0        1
334.0        1
Name: actor_1_facebook_likes, Length: 877, dtype: int64

In [78]:
# Counting the number of elements in the Series may be done with the size or shape parameter or the len function:
director.size

4916

In [79]:
director.shape


(4916,)

In [81]:
len(director)

4916

In [82]:
# Additionally, there is the useful but confusing count method that returns the number of non-missing values

In [83]:
director.count()


4814

In [84]:
actor_1_fb_likes.count()

4909

In [85]:
# Basic summary statistics may be yielded with the min, max, mean, median, std, and sum methods

In [91]:
actor_1_fb_likes.min()

0.0

In [92]:
actor_1_fb_likes.max()   

(640000.0,)

In [93]:
actor_1_fb_likes.mean()

6494.488490527602

In [94]:
actor_1_fb_likes.median()  

982.0

In [95]:
actor_1_fb_likes.std()

15106.986883848309

In [96]:
actor_1_fb_likes.sum()

31881444.0

To simplify step , you may use the describe method to return both the8. summary statistics and a few of the quantiles at once.
When describe is used with an object data type column, a completely different output is returned

In [98]:
actor_1_fb_likes.describe()

count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64

In [99]:
director.describe()

count                 4814
unique                2397
top       Steven Spielberg
freq                    26
Name: director_name, dtype: object

In [101]:
# The quantile method exist too calculate an exact quantile of numeric data :
actor_1_fb_likes.quantile(.2)

510.0

Since the count method  returned a value less than the total number of Series elements , we know that there are missing 
values in each Series. The isnull method may be used to determine whether each individual value is missing or not. 
The result will be a Series of booleans the same length as the original Series:


In [102]:
director.size

4916

In [103]:
director.count()

4814

In [104]:
director.isnull()

0       False
1       False
2       False
3       False
4       False
        ...  
4911    False
4912     True
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

In [117]:
# It is possible to replace all missing values within a series  with the fillna method:

actor_1_fb_likes_filled=actor_1_fb_likes.fillna(0)

In [118]:
actor_1_fb_likes_filled.count()

4916

In [119]:
# To Remove the Series elements with the missing values use dropna:
actor_1_fb_likes_dropped=actor_1_fb_likes.dropna(0)

In [120]:
actor_1_fb_likes_dropped.size

4909

The value_counts method is one of the most informative Series methods and heavily used during exploratory analysis, 
especially with categorical columns. It defaults to returning the counts, but by setting the normalize parameter to True,
the relative frequencies are returned instead, which provides another view of the distribution

In [126]:
director.value_counts(normalize=True)

Steven Spielberg    0.005401
Woody Allen         0.004570
Clint Eastwood      0.004155
Martin Scorsese     0.004155
Ridley Scott        0.003324
                      ...   
Menno Meyjes        0.000208
Jaco Booyens        0.000208
Daniel Schechter    0.000208
Fabián Bielinsky    0.000208
Joon-ho Bong        0.000208
Name: director_name, Length: 2397, dtype: float64

In [127]:
# There exists a complement of isnull: the notnull method, which returns True for all the non-missing values:
director.notnull()

0        True
1        True
2        True
3        True
4        True
        ...  
4911     True
4912    False
4913     True
4914     True
4915     True
Name: director_name, Length: 4916, dtype: bool

# Working with operators on a Series

There exist a vast number of operators in Python for manipulating objects. Operators are not objects themselves,
but rather syntactical structures and keywords that force an operation to occur on an object. For instance, when 
the plus operator is placed between two integers, Python will add them together. See more examples of operators in 
the following code:


In [129]:
5+9 #plus  opertor example add 5 and 9

14

In [130]:
4**2 # Exponentiation opertor raises 4 to the second power 

16

In [131]:
a=10

In [132]:
5<=9

True

In [133]:
# Operators can work for any type of object, not just numerical data. These examples show different objects being operated on:

'abcde'+'fg'

'abcdefg'

In [134]:
not(5<=9)

False

In [135]:
7 in [1,2,4]

False

In [140]:
set([1,2,3]) & set([2,3,4])

{2, 3}

In [141]:
imdb_score=movie['imdb_score']

In [142]:
imdb_score

0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

In [143]:
# Use the Plus Operator To add on to each series element:
imdb_score+1

0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64

The other basic arithmetic operators minus (-), multiplication (*), division (/),and exponentiation (**) work similarly 
with scalar values. In this step, we will multiply the series by 2.5:


In [144]:
imdb_score*2.5

0       19.75
1       17.75
2       17.00
3       21.25
4       17.75
        ...  
4911    19.25
4912    18.75
4913    15.75
4914    15.75
4915    16.50
Name: imdb_score, Length: 4916, dtype: float64

 Python uses two consecutive division operators (//) for floor division and the percent sign (%) for the modulus operator,
    which returns the remainder after a division. Series use these the same way:


In [145]:
imdb_score//7

0       1.0
1       1.0
2       0.0
3       1.0
4       1.0
       ... 
4911    1.0
4912    1.0
4913    0.0
4914    0.0
4915    0.0
Name: imdb_score, Length: 4916, dtype: float64

There exist six comparison operators, greater than (>), less than (<), greater than or equal to (>=), less than or equal to
(<=), equal to (==), and not equal to (!=). Each comparison operator turns each value in the Series to True or False based on 
the outcome of the condition:


In [146]:
imdb_score>7

0        True
1        True
2       False
3        True
4        True
        ...  
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

All the operators used in this recipe apply the same operation to each element in the Series. In native Python, this 
would require a for-loop to iterate through each of the items in the sequence before applying the operation. Pandas 
relies heavily on the NumPy library, which allows for vectorized computations, or the ability to operate on entire
sequences of data without the explicit writing of for loops. Each operation returns a Series with the same index,
but with values that have been modified by the operator.


## There is more ----

All of the operators used in this recipe have method equivalents that produce the exact same result. 
For instance, in step 1, imdb_score + 1 may be reproduced with the add method. Check the following code to 
see the method version of each step in the recipe:


In [147]:
imdb_score.add(1)              # imdb_score + 1 
imdb_score.mul(2.5)            # imdb_score * 2.5 
imdb_score.floordiv(7)         # imdb_score // 7  
imdb_score.gt(7)               # imdb_score > 7 
director.eq('James Cameron')   # director == 'James Cameron' 

0        True
1       False
2       False
3       False
4       False
        ...  
4911    False
4912    False
4913    False
4914    False
4915    False
Name: director_name, Length: 4916, dtype: bool

# Chaining Series methods together

In Python, every variable is an object, and all objects have attributes and methods that refer to or return more objects. 
The sequential invocation of methods using the dot notation is referred to as method chaining. Pandas is a library that 
lends itself well to method chaining, as many Series and DataFrame methods return more Series and DataFrames, upon which
more methods can be called. 

In [149]:
# One of the most common methods to append to the chain is the head method. This suppresses long output. For shorter chains,
#                                    there isn't as great a need to place each method on a different line:

director.value_counts().head(3)

Steven Spielberg    26
Woody Allen         22
Clint Eastwood      20
Martin Scorsese     20
Ridley Scott        16
Name: director_name, dtype: int64

In [150]:
# A coomon  Way to count the number of missing values is to chain the sum  method after isnull:

actor_1_fb_likes.isnull().sum()

7

All the non-missing values of actor_1_fb_likes should be integers as it is impossible to have a partial Facebook like.
Any numeric columns with missing values must have their data type as float. If we fill missing values from actor_1_fb_likes
with zeros, we can then convert it to an integer with the astype method:

In [152]:
actor_1_fb_likes.fillna(0)\
.astype(int)\
.head()

0     1000
1    40000
2    11000
3    27000
4      131
Name: actor_1_facebook_likes, dtype: int32

# Making the index meaningful

The index of a DataFrame provides a label for each of the rows. If no index is explicitly provided upon DataFrame creation, 
then by default, a RangeIndex is created with labels as integers from 0 to n-1, where n is the number of rows

In [158]:
movie2=movie.set_index('movie_title')

In [157]:
movie2.head(3)

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000


In [159]:
# Alternatively, it is possible to choose a column as the index upon initial read with2. the index_col parameter of the read_csv function:
movie = pd.read_csv('data/movie.csv', index_col='movie_title')

In [160]:
movie.head(3)

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000


In [None]:
# You can reset index column
movie.reset_index()

# Renaming row and column names

One of the most basic and common operations on a DataFrame is to rename the row or column names. Good column names are 
descriptive, brief, and follow a common convention with respect to capitalization, spaces, underscores, and other features.


In [163]:
# Alternatively, it is possible to choose a column as the index upon initial read with2. the index_col parameter of the read_csv function:
movie = pd.read_csv('data/movie.csv', index_col='movie_title')

In [168]:
#The rename DataFrame method accepts dictionaries that map the old value to thenew value. Let's create one for the rows and another for the columns:

In [166]:
idx_rename={'Avatar':'Ratava', 'Spectre': 'Ertceps'} 
col_rename = {'director_name':'Director Name', 
              'num_critic_for_reviews': 'Critical Reviews'} 

Pass the dictionaries to the rename method, and assign the result to a new variable:


In [170]:
movie.rename(index=idx_rename, 
             columns=col_rename).head()

Unnamed: 0_level_0,color,Director Name,Critical Reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ratava,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Ertceps,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


# There is More--

There are multiple ways to rename row and column labels. It is possible to reassign the index and column attributes directly to
a Python list. This assignment works when the list has the same number of elements as the row and column labels. The following 
code uses the tolist method on each Index object to create a Python list of labels. It then modifies a couple values in the 
list and reassigns the list to the attributes index and columns:
  




In [172]:
index = movie.index
columns = movie.columns

In [173]:
index_list = index.tolist()
column_list = columns.tolist()

In [174]:
# rename the row and column labels with list assignments 
index_list[0] = 'Ratava' 
index_list[2] = 'Ertceps' 
column_list[1] = 'Director Name' 
column_list[2] = 'Critical Reviews'


In [175]:
print(index_list)

['Ratava', "Pirates of the Caribbean: At World's End", 'Ertceps', 'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens', 'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron', 'Harry Potter and the Half-Blood Prince', 'Batman v Superman: Dawn of Justice', 'Superman Returns', 'Quantum of Solace', "Pirates of the Caribbean: Dead Man's Chest", 'The Lone Ranger', 'Man of Steel', 'The Chronicles of Narnia: Prince Caspian', 'The Avengers', 'Pirates of the Caribbean: On Stranger Tides', 'Men in Black 3', 'The Hobbit: The Battle of the Five Armies', 'The Amazing Spider-Man', 'Robin Hood', 'The Hobbit: The Desolation of Smaug', 'The Golden Compass', 'King Kong', 'Titanic', 'Captain America: Civil War', 'Battleship', 'Jurassic World', 'Skyfall', 'Spider-Man 2', 'Iron Man 3', 'Alice in Wonderland', 'X-Men: The Last Stand', 'Monsters University', 'Transformers: Revenge of the Fallen', 'Transformers: Age of Extinction', 'Oz the Great and Powerful', 'The Amazing Spider-

In [176]:
print(column_list)

['color', 'Director Name', 'Critical Reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes']


In [177]:
# finally reassign the index and columns

In [178]:
movie.index = index_list

In [179]:
movie.columns = column_list

# Creating and deleting columns

During a data analysis, it is extremely likely that you will need to create new columns to represent new variables. 
Commonly, these new columns will be created from previous columns already in the dataset. Pandas has a few different ways 
to add new columns to a DataFrame.

The simplest way to create a new column is to assign it a scalar value. Place the name of the new column as a string into 
the indexing operator. Let's create the has_seen column in the movie dataset to indicate whether or not we have seen the 
movie. We will assign zero for every value. By default, new columns are appended to the end:


In [286]:
movie = pd.read_csv('data/movie.csv')

In [287]:
movie['has_seen']=0

There are several columns that contain data on the number of Facebook likes.2. Let's add up all the actor and director 
Facebook likes and assign them to the actor_director_facebook_likes column:


In [288]:
 movie['actor_director_facebook_likes'] =  ( movie['actor_1_facebook_likes'] +         
                                                  movie['actor_2_facebook_likes'] +         
                                                  movie['actor_3_facebook_likes'] +         
                                                  movie['director_facebook_likes'])

From the Calling Series method recipe in this chapter, we know that this dataset contains missing values. When numeric 
columns are added to one another as in the preceding step, pandas defaults missing values to zero. But, if all values for 
a particular row are missing, then pandas keeps the total as missing as well. Let's check if there are missing values in 
our new column and fill them with 0:


In [289]:
 movie['actor_director_facebook_likes'].isnull().sum() 

122

In [290]:
 movie['actor_director_facebook_likes'] = movie['actor_director_facebook_likes'].fillna(0)

There is another column in the dataset named cast_total_facebook_likes. It would be interesting to see what percentage of 
this column comes from our newly created column, actor_director_facebook_likes. Before we create our percentage column, let's
do some basic data validation. Let's ensure that cast_total_facebook_likes is greater than or equal to 
actor_director_facebook_likes

In [291]:
 movie['is_cast_likes_more'] = (movie['cast_total_facebook_likes'] >=          
                                           movie['actor_director_facebook_likes'])

In [292]:
#is_cast_likes_more is now a column of boolean values. We can check5. whether all the values of this column are True with 
#                                           the all Series method:

In [293]:
 movie['is_cast_likes_more'].all() 

False

It turns out that there is at least one movie with more actor_director_facebook_likes than cast_total_facebook_likes.
It could be that director Facebook likes are not part of the cast total likes. Let's backtrack and delete column 
actor_director_facebook_likes:


In [294]:
movie = movie.drop('actor_director_facebook_likes',axis='columns') 

In [295]:
# Let's recreate a column of just the total actor likes

In [296]:
movie['actor_total_facebook_likes'] = (movie['actor_1_facebook_likes'] +          
                                           movie['actor_2_facebook_likes'] +          
                                               movie['actor_3_facebook_likes'])
    
movie['actor_total_facebook_likes'] = movie['actor_total_facebook_likes'].fillna(0)

In [297]:
#Check again whether all the values in cast_total_facebook_likes are greater than the actor_total_facebook_likes:


In [298]:
movie['is_cast_likes_more'] = (movie['cast_total_facebook_likes'] >=          
                                            movie['actor_total_facebook_likes']) 
movie['is_cast_likes_more'].all()  

True

In [299]:
#Finally, let's calculate the percentage of the cast_total_facebook_likes that9. come from actor_total_facebook_likes:


In [300]:
movie['pct_actor_cast_like'] = (movie['actor_total_facebook_likes'] /        
                                    movie['cast_total_facebook_likes']) 

In [301]:
#Let's validate that the min and max of this column fall between 0 and 1:10.

In [302]:
movie['pct_actor_cast_like'].min(),     movie['pct_actor_cast_like'].max() 

(0.0, 1.0)

We can then output this column as a Series. First, we need to set the index to the movie title so we can properly identify 
each value.


In [303]:
movie.set_index('movie_title')['pct_actor_cast_like'].head()

movie_title
Avatar                                        0.577369
Pirates of the Caribbean: At World's End      0.951396
Spectre                                       0.987521
The Dark Knight Rises                         0.683783
Star Wars: Episode VII - The Force Awakens    0.000000
Name: pct_actor_cast_like, dtype: float64