# Data Wrangling and Transformation with Pandas

Working with tabular data is a necessity for anyone with enterprises having a majority of their data in relational databases and flat files. This mini-project is adopted from the excellent tutorial on pandas by Brandon Rhodes. In this lab, we will be looking at some interesting data based on movie data from the IMDB.

.

### Please make sure you have one of the more recent versions of Pandas

In [None]:
#!pip install --upgrade pip

In [None]:
!pip install py4j==0.10.7

In [None]:
#!pip install pandas==0.23

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
pd.__version__

## Taking a look at the Movies dataset
This data shows the movies based on their title and the year of release

In [None]:
#movies = pd.read_csv('../input/md-imdb/titles.csv', compression='bz2')

movies = pd.read_csv('../input/md-imdb/titles.csv')
movies.info()

In [None]:
movies.head()

## Taking a look at the Cast dataset

This data shows the cast (actors, actresses, supporting roles) for each movie

- The attribute `n` basically tells the importance of the cast role, lower the number, more important the role.
- Supporting cast usually don't have any value for `n`

In [None]:
cast = pd.read_csv('../input/md-imdb/cast.csv')
cast.info()

In [None]:
cast.head(10)

## Taking a look at the Release dataset

This data shows details of when each movie was release in each country with the release date

In [None]:
release_dates = pd.read_csv('../input/md-imdb/release_dates.csv', parse_dates=['date'], infer_datetime_format=True)
release_dates.info()

In [None]:
release_dates.head()

# Section I - Basic Querying, Filtering and Transformations

### What is the total number of movies?

In [None]:
len(movies)

### List all Batman movies ever made

In [None]:
batman_df = movies[movies.title == 'Batman']
print('Total Batman Movies:', len(batman_df))
batman_df

### List all Batman movies ever made - the right approach

In [None]:
batman_df = movies[movies.title.str.contains('Batman', case=False)]
print('Total Batman Movies:', len(batman_df))
batman_df.head(10)

### Display the top 15 Batman movies in the order they were released

In [None]:
batman_df.sort_values(by=['year'], ascending=True).iloc[:15]

### Section I - Q1 : List all the 'Harry Potter' movies from the most recent to the earliest

In [None]:
harry_df = movies[movies.title.str.contains('Harry Potter', case=False)]
print('Total Harry Potter Movies:', len(harry_df))
harry_df.head(10)

In [None]:
harry_df.sort_values(by=['year'], ascending=False)

### How many movies were made in the year 2017?

In [None]:
len(movies[movies.year == 2017])

### Section I - Q2 : How many movies were made in the year 2015?

In [None]:
len(movies[movies.year == 2015])

### Section I - Q3 : How many movies were made from 2000 till 2018?
- You can chain multiple conditions using OR (`|`) as well as AND (`&`) depending on the condition

In [None]:
#(movies[(movies.year >= 2000)&(movies.year <= 2018)])
len(movies[(movies.year >= 2000) & (movies.year <= 2018)])

### Section I - Q4: How many movies are titled "Hamlet"?

In [None]:
hamlet_df = movies[movies.title=='Hamlet']
len(hamlet_df)

### Section I - Q5: List all movies titled "Hamlet" 
- The movies should only have been released on or after the year 2000
- Display the movies based on the year they were released (earliest to most recent)

In [None]:
hamlet_df[hamlet_df.year >= 2000].sort_values(by=['year'], ascending=False)

### Section I - Q6: How many roles in the movie "Inception" are of the supporting cast (extra credits)
- supporting cast are NOT ranked by an "n" value (NaN)
- check for how to filter based on nulls

In [None]:
sup_cast = cast[(cast.title == 'Inception') & (pd.isnull(cast.n))]
len(sup_cast)

### Section I - Q7: How many roles in the movie "Inception" are of the main cast
- main cast always have an 'n' value

In [None]:
main_cast = cast[(cast.title == 'Inception') & ~(pd.isnull(cast.n))]
len(main_cast)

### Section I - Q8: Show the top ten cast (actors\actresses) in the movie "Inception" 
- support cast always have an 'n' value
- remember to sort!

In [None]:
top_cast = (cast[(cast.title == 'Inception') & ~(pd.isnull(cast.n))].sort_values(by='n', ascending=True)).iloc[:10]
top_cast

### Section I - Q9:

(A) List all movies where there was a character 'Albus Dumbledore' 

(B) Now modify the above to show only the actors who played the character 'Albus Dumbledore'
- For Part (B) remember the same actor might play the same role in multiple movies

In [None]:
cast[cast.character == 'Albus Dumbledore']

In [None]:
cast[cast.character == 'Albus Dumbledore'][['name']].drop_duplicates()

### Section I - Q10:

(A) How many roles has 'Keanu Reeves' played throughout his career?

(B) List the leading roles that 'Keanu Reeves' played on or after 1999 in order by year.

In [None]:
len(cast[cast.name=='Keanu Reeves'])


In [None]:
cast[(cast.name=='Keanu Reeves') & ( cast.n==1) & (cast.year >=1999)].sort_values(by='year', ascending=False)


### Section I - Q11: 

(A) List the total number of actor and actress roles available from 1950 - 1960

(B) List the total number of actor and actress roles available from 2007 - 2017

In [None]:
cast.head()

In [None]:
(cast[cast.year.between(1950, 1960)][['type', 'name']]
.groupby('type')
.count()
.reset_index()
.rename({'name': 'freq'}, axis=1))

In [None]:
(cast[cast.year.between(2007, 2017)][['type', 'name']]
.groupby('type')
.count()
.reset_index()
.rename({'name': 'freq'}, axis=1))

### Section I - Q12: 

(A) List the total number of leading roles available from 2000 to present

(B) List the total number of non-leading roles available from 2000 - present (exclude support cast)

(C) List the total number of support\extra-credit roles available from 2000 - present

In [None]:
cast[(cast.year >= 2000) & (cast.n == 1)]

In [None]:
cast[(cast.year >= 2000) & (cast.n != 1) & ~(pd.isnull(cast.n))]

In [None]:
cast[(cast.year >= 2000) & (pd.isnull(cast.n))]

# Section II - Aggregations, Transformations and Visualizations

## What are the top ten most common movie names of all time?


In [None]:
top_ten = movies.title.value_counts()[:10]
top_ten

### Plot the top ten common movie names of all time

In [None]:
top_ten.plot(kind='barh')

### Section II - Q1:  Which years in the 2000s saw the most movies released? (Show top 3)

In [None]:
movies.head()
movies[movies.year // 10 == 200]['year'].value_counts()[:3]

### Section II - Q2: # Plot the total number of films released per-decade (1890, 1900, 1910,....)
- Hint: Dividing the year and multiplying with a number might give you the decade the year falls into!
- You might need to sort before plotting

In [None]:
(movies.year // 10 * 10).value_counts().sort_index().plot(kind='bar')

### Section II - Q3: 

(A) What are the top 10 most common character names in movie history?

(B) Who are the top 10 people most often credited as "Herself" in movie history?

(C) Who are the top 10 people most often credited as "Himself" in movie history?

In [None]:
cast.character.value_counts()[:10]

In [None]:
cast[cast.character == 'Herself']['name'].value_counts()[:10]

In [None]:
cast[cast.character == 'Himself']['name'].value_counts()[:10]

### Section II - Q4: 

(A) What are the top 10 most frequent roles that start with the word "Zombie"?

(B) What are the top 10 most frequent roles that start with the word "Police"?

- Hint: The `startswith()` function might be useful

In [None]:
cast[cast.character.str.startswith('Zombie')].character.value_counts().head(10)

In [None]:
cast[cast.character.str.startswith('Police')].character.value_counts().head(10)

### Section II - Q5: Plot how many roles 'Keanu Reeves' has played in each year of his career.

In [None]:
cast[cast.name=='Keanu Reeves'].year.value_counts().sort_index().plot(kind='barh')

### Section II - Q6: Plot the cast positions (n-values) of Keanu Reeve's roles through his career over the years.


In [None]:
keanu = cast[(cast.name == 'Keanu Reeves') & (pd.notnull(cast.n))][['year', 'n']].sort_values('year')
keanu.plot(x='year', y='n', kind='scatter')

### Section II - Q7: Plot the number of "Hamlet" films made by each decade

### Section II - Q8: 

(A) How many leading roles were available to both actors and actresses, in the 1960s (1960-1969)?

(B) How many leading roles were available to both actors and actresses, in the 2000s (2000-2009)?

- Hint: A specific value of n might indicate a leading role

In [None]:
hamlet = (movies[movies.title == 'Hamlet']
          .groupby(movies.year // 10 * 10)
          .count()
          .rename({'title': 'count'}, axis=1))['count']
hamlet.plot(kind='bar')

In [None]:
(cast[(cast.year.between(1960, 1969)) & (cast.n == 1)]
.groupby(['year', 'type'])
.count()[['title']]
.rename({'title': 'count'}, axis=1))

In [None]:
(cast[(cast.year.between(2000, 2009)) & (cast.n == 1)]
.groupby(['year', 'type'])
.count()[['title']]
.rename({'title': 'count'}, axis=1))

### Section II - Q9: List, in order by year, each of the films in which Frank Oz has played more than 1 role.

In [None]:
frank = (cast[cast.name == 'Frank Oz']
         .groupby(['year', 'title'])
         .count()[['name']]
         .rename({'name': 'freq'}, axis=1)
         .sort_values(by=['year'], ascending=True))
frank[frank.freq > 1]

### Section II - Q10: List each of the characters that Frank Oz has portrayed at least twice

In [None]:
frank = (cast[cast.name == 'Frank Oz']
         .groupby(['character'])
         .count()[['name']]
         .rename({'name': 'freq'}, axis=1)
         .sort_values(by=['freq'], ascending=False))
frank[frank.freq > 1]

# Section III - Advanced Merging, Querying and Visualizations

## Make a bar plot with the following conditions
- Frequency of the number of movies with "Christmas" in their title 
- Movies should be such that they are released in the USA.
- Show the frequency plot by month

In [None]:
christmas = release_dates[(release_dates.title.str.contains('Christmas')) & (release_dates.country == 'USA')]
christmas.date.dt.month.value_counts().sort_index().plot(kind='bar')

### Section III - Q1: Make a bar plot with the following conditions
- Frequency of the number of movies with "Summer" in their title 
- Movies should be such that they are released in the USA.
- Show the frequency plot by month

In [None]:
summer = release_dates[(release_dates.title.str.contains('Summer')) & (release_dates.country == 'USA')]
summer.date.dt.month.value_counts().sort_index().plot(kind='bar')

### Section III - Q2: Make a bar plot with the following conditions
- Frequency of the number of movies with "Action" in their title 
- Movies should be such that they are released in the USA.
- Show the frequency plot by week

In [None]:
action = release_dates[(release_dates.title.str.contains('Action')) & (release_dates.country == 'USA')]
action.date.dt.dayofweek.value_counts().sort_index().plot(kind='bar')

### Section III - Q3: Show all the movies in which Keanu Reeves has played the lead role along with their   release date in the USA sorted by the date of release
- Hint: You might need to join or merge two datasets!

In [None]:
us = release_dates[release_dates.country == 'USA']
keanu = cast[(cast.name == 'Keanu Reeves') & (cast.n == 1)]
(keanu.merge(us, how='inner', on=['title', 'year'])
      .sort_values('date')) 

### Section III - Q4: Make a bar plot showing the months in which movies with Keanu Reeves tend to be released in the USA?

In [None]:
us = release_dates[release_dates.country == 'USA']
keanu = cast[(cast.name == 'Keanu Reeves')]
keanu = (keanu.merge(us, how='inner', on=['title', 'year'])
              .sort_values('date'))
keanu.date.dt.month.value_counts().sort_index().plot(kind='bar')

### Section III - Q5: Make a bar plot showing the years in which movies with Ian McKellen tend to be released in the USA?

In [None]:
us = release_dates[release_dates.country == 'USA']
ian = cast[(cast.name == 'Ian McKellen')]
ian = (ian.merge(us, how='inner', on=['title', 'year'])
              .sort_values('date'))
ian.date.dt.year.value_counts().sort_index().plot(kind='bar')