# Project 4: Merging & Cleaning & Transforming Data (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 4 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## Introduction / Getting the Datasets

In [85]:
import pandas as pd
import numpy as np
import ast

1. __Load__ and __inspect__ the datasets "movies_clean.csv" and "credits.csv". __Identify__ stringified/nested __json columns__ in the __credits__ dataset.

In [86]:
movies = pd.read_csv("movies_clean.csv")
credit = pd.read_csv("credits.csv")

## Preparing the Data for Merge

2. __Drop Duplicates__ in the credits datasets. (similar to Project 3)

In [87]:
credit.drop_duplicates(inplace = True)

## Merging the Data

3. __Merge/Join__ the datasets movies_clean and credits. -> Add the features __cast__ and __crew__ to the movies_clean dataset.

In [88]:
df = movies.merge(credit, how = 'left', on = 'id')

## Cleaning and Transforming the new "Cast" Column

4.  __Evaluate__ Python Expressions in the stringified column "cast" and __remove quotes__ ("") where possible.

In [89]:
df['cast'] = df.cast.apply(lambda x : ast.literal_eval(x))

5. __Determine__ the __cast size__ for all movies (number of actors) and add the additional column "cast_size".

In [90]:
df['cast_size'] = df.cast.apply(lambda x: len(x))

6. __Extract__ all __actor names__ from the column "cast" and __overwrite__ "cast". If a movie has more than one actor, __seperate names by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wallace Shawn|John Ratzenberger|Annie Potts|John Morris|Erik von Detten|Laurie Metcalf|R. Lee Ermey|Sarah Freeman|Penn Jillette'.

In [91]:
df['cast'] = df.cast.apply(lambda x: "|".join(i['name'] for i in x))

7. __Inspect__ cast with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [92]:
df.cast.value_counts()

                                                                                                                                                  2189
Georges Méliès                                                                                                                                      24
Louis Theroux                                                                                                                                       15
Mel Blanc                                                                                                                                           12
Jimmy Carr                                                                                                                                           9
                                                                                                                                                  ... 
Aida Elkashef|Sohum Shah|Neeraj Kabi|Vinay Shukla|Sameer Khurana|Vipul Binjola|Faraz Khan     

In [93]:
df.cast.replace("", np.nan, inplace = True)

## Cleaning and Transforming the new "Crew" Column

8.  __Evaluate__ Python Expressions in the stringified column "crew" and __remove quotes__ ("") where possible.

In [94]:
df['crew'] = df.crew.apply(lambda x : ast.literal_eval(x))

9. __Determine__ the __crew size__ for all movies (size of the crew) and add the additional column "crew_size".

In [95]:
df['crew_size'] = df.crew.apply(lambda x : len(x))

10. __Extract__ the __director name__ from the column "crew" and create the new column "director". <br> For example: The value in the first row (Toy Story) should be 'John Lasseter'.

In [96]:
df.crew.apply(lambda x : "|".join(i['job'] for i in x))

0        Director|Screenplay|Screenplay|Screenplay|Scre...
1        Executive Producer|Screenplay|Original Music C...
2               Director|Characters|Writer|Sound Recordist
3        Director|Screenplay|Producer|Producer|Producer...
4        Original Music Composer|Director of Photograph...
                               ...                        
44693    Director|Producer|Camera Supervisor|Script|Edi...
44694    Director|Writer|Production Design|Music|Editor...
44695    Director|Screenplay|Screenplay|Original Music ...
44696                                    Director|Producer
44697                                             Director
Name: crew, Length: 44698, dtype: object

In [97]:
df.crew[0]

[{'credit_id': '52fe4284c3a36847f8024f49',
  'department': 'Directing',
  'gender': 2,
  'id': 7879,
  'job': 'Director',
  'name': 'John Lasseter',
  'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f4f',
  'department': 'Writing',
  'gender': 2,
  'id': 12891,
  'job': 'Screenplay',
  'name': 'Joss Whedon',
  'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f55',
  'department': 'Writing',
  'gender': 2,
  'id': 7,
  'job': 'Screenplay',
  'name': 'Andrew Stanton',
  'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f5b',
  'department': 'Writing',
  'gender': 2,
  'id': 12892,
  'job': 'Screenplay',
  'name': 'Joel Cohen',
  'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f61',
  'department': 'Writing',
  'gender': 0,
  'id': 12893,
  'job': 'Screenplay',
  'name': 'Alec Sokolow',
  'profile_path': '/v79vlRYi94BZUQnkkyzn

In [246]:
def get_director(x):
    a =[]
    for i in x:
        if i['job'] == 'Director':
            a.append(i['name'])
    if len(a) != 0:
        return "|".join(a)
    else:
        return np.nan

In [247]:
a = [{'credit_id': '52fe4284c3a36847f8024f49',
  'department': 'Directing',
  'gender': 2,
  'id': 7879,
  'job': 'Director',
  'name': 'John Lasseter',
  'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f4f',
  'department': 'Writing',
  'gender': 2,
  'id': 12891,
  'job': 'Director',
  'name': 'Joss Whedon',
  'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'}]
b = [{'credit_id': '52fe4284c3a36847f8024f49',
  'department': 'Directing',
  'gender': 2,
  'id': 7879,
  'job': 'Director2',
  'name': 'John Lasseter',
  'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'},
 {'credit_id': '52fe4284c3a36847f8024f4f',
  'department': 'Writing',
  'gender': 2,
  'id': 12891,
  'job': 'Director4',
  'name': 'Joss Whedon',
  'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'}]

In [249]:
get_director(a)

'John Lasseter|Joss Whedon'

In [236]:
df.crew[1]

[{'credit_id': '52fe44bfc3a36847f80a7cd1',
  'department': 'Production',
  'gender': 2,
  'id': 511,
  'job': 'Executive Producer',
  'name': 'Larry J. Franco',
  'profile_path': None},
 {'credit_id': '52fe44bfc3a36847f80a7c89',
  'department': 'Writing',
  'gender': 2,
  'id': 876,
  'job': 'Screenplay',
  'name': 'Jonathan Hensleigh',
  'profile_path': '/l1c4UFD3g0HVWj5f0CxXAvMAGiT.jpg'},
 {'credit_id': '52fe44bfc3a36847f80a7cdd',
  'department': 'Sound',
  'gender': 2,
  'id': 1729,
  'job': 'Original Music Composer',
  'name': 'James Horner',
  'profile_path': '/oLOtXxXsYk8X4qq0ud4xVypXudi.jpg'},
 {'credit_id': '52fe44bfc3a36847f80a7c7d',
  'department': 'Directing',
  'gender': 2,
  'id': 4945,
  'job': 'Director',
  'name': 'Joe Johnston',
  'profile_path': '/fok4jaO62v5IP6hkpaaAcXuw2H.jpg'},
 {'credit_id': '52fe44bfc3a36847f80a7cd7',
  'department': 'Editing',
  'gender': 2,
  'id': 4951,
  'job': 'Editor',
  'name': 'Robert Dalva',
  'profile_path': None},
 {'credit_id': '57352

In [155]:
for crew in df.crew[:10]:
    print(get_director(crew))

John Lasseter
Joe Johnston
Howard Deutch
Forest Whitaker
Charles Shyer
Michael Mann
Sydney Pollack
Peter Hewitt
Peter Hyams
Martin Campbell


In [237]:
df["director"] = df.crew.apply(get_director)

In [238]:
df.director.value_counts(dropna = False).head(50)

NaN                         731
John Ford                    63
Michael Curtiz               61
Werner Herzog                52
Alfred Hitchcock             52
Georges Méliès               49
Woody Allen                  47
Sidney Lumet                 45
Charlie Chaplin              43
William A. Wellman           41
Richard Thorpe               40
Henry Hathaway               40
Ingmar Bergman               39
Raoul Walsh                  38
Fritz Lang                   37
Martin Scorsese              36
John Huston                  36
Robert Altman                36
George Cukor                 36
Mervyn LeRoy                 36
Clint Eastwood               35
Claude Chabrol               35
Jean-Luc Godard              35
J. Lee Thompson              35
Robert Wise                  35
Takashi Miike                35
Richard Fleischer            33
Michael Apted                33
Roger Corman                 33
Norman Taurog                33
Spike Lee                    32
Henry Ko

## Final Steps

11. __Drop__ the column "crew" and __save__ the dataset in a csv-file.

In [81]:
df.drop(labels = 'crew', axis = 1, inplace = True)

# +++++++++ See some Hints below +++++++++++++

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,...,vote_average,popularity,runtime,overview,spoken_languages,poster_path,cast,cast_size,crew_size,director
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,...,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,<img src='http://image.tmdb.org/t/p/w185//rhIR...,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,13,106,Mickie McGowan
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,...,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,<img src='http://image.tmdb.org/t/p/w185//vzmL...,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,26,16,Jim Strain
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,...,6.5,11.712900,101.0,A family wedding reignites the ancient feud be...,English,<img src='http://image.tmdb.org/t/p/w185//6ksm...,Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...,7,4,Jack Keller
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,...,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,<img src='http://image.tmdb.org/t/p/w185//16XO...,Whitney Houston|Angela Bassett|Loretta Devine|...,10,10,Caron K
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,...,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,<img src='http://image.tmdb.org/t/p/w185//e64s...,Steve Martin|Diane Keaton|Martin Short|Kimberl...,12,7,Adam Bernardi
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44693,439050,Subdue,Rising and falling between a man and woman,,Drama|Family,,fa,,,,...,4.0,0.072051,90.0,Rising and falling between a man and woman.,فارسی,<img src='http://image.tmdb.org/t/p/w185//jlds...,Leila Hatami|Kourosh Tahami|Elham Korda,3,9,Homayoun Shajarian
44694,111109,Century of Birthing,,2011-11-17,Drama,,tl,,,Sine Olivia,...,9.0,0.178241,360.0,An artist struggles to finish his work while a...,,<img src='http://image.tmdb.org/t/p/w185//xZkm...,Angel Aquino|Perry Dizon|Hazel Orencio|Joel To...,11,6,Lav Diaz
44695,67758,Betrayal,A deadly game of wits.,2003-08-01,Action|Drama|Thriller,,en,,,American World Pictures,...,3.8,0.903007,90.0,"When one of her hits goes wrong, a professiona...",English,<img src='http://image.tmdb.org/t/p/w185//d5bX...,Erika Eleniak|Adam Baldwin|Julie du Page|James...,15,5,João Fernandes
44696,227506,Satan Triumphant,,1917-10-21,,,en,,,Yermoliev,...,,0.003503,87.0,"In a small town live two brothers, one a minis...",,<img src='http://image.tmdb.org/t/p/w185//aorB...,Iwan Mosschuchin|Nathalie Lissenko|Pavel Pavlo...,5,2,Joseph N. Ermolieff


# ++++++++++++++++ Hints++++++++++++++++++++

__Hints for 2.__<br>
There cannot be two or more movies with the same movie id.

__Hints for 3.__<br>
You can use a left join with movies_clean as left dataset and credits as right dataset.

__Hints for 4.__<br>
This is very similar to Question 3 in Project 3.

__Hints for 5.__<br> 
apply an appropriate lambda function on all column elements.

__Hints for 6.__<br>
This is very similar to Questions 4-8 in Project 3.

__Hints for 7.__<br>
This is very similar to Question 9 in Project 3.

__Hints for 10.__<br> 
apply an appropriate user-defined function (a bit more complex) on all column elements.