# Assignment 2 - Data Analyst Course
##### in this section you'll find the code and extra notes to analyse the 'Movie' dataset for the Udacity - Data Analyst Nanoodegree
##### changes of the project will be saved by version control tool, GIT. The project will also be posted on GitHub as an extra excercise 
##### I'v set up the Jupyter Notebook locally to have an extra excercise 
##### Below you can find my assignment

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables. If you're not sure what questions to ask, then make sure you familiarize yourself with the dataset, its variables and the dataset context for ideas of what to explore.

> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. In order to work with the data in this workspace, you also need to upload it to the workspace. To do so, click on the jupyter icon in the upper left to be taken back to the workspace directory. There should be an 'Upload' button in the upper right that will let you add your data file(s) to the workspace. You can then click on the .ipynb file name to come back here.

In [106]:
# first task: import packages & import the data
# the data can be found on 'tmdb-movies.csv'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sn

# we want the visualizations to be displayed in our notebook. Therefor we need to insert next line
%matplotlib inline

movies=pd.read_csv('tmdb-movies.csv')

# we'll print the head of the data to have a first glimps and to get to know the dataset
print(movies.head)

<bound method NDFrame.head of            id    imdb_id  popularity     budget     revenue  \
0      135397  tt0369610   32.985763  150000000  1513528810   
1       76341  tt1392190   28.419936  150000000   378436354   
2      262500  tt2908446   13.112507  110000000   295238201   
3      140607  tt2488496   11.173104  200000000  2068178225   
4      168259  tt2820852    9.335014  190000000  1506249360   
...       ...        ...         ...        ...         ...   
10861      21  tt0060371    0.080598          0           0   
10862   20379  tt0060472    0.065543          0           0   
10863   39768  tt0060161    0.065141          0           0   
10864   21449  tt0061177    0.064317          0           0   
10865   22293  tt0060666    0.035919      19000           0   

                     original_title  \
0                    Jurassic World   
1                Mad Max: Fury Road   
2                         Insurgent   
3      Star Wars: The Force Awakens   
4                 

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

In [107]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

print(movies.dtypes)

id                        int64
imdb_id                  object
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                   int64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object


In [108]:
# as we know the datatypes of each column, we can now check if every column has the right type
# the release date is still formatted as a string, this must be changed to the datetime class
from datetime import datetime
print(movies['release_date'])
# the string is formated: %m/%d/%Y
# i tried to use the to_datetime function but the datetime is not zero parsed, therefor my format would not work
# i can not find a solution to this using the datetime package so I'll add the zero's manipulating the string
movies['release_date']=pd.to_datetime(movies['release_date'],format="%m/%d/%y")


0          6/9/15
1         5/13/15
2         3/18/15
3        12/15/15
4          4/1/15
           ...   
10861     6/15/66
10862    12/21/66
10863      1/1/66
10864     11/2/66
10865    11/15/66
Name: release_date, Length: 10866, dtype: object


##### NO LONGER NEEDED!! I used the wrong format= in the previous excercise. This way the to_datetime function would not work.
##### I found a solution manipulating the strings
##### This was my solution but it is no longer needed
##### -------------------------------------------------------------------------------
##### i tried to use the to_datetime function but the datetime is not zero parsed, therefor my format would not work
##### i can not find a solution to this using the datetime package so I'll add the zero's manipulating the string
##### i must admit this method is not the best method but I could not find a better solution to format the string to a date type

```datum=[]
datum=movies['release_date'].str.split("/"
right_format_date=[]
for dat in datum:
    month=dat[0]
    day=dat[1]
    year=dat[2]
    if len(dat[0])==1:
        month="0"+dat[0]
    elif len(dat[1])==1:
        day="0"+dat[1]
    dat=str(month+"/"+day+"/"+year)
    right_format_date.append(dat)
formated_date=pd.DataFrame(right_format_date)
movies['release_date']=formated_date
movies['release_date']=pd.to_datetime(movies['release_date'],format="%m/%d/%y")```


In [95]:
# checking for missing values / checking for zeros 
print(movies.isna().sum())

# below you'll find the missing values in percentages of the total values
print(movies.isna().sum()/movies.shape[0])

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64
id                      0.000000
imdb_id                 0.000920
popularity              0.000000
budget                  0.000000
revenue                 0.000000
original_title          0.000000
cast                    0.006994
homepage                0.729799
director                0.004049
tagline                 0.259893
keywords                0.137401
overview       

In [92]:
# 72,9 % of the homepages are missing. maybe we should drop this column for further analysis
movies.drop('homepage',axis=1)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,The park is open.,monster|dna|tyrannosaurus rex|velociraptor|island,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,2015-06-09,5562,6.5,2015,1.379999e+08,1.392446e+09
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,George Miller,What a Lovely Day.,future|chase|post-apocalyptic|dystopia|australia,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,2015-05-13,6185,7.1,2015,1.379999e+08,3.481613e+08
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,Robert Schwentke,One Choice Can Destroy You,based on novel|revolution|dystopia|sequel|dyst...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,2015-03-18,2480,6.3,2015,1.012000e+08,2.716190e+08
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,J.J. Abrams,Every generation has a story.,android|spaceship|jedi|space opera|3d,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,2015-12-15,5292,7.5,2015,1.839999e+08,1.902723e+09
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,James Wan,Vengeance Hits Home,car race|speed|revenge|suspense|car,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,2015-04-01,2947,7.3,2015,1.747999e+08,1.385749e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10861,21,tt0060371,0.080598,0,0,The Endless Summer,Michael Hynson|Robert August|Lord 'Tally Ho' B...,Bruce Brown,,surfer|surfboard|surfing,"The Endless Summer, by Bruce Brown, is one of ...",95,Documentary,Bruce Brown Films,2066-06-15,11,7.4,1966,0.000000e+00,0.000000e+00
10862,20379,tt0060472,0.065543,0,0,Grand Prix,James Garner|Eva Marie Saint|Yves Montand|Tosh...,John Frankenheimer,Cinerama sweeps YOU into a drama of speed and ...,car race|racing|formula 1,Grand Prix driver Pete Aron is fired by his te...,176,Action|Adventure|Drama,Cherokee Productions|Joel Productions|Douglas ...,2066-12-21,20,5.7,1966,0.000000e+00,0.000000e+00
10863,39768,tt0060161,0.065141,0,0,Beregis Avtomobilya,Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z...,Eldar Ryazanov,,car|trolley|stealing car,An insurance agent who moonlights as a carthie...,94,Mystery|Comedy,Mosfilm,2066-01-01,11,6.5,1966,0.000000e+00,0.000000e+00
10864,21449,tt0061177,0.064317,0,0,"What's Up, Tiger Lily?",Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh...,Woody Allen,WOODY ALLEN STRIKES BACK!,spoof,"In comic Woody Allen's film debut, he took the...",80,Action|Comedy,Benedict Pictures Corp.,2066-11-02,22,5.4,1966,0.000000e+00,0.000000e+00


In [119]:
# find duplicates
movies.duplicated().sum()

1

In [100]:
# some movies have no tagline we can change this into No Tagline in stead of NaN
movies['tagline']=movies['tagline'].fillna('No Tagline')

In [101]:
movies['tagline'].value_counts()

No Tagline                                                                            2824
Based on a true story.                                                                   5
Two Films. One Love.                                                                     3
Be careful what you wish for.                                                            3
Some houses are born bad.                                                                2
                                                                                      ... 
The past never dies. It kills.                                                           1
Can two friends sleep together and still love each other in the morning?                 1
Living in Hollywood can make you famous. Dying in Hollywood can make you a legend.       1
He just doesn't know it yet.                                                             1
Everybody's dreamgirl. One girl's nightmare.                                             1