# Project 3: Data Cleaning - Tidy up messy Datasets (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 3 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## First Steps 

In [5]:
import pandas as pd
import numpy as np

1. __Load__ and __inspect__ the messy dataset __movies_metadata.csv__. Identify columns with nested / stringified json data.

In [6]:
movies = pd.read_csv('movies_metadata.csv')#, index_col = 'id')
movies.head(n =2)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


## Dropping irrelevant Columns

2. __Drop__ the irrelevant columns 'adult', 'imdb_id', 'original_title', 'video' and 'homepage'.

In [7]:
movies.drop(labels =['adult','original_title','imdb_id','video', 'homepage'], axis = 1, inplace = True)
movies.head(n=2)

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,"Led by Woody, Andy's toys live happily in his ...",21.9469,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,When siblings Judy and Peter discover an encha...,17.0155,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0


## How to handle stringified JSON columns

3. __Evaluate__ Python Expressions in the stringified columns ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] and __remove quotes__ ("") where possible.

In [8]:
import ast

In [9]:
movies['belongs_to_collection'] = movies.belongs_to_collection.map(ast.literal_eval,na_action ='ignore' )

In [10]:
movies.genres =movies.genres.map(ast.literal_eval, na_action = 'ignore')

In [11]:
movies.production_countries = movies.production_countries.map(ast.literal_eval, na_action = 'ignore')

In [12]:
movies.spoken_languages = movies.spoken_languages.map(ast.literal_eval, na_action = 'ignore')

## How to flatten nested Columns

4. __Extract__ only the __collection name__ from the column "belongs_to_collection" and __overwrite__ "belongs_to_collection". <br> For example: The value in the first row (Toy Story) should be 'Toy Story Collection'.

### first way (better)

In [20]:
movies['belongs_to_collection'] = movies['belongs_to_collection'].apply(lambda x: x['name'] if isinstance(x, dict) else np.nan)

### second way

In [220]:
# remove weird id
#movies = movies[~movies['id'].str.contains('-')]

In [15]:
#movies['belongs_to_collection']=movies.belongs_to_collection.map(lambda x: x['name'], na_action = 'ignore')

TypeError: 'float' object is not subscriptable

In [222]:
#movies.head(n=2)

Unnamed: 0_level_0,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,overview,popularity,poster_path,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
862,Toy Story Collection,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",21.9469,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
8844,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,tt0113497,en,When siblings Judy and Peter discover an encha...,17.0155,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


5. __Extract__ all __genre names__ from the column "genres" and __overwrite__ "genres". If a movie has more than one genre, __seperate genres by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Animation|Comedy|Family'.

### first way (better)

In [22]:
movies.genres[0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [26]:
movies['genres'] = movies.genres.apply(lambda x: "|".join(i['name'] for i in x))

### second way

In [223]:
def flatten(df, col):
    genres_ser = []
    for genres_dict in df[col]:
        genres_list = []
        if type(genres_dict) == float:
            pass
        else:
            for genre in genres_dict:
                genres_list.append(genre['name'])
        genres_list = '|'.join(genres_list)
        genres_ser.append(genres_list)
    return genres_ser

In [224]:
def test1(df, col):
    return df[col][0]

In [225]:
test1(movies, 'genres')

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [226]:
movies.genres

id
862       [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
8844      [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
15602     [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
31357     [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
11862                        [{'id': 35, 'name': 'Comedy'}]
                                ...                        
439050    [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...
111109                        [{'id': 18, 'name': 'Drama'}]
67758     [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...
227506                                                   []
461257                                                   []
Name: genres, Length: 45463, dtype: object

In [227]:
movies['genres'] = flatten(movies, 'genres')
movies.genres

id
862        Animation|Comedy|Family
8844      Adventure|Fantasy|Family
15602               Romance|Comedy
31357         Comedy|Drama|Romance
11862                       Comedy
                    ...           
439050                Drama|Family
111109                       Drama
67758        Action|Drama|Thriller
227506                            
461257                            
Name: genres, Length: 45463, dtype: object

6. __Extract__ all __spoken language names__ from the column "spoken_languages" and __overwrite__ "spoken_languages". If a movie has more than one spoken language, __seperate spoken languages by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'English'.

In [228]:
movies.spoken_languages

id
862                [{'iso_639_1': 'en', 'name': 'English'}]
8844      [{'iso_639_1': 'en', 'name': 'English'}, {'iso...
15602              [{'iso_639_1': 'en', 'name': 'English'}]
31357              [{'iso_639_1': 'en', 'name': 'English'}]
11862              [{'iso_639_1': 'en', 'name': 'English'}]
                                ...                        
439050               [{'iso_639_1': 'fa', 'name': 'فارسی'}]
111109                    [{'iso_639_1': 'tl', 'name': ''}]
67758              [{'iso_639_1': 'en', 'name': 'English'}]
227506                                                   []
461257             [{'iso_639_1': 'en', 'name': 'English'}]
Name: spoken_languages, Length: 45463, dtype: object

In [229]:
movies['spoken_languages'] = flatten(movies, 'spoken_languages')

In [230]:
movies.spoken_languages

id
862                English
8844      English|Français
15602              English
31357              English
11862              English
                ...       
439050               فارسی
111109                    
67758              English
227506                    
461257             English
Name: spoken_languages, Length: 45463, dtype: object

7. __Extract__ all __production countries names__ from the column "production_countries" and __overwrite__ "production_countries". If a movie has more than one production country, __seperate production countries by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'United States of America'.

In [231]:
movies['production_countries'] = flatten(movies,'production_countries')

In [232]:
movies.production_countries

id
862       United States of America
8844      United States of America
15602     United States of America
31357     United States of America
11862     United States of America
                    ...           
439050                        Iran
111109                 Philippines
67758     United States of America
227506                      Russia
461257              United Kingdom
Name: production_countries, Length: 45463, dtype: object

8. __Extract__ all __production companies names__ from the column "production_companies" and __overwrite__ "production_companies". If a movie has more than one production company, __seperate production companies by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Pixar Animation Studios'

In [233]:
movies.production_companies = movies.production_companies.map(ast.literal_eval, na_action = 'ignore')

In [234]:
movies['production_companies'] = flatten(movies,'production_companies')

9. __Inspect__ all columns above with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [235]:
movies['production_countries'].replace("", np.nan, inplace =True)
movies['production_companies'].replace("", np.nan, inplace =True)
movies['spoken_languages'].replace("", np.nan, inplace =True)
movies.budget.value_counts()

0           36573
5000000       286
10000000      259
20000000      243
2000000       242
            ...  
12902809        1
24500           1
123690          1
3684600         1
40600000        1
Name: budget, Length: 1223, dtype: int64

In [236]:
np.where(movies.production_companies == "")

(array([], dtype=int64),)

## Cleaning Numerical Columns

10. __Convert__ the datatype in the columns __"budget"__, __"id"__ and __"popularity"__ __to numeric__. Set invalid values as NaN.

In [237]:
movies['budget'] = pd.to_numeric(movies.budget, errors = 'coerce')

In [238]:
movies['id']= pd.to_numeric(movies.index, errors = 'coerce')
movies.set_index(keys='id')

Unnamed: 0_level_0,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,overview,popularity,poster_path,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
862,Toy Story Collection,30000000,Animation|Comedy|Family,http://toystory.disney.com/toy-story,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",21.9469,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,...,1995-10-30,373554033.0,81.0,English,Released,,Toy Story,False,7.7,5415.0
8844,,65000000,Adventure|Fantasy|Family,,tt0113497,en,When siblings Judy and Peter discover an encha...,17.0155,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,...,1995-12-15,262797249.0,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
15602,Grumpy Old Men Collection,0,Romance|Comedy,,tt0113228,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,...,1995-12-22,0.0,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
31357,,16000000,Comedy|Drama|Romance,,tt0114885,en,"Cheated on, mistreated and stepped on, the wom...",3.85949,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,...,1995-12-22,81452156.0,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
11862,Father of the Bride Collection,0,Comedy,,tt0113041,en,Just when George Banks has recovered from his ...,8.38752,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,...,1995-02-10,76578911.0,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
439050,,0,Drama|Family,http://www.imdb.com/title/tt6209470/,tt6209470,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,...,,0.0,90.0,فارسی,Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
111109,,0,Drama,,tt2028550,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,...,2011-11-17,0.0,360.0,,Released,,Century of Birthing,False,9.0,3.0
67758,,0,Action|Drama|Thriller,,tt0303758,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,...,2003-08-01,0.0,90.0,English,Released,A deadly game of wits.,Betrayal,False,3.8,6.0
227506,,0,,,tt0008536,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,...,1917-10-21,0.0,87.0,,Released,,Satan Triumphant,False,0.0,0.0


In [239]:
movies['popularity']= pd.to_numeric(movies.popularity, errors = 'coerce')

11. __Analyze__ the columns __"budget"__ and __"revenue"__ and __"runtime"__. Analyze movies with a budget/revenue/runtime of 0. Do you think the value 0 is the most appropriate value? __Take reasonable measures__! 

In [240]:
movies.budget.replace(0, np.nan, inplace = True)
movies.runtime.replace(0, np.nan, inplace = True)
movies.revenue.replace(0, np.nan, inplace = True)

12. The columns "budget" and "revenue" shall show values in Million USD. __Convert and Overwrite__!

In [241]:
movies['budget'] = movies.budget.map(lambda x: x/1000000)
movies['revenue'] = movies.revenue.map(lambda x: x/1000000)

13. __Analyze__ movies with a __vote_count of 0__. What´s the __vote_average__ for those movies? Do you think this value is the most appropriate value? __Take reasonable measures__!

In [242]:
sum(movies[movies.vote_count == 0].vote_average != 0)

0

In [243]:
movies.vote_count.replace(0, np.nan, inplace = True)
movies.vote_average.replace(0, np.nan, inplace = True)

## Cleaning DateTime Columns

14. __Convert__ the datatype in the column __"release_date"__ __to datetime__. Set invalid values as NaN.

In [244]:
movies.release_date = pd.to_datetime(movies.release_date, errors = 'coerce')

In [245]:
movies.release_date.dtypes

dtype('<M8[ns]')

## Cleaning Text / String Columns

15. __Analyze__ the text columns "overview" and "tagline". Try to identify __missing data that is not represented by NaN__ (e.g. "No Data"). __Replace as NaN__ (np.nan)!

In [246]:
movies[movies.overview.str.contains("No Data", na = False)]

Unnamed: 0_level_0,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,overview,popularity,poster_path,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


## Removing Duplicates

16. __Identify__ and __remove__ duplicates!

In [309]:
movies[movies.duplicated(subset = 'id',keep = False)].sort_index(ascending = True)

Unnamed: 0_level_0,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,overview,popularity,poster_path,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4912,,30.0,Comedy|Crime|Drama|Romance|Thriller,,tt0270288,en,"Television made him famous, but his biggest hi...",7.645827,/o3Im9nPLAgtlw1j2LtpMebAotSe.jpg,Miramax Films|Allied Filmmakers|Mad Chance,...,33.013805,113.0,English,Released,Some things are better left top secret.,Confessions of a Dangerous Mind,False,6.6,281.0,4912
4912,,30.0,Comedy|Crime|Drama|Romance|Thriller,,tt0270288,en,"Television made him famous, but his biggest hi...",11.331072,/o3Im9nPLAgtlw1j2LtpMebAotSe.jpg,Miramax Films|Allied Filmmakers|Mad Chance,...,33.013805,113.0,English,Released,Some things are better left top secret.,Confessions of a Dangerous Mind,False,6.6,281.0,4912
10991,Pokémon Collection,16.0,Adventure|Fantasy|Animation|Action|Family,http://movies.warnerbros.com/pk3/,tt0235679,ja,When Molly Hale's sadness of her father's disa...,10.264597,/5ILjS6XB5deiHop8SXPsYxXWVPE.jpg,TV Tokyo|4 Kids Entertainment|Nintendo|Pikachu...,...,68.411275,93.0,English,Released,Pokémon: Spell of the Unknown,Pokémon: Spell of the Unknown,False,6.0,143.0,10991
10991,Pokémon Collection,16.0,Adventure|Fantasy|Animation|Action|Family,http://movies.warnerbros.com/pk3/,tt0235679,ja,When Molly Hale's sadness of her father's disa...,6.480376,/5ILjS6XB5deiHop8SXPsYxXWVPE.jpg,TV Tokyo|4 Kids Entertainment|Nintendo|Pikachu...,...,68.411275,93.0,English,Released,Pokémon: Spell of the Unknown,Pokémon: Spell of the Unknown,False,6.0,144.0,10991
12600,Pokémon Collection,,Adventure|Fantasy|Animation|Science Fiction|Fa...,http://www.pokemon.com/us/movies/movie-pokemon...,tt0287635,ja,"All your favorite Pokémon characters are back,...",6.080108,/bqL0PVHbQ8Jmw3Njcl38kW0CoeM.jpg,,...,28.023563,75.0,日本語,Released,,Pokémon 4Ever: Celebi - Voice of the Forest,False,5.7,82.0,12600
12600,Pokémon Collection,,Adventure|Fantasy|Animation|Science Fiction|Fa...,http://www.pokemon.com/us/movies/movie-pokemon...,tt0287635,ja,"All your favorite Pokémon characters are back,...",7.072301,/bqL0PVHbQ8Jmw3Njcl38kW0CoeM.jpg,,...,28.023563,75.0,日本語,Released,,Pokémon 4Ever: Celebi - Voice of the Forest,False,5.7,82.0,12600
13209,,0.0025,Drama|Comedy|Foreign,,tt0499537,fa,"Since women are banned from soccer matches, Ir...",1.529879,/nfkOkpudNNIjRrf0mTFVoiGzHyc.jpg,Jafar Panahi Film Productions,...,,93.0,فارسی,Released,,Offside,False,6.7,27.0,13209
13209,,0.0025,Drama|Comedy|Foreign,,tt0499537,fa,"Since women are banned from soccer matches, Ir...",1.52896,/nfkOkpudNNIjRrf0mTFVoiGzHyc.jpg,Jafar Panahi Film Productions,...,,93.0,فارسی,Released,,Offside,False,6.7,27.0,13209
14788,,1.6,Drama|Crime|Mystery,http://www.bubblethefilm.com/,tt0454792,en,Set against the backdrop of a decaying Midwest...,3.185256,/w56oo9nREcF54sNXVYuE9QxZFjT.jpg,Magnolia Pictures|Extension 765,...,,73.0,English,Released,,Bubble,False,6.4,36.0,14788
14788,,1.6,Drama|Crime|Mystery,http://www.bubblethefilm.com/,tt0454792,en,Set against the backdrop of a decaying Midwest...,3.008299,/w56oo9nREcF54sNXVYuE9QxZFjT.jpg,Magnolia Pictures|Extension 765,...,,73.0,English,Released,,Bubble,False,6.4,36.0,14788


In [310]:
movies.drop_duplicates(subset = 'id', inplace = True)

In [311]:
print(len(pd.unique(movies.id)))
len(movies.id)

45433


45433

In [248]:
unique_id = pd.DataFrame(pd.unique(movies.id), columns = ['id']).set_index(keys = 'id')

In [266]:
movies.index = pd.to_numeric(movies.index)

In [272]:
unique_movies = unique_id.merge(movies, how = 'left', left_index=True,right_index=True, indicator= True)

In [277]:
unique_movies['_merge']

id
2         both
3         both
5         both
6         both
11        both
          ... 
465044    both
467731    both
468343    both
468707    both
469172    both
Name: _merge, Length: 45463, dtype: category
Categories (3, object): [left_only, right_only, both]

## Handling Missing Values & Removing Observations

17. __Drop__ all rows/movies with unknown __id__ or __title__.

In [313]:
movies.drop('id', axis = 1, inplace = True)

In [321]:
movies[movies.index.isna()]

Unnamed: 0_level_0,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,overview,popularity,poster_path,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [318]:
movies = movies[movies.title.notna()]

In [319]:
movies.shape

(45430, 21)

18. __Keep__ only those rows/movies in the df with __10 or more non-NaN__ values.

In [372]:
movies.loc[:,"not_na"] = np.sum(movies.notna(), axis = 1)

In [374]:
movies = movies[movies.not_na >= 10]

In [377]:
movies.shape

(45406, 22)

## Final (Cleaning) Steps

19. __Keep__ only those rows/movies in the df with __status "Released"__. Then __drop__ the column "status".

In [379]:
movies = movies[movies.status == "Released"]

In [380]:
movies.shape

(44966, 22)

20. The Order of the columns should be as follows: 

In [410]:
movies.rename(columns = {"budget":"budget_musd", "revenue":"revenue_musd"}, inplace = True)

In [417]:
list_columns = movies.columns.tolist()

In [412]:
list_columns

['belongs_to_collection',
 'budget_musd',
 'genres',
 'homepage',
 'imdb_id',
 'original_language',
 'overview',
 'popularity',
 'poster_path',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue_musd',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'video',
 'vote_average',
 'vote_count',
 'not_na']

In [413]:
new_col_list = ["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget_musd", "revenue_musd", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]

In [420]:
list(set(list_columns)-set(new_col_list))

['id']

In [415]:
movies.drop(labels = ['video', 'status', 'not_na', 'imdb_id', 'homepage'], axis =1, inplace = True)

In [424]:
movies.reset_index(inplace = True)
movies

Unnamed: 0,id,belongs_to_collection,budget_musd,genres,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue_musd,runtime,spoken_languages,tagline,title,vote_average,vote_count
0,862,Toy Story Collection,30.0,Animation|Comedy|Family,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373.554033,81.0,English,,Toy Story,7.7,5415.0
1,8844,,65.0,Adventure|Fantasy|Family,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262.797249,104.0,English|Français,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,15602,Grumpy Old Men Collection,,Romance|Comedy,en,A family wedding reignites the ancient feud be...,11.712900,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,,101.0,English,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,31357,,16.0,Comedy|Drama|Romance,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81.452156,127.0,English,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,11862,Father of the Bride Collection,,Comedy,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76.578911,106.0,English,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44961,439050,,,Drama|Family,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,Iran,NaT,,90.0,فارسی,Rising and falling between a man and woman,Subdue,4.0,1.0
44962,111109,,,Drama,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,Philippines,2011-11-17,,360.0,,,Century of Birthing,9.0,3.0
44963,67758,,,Action|Drama|Thriller,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,United States of America,2003-08-01,,90.0,English,A deadly game of wits.,Betrayal,3.8,6.0
44964,227506,,,,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,Russia,1917-10-21,,87.0,,,Satan Triumphant,,


In [425]:
movies = movies[new_col_list]

In [426]:
movies

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,United States of America,5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,United States of America,92.0,6.5,11.712900,101.0,A family wedding reignites the ancient feud be...,English,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,United States of America,34.0,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,United States of America,173.0,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,/e64sOI48hQXyru7naBFyssKFxVd.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44961,439050,Subdue,Rising and falling between a man and woman,NaT,Drama|Family,,fa,,,,Iran,1.0,4.0,0.072051,90.0,Rising and falling between a man and woman.,فارسی,/jldsYflnId4tTWPx8es3uzsB1I8.jpg
44962,111109,Century of Birthing,,2011-11-17,Drama,,tl,,,Sine Olivia,Philippines,3.0,9.0,0.178241,360.0,An artist struggles to finish his work while a...,,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg
44963,67758,Betrayal,A deadly game of wits.,2003-08-01,Action|Drama|Thriller,,en,,,American World Pictures,United States of America,6.0,3.8,0.903007,90.0,"When one of her hits goes wrong, a professiona...",English,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg
44964,227506,Satan Triumphant,,1917-10-21,,,en,,,Yermoliev,Russia,,,0.003503,87.0,"In a small town live two brothers, one a minis...",,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg


21. __Reset__ the Index and create a __RangeIndex__.

22. __Save__ the cleaned dataset in a __csv-file__.

In [427]:
movies.to_csv("movies_clean_practice2.csv", index = False)

In [428]:
pd.read_csv("movies_clean_practice2.csv")

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,United States of America,5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,United States of America,92.0,6.5,11.712900,101.0,A family wedding reignites the ancient feud be...,English,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,United States of America,34.0,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,United States of America,173.0,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,/e64sOI48hQXyru7naBFyssKFxVd.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44961,439050,Subdue,Rising and falling between a man and woman,,Drama|Family,,fa,,,,Iran,1.0,4.0,0.072051,90.0,Rising and falling between a man and woman.,فارسی,/jldsYflnId4tTWPx8es3uzsB1I8.jpg
44962,111109,Century of Birthing,,2011-11-17,Drama,,tl,,,Sine Olivia,Philippines,3.0,9.0,0.178241,360.0,An artist struggles to finish his work while a...,,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg
44963,67758,Betrayal,A deadly game of wits.,2003-08-01,Action|Drama|Thriller,,en,,,American World Pictures,United States of America,6.0,3.8,0.903007,90.0,"When one of her hits goes wrong, a professiona...",English,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg
44964,227506,Satan Triumphant,,1917-10-21,,,en,,,Yermoliev,Russia,,,0.003503,87.0,"In a small town live two brothers, one a minis...",,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg


# +++++++++ See some Hints below +++++++++++++

# ++++++++++++++++ Hints++++++++++++++++++++

__Hints for 3.__ <br>
apply ast.literal_eval() on all stringified elements (you have to import ast):

In [None]:
# example:
df.stringified_column = df.stringified_column.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

__Hints for 4., 5., 6., 7., 8.__<br> 
apply an appropriate lambda function on all column elements

__Hints for 9.__<br>
Replace all __""__ (empty strings) in the above columns by NaN (__np.nan__)

__Hints for 10.__<br>
Use pd.to_numeric() and "coerce" errors

__Hints for 11.__<br>
Replace the value 0 by NaN (__np.nan__)

__Hints for 13.__<br>
Replace the value 0 by NaN (__np.nan__)

__Hints for 14.__<br>
Use pd.to_datetime() and "coerce" errors

__Hints for 16.__<br>
There cannot be two or more movies with the same movie id.