##                              Data Cleaning

- After analyzing data, we found that there are some irregularities in out data.
- In order to build a robust model, we need to fix all those irregularities
#### In this notebook we will perform below transformations
    - Convert features with list of dictionaries into a normal list form
    - Merge both the datasets (credits and movies)
    - Drop unnecessary features
    - Remove missing and duplicated rows
    - Merge features into a single feature
    - remove backslash-apostrophe
    - remove backslash-apostrophe
    - remove everything alphabets
    - remove whitespaces
    - convert text to lowercase 
    - Save processed dataset

### Import Liberaries and Dataset

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_row', 300)

In [4]:
credits=pd.read_csv('tmdb_5000_credits.csv')
tmdb=pd.read_csv('tmdb_5000_movies.csv')

In [5]:
credits.head(1)   # lets have a look to out dataset

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""c...","[{""credit_id"": ""52fe48009251416c750aca23"", ""department"": ""Editing"", ""gender"": 0, ""id"": 1721, ""job"": ""Editor"", ""name"": ""Stephen E. Rivkin""}, {""credit_id"": ""539c47ecc3a36810e3001f87"", ""department"": ""Art"", ""gender"": 2, ""id"": 496, ""job"": ""Production Design"", ""name"": ""Rick Carter""}, {""credit_id"": ""54..."


In [6]:
tmdb.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {""id"": 878, ""name"": ""Science Fiction""}]",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""space war""}, {""id"": 3388, ""name"": ""space colony""}, {""id"": 3679, ""name"": ""society""}, {""id"": 3801, ""name"": ""space travel""}, {""id"": 9685, ""name"": ""futuristic""}, {""id"": 9840, ""name"": ""romance""}, {""id"": 9882...",en,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289}, {""name"": ""Twentieth Century Fox Film Corporation"", ""id"": 306}, {""name"": ""Dune Entertainment"", ""id"": 444}, {""name"": ""Lightstorm Entertainment"", ""id"": 574}]","[{""iso_3166_1"": ""US"", ""name"": ""United States of America""}, {""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""}]",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [7]:
# Let's merge both the datasets

movies= tmdb.merge(credits, on='title')

In [8]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {""id"": 878, ""name"": ""Science Fiction""}]",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""space war""}, {""id"": 3388, ""name"": ""space colony""}, {""id"": 3679, ""name"": ""society""}, {""id"": 3801, ""name"": ""space travel""}, {""id"": 9685, ""name"": ""futuristic""}, {""id"": 9840, ""name"": ""romance""}, {""id"": 9882...",en,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289}, {""name"": ""Twentieth Century Fox Film Corporation"", ""id"": 306}, {""name"": ""Dune Entertainment"", ""id"": 444}, {""name"": ""Lightstorm Entertainment"", ""id"": 574}]",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""c...","[{""credit_id"": ""52fe48009251416c750aca23"", ""department"": ""Editing"", ""gender"": 0, ""id"": 1721, ""job"": ""Editor"", ""name"": ""Stephen E. Rivkin""}, {""credit_id"": ""539c47ecc3a36810e3001f87"", ""department"": ""Art"", ""gender"": 2, ""id"": 496, ""job"": ""Production Design"", ""name"": ""Rick Carter""}, {""credit_id"": ""54..."


In [9]:
# Droping features that are not required
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [10]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {""id"": 878, ""name"": ""Science Fiction""}]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""space war""}, {""id"": 3388, ""name"": ""space colony""}, {""id"": 3679, ""name"": ""society""}, {""id"": 3801, ""name"": ""space travel""}, {""id"": 9685, ""name"": ""futuristic""}, {""id"": 9840, ""name"": ""romance""}, {""id"": 9882...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""c...","[{""credit_id"": ""52fe48009251416c750aca23"", ""department"": ""Editing"", ""gender"": 0, ""id"": 1721, ""job"": ""Editor"", ""name"": ""Stephen E. Rivkin""}, {""credit_id"": ""539c47ecc3a36810e3001f87"", ""department"": ""Art"", ""gender"": 2, ""id"": 496, ""job"": ""Production Design"", ""name"": ""Rick Carter""}, {""credit_id"": ""54..."


In [11]:
# Checking total rows and columns in dataset

movies.shape

(4809, 7)

In [12]:
# Checking if there is any missing values in out dataset
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

- We have only 3 missing values in Overview feature. We can drop the missing rows

In [13]:
# Droping missing rows

movies=movies.dropna()

In [14]:
# Checking if there is any missing values in out dataset

movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [15]:
# Checing if there are any duplicated rows in out dataset

movies.duplicated().sum()

0

In [16]:
# Checking 1s row of geners feature

movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

- We can see out data is not properly formated
- We will convert data into notmal string format

In [17]:
import ast
def convert(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L 

In [18]:
columns=['genres', 'keywords']
for i in columns:
  movies[i]=movies[i].apply(convert)

In [19]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colony, society, space travel, futuristic, romance, space, alien, tribe, alien planet, cgi, marine, soldier, battle, love affair, anti war, power relations, mind and soul, 3d]","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""c...","[{""credit_id"": ""52fe48009251416c750aca23"", ""department"": ""Editing"", ""gender"": 0, ""id"": 1721, ""job"": ""Editor"", ""name"": ""Stephen E. Rivkin""}, {""credit_id"": ""539c47ecc3a36810e3001f87"", ""department"": ""Art"", ""gender"": 2, ""id"": 496, ""job"": ""Production Design"", ""name"": ""Rick Carter""}, {""credit_id"": ""54..."


In [20]:
movies.iloc[0].cast

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [21]:
movies['cast'][0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [22]:
def convert3(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L 

In [23]:
movies['cast'] = movies['cast'].apply(convert)
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colony, society, space travel, futuristic, romance, space, alien, tribe, alien planet, cgi, marine, soldier, battle, love affair, anti war, power relations, mind and soul, 3d]","[Sam Worthington, Zoe Saldana, Sigourney Weaver, Stephen Lang, Michelle Rodriguez, Giovanni Ribisi, Joel David Moore, CCH Pounder, Wes Studi, Laz Alonso, Dileep Rao, Matt Gerald, Sean Anthony Moran, Jason Whyte, Scott Lawrence, Kelly Kilgour, James Patrick Pitt, Sean Patrick Murphy, Peter Dillon...","[{""credit_id"": ""52fe48009251416c750aca23"", ""department"": ""Editing"", ""gender"": 0, ""id"": 1721, ""job"": ""Editor"", ""name"": ""Stephen E. Rivkin""}, {""credit_id"": ""539c47ecc3a36810e3001f87"", ""department"": ""Art"", ""gender"": 2, ""id"": 496, ""job"": ""Production Design"", ""name"": ""Rick Carter""}, {""credit_id"": ""54..."


In [24]:
movies['cast'] = movies['cast'].apply(lambda x:x[0:4])

In [25]:
def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 

In [26]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [27]:
movies.sample(5)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
4137,13075,Sherrybaby,"After serving time in prison, former drug addict Sherry Swanson returns home to reclaim her young daughter from family members who have been raising the child. Sherry's family, especially her sister-in-law, doubt Sherry's ability to be a good mother, and Sherry finds her resolve to stay clean sl...",[Drama],"[independent film, mother daughter relationship, woman director]","[Maggie Gyllenhaal, Michelle Hurst, Sandra Rodríguez, Danny Trejo]",[Laurie Collyer]
3215,1547,The Lost Boys,A mother and her two teenage sons move to a seemingly nice and quiet small coastal California town yet soon find out that it's overrun by bike gangs and vampires. A couple of teenage friends take it upon themselves to hunt down the vampires that they suspect of a few mysterious murders and resto...,"[Horror, Comedy]","[street gang, small town, vampire, comic book, boardwalk, single, amusement park, mother son relationship]","[Jason Patric, Corey Haim, Corey Feldman, Dianne Wiest]",[Joel Schumacher]
2311,340611,Indignation,"In 1951, Marcus Messner, a working-class Jewish student from New Jersey, attends a small Ohio college, where he struggles with anti-Semitism, sexual repression, and the ongoing Korean War.",[Drama],"[based on novel, jewish life, ohio, 1950s]","[Logan Lerman, Sarah Gadon, Tracy Letts, Linda Emond]",[James Schamus]
1492,7220,The Punisher,"When undercover FBI agent Frank Castle's wife and son are slaughtered, he becomes 'the Punisher' -- a ruthless vigilante willing to go to any length to avenge his family.","[Action, Crime, Drama]","[chain, submachine gun, undercover, smuggling, twin brother, marvel comic, one man army, massacre, extreme violence, family reunion, pier]","[Thomas Jane, John Travolta, Will Patton, Roy Scheider]",[Jonathan Hensleigh]
1164,37799,The Social Network,"On a fall night in 2003, Harvard undergrad and computer programming genius Mark Zuckerberg sits down at his computer and heatedly begins working on a new idea. In a fury of blogging and programming, what begins in his dorm room as a small site among friends soon becomes a global social network a...",[Drama],"[hacker, hacking, creator, frat party, social network, deposition, intellectual property, entrepreneur, arrogance, young entrepreneur, facebook]","[Jesse Eisenberg, Andrew Garfield, Justin Timberlake, Armie Hammer]",[David Fincher]


In [28]:
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

In [29]:
columns=['cast','crew','genres','keywords']
for i in columns :
  movies[i]=movies[i].apply(collapse)

In [30]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, society, spacetravel, futuristic, romance, space, alien, tribe, alienplanet, cgi, marine, soldier, battle, loveaffair, antiwar, powerrelations, mindandsoul, 3d]","[SamWorthington, ZoeSaldana, SigourneyWeaver, StephenLang]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But nothing is quite as it seems.","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatradingcompany, loveofone'slife, traitor, shipwreck, strongwoman, ship, alliance, calypso, afterlife, fighter, pirate, swashbuckler, aftercreditsstinger]","[JohnnyDepp, OrlandoBloom, KeiraKnightley, StellanSkarsgård]",[GoreVerbinski]
2,206647,Spectre,"A cryptic message from Bond’s past sends him on a trail to uncover a sinister organization. While M battles political forces to keep the secret service alive, Bond peels back the layers of deceit to reveal the terrible truth behind SPECTRE.","[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, britishsecretservice, unitedkingdom]","[DanielCraig, ChristophWaltz, LéaSeydoux, RalphFiennes]",[SamMendes]
3,49026,The Dark Knight Rises,"Following the death of District Attorney Harvey Dent, Batman assumes responsibility for Dent's crimes to protect the late attorney's reputation and is subsequently hunted by the Gotham City Police Department. Eight years later, Batman encounters the mysterious Selina Kyle and the villainous Bane...","[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretidentity, burglar, hostagedrama, timebomb, gothamcity, vigilante, cover-up, superhero, villainess, tragichero, terrorism, destruction, catwoman, catburglar, imax, flood, criminalunderworld, batman]","[ChristianBale, MichaelCaine, GaryOldman, AnneHathaway]",[ChristopherNolan]
4,49529,John Carter,"John Carter is a war-weary, former military captain who's inexplicably transported to the mysterious and exotic planet of Barsoom (Mars) and reluctantly becomes embroiled in an epic conflict. It's a world on the brink of collapse, and Carter rediscovers his humanity when he realizes the survival...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, princess, alien, steampunk, martian, escape, edgarriceburroughs, alienrace, superhumanstrength, marscivilization, swordandplanet, 19thcentury, 3d]","[TaylorKitsch, LynnCollins, SamanthaMorton, WillemDafoe]",[AndrewStanton]


In [31]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [32]:
movies['tags']= movies['overview']+ movies['genres']+ movies['keywords']+ movies['cast']+ movies['crew']

In [33]:
df= movies[['movie_id','title','tags']]

In [34]:
df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marine, is, dispatched, to, the, moon, Pandora, on, a, unique, mission,, but, becomes, torn, between, following, orders, and, protecting, an, alien, civilization., Action, Adventure, Fantasy, ScienceFiction, cultureclash, future, spacewar, spacecolony, so..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, dead,, has, come, back, to, life, and, is, headed, to, the, edge, of, the, Earth, with, Will, Turner, and, Elizabeth, Swann., But, nothing, is, quite, as, it, seems., Adventure, Fantasy, Action, ocean, drugabuse, exoticisland, eastindiatradingcompany,..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, sends, him, on, a, trail, to, uncover, a, sinister, organization., While, M, battles, political, forces, to, keep, the, secret, service, alive,, Bond, peels, back, the, layers, of, deceit, to, reveal, the, terrible, truth, behind, SPECTRE., Action, Adven..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney, Harvey, Dent,, Batman, assumes, responsibility, for, Dent's, crimes, to, protect, the, late, attorney's, reputation, and, is, subsequently, hunted, by, the, Gotham, City, Police, Department., Eight, years, later,, Batman, encounters, the, mysteriou..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, military, captain, who's, inexplicably, transported, to, the, mysterious, and, exotic, planet, of, Barsoom, (Mars), and, reluctantly, becomes, embroiled, in, an, epic, conflict., It's, a, world, on, the, brink, of, collapse,, and, Carter, rediscovers, hi..."


In [35]:
df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marine, is, dispatched, to, the, moon, Pandora, on, a, unique, mission,, but, becomes, torn, between, following, orders, and, protecting, an, alien, civilization., Action, Adventure, Fantasy, ScienceFiction, cultureclash, future, spacewar, spacecolony, so..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, dead,, has, come, back, to, life, and, is, headed, to, the, edge, of, the, Earth, with, Will, Turner, and, Elizabeth, Swann., But, nothing, is, quite, as, it, seems., Adventure, Fantasy, Action, ocean, drugabuse, exoticisland, eastindiatradingcompany,..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, sends, him, on, a, trail, to, uncover, a, sinister, organization., While, M, battles, political, forces, to, keep, the, secret, service, alive,, Bond, peels, back, the, layers, of, deceit, to, reveal, the, terrible, truth, behind, SPECTRE., Action, Adven..."
3,49026,The Dark Knight Rises,"[Following, the, death, of, District, Attorney, Harvey, Dent,, Batman, assumes, responsibility, for, Dent's, crimes, to, protect, the, late, attorney's, reputation, and, is, subsequently, hunted, by, the, Gotham, City, Police, Department., Eight, years, later,, Batman, encounters, the, mysteriou..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, military, captain, who's, inexplicably, transported, to, the, mysterious, and, exotic, planet, of, Barsoom, (Mars), and, reluctantly, becomes, embroiled, in, an, epic, conflict., It's, a, world, on, the, brink, of, collapse,, and, Carter, rediscovers, hi..."


In [36]:
# function for text cleaning

import re

def clean_text(text):
    text = re.sub("-", "", text)    # remove backslash-apostrophe
    text = re.sub("\'", "", text)    # remove backslash-apostrophe
    text = re.sub("[^a-zA-Z]"," ",text)    # remove everything alphabets
    text = ' '.join(text.split())    # remove whitespaces
    text = text.lower()     # convert text to lowercase 
    return text

In [37]:
df['tags']= df['tags'].apply(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['tags']= df['tags'].apply(lambda x: " ".join(x))


In [38]:
df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance ..."


In [39]:
df['clean_tags']=df['tags'].apply(lambda x: clean_text(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clean_tags']=df['tags'].apply(lambda x: clean_text(x))


In [40]:
df[['tags', 'clean_tags']].sample(5)

Unnamed: 0,tags,clean_tags
3329,A group of print workers in 1980s London club together to buy a race horse. Comedy racehorse BobHoskins JennyAgutter PhilDavis VincentRegan SachaBennett,a group of print workers in s london club together to buy a race horse comedy racehorse bobhoskins jennyagutter phildavis vincentregan sachabennett
3341,Diamonds are stolen only to be sold again in the international market. James Bond infiltrates a smuggling mission to find out who’s guilty. The mission takes him to Las Vegas where Bond meets his archenemy Blofeld. Adventure Action Thriller spy fight secretorganization satellite secretagent plas...,diamonds are stolen only to be sold again in the international market james bond infiltrates a smuggling mission to find out who s guilty the mission takes him to las vegas where bond meets his archenemy blofeld adventure action thriller spy fight secretorganization satellite secretagent plastic...
1383,"In 1993, the Search for Extra Terrestrial Intelligence Project receives a transmission detailing an alien DNA structure, along with instructions on how to splice it with human DNA. The result is Sil, a sensual but deadly creature who can change from a beautiful woman to an armour-plated killing ...",in the search for extra terrestrial intelligence project receives a transmission detailing an alien dna structure along with instructions on how to splice it with human dna the result is sil a sensual but deadly creature who can change from a beautiful woman to an armourplated killing machine in...
4566,"Shipped off to her American dad's ranch for the summer, a teen and her horse Lucky Lad compete for a spot at the National Youth Rodeo. Family KevinSorbo SophieBolen DerekBrandon CarrieBradstreet JoelPaulReisig",shipped off to her american dads ranch for the summer a teen and her horse lucky lad compete for a spot at the national youth rodeo family kevinsorbo sophiebolen derekbrandon carriebradstreet joelpaulreisig
2512,"In 1971, air-conditioner repairman and boat enthusiast Jim McCormick entertains his desire to 'go down' as a legend in the record books when the Gold Cup hydroplane boat race improbably comes to his small town of Madison, Indiana. Immediately, Jim seizes his opportunity to enter the contest. Wit...",in airconditioner repairman and boat enthusiast jim mccormick entertains his desire to go down as a legend in the record books when the gold cup hydroplane boat race improbably comes to his small town of madison indiana immediately jim seizes his opportunity to enter the contest with a motley cr...


In [42]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# function to remove stopwords
def remove_stopwords(text):
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(no_stopword_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tanmay.dwivedi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [45]:
df['clean_tags'] = df['clean_tags'].apply(lambda x: remove_stopwords(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clean_tags'] = df['clean_tags'].apply(lambda x: remove_stopwords(x))


In [46]:
df.head()

Unnamed: 0,movie_id,title,tags,clean_tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance ...",nd century paraplegic marine dispatched moon pandora unique mission becomes torn following orders protecting alien civilization action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier...
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the Earth with Will Turner and Elizabeth Swann. But nothing is quite as it seems. Adventure Fantasy Action ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongw...",captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofoneslife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler a...
2,206647,Spectre,"A cryptic message from Bond’s past sends him on a trail to uncover a sinister organization. While M battles political forces to keep the secret service alive, Bond peels back the layers of deceit to reveal the terrible truth behind SPECTRE. Action Adventure Crime spy basedonnovel secretagent seq...",cryptic message bond past sends trail uncover sinister organization battles political forces keep secret service alive bond peels back layers deceit reveal terrible truth behind spectre action adventure crime spy basedonnovel secretagent sequel mi britishsecretservice unitedkingdom danielcraig c...
3,49026,The Dark Knight Rises,"Following the death of District Attorney Harvey Dent, Batman assumes responsibility for Dent's crimes to protect the late attorney's reputation and is subsequently hunted by the Gotham City Police Department. Eight years later, Batman encounters the mysterious Selina Kyle and the villainous Bane...",following death district attorney harvey dent batman assumes responsibility dents crimes protect late attorneys reputation subsequently hunted gotham city police department eight years later batman encounters mysterious selina kyle villainous bane new terrorist leader overwhelms gothams finest d...
4,49529,John Carter,"John Carter is a war-weary, former military captain who's inexplicably transported to the mysterious and exotic planet of Barsoom (Mars) and reluctantly becomes embroiled in an epic conflict. It's a world on the brink of collapse, and Carter rediscovers his humanity when he realizes the survival...",john carter warweary former military captain whos inexplicably transported mysterious exotic planet barsoom mars reluctantly becomes embroiled epic conflict world brink collapse carter rediscovers humanity realizes survival barsoom people rests hands action adventure sciencefiction basedonnovel ...


In [48]:
df=df.drop('tags', axis=1)

KeyError: "['tags'] not found in axis"

In [49]:
df.head()

Unnamed: 0,movie_id,title,clean_tags
0,19995,Avatar,nd century paraplegic marine dispatched moon pandora unique mission becomes torn following orders protecting alien civilization action adventure fantasy sciencefiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier...
1,285,Pirates of the Caribbean: At World's End,captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems adventure fantasy action ocean drugabuse exoticisland eastindiatradingcompany loveofoneslife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler a...
2,206647,Spectre,cryptic message bond past sends trail uncover sinister organization battles political forces keep secret service alive bond peels back layers deceit reveal terrible truth behind spectre action adventure crime spy basedonnovel secretagent sequel mi britishsecretservice unitedkingdom danielcraig c...
3,49026,The Dark Knight Rises,following death district attorney harvey dent batman assumes responsibility dents crimes protect late attorneys reputation subsequently hunted gotham city police department eight years later batman encounters mysterious selina kyle villainous bane new terrorist leader overwhelms gothams finest d...
4,49529,John Carter,john carter warweary former military captain whos inexplicably transported mysterious exotic planet barsoom mars reluctantly becomes embroiled epic conflict world brink collapse carter rediscovers humanity realizes survival barsoom people rests hands action adventure sciencefiction basedonnovel ...


In [50]:
df.to_csv('tags.csv')