<h1 style = "font-family: garamond; font-size: 50px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :royalblue; border-radius: 100px 100px; text-align:center " >Movie Plot Summarizer</h1>
<h1 style = "font-family: garamond; font-size: 30px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :royalblue; border-radius: 100px 100px; text-align:center " >Wrangling</h1>

<h1 style = "font-family: garamond; font-size: 45px; font-style: normal; letter-spacing: 3px; background-color: #f6f5f5; color :royalblue; border-radius: 100px 100px; text-align:center " >Table of Contents</h1>


* [1. Introduction](#1)
    * [1.1 Problem Statement](#1.1)
    * [1.2 Questions we want to answer](#1.2)
    * [1.3 Libraries](#1.3)
* [2. Data Wrangling](#2)
    * [2.1 Data Loading](#2.1)
    * [2.2 Data Cleaning and Merging](#2.2)
* [3. Summary](#3)

<a id = '1'></a>
<h1 style = "font-family: garamond; font-size: 45px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :royalblue; border-radius: 100px 100px; text-align:center " >Introduction</h1>

Over the past few years Natural Language Processing (NLP) abilities and applications have seen a lot of growth, and text summarization is a big part of that. Text summarization is generating intelligent, accurate and coherent summaries for long pieces of text. There are two fundamental approaches to text summarization: extractive and abstractive. The extractive approach takes exact words, phrases and sentences from the original text to create a summary. The abstractive approach learns the internal representation of the text to generate new sentences for the summary. 

<a id = '1.1'></a>
<h3 style = "font-family:garamond; font-size:35px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Problem Statement</h3>

To determine which approach gives a better summary, abstractive or extractive. One objective measure we will use to compare these approaches is cosine similarity scores between the generated summary and the actual summary. Another is their rouge score. Unfortunately, these do not tell us what is most important, which is how coherent and accurate the generated summaries are. At this point humans are simply better at evaluating summaries. Thus, we will need to use our subjective opinions on the accuracy and coherence of the text produced. 

<a id = '1.2'></a>
<h3 style = "font-family:garamond; font-size:35px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Questions/Problems We Want to Explore</h3>

1. Which approach produces better objective measures?
2. Which approach seems subjectively better?

<a id = '1.3'></a>
<h3 style = "font-family:garamond; font-size:35px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Importing Libraries</h3>

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import re
import ast
import statistics

<a id = '2'></a>
<h1 style = "font-family: garamond; font-size: 45px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :royalblue; border-radius: 100px 100px; text-align:center " >Data Wrangling</h1>

<a id = '2.1'></a>
<h3 style = "font-family:garamond; font-size:35px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Loading the Data</h3>

In [2]:
movie_data1 = pd.read_csv('../raw_data/wiki_movie_plots_deduped.csv')
movie_data2 = pd.read_csv('../raw_data/movies_metadata.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<h3 style = "font-family:garamond; font-size:25px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Data Definition and Understanding the Data</h3>

In [3]:
movie_data1.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [4]:
movie_data2.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [5]:
movie_data1.dtypes

Release Year         int64
Title               object
Origin/Ethnicity    object
Director            object
Cast                object
Genre               object
Wiki Page           object
Plot                object
dtype: object

In [6]:
movie_data2.dtypes

adult                     object
belongs_to_collection     object
budget                    object
genres                    object
homepage                  object
id                        object
imdb_id                   object
original_language         object
original_title            object
overview                  object
popularity                object
poster_path               object
production_companies      object
production_countries      object
release_date              object
revenue                  float64
runtime                  float64
spoken_languages          object
status                    object
tagline                   object
title                     object
video                     object
vote_average             float64
vote_count               float64
dtype: object

<a id = '2.2'></a>
<h3 style = "font-family:garamond; font-size:35px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Data Cleaning and Merging</h3>

<h3 style = "font-family:garamond; font-size:20px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Unpack the genres column of movie_data2</h3>

In [7]:
ast.literal_eval(movie_data2['genres'][0])

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [8]:
movie_data2 = movie_data2.drop(['adult', 'belongs_to_collection', 'original_language', 'tagline', 'budget', 'homepage','id','imdb_id', 'original_title', 'poster_path', 'production_companies', 'production_countries', 'revenue','runtime','spoken_languages', 'status','video'], axis=1)
movie_data2 = movie_data2.rename(columns={'title':'Title'})

In [9]:
for x in range(len(movie_data2)):
    if ast.literal_eval(movie_data2['genres'][x]):
        genre_string = ''
        for item in ast.literal_eval(movie_data2['genres'][x]):
            genre_string = genre_string + item['name'] + ' '
        if genre_string[-1] == ' ':
            genre_string = genre_string[:-1]
        movie_data2['genres'][x] = genre_string

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_data2['genres'][x] = genre_string


<h3 style = "font-family:garamond; font-size:20px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Change release date to year only</h3>

In [10]:
movie_data2 = movie_data2.rename(columns={'genres':'Genre', 'overview':'Overview', 'popularity':'Popularity', 'release_date':'Release_Date', 'vote_average':'Vote_Average', 'vote_count':'Vote_Count'})

In [11]:
movie_data1['Release Year'] = pd.to_datetime(movie_data1['Release Year'], format= '%Y', errors='coerce').dt.strftime('%Y')
movie_data2['Release_Date'] = pd.to_datetime(movie_data2['Release_Date'], errors='coerce').dt.strftime('%Y')
movie_data2 = movie_data2.rename(columns={'Release_Date':'Release Year'})

<h3 style = "font-family:garamond; font-size:20px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Merge the two dataframes</h3>

In [12]:
movie_data1.columns, movie_data2.columns

(Index(['Release Year', 'Title', 'Origin/Ethnicity', 'Director', 'Cast',
        'Genre', 'Wiki Page', 'Plot'],
       dtype='object'),
 Index(['Genre', 'Overview', 'Popularity', 'Release Year', 'Title',
        'Vote_Average', 'Vote_Count'],
       dtype='object'))

In [13]:
df = pd.merge(movie_data1, movie_data2, how='outer', on='Title', suffixes=('_wiki', '_other'))

In [14]:
df.head()

Unnamed: 0,Release Year_wiki,Title,Origin/Ethnicity,Director,Cast,Genre_wiki,Wiki Page,Plot,Genre_other,Overview,Popularity,Release Year_other,Vote_Average,Vote_Count
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr...",,,,,,
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov...",,,,,,
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed...",,,,,,
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...,,,,,,
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...,Comedy Family Fantasy,Abbott and Costello's version of the famous fa...,1.55627,1952.0,6.0,12.0


<h3 style = "font-family:garamond; font-size:20px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">fill missing values in _wiki columns with _other columns and drop _other columns</h3>

In [15]:
def fill_left_null(df, left, right):
    '''first arg is the dataframe to run on. second arg is the column with missing values you want filled. third arg is the column you want the missing values to be filled with '''
    for x in range(len(df)):
        if pd.isna(df[left][x]) or df[left][x] == 'unknown':
            if pd.isna(df[right][x]) or df[right][x] == '[]':
                pass
            else:
                df[left][x] = str(df[right][x])

In [16]:
df['Release Year_wiki'].isnull().sum()

28352

In [17]:
fill_left_null(df, 'Release Year_wiki', 'Release Year_other')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[left][x] = str(df[right][x])


In [18]:
df['Release Year_wiki'].isnull().sum()

87

In [19]:
df['Genre_wiki'].isnull().sum()

28352

In [20]:
fill_left_null(df, 'Genre_wiki', 'Genre_other')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[left][x] = str(df[right][x])


In [21]:
df['Genre_wiki'].isnull().sum()

2123

In [22]:
df.head()

Unnamed: 0,Release Year_wiki,Title,Origin/Ethnicity,Director,Cast,Genre_wiki,Wiki Page,Plot,Genre_other,Overview,Popularity,Release Year_other,Vote_Average,Vote_Count
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr...",,,,,,
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov...",,,,,,
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed...",,,,,,
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...,,,,,,
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,Comedy Family Fantasy,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...,Comedy Family Fantasy,Abbott and Costello's version of the famous fa...,1.55627,1952.0,6.0,12.0


In [23]:
df = df.drop(['Origin/Ethnicity', 'Director', 'Cast', 'Wiki Page', 'Genre_other', 'Release Year_other', 'Popularity', 'Vote_Average', 'Vote_Count'], axis=1)
df = df.rename(columns={'Genre_wiki':'Genre', 'Release Year_wiki': 'Release Year'})

In [24]:
def Data_Cleaning(Genre):
    '''
     This function is from the kernel - https://www.kaggle.com/aminejallouli/genre-classification-based-on-wiki-movies-plots
    '''
    df['Genre_improved'] = df['Genre']
    df['Genre_improved']=df['Genre_improved'].str.strip()
    df['Genre_improved']=df['Genre_improved'].str.replace(' - ', '|')
    df['Genre_improved']=df['Genre_improved'].str.replace(' / ', '|')
    df['Genre_improved']=df['Genre_improved'].str.replace('/', '|')
    df['Genre_improved']=df['Genre_improved'].str.replace(' & ', '|')
    df['Genre_improved']=df['Genre_improved'].str.replace(', ', '|')
    df['Genre_improved']=df['Genre_improved'].str.replace('; ', '|')
    df['Genre_improved']=df['Genre_improved'].str.replace('bio-pic', 'biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('biopic', 'biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('biographical', 'biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('biodrama', 'biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('bio-drama', 'biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('biographic', 'biography')
    df['Genre_improved']=df['Genre_improved'].str.replace(' \(film genre\)', '')
    df['Genre_improved']=df['Genre_improved'].str.replace('animated','animation')
    df['Genre_improved']=df['Genre_improved'].str.replace('anime','animation')
    df['Genre_improved']=df['Genre_improved'].str.replace('children\'s','children')
    df['Genre_improved']=df['Genre_improved'].str.replace('comedey','comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('\[not in citation given\]','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' set 4,000 years ago in the canadian arctic','')
    df['Genre_improved']=df['Genre_improved'].str.replace('historical','history')
    df['Genre_improved']=df['Genre_improved'].str.replace('romantic','romance')
    df['Genre_improved']=df['Genre_improved'].str.replace('3-d','animation')
    df['Genre_improved']=df['Genre_improved'].str.replace('3d','animation')
    df['Genre_improved']=df['Genre_improved'].str.replace('viacom 18 motion pictures','')
    df['Genre_improved']=df['Genre_improved'].str.replace('sci-fi','science_fiction')
    df['Genre_improved']=df['Genre_improved'].str.replace('ttriller','thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('.','')
    df['Genre_improved']=df['Genre_improved'].str.replace('based on radio serial','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' on the early years of hitler','')
    df['Genre_improved']=df['Genre_improved'].str.replace('sci fi','science_fiction')
    df['Genre_improved']=df['Genre_improved'].str.replace('science fiction','science_fiction')
    df['Genre_improved']=df['Genre_improved'].str.replace(' (30min)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('16 mm film','short')
    df['Genre_improved']=df['Genre_improved'].str.replace('\[140\]','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('\[144\]','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' for ','')
    df['Genre_improved']=df['Genre_improved'].str.replace('adventures','adventure')
    df['Genre_improved']=df['Genre_improved'].str.replace('kung fu','martial_arts')
    df['Genre_improved']=df['Genre_improved'].str.replace('kung-fu','martial_arts')
    df['Genre_improved']=df['Genre_improved'].str.replace('martial arts','martial_arts')
    df['Genre_improved']=df['Genre_improved'].str.replace('world war ii','war')
    df['Genre_improved']=df['Genre_improved'].str.replace('world war i','war')
    df['Genre_improved']=df['Genre_improved'].str.replace('biography about montreal canadiens star|maurice richard','biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('bholenath movies|cinekorn entertainment','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' \(volleyball\)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('spy film','spy')
    df['Genre_improved']=df['Genre_improved'].str.replace('anthology film','anthology')
    df['Genre_improved']=df['Genre_improved'].str.replace('biography fim','biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('avant-garde','avant_garde')
    df['Genre_improved']=df['Genre_improved'].str.replace('biker film','biker')
    df['Genre_improved']=df['Genre_improved'].str.replace('buddy cop','buddy')
    df['Genre_improved']=df['Genre_improved'].str.replace('buddy film','buddy')
    df['Genre_improved']=df['Genre_improved'].str.replace('comedy 2-reeler','comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('films','')
    df['Genre_improved']=df['Genre_improved'].str.replace('film','')
    df['Genre_improved']=df['Genre_improved'].str.replace('biography of pioneering american photographer eadweard muybridge','biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('british-german co-production','')
    df['Genre_improved']=df['Genre_improved'].str.replace('bruceploitation','martial_arts')
    df['Genre_improved']=df['Genre_improved'].str.replace('comedy-drama adaptation of the mordecai richler novel','comedy-drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('movies by the mob\|knkspl','')
    df['Genre_improved']=df['Genre_improved'].str.replace('movies','')
    df['Genre_improved']=df['Genre_improved'].str.replace('movie','')
    df['Genre_improved']=df['Genre_improved'].str.replace('coming of age','coming_of_age')
    df['Genre_improved']=df['Genre_improved'].str.replace('coming-of-age','coming_of_age')
    df['Genre_improved']=df['Genre_improved'].str.replace('drama about child soldiers','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('(( based).+)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('(( co-produced).+)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('(( adapted).+)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('(( about).+)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('musical b','musical')
    df['Genre_improved']=df['Genre_improved'].str.replace('animationchildren','animation|children')
    df['Genre_improved']=df['Genre_improved'].str.replace(' period','period')
    df['Genre_improved']=df['Genre_improved'].str.replace('drama loosely','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace(' \(aquatics|swimming\)','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' \(aquatics|swimming\)','')
    df['Genre_improved']=df['Genre_improved'].str.replace("yogesh dattatraya gosavi's directorial debut \[9\]",'')
    df['Genre_improved']=df['Genre_improved'].str.replace("war-time","war")
    df['Genre_improved']=df['Genre_improved'].str.replace("wartime","war")
    df['Genre_improved']=df['Genre_improved'].str.replace("ww1","war")
    df['Genre_improved']=df['Genre_improved'].str.replace('unknown','')
    df['Genre_improved']=df['Genre_improved'].str.replace("wwii","war")
    df['Genre_improved']=df['Genre_improved'].str.replace('psychological','psycho')
    df['Genre_improved']=df['Genre_improved'].str.replace('rom-coms','romance')
    df['Genre_improved']=df['Genre_improved'].str.replace('true crime','crime')
    df['Genre_improved']=df['Genre_improved'].str.replace('\|007','')
    df['Genre_improved']=df['Genre_improved'].str.replace('slice of life','slice_of_life')
    df['Genre_improved']=df['Genre_improved'].str.replace('computer animation','animation')
    df['Genre_improved']=df['Genre_improved'].str.replace('gun fu','martial_arts')
    df['Genre_improved']=df['Genre_improved'].str.replace('j-horror','horror')
    df['Genre_improved']=df['Genre_improved'].str.replace(' \(shogi|chess\)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('afghan war drama','war drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('\|6 separate stories','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' \(30min\)','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' (road bicycle racing)','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' v-cinema','')
    df['Genre_improved']=df['Genre_improved'].str.replace('tv miniseries','tv_miniseries')
    df['Genre_improved']=df['Genre_improved'].str.replace('\|docudrama','\|documentary|drama')
    df['Genre_improved']=df['Genre_improved'].str.replace(' in animation','|animation')
    df['Genre_improved']=df['Genre_improved'].str.replace('((adaptation).+)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('((adaptated).+)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('((adapted).+)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('(( on ).+)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('american football','sports')
    df['Genre_improved']=df['Genre_improved'].str.replace('dev\|nusrat jahan','sports')
    df['Genre_improved']=df['Genre_improved'].str.replace('television miniseries','tv_miniseries')
    df['Genre_improved']=df['Genre_improved'].str.replace(' \(artistic\)','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' \|direct-to-dvd','')
    df['Genre_improved']=df['Genre_improved'].str.replace('history dram','history drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('martial art','martial_arts')
    df['Genre_improved']=df['Genre_improved'].str.replace('psycho thriller,','psycho thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('\|1 girl\|3 suitors','')
    df['Genre_improved']=df['Genre_improved'].str.replace(' \(road bicycle racing\)','')
    filterE = df['Genre_improved']=="ero"
    df.loc[filterE,'Genre_improved']="adult"
    filterE = df['Genre_improved']=="music"
    df.loc[filterE,'Genre_improved']="musical"
    filterE = df['Genre_improved']=="-"
    df.loc[filterE,'Genre_improved']=''
    filterE = df['Genre_improved']=="comedy–drama"
    df.loc[filterE,'Genre_improved'] = "comedy|drama"
    filterE = df['Genre_improved']=="comedy–horror"
    df.loc[filterE,'Genre_improved'] = "comedy|horror"
    
    df['Genre_improved']=df['Genre_improved'].str.replace(' ','|')
    df['Genre_improved']=df['Genre_improved'].str.replace(',','|')
    df['Genre_improved']=df['Genre_improved'].str.replace('-','')
    df['Genre_improved']=df['Genre_improved'].str.replace('actionadventure','action|adventure')
    df['Genre_improved']=df['Genre_improved'].str.replace('actioncomedy','action|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('actiondrama','action|drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('actionlove','action|love')
    df['Genre_improved']=df['Genre_improved'].str.replace('actionmasala','action|masala')
    df['Genre_improved']=df['Genre_improved'].str.replace('actionchildren','action|children')

    df['Genre_improved']=df['Genre_improved'].str.replace('fantasychildren\|','fantasy|children')
    df['Genre_improved']=df['Genre_improved'].str.replace('fantasycomedy','fantasy|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('fantasyperiod','fantasy|period')
    df['Genre_improved']=df['Genre_improved'].str.replace('cbctv_miniseries','tv_miniseries')
    df['Genre_improved']=df['Genre_improved'].str.replace('dramacomedy','drama|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('dramacomedysocial','drama|comedy|social')
    df['Genre_improved']=df['Genre_improved'].str.replace('dramathriller','drama|thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('comedydrama','comedy|drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('dramathriller','drama|thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('comedyhorror','comedy|horror')
    df['Genre_improved']=df['Genre_improved'].str.replace('sciencefiction','science_fiction')
    df['Genre_improved']=df['Genre_improved'].str.replace('adventurecomedy','adventure|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('animationdrama','animation|drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('\|\|','|')
    df['Genre_improved']=df['Genre_improved'].str.replace('muslim','religious')
    df['Genre_improved']=df['Genre_improved'].str.replace('thriler','thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('crimethriller','crime|thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('fantay','fantasy')
    df['Genre_improved']=df['Genre_improved'].str.replace('actionthriller','action|thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('comedysocial','comedy|social')
    df['Genre_improved']=df['Genre_improved'].str.replace('martialarts','martial_arts')
    df['Genre_improved']=df['Genre_improved'].str.replace('\|\(children\|poker\|karuta\)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('epichistory','epic|history')

    df['Genre_improved']=df['Genre_improved'].str.replace('erotica','adult')
    df['Genre_improved']=df['Genre_improved'].str.replace('erotic','adult')

    df['Genre_improved']=df['Genre_improved'].str.replace('((\|produced\|).+)','')
    df['Genre_improved']=df['Genre_improved'].str.replace('chanbara','chambara')
    df['Genre_improved']=df['Genre_improved'].str.replace('comedythriller','comedy|thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('biblical','religious')
    df['Genre_improved']=df['Genre_improved'].str.replace('biblical','religious')
    df['Genre_improved']=df['Genre_improved'].str.replace('colour\|yellow\|productions\|eros\|international','')
    df['Genre_improved']=df['Genre_improved'].str.replace('\|directtodvd','')
    df['Genre_improved']=df['Genre_improved'].str.replace('liveaction','live|action')
    df['Genre_improved']=df['Genre_improved'].str.replace('melodrama','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('superheroes','superheroe')
    df['Genre_improved']=df['Genre_improved'].str.replace('gangsterthriller','gangster|thriller')

    df['Genre_improved']=df['Genre_improved'].str.replace('heistcomedy','comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('heist','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('historic','history')
    df['Genre_improved']=df['Genre_improved'].str.replace('historydisaster','history|disaster')
    df['Genre_improved']=df['Genre_improved'].str.replace('warcomedy','war|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('westerncomedy','western|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('ancientcostume','costume')
    df['Genre_improved']=df['Genre_improved'].str.replace('computeranimation','animation')
    df['Genre_improved']=df['Genre_improved'].str.replace('dramatic','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('familya','family')
    df['Genre_improved']=df['Genre_improved'].str.replace('familya','family')
    df['Genre_improved']=df['Genre_improved'].str.replace('dramedy','drama|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('dramaa','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('famil\|','family')

    df['Genre_improved']=df['Genre_improved'].str.replace('superheroe','superhero')
    df['Genre_improved']=df['Genre_improved'].str.replace('biogtaphy','biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('devotionalbiography','devotional|biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('docufiction','documentary|fiction')

    df['Genre_improved']=df['Genre_improved'].str.replace('familydrama','family|drama')

    df['Genre_improved']=df['Genre_improved'].str.replace('espionage','spy')
    df['Genre_improved']=df['Genre_improved'].str.replace('supeheroes','superhero')
    df['Genre_improved']=df['Genre_improved'].str.replace('romancefiction','romance|fiction')
    df['Genre_improved']=df['Genre_improved'].str.replace('horrorthriller','horror|thriller')

    df['Genre_improved']=df['Genre_improved'].str.replace('suspensethriller','suspense|thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('musicaliography','musical|biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('triller','thriller')

    df['Genre_improved']=df['Genre_improved'].str.replace('\|\(fiction\)','|fiction')

    df['Genre_improved']=df['Genre_improved'].str.replace('romanceaction','romance|action')
    df['Genre_improved']=df['Genre_improved'].str.replace('romancecomedy','romance|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('romancehorror','romance|horror')

    df['Genre_improved']=df['Genre_improved'].str.replace('romcom','romance|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('rom\|com','romance|comedy')
    df['Genre_improved']=df['Genre_improved'].str.replace('satirical','satire')

    df['Genre_improved']=df['Genre_improved'].str.replace('science_fictionchildren','science_fiction|children')
    df['Genre_improved']=df['Genre_improved'].str.replace('homosexual','adult')
    df['Genre_improved']=df['Genre_improved'].str.replace('sexual','adult')

    df['Genre_improved']=df['Genre_improved'].str.replace('mockumentary','documentary')
    df['Genre_improved']=df['Genre_improved'].str.replace('periodic','period')
    df['Genre_improved']=df['Genre_improved'].str.replace('romanctic','romance')
    df['Genre_improved']=df['Genre_improved'].str.replace('politics','political')
    df['Genre_improved']=df['Genre_improved'].str.replace('samurai','martial_arts')
    df['Genre_improved']=df['Genre_improved'].str.replace('tv_miniseries','series')
    df['Genre_improved']=df['Genre_improved'].str.replace('serial','series')

    filterE = df['Genre_improved']=="musical–comedy"
    df.loc[filterE,'Genre_improved'] = "musical|comedy"

    filterE = df['Genre_improved']=="roman|porno"
    df.loc[filterE,'Genre_improved'] = "adult"


    filterE = df['Genre_improved']=="action—masala"
    df.loc[filterE,'Genre_improved'] = "action|masala"


    filterE = df['Genre_improved']=="horror–thriller"
    df.loc[filterE,'Genre_improved'] = "horror|thriller"

    df['Genre_improved']=df['Genre_improved'].str.replace('family','children')
    df['Genre_improved']=df['Genre_improved'].str.replace('martial_arts','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('horror','thriller')
    df['Genre_improved']=df['Genre_improved'].str.replace('war','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('adventure','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('science_fiction','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('western','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('western','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('noir','black')
    df['Genre_improved']=df['Genre_improved'].str.replace('spy','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('superhero','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('social','')
    df['Genre_improved']=df['Genre_improved'].str.replace('suspense','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('sex','adult')


    filterE = df['Genre_improved']=="drama|romance|adult|children"
    df.loc[filterE,'Genre_improved'] = "drama|romance|adult"

    df['Genre_improved']=df['Genre_improved'].str.replace('\|–\|','|')
    df['Genre_improved']=df['Genre_improved'].str.strip(to_strip='\|')
    df['Genre_improved']=df['Genre_improved'].str.replace('actionner','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('love','romance')
    df['Genre_improved']=df['Genre_improved'].str.replace('crime','mystery')
    df['Genre_improved']=df['Genre_improved'].str.replace('kids','children')
    df['Genre_improved']=df['Genre_improved'].str.replace('boxing','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('buddy','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('cartoon','animation')
    df['Genre_improved']=df['Genre_improved'].str.replace('cinema','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('religious','supernatural')
    df['Genre_improved']=df['Genre_improved'].str.replace('christian','supernatural')
    df['Genre_improved']=df['Genre_improved'].str.replace('lgbtthemed','romance')
    df['Genre_improved']=df['Genre_improved'].str.replace('detective','mystery')
    df['Genre_improved']=df['Genre_improved'].str.replace('nature','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('fiction','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('music','artistic')
    df['Genre_improved']=df['Genre_improved'].str.replace('musical','artistic')
    df['Genre_improved']=df['Genre_improved'].str.replace('short','artistic')
    df['Genre_improved']=df['Genre_improved'].str.replace('mythology','supernatural')
    df['Genre_improved']=df['Genre_improved'].str.replace('mythological','supernatural')
    df['Genre_improved']=df['Genre_improved'].str.replace('masala','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('military','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('sexploitation','adult')
    df['Genre_improved']=df['Genre_improved'].str.replace('tragedy','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('murder','mystery')
    df['Genre_improved']=df['Genre_improved'].str.replace('disaster','drama')
    df['Genre_improved']=df['Genre_improved'].str.replace('documentary','biography')
    df['Genre_improved']=df['Genre_improved'].str.replace('dance','artistic')
    df['Genre_improved']=df['Genre_improved'].str.replace('cowboy','action')
    df['Genre_improved']=df['Genre_improved'].str.replace('anthology','artistic')
    df['Genre_improved']=df['Genre_improved'].str.replace('artistical','artistic')
    df['Genre_improved']=df['Genre_improved'].str.replace('art','artistic')
    df['Genre_improved']=df['Genre_improved'].str.strip()
    return df['Genre_improved']

In [25]:
df['Genre_improved'] = Data_Cleaning(df['Genre'])

In [26]:
df_rec = df
df_full = df

df_rec is ready to be used for other purposes (e.g. recommendation system, genre predictor, etc.). df_full contains the full list of movies rather than only those movies that meet our filtering standards below.

In [27]:
# save the data to a new csv file
df_rec.to_csv (r'C:\Users\rotzn\gitProjects\DS program\DS projects\cap 3\data\df_rec.csv', header=True, index=False)
df_full.to_csv (r'C:\Users\rotzn\gitProjects\DS program\DS projects\cap 3\data\df_full.csv', header=True, index=False)

<h3 style = "font-family:garamond; font-size:20px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Find only movies with both 'Plot' and 'Overview'</h3>

In [28]:
df = df[(pd.notna(df['Plot'])) & (pd.notna(df['Overview']))]

In [29]:
df = df.rename(columns={'Genre_wiki':'Genre', 'Release Year_wiki': 'Release Year'})

In [30]:
df.head()

Unnamed: 0,Release Year,Title,Genre,Plot,Overview,Genre_improved
4,1902,Jack and the Beanstalk,Comedy Family Fantasy,The earliest known adaptation of the classic f...,Abbott and Costello's version of the famous fa...,Comedy|Family|Fantasy
5,1902,Jack and the Beanstalk,Adventure Comedy Family Fantasy,The earliest known adaptation of the classic f...,A fairy tale character who is about to flunk o...,Adventure|Comedy|Family|Fantasy
6,1902,Jack and the Beanstalk,Adventure Family Fantasy,The earliest known adaptation of the classic f...,Porter's sequential continuity editing links s...,Adventure|Family|Fantasy
7,1902,Jack and the Beanstalk,Adventure Animation Fantasy,The earliest known adaptation of the classic f...,Innovative and enchanting adaptation of this w...,Adventure|Animation|Fantasy
8,1952,Jack and the Beanstalk,comedy,Mr. Dinkle and Jack (Abbott and Costello) look...,Abbott and Costello's version of the famous fa...,comedy


<h3 style = "font-family:garamond; font-size:20px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Find and drop duplicate rows</h3>

In [31]:
df.duplicated().sum()

62

In [32]:
df = df.drop_duplicates()

In [33]:
df.duplicated(subset=['Title', 'Release Year']).sum()

4356

In [34]:
df = df.drop_duplicates(subset=['Title', 'Release Year'])

In [35]:
df.duplicated(subset=['Title', 'Release Year']).sum()

0

In [36]:
df = df.reset_index(drop=True)

In [37]:
df.head()

Unnamed: 0,Release Year,Title,Genre,Plot,Overview,Genre_improved
0,1902,Jack and the Beanstalk,Comedy Family Fantasy,The earliest known adaptation of the classic f...,Abbott and Costello's version of the famous fa...,Comedy|Family|Fantasy
1,1952,Jack and the Beanstalk,comedy,Mr. Dinkle and Jack (Abbott and Costello) look...,Abbott and Costello's version of the famous fa...,comedy
2,1903,Alice in Wonderland,Animation Adventure Family Fantasy,"Alice follows a large white rabbit down a ""Rab...","On a golden afternoon, young Alice follows a W...",Animation|Adventure|Family|Fantasy
3,1933,Alice in Wonderland,fantasy,Left alone with a governess one snowy afternoo...,"On a golden afternoon, young Alice follows a W...",fantasy
4,1951,Alice in Wonderland,animation,"On a riverbank, Alice spots a White Rabbit in ...","On a golden afternoon, young Alice follows a W...",animation


<h3 style = "font-family:garamond; font-size:20px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Drop unnecessary columns</h3>

In [38]:
df = df.drop(['Genre', 'Genre_improved'], axis=1)

<h3 style = "font-family:garamond; font-size:20px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Drop duplicates</h3>

In [39]:
df = df.drop_duplicates(subset=['Overview'])
df = df.drop_duplicates(subset=['Plot'])

<h3 style = "font-family:garamond; font-size:20px; background-color: white; color : royalblue; border-radius: 100px 100px; text-align:left">Explore the merged dataframe</h3>

In [40]:
df.dtypes

Release Year    object
Title           object
Plot            object
Overview        object
dtype: object

In [41]:
df.shape

(14664, 4)

In [42]:
df.describe()

Unnamed: 0,Release Year,Title,Plot,Overview
count,14664,14664,14664,14664
unique,114,14664,14664,14664
top,2013,The Third Man,Casey Cooke is an emotionally withdrawn teenag...,"In the early part of this century, Maddelena a..."
freq,404,1,1,1


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14664 entries, 0 to 16214
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Release Year  14664 non-null  object
 1   Title         14664 non-null  object
 2   Plot          14664 non-null  object
 3   Overview      14664 non-null  object
dtypes: object(4)
memory usage: 572.8+ KB


In [44]:
# save the data to a new csv file
df.to_csv (r'C:\Users\rotzn\gitProjects\DS program\DS projects\cap 3\data\df.csv', header=True, index=False)

<a id = '3'></a>
<h1 style = "font-family: garamond; font-size: 45px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :royalblue; border-radius: 100px 100px; text-align:center " >Summary</h1>

In [45]:
df

Unnamed: 0,Release Year,Title,Plot,Overview
0,1902,Jack and the Beanstalk,The earliest known adaptation of the classic f...,Abbott and Costello's version of the famous fa...
2,1903,Alice in Wonderland,"Alice follows a large white rabbit down a ""Rab...","On a golden afternoon, young Alice follows a W..."
8,1903,The Great Train Robbery,The film opens with two bandits breaking into ...,The clerk at the train station is assaulted an...
9,1905,The Night Before Christmas,Scenes are introduced using lines of the poem....,A cartoon based on the works of Nikolay Gogol....
11,1906,Dream of a Rarebit Fiend,The Rarebit Fiend gorges on Welsh rarebit at a...,Adapted from Winsor McCay's films and comics o...
...,...,...,...,...
16210,2010,Five Minarets in New York,The film follows two anti-terror officers from...,Two Turkish anti-terrorist agents are sent to ...
16211,2011,Love Likes Coincidences,"One September morning in 1977 in Ankara, a you...","Year 1977, a September morning in Ankara... Yi..."
16212,2011,Once Upon a Time in Anatolia,"Through the night, three cars carry a small gr...",A group of men set out in search of a dead bod...
16213,2014,Winter Sleep,"Aydın, a former actor, owns a mountaintop hote...","Aydin, a retired actor, owns a small hotel in ..."


First, we loaded the data. We have 2 different movie databases that we merged together. We cleaned up the Genre column just in case we want to pursue some other goals (Genre predictor, movie recommendations, etc.) later. We dropped duplicate rows, rows that don't contain both 'Plot' and 'Overview',  as well as columns that won’t be useful. We are left with a pandas dataframe that consists of 16,215 rows and 4 columns (Release Year, Title, Plot, Overview). 