# Why multi-tables

### Introduction

### The problem

Let's imagine that we would like to study movies.  One thing that we could do is take a look at the [rotten tomatoes data](https://www.kaggle.com/ayushkalla1/rotten-tomatoes-movie-database).

In [143]:
import pandas as pd
df = pd.read_csv('./all_movie.csv')

In [144]:
df.columns

Index(['Cast 1', 'Cast 2', 'Cast 3', 'Cast 4', 'Cast 5', 'Cast 6',
       'Description', 'Director 1', 'Director 2', 'Director 3', 'Genre',
       'Rating', 'Release Date', 'Runtime', 'Studio', 'Title', 'Writer 1',
       'Writer 2', 'Writer 3', 'Writer 4', 'Year'],
      dtype='object')

In [159]:
movie_cols = ['Title', 'Studio', 'Runtime', 'Description', 'Release Date', 'Year']

In [160]:
movie_df = df[movie_cols]

In [161]:
movie_df.index = range(1, len(movie_df) + 1)

In [162]:
movie_df[:2]

Unnamed: 0,Title,Studio,Runtime,Description,Release Date,Year
1,The Mummy: Tomb of the Dragon Emperor,Universal Pictures,112 minutes,The Fast and the Furious director Rob Cohen co...,7/24/2008,2008
2,The Masked Saint,Freestyle Releasing,111 minutes,The journey of a professional wrestler who bec...,1/8/2016,2016


In [190]:
col_names = [col_name.lower() for col_name in movie_df.columns]

In [192]:
col_names[-1] = 'year'

In [193]:
col_names

['title', 'studio', 'runtime', 'description', 'release_date', 'year']

In [194]:
movie_df.columns = col_names

In [208]:
movie_df['runtime'] = movie_df['runtime'].str[:3]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [222]:
runtime_col = pd.to_numeric(movie_df['runtime'],errors='coerce')

In [223]:
movie_df['runtime'] = runtime_col

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [226]:
movie_release = pd.to_datetime(movie_df['release_date'],infer_datetime_format=True)

In [228]:
movie_df['release_date'] = movie_release

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [229]:
movie_df.dtypes

title                   object
studio                  object
runtime                float64
description             object
release_date    datetime64[ns]
year                     int64
dtype: object

In [230]:
import sqlite3
conn = sqlite3.connect('films.db')
cursor = conn.cursor()

In [231]:
movie_df.to_sql('movies', conn, index = True, index_label = 'id')

In [232]:
cursor.execute('SELECT name from sqlite_master where type= "table"')
cursor.fetchall()

[('movies',)]

In [233]:
cursor.execute('PRAGMA table_info(movies);')
cursor.fetchall()

[(0, 'id', 'INTEGER', 0, None, 0),
 (1, 'title', 'TEXT', 0, None, 0),
 (2, 'studio', 'TEXT', 0, None, 0),
 (3, 'runtime', 'REAL', 0, None, 0),
 (4, 'description', 'TEXT', 0, None, 0),
 (5, 'release_date', 'TIMESTAMP', 0, None, 0),
 (6, 'year', 'INTEGER', 0, None, 0)]

In [235]:
cursor.execute('select * from movies limit 1')
cursor.fetchall()

[(1,
  'The Mummy: Tomb of the Dragon Emperor',
  'Universal Pictures',
  112.0,
  "The Fast and the Furious director Rob Cohen continues the tale set into motion by director Stephen Sommers with this globe-trotting adventure that finds explorer Rick O'Connell and son attempting to thwart a resurrected emperor's (Jet Li) plan to enslave the entire human race. It's been 2,000 years since China's merciless Emperor Han and his formidable army were entombed in terra cotta clay by a double-dealing sorceress (Michelle Yeoh), but now, after centuries in suspended animation, an ancient curse is about to be broken. Thanks to his childhood adventures alongside father Rick (Brendan Fraser) and mother Evelyn (Maria Bello), dashing young archeologist Alex O'Connell (Luke Ford) is more than familiar with the power of the supernatural. After he is tricked into awakening the dreaded emperor from his eternal slumber, however, the frightened young adventurer is forced to seek out the wisdom of his paren

### Cast

In [247]:
cast_cols = list(df.columns[0:6])
cast_cols

['Cast 1', 'Cast 2', 'Cast 3', 'Cast 4', 'Cast 5', 'Cast 6']

In [248]:
cast_cols.append('Title')

In [249]:
cast_df = df[cast_cols]

In [250]:
cursor = conn.cursor()
cursor.execute('select id, title from movies;')
movie_ids = cursor.fetchall()

In [251]:
movie_id_df = pd.DataFrame(movie_ids, columns = ['id', 'name'])

In [252]:
movie_id_df[0:3]

Unnamed: 0,id,name
0,1,The Mummy: Tomb of the Dragon Emperor
1,2,The Masked Saint
2,3,Spy Hard


In [253]:
cast_df[0:2]

Unnamed: 0,Cast 1,Cast 2,Cast 3,Cast 4,Cast 5,Cast 6,Title
0,Brendan Fraser,John Hannah,Maria Bello,Michelle Yeoh,Jet Li,Russell Wong,The Mummy: Tomb of the Dragon Emperor
1,Brett Granstaff,Diahann Carroll,Lara Jean Chorostecki,Roddy Piper,T.J. McGibbon,James Preston Rogers,The Masked Saint


In [254]:
combined_cast_df = cast_df.join(movie_id_df)

In [255]:
combined_cast_df = combined_cast_df.rename(columns = {'id': 'movie_id'})

In [256]:
combined_cast_df[0:2]

Unnamed: 0,Cast 1,Cast 2,Cast 3,Cast 4,Cast 5,Cast 6,Title,movie_id,name
0,Brendan Fraser,John Hannah,Maria Bello,Michelle Yeoh,Jet Li,Russell Wong,The Mummy: Tomb of the Dragon Emperor,1,The Mummy: Tomb of the Dragon Emperor
1,Brett Granstaff,Diahann Carroll,Lara Jean Chorostecki,Roddy Piper,T.J. McGibbon,James Preston Rogers,The Masked Saint,2,The Masked Saint


In [257]:
condensed_df = pd.melt(combined_cast_df, id_vars=['movie_id', 'name'], value_vars=['Cast 1', 'Cast 2', 'Cast 3', 'Cast 4', 'Cast 5', 'Cast 6'])

In [258]:
condensed_df[:2]

Unnamed: 0,movie_id,name,variable,value
0,1,The Mummy: Tomb of the Dragon Emperor,Cast 1,Brendan Fraser
1,2,The Masked Saint,Cast 1,Brett Granstaff


In [399]:
selected_actors = condensed_df[condensed_df['value'] != 'Cast Not Available']
selected_actors_array = selected_actors[['name', 'value']].to_numpy()

In [400]:
selected_actors_array

array([['The Mummy: Tomb of the Dragon Emperor', 'Brendan Fraser'],
       ['The Masked Saint', 'Brett Granstaff'],
       ['Spy Hard', 'Leslie Nielsen'],
       ...,
       ['Zardoz', 'Niall Buggy'],
       ['Supernova', 'Robert Forster'],
       ['Battle: Los Angeles', 'Michelle Rodriguez']], dtype=object)

In [401]:
arr = np.array([])

In [408]:
np.append(np.array([1, 2, 3]), arr)

array([1., 2., 3.])

In [439]:
selected_actors_array.shape

(172424, 2)

In [440]:
172424/4

43106.0

In [431]:
actor_ids = []
actor_names = []
for movie_name, actor_name in selected_actors_array[:10]:
    cursor.execute('select id from actors where name = ?', (actor_name,))
    actor_id = cursor.fetchone()[0]
    cursor.execute('select id from movies where title = ?', (movie_name,))
    movie_id = cursor.fetchone()[0]
    data_row = np.array([actor_name, actor_id, movie_name, movie_id])
    actor_ids.append(data_row)

In [432]:
movie_actor_rows = np.array(actor_ids)

In [433]:
movie_actor_rows.shape

(10, 4)

In [437]:
movie_actor_df = pd.DataFrame(data = movie_actor_rows, index = range(1, 11), columns = ['actor_name', 'actor_id', 'movie_name', 'movie_id'])

In [438]:
movie_actor_df

Unnamed: 0,actor_name,actor_id,movie_name,movie_id
1,Brendan Fraser,1,The Mummy: Tomb of the Dragon Emperor,1
2,Brett Granstaff,2,The Masked Saint,2
3,Leslie Nielsen,3,Spy Hard,3
4,Martina Gedeck,4,Der Baader Meinhof Komplex (The Baader Meinhof...,4
5,Martin Sheen,5,Apocalypse Now,5
6,Johnny Depp,6,Mortdecai,6
7,Jeremy Renner,7,The Hurt Locker,7
8,Jim Carter,8,The Little Vampire 3D,8
9,Jack Nicholson,9,The Fortune,9
10,Robert Mitchum,10,Heaven Knows Mr. Allison,10


### List of actor movie joins

In [366]:
import numpy as np
actor_ids_np = np.array(actor_ids)

In [377]:
movie_ids_arr = selected_actors_array[:, 0]

In [386]:
movie_actors_arr = np.stack((movie_ids_arr,actor_ids_np)).T

In [397]:
movie_actor_df = pd.DataFrame(movie_actors_arr, columns = ['movie_id', 'actor_id'], index = range(1, 172424 + 1))

In [392]:
cursor.execute('select * from movies where id = ?', ('29810',))

<sqlite3.Cursor at 0x11671f3b0>

In [393]:
cursor.fetchall()

[(29810,
  'Battle: Los Angeles',
  'Sony Pictures/Columbia Pictures',
  116.0,
  "For years, there have been documented cases of UFO sightings around the world - Buenos Aires, Seoul, France, Germany, China. But in 2011, what were once just sightings will become a terrifying reality when Earth is attacked by unknown forces. As people everywhere watch the world's great cities fall, Los Angeles becomes the last stand for mankind in a battle no one expected. It's up to a Marine staff sergeant (Aaron Eckhart) and his new platoon to draw a line in the sand as they take on an enemy unlike any they've ever encountered before. -- (C) Sony",
  '2011-03-11 00:00:00',
  2011)]

In [394]:
cursor.execute('select * from actors where id = ?', ('12728',))

<sqlite3.Cursor at 0x11671f3b0>

In [395]:
cursor.fetchall()

[(12728, 'Bob Gunton')]

In [420]:
# movie_actor_df

In [261]:
selected_cast_df = condensed_df[condensed_df['value'] != 'Cast Not Available']

In [264]:
selected_cast_df.index = range(1, len(selected_cast_df) + 1)

In [267]:
unique_actors = selected_cast_df['value'].unique()

In [270]:
unique_actors_df = pd.DataFrame(unique_actors, columns = ['name'], index = range(1, len(unique_actors) + 1))

In [272]:
unique_actors_df.to_sql('actors', conn, index = True, index_label='id')

In [273]:
cursor.execute('PRAGMA table_info(actors);')
cursor.fetchall()

[(0, 'id', 'INTEGER', 0, None, 0), (1, 'name', 'TEXT', 0, None, 0)]

In [274]:
cursor.execute('SELECT name from sqlite_master where type= "table"')
cursor.fetchall()

[('movies',), ('actors',)]

### Directors

In [61]:
cast_condensed = condensed_df[['value', 'movie_id']]

In [65]:
cast_condensed.columns = ['name', 'movie_id']
cast_condensed[0:2]

Unnamed: 0,name,movie_id
0,Brendan Fraser,0
1,Brett Granstaff,1


In [None]:
# combined_cast_df['Cast 6'].value_counts()



* Remove where says - Cast Not Available

In [286]:
cast_condensed = cast_condensed[cast_condensed['name'] != "Cast Not Available"]

In [288]:
# cast_condensed

In [84]:
cursor = conn.cursor()
cursor.execute('DROP table actors;')

<sqlite3.Cursor at 0x11b00b2d0>

In [85]:
cast_condensed.to_sql('actors', conn, index=True, index_label = 'id')

In [86]:
cursor.execute('SELECT name from sqlite_master where type= "table"')
cursor.fetchall()

[('movies',), ('actors',)]

In [87]:
cursor.execute('PRAGMA table_info(actors);')
cursor.fetchall()

[(0, 'id', 'INTEGER', 0, None, 0),
 (1, 'name', 'TEXT', 0, None, 0),
 (2, 'movie_id', 'INTEGER', 0, None, 0)]

### Directors

In [275]:
director_columns = ['Director 1', 'Director 2', 'Director 3', 'Title']

In [276]:
director_df = df[director_columns]

In [277]:
combined_director_df = director_df.join(movie_id_df)

In [278]:
combined_director_df[0:2]

Unnamed: 0,Director 1,Director 2,Director 3,Title,id,name
0,Rob Cohen,Simon Duggan,Director Not Available,The Mummy: Tomb of the Dragon Emperor,1,The Mummy: Tomb of the Dragon Emperor
1,Warren P. Sonoda,Director Not Available,Director Not Available,The Masked Saint,2,The Masked Saint


In [279]:
condensed_director_df = pd.melt(combined_director_df, id_vars=['id', 'name'], value_vars=['Director 1', 'Director 2', 'Director 3'])

In [280]:
condensed_director_df[0:2]

Unnamed: 0,id,name,variable,value
0,1,The Mummy: Tomb of the Dragon Emperor,Director 1,Rob Cohen
1,2,The Masked Saint,Director 1,Warren P. Sonoda


In [289]:
condensed_director_df = condensed_director_df[condensed_director_df['value'] != "Director Not Available"]

In [421]:
# actor_ids = []
# actor_names = []
# for movie_name, actor_name in selected_actors_array:
#     cursor.execute('select id from actors where name = ?', (actor_name,))
#     actor_id = cursor.fetchone()[0]
#     cursor.execute('select id from movies where title = ?', (movie_name,))
#     actor_id = cursor.fetchone()[0]
#     data_row = np.array([actor_name, actor_id, movie_name, movie_id])
#     actor_ids.append(data_row)

In [291]:
# condensed_director_df['value']

In [292]:
director_df = condensed_director_df[['id', 'value']]

In [293]:
director_df.columns = ['movie_id', 'director']

In [294]:
director_df_renamed = director_df[['director', 'movie_id']]

In [295]:
director_df_renamed.columns = ['name', 'movie_id']

In [297]:
director_names_df = director_df_renamed[['name']]

In [304]:
unique_directors = director_names_df['name'].unique()

In [305]:
unique_directors_df = pd.DataFrame(unique_directors, columns = ['name'], index = range(1, len(unique_directors) + 1))

In [306]:
conn.execute('DROP TABLE directors;')

<sqlite3.Cursor at 0x1203167a0>

In [307]:
unique_directors_df.to_sql('directors', conn, index=True, index_label = 'id')

In [311]:
# unique_directors_df['name'].value_counts()

In [308]:
cursor.execute('SELECT name from sqlite_master where type= "table"')
cursor.fetchall()

[('movies',), ('actors',), ('directors',)]

In [309]:
cursor.execute('PRAGMA table_info(directors);')
cursor.fetchall()

[(0, 'id', 'INTEGER', 0, None, 0), (1, 'name', 'TEXT', 0, None, 0)]

### Writers

But do all of these columns belong in our `movies` table?  Some of the columns make a lot of sense.  For example, for a movie, we should have the year, the runtime, the studio, rating, and release date.  What about the other columns?

Let's select a few rows from our database.

In [312]:
writer_df = df[['Title', 'Writer 1', 'Writer 2', 'Writer 3', 'Writer 4']].join(movie_id_df)

In [313]:
writer_df_condensed = pd.melt(writer_df, id_vars=['id', 'name'], value_vars=['Writer 1', 'Writer 2', 'Writer 3', 'Writer 4'])

In [314]:
writer_df_condensed.columns = ['movie_id', 'title', 'variable', 'name']

In [315]:
writers_df = writer_df_condensed[['name', 'movie_id']]

In [317]:
unique_writers = writer_df_condensed['name'].unique()

In [318]:
unique_writers_df = pd.DataFrame(unique_writers, columns = ['name'], index = range(1, len(unique_writers) + 1))

In [319]:
unique_writers_df.to_sql('writers', conn, index=True, index_label = 'id')

In [320]:
cursor.execute('SELECT name from sqlite_master where type= "table"')
cursor.fetchall()

[('movies',), ('actors',), ('directors',), ('writers',)]

In [321]:
cursor.execute('PRAGMA table_info(writers);')
cursor.fetchall()

[(0, 'id', 'INTEGER', 0, None, 0), (1, 'name', 'TEXT', 0, None, 0)]

### Movies

### A better way

In [141]:
cursor.execute('PRAGMA table_info(movies);')
cursor.fetchall()

[(0, 'id', 'INTEGER', 0, None, 0),
 (1, 'Cast 1', 'TEXT', 0, None, 0),
 (2, 'Cast 2', 'TEXT', 0, None, 0),
 (3, 'Cast 3', 'TEXT', 0, None, 0),
 (4, 'Cast 4', 'TEXT', 0, None, 0),
 (5, 'Cast 5', 'TEXT', 0, None, 0),
 (6, 'Cast 6', 'TEXT', 0, None, 0),
 (7, 'Description', 'TEXT', 0, None, 0),
 (8, 'Director 1', 'TEXT', 0, None, 0),
 (9, 'Director 2', 'TEXT', 0, None, 0),
 (10, 'Director 3', 'TEXT', 0, None, 0),
 (11, 'Genre', 'TEXT', 0, None, 0),
 (12, 'Rating', 'TEXT', 0, None, 0),
 (13, 'Release Date', 'TEXT', 0, None, 0),
 (14, 'Runtime', 'TEXT', 0, None, 0),
 (15, 'Studio', 'TEXT', 0, None, 0),
 (16, 'Title', 'TEXT', 0, None, 0),
 (17, 'Writer 1', 'TEXT', 0, None, 0),
 (18, 'Writer 2', 'TEXT', 0, None, 0),
 (19, 'Writer 3', 'TEXT', 0, None, 0),
 (20, 'Writer 4', 'TEXT', 0, None, 0),
 (21, 'Year', 'INTEGER', 0, None, 0)]

* need to input cast names, then join cast name and add cast_id onto movie_id

In [138]:
cast_condensed[0:3]

Unnamed: 0,name,movie_id
0,Brendan Fraser,0
1,Brett Granstaff,1
2,Leslie Nielsen,2


* need to input writer names, then join writer id onto df with movie_id

In [134]:
writers_df.columns

Index(['name', 'movie_id'], dtype='object')

In [140]:
writers_df[:5]

Unnamed: 0,name,movie_id
0,Alfred Gough,0
1,Scott Crowell,1
2,Rick Friedberg,2
3,Uli Edel,3
4,John Milius,4


In [139]:
director_df_renamed[:5]

Unnamed: 0,name,movie_id
0,Rob Cohen,0
1,Warren P. Sonoda,1
2,Rick Friedberg,2
3,Uli Edel,3
4,Francis Ford Coppola,4


* https://www.kaggle.com/rounakbanik/the-movies-dataset#movies_metadata.csv