# Project 3 - Part 3

## Business problem

- Produce a MySQL database on movies from a subset of IMDB's publicly available dataset.
- Use this database to analyze what makes a movie successful.
- Provide recommendations to the stakeholder on how to make a successful movie.

## Requirements

- Create a new MySQL database after preparing the data for a relational database.
- Export your database to a .sql file in your repository using MySQL Workbench.

### Database Specifications
#### Title Basics:
- Movie ID (tconst)
- Primary Title
- Start Year
- Runtime (in Minutes)
- Genres

#### Title Ratings:
- Movie ID (tconst)
- Average Movie Rating
- Number of Votes

#### The TMDB API Results (multiple files):
- Movie ID
- Revenue
- Budget
- Certification (MPAA Rating)


- Normalize the tables as best you can before adding them to your new database.
- Keep all of the data from the TMDB API in one table together (even though it will not be perfectly normalized).

### Required Transformation Steps

#### Title Basics:
- Normalize Genre: Convert the single string of genres from title basics into 2 new tables (title_genres with the columns tconst and genre_id; and genres with columns genre_id, genre_name)
- Discard unnecessary information by dropping the following columns: original_title, isAdult, titleType, genres and other variants of genre

#### Title AKAS:
- Do not include the title_akas table in your SQL database.

### MySQL Database Requirements
- Use sqlalchemy with Pandas to execute your SQL queries inside your notebook
- Create a new database on your MySQL server and call it "movies"
- Make sure to have the following tables in your "movies" database:
  - title_basics
  - title_ratings
  - title_genres
  - genres
  - tmdb_data
- Set a primary key for each table that isn't a joiner table (e.g., title_genres is a joiner table).
- After creating each table, show the first 5 rows of that table using a SQL query.
- Run the "SHOW TABLES" SQL query at the end of your notebook to show that all required tables have been created.

## Imports

In [2]:
import pandas as pd
import pymysql
from sqlaclhemy import create_engine
from sqlalchemy.types import *

## Code

In [4]:
## Load data
basics_df = pd.read_csv("Data/title_basics.csv.gz", low_memory=False)
ratings_df = pd.read_csv("Data/title_ratings.csv.gz", low_memory=False)
tmdb_2000 = pd.read_csv("Data/movies_2000_final.csv.gz", low_memory=False)
tmdb_2001 = pd.read_csv("Data/movies_2001_final.csv.gz", low_memory=False)

In [3]:
## Transformation: Title basics
## Normalize genre

## Create a column with a list of genres
basics_df['genres_split'] = basics_df['genre'].str.split(',')
exploded = basics_df['genres_split'].explode
columns = sorted(exploded['genres_split'].dropna().unique())
columns

## Create a new table title_genres with columns tconst and genre_id
title_genres = exploded[['tconst', 'genres_split']].copy

## Make a genre mapper dictionary
genre_ints = range(len(unique_genres))
genre_map = dict(zip(unique_genres, genre_ints))

## Add genre_id and drop genres
title_genres['genre_id'] = title_genres['genres_split'].map(genre_map)
title_genres = title_genres.drop(columns='genres_split')

## Create a new table genres with columns genre_id and genre_name
genres = pd.DataFrame({'genre_id': genre_map.values()
                      'genre_name': genre_map.keys()})

In [None]:
## Drop columns
cols_todrop = ['original_title', 'isAdult', 'titleType', 'genres']

for col in cols_todrop:
    basics_df = basics_df.drop(columns=col)

In [4]:
## Transformation: TMDB data
## Join tmdb data

tmdb_data = pd.concat([tmdb_2000, tmdb_2001], axis=0)

In [4]:
## Export as .csv file
basics_df.to_csv("Data/new_basics.csv", index=False)
title_genres.to_csv("Data/title_genres.csv", index=False)
genres.to_csv("Data/genres.csv", index=False)
tmdb_data.to_csv("Data/tmdb_data.csv", index=False)

In [None]:
## Setup
pymysql.install_as_MySQLdb()

## Create connection
username = 'root'
password = 'root'
db_name = 'movies'
connection = f'mysql+pymysql://{username}:{password}@localhost/{db_name}'

## Create engine
engine = create_engine(connection)

In [None]:
## Create and use database
engine.execute("CREATE DATABASE IF NOT EXISTS movies")
engine.execute("USE movies")

In [None]:
## Create table title_basics
basics_df.to_sql('title_basics', engine, index=False, if_exists='replace')

## Calculate max string lengths for object columns
key_len = basics_df['tconst'].fillna('').map(len).max()
title_len = basics_df['primaryTitle'].fillna('').map(len).max()

## Create schema dictonary
basics_df_schema = {
    "tconst": String(key_len+1), 
    "primaryTitle": Text(title_len+1),
    'startYear':Float(),
    'endYear':Float(),
    'runtimeMinutes':Integer()}

engine.execute('ALTER TABLE title_basics ADD PRIMARY KEY ('tconst');')

## Show first five rows
q1 = pd.read_sql("SELECT TOP 5 * from title_basics;", connection)

In [None]:
## Run SHOW TABLES
q6 = pd.read_sql("SHOW TABLES IN movies;", connection)