<span style="font-size: 36px;">W4111_Fall_2024_003 - Introduction to Databases:<br>Homework 5 - Both Tracks
<br>Game of Thrones Data and IMDB</span>

# Homerwork Overview and Definition

## HW5/Project Reminder

| <img src="../../../assets/overall-system.jpg"> |
| :---: |
| __Overall System__ |

The preceeding diagram provides an overview of HW 5/Project. We discussed this diagram in the 18-April lecture. There are three parts to the homework/project:
1. Both the programming and non-programming track implement "common data engineering" tasks. The tasks implement simple extract-transform-load/extract-load-transform functions on the IMDB and Game of Thrones Datasets. This involves:
    2. Reading CSV and JSON input files.
    3. Loading data into MySQL, MongoDB and Neo4j.
    4. Some simple transformation and cleanup of the loaded data.
3. The non-programming track implements a Jupyter notebook that builds a data warehouse/data lake, and writes queries against the warehouse that produces input data to visualizations.
4. The programming track implements additional functions in the template full-stack web application to provide a [RESTful API](https://restfulapi.net/) to the loaded data.

This notebook defines the HW common to both tracks and is the template for submission. See Ed for full submission instructions.

## HW5/Project Both Tracks -- Overview

This section is a succinct overview of the assignment for both tracks. Sections below provide detailed instructions and an implementation template for the specific tasks that comprise the HW.

Students complete the implementation of this notebook to build a data lake. The data lake uses 3 databases:
1. MySQL
2. MongoDB
3. Neo4j

For MongoDB and Neo4j, the students can use either a local instance of the database management systems (if they have installed one) or use a cloud version (i.e. Aura for Neo4j and Atlas for MongoDB).

The HW is a series of logical, simple ETL/ELT pipelines. There is one pipeline for each of the loaded files. Since Professor Ferguson copied the input data into the HW directory, he implemented the "Extract" phase of ETL/ELT. 

| <img src="../../../assets/common-etl.jpg" width="900px"> |
| :---: |
| __ETL/ELT Notebook and Pipelines__ |

The files that the notebook transforms/loads are:
- IMDB: All of these files are CSV files and the pipelines load the data into MySQL.
    - name_basics.csv: basic IMDB information about actors in Game of Thrones.
    - title_basics.csv: basic IMDB information about _all_ TV episodes, movies, etc. in which the actors have appeared. This includes non-Game of Thrones movies and episodes.
    - title_principals: IMDB information that is logically an _associative entity_ connecting entities in _name\_basics_ and _title\_basics_.
    - title_ratings: IMDB information on viewer ratings for the episodes.
- Game of Thrones: All of these files are JSON files. Some data goes into MongoDB and some data goes into Neo4j.
    - characters.json: information about characters in Game of Thrones. This file also includes relationships/links between characters, e.g. _siblings, killedBy, parents, married, ... ..._ The basic character information goes into MongoDB and Neo4j. The relationships becomes edges in Neo4j.
    - groups.json: groups in Game of Thrones (e.g _Stark, Lanister, Tully, ... ..._) and characters in the groups. This information goes into Neo4j.
    - locations.json: information about locations and sublocations, e.g. _(The North, Winterfell)._ This information goes into MongoDB.
    - episodes.json: information about episodes and scenes, including which characters appear in which scenes.

| <img src="../../../assets/conceptual_data_model.jpg"> |
| :---: |
| __Data Lake Conceptual Model__ |

The preceeding diagram depicts the conceptual data model of the (entity types, relationships). Relationships in MySQL are _foreign keys_ or _associative entities._ Relationships in Neo4j are edges with labels. Relationships in MongoDB are "logical" and in the entity data. That is, the entity has a data value that is a foreign key but MongoDB does not explicitly support foreign key constraints. Relationships that span databases are also logical.

# Initialize Environment

__Execute the cells in this section. The cells are the same as from the data loading notebook I prepared for you.__

## General Python Packages

In [1]:
import copy

In [2]:
import json

In [3]:
import pandas

In [4]:
# You should have installed the packages for previous homework assignments
#
import pymysql
import sqlalchemy

In [5]:
# You may have to do %pip installs to use the packages below.
#
# %pip install "pymongo[srv]"
#
import pymongo

In [6]:
# You may have to do %pip installs to use the packages below.
#
# %pip install neo4j
#
import neo4j

In [7]:
import numpy

## MySQL

### ipython-sql

In [8]:
# You have installed and configured ipython-sql for previous assignments.
# https://pypi.org/project/ipython-sql/
#
%load_ext sql

In [9]:
# Make sure that you set these values to the correct values for your installation and 
# configuration of MySQL
#
db_user = "root"
db_password = "dbuserdbuser"

In [10]:
# Create the URL for connecting to the database.
# Do not worry about the local_infile=1, I did that for wizard reasons that you should not have to use.
#
db_url = f"mysql+pymysql://{db_user}:{db_password}@localhost?local_infile=1"

In [11]:
# Initialize ipython-sql
#
%sql $db_url

In [14]:
# This is a hack to fix a version incompatibility problem.
#
%config SqlMagic.style = '_DEPRECATED_DEFAULT'

In [15]:
# This assumes you used the name "db_book" for the sample database.
#
%sql select * from db_book.student

 * mysql+pymysql://root:***@localhost?local_infile=1
15 rows affected.


ID,name,dept_name,tot_cred
128,Zhang,Comp. Sci.,102.0
12345,Shankar,Comp. Sci.,32.0
19991,Brandt,History,80.0
23121,Chavez,Finance,110.0
44553,Peltier,Physics,56.0
45678,Levy,Physics,46.0
54321,Williams,Comp. Sci.,54.0
55739,Sanchez,Music,38.0
70557,Snow,Physics,0.0
76543,Brown,Comp. Sci.,58.0


### PyMySQL

In [16]:
default_mysql_conn = pymysql.connect(
    user=db_user,
    password=db_password,
    host="localhost",
    port=3306,
    cursorclass=pymysql.cursors.DictCursor,
    autocommit=True
)

In [17]:
def get_mysql_con(
    user=db_user,
    password=db_password,
    host="localhost",
    port=3306,
    cursorclass=pymysql.cursors.DictCursor,
    autocommit=True):

    result = pymysql.connect(
        user=db_user,
        password=db_password,
        host="localhost",
        port=3306,
        cursorclass=pymysql.cursors.DictCursor,
        autocommit=True
    )

    return result

In [20]:
con = get_mysql_con()
cur = con.cursor()

result = cur.execute("select * from db_book.student;");
result = cur.fetchall()
con.close()

result_df = pandas.DataFrame(result)
result_df

Unnamed: 0,ID,name,dept_name,tot_cred
0,128,Zhang,Comp. Sci.,102.0
1,12345,Shankar,Comp. Sci.,32.0
2,19991,Brandt,History,80.0
3,23121,Chavez,Finance,110.0
4,44553,Peltier,Physics,56.0
5,45678,Levy,Physics,46.0
6,54321,Williams,Comp. Sci.,54.0
7,55739,Sanchez,Music,38.0
8,70557,Snow,Physics,0.0
9,76543,Brown,Comp. Sci.,58.0


### SQLAlchemy

In [21]:
from sqlalchemy import create_engine

In [23]:
default_engine = create_engine(db_url)
result_df = pandas.read_sql(
    "select * from db_book.student", con=default_engine
)
result_df

Unnamed: 0,ID,name,dept_name,tot_cred
0,128,Zhang,Comp. Sci.,102.0
1,12345,Shankar,Comp. Sci.,32.0
2,19991,Brandt,History,80.0
3,23121,Chavez,Finance,110.0
4,44553,Peltier,Physics,56.0
5,45678,Levy,Physics,46.0
6,54321,Williams,Comp. Sci.,54.0
7,55739,Sanchez,Music,38.0
8,70557,Snow,Physics,0.0
9,76543,Brown,Comp. Sci.,58.0


## MongoDB Atlas

In [24]:
# If you use MongoDB Atlas, set the information below to the correct IP address, user ID and password.
#
# You will need to replace with your settings.
#
mongodb_user = "dff9"
mongodb_pw = "8MAEOqViwkW0Eo3H"
mongo_db_url = f"mongodb+srv://{mongodb_user}:{mongodb_pw}@cluster0.t8qdk.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

In [29]:
# If you use local MongoDB, run this cell.
#
#
mongo_db_url = None

In [30]:
mongo_db_url

In [31]:
mongo_client = pymongo.MongoClient(mongo_db_url)

In [32]:
# Your list of databases will be different.
#
list(mongo_client.list_databases())

[{'name': 'Classic_Models', 'sizeOnDisk': 131072, 'empty': False},
 {'name': 'F24_GoT', 'sizeOnDisk': 2203648, 'empty': False},
 {'name': 'S24_GoT', 'sizeOnDisk': 1093632, 'empty': False},
 {'name': 'admin', 'sizeOnDisk': 40960, 'empty': False},
 {'name': 'config', 'sizeOnDisk': 49152, 'empty': False},
 {'name': 'local', 'sizeOnDisk': 98304, 'empty': False}]

## Neo4j

In [37]:
# TODO Student -- This should be the same as what you provided in the data loading notebook

# You need to have created a Neo4j Aura DB with a user ID and information.
# Please make sure you copied the information for connecting.
# You can download this information when you create your instance.
# The download will be a text file with the information below.
# You will have to modify the file you download to wrap the strings with "
#
# Wait 60 seconds before connecting using these details, or login to https://console.neo4j.io to validate the Aura Instance is available
NEO4J_URI="neo4j+s://377ecc70.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="S-1t6bUmGMOst4YI2M3Flsll8sE61hiD7b79KD8eXWQ"
AURA_INSTANCEID="377ecc70"
AURA_INSTANCENAME="Instance02"




In [40]:
from neo4j import GraphDatabase

# URI examples: "neo4j://localhost", "neo4j+s://xxx.databases.neo4j.io"
URI = NEO4J_URI
AUTH = (NEO4J_USERNAME, NEO4J_PASSWORD)

with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()
    print("I got here without an exception. So, it worked.")

I got here without an exception. So, it worked.


__Note:__ You should have already created a Neo4j account and loaded the Movie Database.

In [41]:
# The following code assumes that you followed the tutorial for the Movie Database and loaded the data.
#
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    # driver.verify_connectivity()
    
    records, summary, keys = driver.execute_query(
        "MATCH (p:Person) where p.name='Tom Hanks' RETURN p.name AS name, p.born as birth_year ",
    )
    
    # Loop through results and do something with them
    # There is probably an easier way to do this.
    #
    person_records = []
    for person in records:
        new_p = dict(person)
        person_records.append(new_p)
    
    # Summary information
    print("The query `{query}` returned {records_count} records in {time} ms.".format(
        query=summary.query, records_count=len(records),
        time=summary.result_available_after,
    ))

The query `MATCH (p:Person) where p.name='Tom Hanks' RETURN p.name AS name, p.born as birth_year ` returned 1 records in 44 ms.


In [42]:
results_df = pandas.DataFrame(person_records)
results_df.head(10)

Unnamed: 0,name,birth_year
0,Tom Hanks,1956


# IMDB Data and Pipelines

## Create the Database

In [43]:
%sql drop schema if exists hw5_common;

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.


[]

In [45]:
%sql create database hw5_common;

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


[]

## Check Path and Files

In [51]:
imdb_dir = "../../../data/IMDB"
%ls $imdb_dir

got_title_basics.csv         title_basics.csv
name_basics.csv              title_principals.csv
name_basics_professions.csv  title_ratings.csv
professions.csv


- Task details will go here.

# Non-Programming

## Build the Data Warehouse

- I will give you the SQL schema to build.

- You will write queries and scripts that load the schema from IMDB data, data in MongoDB and data in Neo4j.

In [None]:
episode_info_df.to_sql(
    "episodes_characters",
    schema="f24_project_solution",
    index=False,
    if_exists="replace",
    con=default_engine
)

In [None]:
%sql use f24_project_solution

## Some Visualization

### Overview

- I will give you a set of subsections that produce visualization.

- I will also give you examples of what the input Pandas data frames will look like.

- You will write SQL queries to produce the data frames.

- You can then use the code I provide to produce the visualizations.

### GoT Episode Ratings

In [None]:
%%sql


In [None]:
%%sql



In [None]:
episodes_ratings.DataFrame()

In [None]:
%%sql

bar_chart_1 << 
    select
    season_num, episode_num, concat(season_num, ".", episode_num) as season_episode, avg_rating
from
    got_episode_title_basics join title_ratings
using(tconst)
order by season_num, episode_num;

In [None]:
df = pandas.DataFrame(bar_chart_1)
df

In [None]:
# Assign colors to each season
season_colors = {
    1: 'blue',
    2: 'green',
    3: 'red',
    4: 'purple',
    5: 'orange',
    6: 'cyan',
    7: 'magenta',
    8: 'brown'
}

# Map colors to the seasons in the DataFrame
df['color'] = df['season_num'].map(season_colors)

df

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


In [None]:
# Create a DataFrame

# Plot the data
plt.figure(figsize=(15, 6))
plt.bar(df['season_episode'], df['avg_rating'], color=df['color'], width=0.6)

# Add labels and title
plt.xlabel('Season.Episode', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)
plt.title('Average Ratings by Episode', fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.tight_layout()
plt.show()

What is a conclusion we could derive from this graph?

My observation is that:
- The last two or three episodes of a season were the highest rated.<br><br>
- Except for season 8, where the episodes were not well rated and got worse over the season.

Season 8 was very highly anticipated. I can remember the excitement. All I can say now is,

| <img src="epic_fail.jpg" width="700px;"> |
| :---: |
| __Game of Thrones Season 8__ |

## Seasons, Episodes and Locations

In [None]:

df_graph_data

In [None]:
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as mcolors

# Extract valid locations dynamically from the DataFrame
valid_locations = df['sceneLocation'].unique()

# Create the graph
G = nx.Graph()

# Add edges from the existing DataFrame
for _, row in df.iterrows():
    G.add_edge(f"Season {row['seasonNum']}", row['sceneLocation'])

# Remove nodes with degree 0
nodes_to_remove = [node for node, degree in G.degree() if degree == 0]
G.remove_nodes_from(nodes_to_remove)

# Compute node degrees
degree_dict = dict(G.degree())

# Create a colormap for degrees
degree_cmap = cm.viridis
degree_norm = mcolors.Normalize(vmin=min(degree_dict.values()), vmax=max(degree_dict.values()))

# Assign colors to nodes and classify them as season or location
season_nodes = [node for node in G.nodes if node.startswith("Season")]
location_nodes = [node for node in G.nodes if node in valid_locations]

season_node_colors = [degree_cmap(degree_norm(degree_dict[node])) for node in season_nodes]
location_node_colors = [degree_cmap(degree_norm(degree_dict[node])) for node in location_nodes]

# Draw the graph
plt.figure(figsize=(14, 10))
pos = nx.spring_layout(G)

# Draw season nodes (circles)
nx.draw_networkx_nodes(
    G, pos,
    nodelist=season_nodes,
    node_size=1000,  # Set node size to 1000
    node_color=season_node_colors,
    node_shape='o',  # Circle
    label='Seasons'
)

# Draw location nodes (triangles)
nx.draw_networkx_nodes(
    G, pos,
    nodelist=location_nodes,
    node_size=1000,  # Set node size to 1000
    node_color=location_node_colors,
    node_shape='^'  # Triangle
)

# Draw edges
nx.draw_networkx_edges(G, pos, edge_color="gray")

# Draw labels
nx.draw_networkx_labels(G, pos, font_size=10, font_weight="bold")

# Add a unified colorbar for node degrees
sm = cm.ScalarMappable(cmap=degree_cmap, norm=degree_norm)
sm.set_array([])
cbar = plt.colorbar(sm, ax=plt.gca())
cbar.set_label("Node Degree", fontsize=12)

plt.title("Node-Edge Chart: Seasons and Scene Locations", fontsize=16)
plt.show()


### Scene Locations

Produces the following code and result.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

# Activate ipython-sql magic and connect to the MySQL database
%load_ext sql
%sql mysql+pymysql://root:***@localhost?local_infile=1

# SQL query to fetch the data
query = """

"""

# Execute the query and load it into a pandas DataFrame
df = %sql $query
df = df.DataFrame()  # Convert the ResultSet into a pandas DataFrame

# Aggregate total time per location for the inner pie
location_totals = df.groupby('location')['total_time'].sum()

# Prepare data for the inner and outer pie charts
outer_labels = df['sublocation']
outer_sizes = df['total_time']
inner_labels = location_totals.index
inner_sizes = location_totals.values

# Define colors for consistency between outer and inner pie charts
colors = plt.cm.tab20.colors

# Plot the nested pie chart
fig, ax = plt.subplots(figsize=(12, 12))
ax.axis('equal')  # Equal aspect ratio ensures the pie chart is circular

# Inner pie (location totals with radius 0.9)
inner_pie, texts = ax.pie(
    inner_sizes, 
    radius=0.9,  # Set radius to 0.9
    labels=inner_labels,  # Show location labels outside
    labeldistance=1.15, 
    colors=colors[:len(inner_labels)],
    textprops={'fontsize': 10, 'weight': 'bold'}  # Set label properties
)

# Set label colors to match the pie slices
for text, color in zip(texts, colors[:len(inner_labels)]):
    text.set_color(color)

# Outer pie (sublocation breakdowns)
outer_pie, _ = ax.pie(
    outer_sizes, 
    radius=1, 
    labels=None,  # No labels on the outer pie
    colors=colors[:len(outer_labels)],
    wedgeprops=dict(width=0.3, edgecolor='w')  # Wedge width for the outer pie
)

# Create a custom legend for sublocations in the format location.sublocation
custom_labels = df['location'] + "." + df['sublocation']
legend_patches = [
    Patch(facecolor=colors[i % len(colors)], label=custom_labels.iloc[i]) 
    for i in range(len(custom_labels))
]
ax.legend(
    handles=legend_patches,
    title="Locations.Sublocations",
    loc="center left",
    bbox_to_anchor=(1.05, 0, 0.4, 1),
    fontsize=8,  # Smaller font size for the legend
    title_fontsize=10
)

# Title and display
plt.title("Nested Pie Chart: Scene Location and Sublocation Times", fontsize=16)
plt.show()


### Time on Screen Together

In [None]:
# %pip install seaborn

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Activate ipython-sql magic and connect to the MySQL database
%load_ext sql
%sql mysql+pymysql://root:***@localhost?local_infile=1

# SQL query to fetch the data
query = """

"""

# Execute the query and load it into a pandas DataFrame
df = %sql $query
df = df.DataFrame()  # Convert the ResultSet into a pandas DataFrame

# Ensure total_screen_time is numeric
df['total_screen_time'] = pd.to_numeric(df['total_screen_time'], errors='coerce')

# Pivot the DataFrame to create a heatmap-compatible format
heatmap_data = df.pivot(index='characterName1', columns='characterName2', values='total_screen_time')

# Fill missing values with 0 (optional)
heatmap_data = heatmap_data.fillna(0)

# Create the heatmap using seaborn
plt.figure(figsize=(12, 10))
sns.heatmap(
    heatmap_data,
    cmap="YlGnBu",  # Color palette
    annot=True,  # Display values on the heatmap
    fmt=".0f",  # Integer format for annotation
    cbar_kws={"label": "Total Screen Time (seconds)"}
)

# Title and axis labels
plt.title("Character Pair Total Screen Time Heatmap", fontsize=16)
plt.xlabel("Character Name 2", fontsize=12)
plt.ylabel("Character Name 1", fontsize=12)

# Show the heatmap
plt.tight_layout()
plt.show()


## Terrifying Networks

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

def display_graph(paths):
    """
    Display a graph image based on the paths retrieved from Neo4j.
    
    :param paths: List of paths, where each path is a Neo4j path object.
    """
    # Initialize a NetworkX graph
    G = nx.DiGraph()  # Use DiGraph for directed relationships like 'killed'

    # Process each path to extract nodes and relationships
    for path in paths:
        # Extract nodes and relationships from the path
        for i in range(len(path.nodes) - 1):
            source = path.nodes[i]["characterName"]
            target = path.nodes[i + 1]["characterName"]
            relationship = path.relationships[i].type  # e.g., 'killed'

            # Add nodes and edges to the graph
            G.add_node(source, label=source)
            G.add_node(target, label=target)
            G.add_edge(source, target, label=relationship)

    # Draw the graph
    plt.figure(figsize=(12, 8))
    pos = nx.spring_layout(G)  # Layout for the graph

    # Draw nodes
    nx.draw_networkx_nodes(G, pos, node_size=700, node_color="lightblue", edgecolors="black")
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, arrowstyle='-|>', arrowsize=20)

    # Draw labels for nodes
    nx.draw_networkx_labels(G, pos, font_size=10, font_color="black", font_weight="bold")

    # Draw edge labels for relationships
    edge_labels = nx.get_edge_attributes(G, "label")
    nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_size=8, font_color="red")

    # Title and show plot
    plt.title("Graph Visualization of Character Relationships", fontsize=16)
    plt.axis("off")
    plt.show()

# Example usage
# Assuming 'paths' contains the result from Neo4j (replace with actual paths data)
display_graph(paths)


# Programming Track

- You will implement a set resources (and paths) for the conceptual data model above.


- 