# Motivation
I want to get a little familiar with MySQL, and a part of that is going to be using Python to interface with the server that I've set up. This notebook will contain some of my initial experimentation. 

# Setup
The cells below will help to set up the rest of the notebook. 

I'll start by changing my working directory to the root of the repo. 

In [1]:
%cd ..

d:\data\programming\neural-needle-drop


Next, I'll import a couple of important Python modules.

In [2]:
# Import statements
import json
import os
import traceback
from tqdm import tqdm
import pandas as pd
import mysql.connector
from pathlib import Path

### Setting Up MySQL Connector
I also want to set up the MySQL Connector. 

In [3]:
# Set up the connection to the MySQL server
cnx = mysql.connector.connect(
    user='root', password=os.getenv("MYSQL_PASSWORD"), 
    host='localhost', database='neural-needle-drop')

# Create a cursor 
cursor = cnx.cursor()

# Populating Table
One of the first things I want to do: try and populate a table within the MySQL database. I'm going to follow [this tutorial](https://dev.mysql.com/doc/connector-python/en/connector-python-example-cursor-transaction.html) on populating a table with some data. 

First thing's first: I need to load in some of the data. I'll load everything into a DataFrame. 

In [4]:
# Load in all of the data from theneedledrop_scraping folder
tnd_data_df_records = []
for child_dir in tqdm(list(Path("data/theneedledrop_scraping/").iterdir())):
    if (child_dir.is_dir()):
        enriched_details_path = Path(f"{child_dir}/enriched_details.json")
        if (enriched_details_path.exists()):
            with open(enriched_details_path, "r") as json_file:
                tnd_data_df_records.append(json.load(json_file))
tnd_data_df = pd.DataFrame.from_records(tnd_data_df_records)
tnd_data_df.head(1)

100%|██████████| 3974/3974 [00:00<00:00, 4966.58it/s]


Unnamed: 0,videoId,title,lengthSeconds,keywords,channelId,isOwnerViewing,shortDescription,isCrawlable,thumbnail,allowRatings,viewCount,author,isPrivate,isUnpluggedCorpus,isLiveContent,publish_date,inferred_video_type,inferred_review_score,spotify_linkages
0,--YnsRamkzc,Joey Bada$$- 1999 ALBUM REVIEW,418,"[joey badass, joey bada$$, new york, album, 19...",UCt7fwAhXDy3oNFTAzF2o8Pw,False,Listen: http://theneedledrop.com/2012/06/joey-...,True,{'thumbnails': [{'url': 'https://i.ytimg.com/v...,True,365830,theneedledrop,False,False,False,2012-06-18 00:00:00,album_review,7.0,{'album': [{'review_album': '5ra51AaWF3iVebyhl...


Nice - with this data in hand, I should be able to load some of it into the SQL server.

In [5]:
# Iterate through each of the rows in the tnd_data_df
for row in tqdm(list(tnd_data_df.itertuples())):

    # Create the query from some of the information in this row 
    insert_query = """
    REPLACE INTO video_details
        (id, title, length, channel_id, description, 
        view_ct, channel_name, publish_date, url)
        VALUES (%(id)s, %(title)s, %(length)s, %(channel_id)s, 
        %(description)s, %(view_ct)s, %(channel_name)s, %(publish_date)s,
        %(url)s)
    """

    # We'll specify the data that we'll be injecting into the table
    insert_data = {
        "id": row.videoId,
        "title": row.title,
        "length": row.lengthSeconds,
        "channel_id": row.channelId,
        "description": row.shortDescription,
        "view_ct": row.viewCount,
        "channel_name": row.author,
        "publish_date": row.publish_date,
        "url": f"https://www.youtube.com/watch?v={row.videoId}"
    }

    # Execute this query 
    cursor.execute(insert_query, insert_data)

# Now that we're done with adding all of the data, we'll commit it 
cnx.commit()

100%|██████████| 3974/3974 [00:00<00:00, 5150.05it/s]


# Querying the Server
Now that I've got some data in the table, I want to try and test querying it. With the help of ChatGPT, I've written the method below. 

In [6]:
def query_to_df(query, print_error=False):
    '''Query the active MySQL database and return results in a DataFrame'''

    # Try to return the results as a DataFrame
    try:
        # Execute the query
        cursor.execute(query)

        # Fetch the results 
        res = cursor.fetchall()

        # Return a DataFrame
        return pd.DataFrame(res, columns=[i[0] for i in cursor.description])

    # If we run into an Exception, return None
    except Exception as e:
        if (print_error):
            print(f"Ran into the following error:\n{e}\nStack trace:")
            print(traceback.format_exc())
        return None

Now - we can test this method out! The code below will grab *all* of the video details:

In [7]:
# This query will grab all of the data from the table 
all_video_details_query = """SELECT * FROM video_details"""

# Execute the above query 
all_video_details_df = query_to_df(all_video_details_query)

# Show the first 3 rows
all_video_details_df.head(3)

Unnamed: 0,id,title,length,channel_id,description,view_ct,channel_name,publish_date,url
0,__MzJoGgYLk,Jack White - Lazaretto ALBUM REVIEW,391,UCt7fwAhXDy3oNFTAzF2o8Pw,Listen: http://youtu.be/sRbnAxrS3EM\n\nJack Wh...,219457,theneedledrop,2014-06-11,https://www.youtube.com/watch?v=__MzJoGgYLk
1,__PxaWntvhg,Metá Metá - MM3 ALBUM REVIEW,245,UCt7fwAhXDy3oNFTAzF2o8Pw,Listen: https://www.youtube.com/watch?v=FNXUOG...,36095,theneedledrop,2016-12-10,https://www.youtube.com/watch?v=__PxaWntvhg
2,_-DrFVGXHyA,Dave - We're All Alone in This Together ALBUM ...,558,UCt7fwAhXDy3oNFTAzF2o8Pw,Listen: https://www.youtube.com/watch?v=oFqVvj...,298074,theneedledrop,2021-07-26,https://www.youtube.com/watch?v=_-DrFVGXHyA


What about a query that'll do a little analysis? What's the longest video I have? 

In [8]:
# This query will determine the title of the longest video
longest_video_query = """
SELECT
    title,
    length
FROM
    video_details
WHERE length = (SELECT MAX(length) FROM video_details)
"""

# Execute the above query
query_to_df(longest_video_query, print_error=True)

Unnamed: 0,title,length
0,TND Podcast #42 ft. Digibro,6808


How about the shortest video? 

In [9]:
# This query will determine the title of the shortest video
shortest_video_query = """
SELECT 
    title,
    length
FROM
    video_details
WHERE length = (SELECT MIN(length) FROM video_details)
"""

# Execute the above query
query_to_df(shortest_video_query, print_error=True)

Unnamed: 0,title,length
0,???,23


What about something a little more complicated: "what's the longest video released in 2018 that had a maximum length of 3 minutes?"

In [10]:
# This query will determine the video described above
complicated_query = """
WITH only_2018 AS (
    SELECT id, length
    FROM video_details
    WHERE YEAR(publish_date) = 2018
)
SELECT 
    video_details.id,
    video_details.title, 
    video_details.length,
    video_details.url
FROM only_2018
JOIN video_details ON video_details.id=only_2018.id
WHERE video_details.length = (SELECT MAX(length) from only_2018 WHERE length <=300) 
"""

# Execute the above query
query_to_df(complicated_query, print_error=True)

Unnamed: 0,id,title,length,url
0,7v0mYN2k16g,"Valee - GOOD Job, You Found Me EP REVIEW",299,https://www.youtube.com/watch?v=7v0mYN2k16g


# Closing the Cursor
Once we're finished with things, we ought to close out the cursor. 

In [11]:
cursor.close()
cnx.close()