<h1 style='font-family: "SF Atarian System";'>ytscrapeqt.py</h1>

<h3>A YouTube Transcript scraping program and it's associated MariaDB / MySQL database:</h3>
<h2 style="font-family: 'SF Atarian System';"><b>ytscrape@192.168.1.250:3306</b></h2>

<p>The following Jupyter Notebook concerns the research and development of the associated database and creating utilities through Jupyter Notebooks, PHP and Python3 PyQt5 to manage and utilize that database</p>

In [1]:
###
# This is a JuPyTer Notebook for `ytscrapedb` a database created and grown by `ytscrapeqt` a python3 based
# Selenium WebDriver script & GUI, with shell based .LOG file. 
###

import os
import pymysql
import pandas as pd

host = os.getenv('MYSQL_HOST')
port = os.getenv('MYSQL_PORT')
user = os.getenv('MYSQL_USER')
password = os.getenv('MYSQL_PASSWORD')
database = os.getenv('MYSQL_DATABASE')

conn = pymysql.connect(
    host=host,
    port=int(3306),
    user="pyscrape",
    passwd="ytscrape",
    db="ytscrape",
    charset='utf8mb4')

## This converts the entire MySQL Transcripts table into a DataFrame
df = pd.read_sql_query("SELECT * FROM Transcripts WHERE `Channel Name` = 'theoria apophasis'", conn)
df

Unnamed: 0,idTranscripts,Video URL,Video Title,Video Description,Video Transcript,Channel Name
0,521,https://www.youtube.com/watch?v=vlNRe57hNvE,Giant Monster Magnet: Painful? Dangerous? Funn...,Giant Monster Magnet: Painful? Dangerous? Funn...,mmm funny or dangerous or both now I was doing...,Theoria Apophasis
1,522,https://www.youtube.com/watch?v=gDQlCvsJnK8,Never seen Before: BENDING LIGHT from Precisio...,Never seen Before: BENDING LIGHT from Precisio...,Skipped,Theoria Apophasis
2,523,https://www.youtube.com/watch?v=R7NUj_qDuzQ,:thinking face: MAGNET CRASH? :grinning face w...,"IF YOU LIKE THESE VIDEOS, YOU CAN MAKE A SMALL...",okay I'm actually using dental floss two inch ...,Theoria Apophasis
3,524,https://www.youtube.com/watch?v=1ZeCIejT2NY,VIDEO 111 UNCOVERING SECRETS OF MAGNETISM. Mag...,VIDEO 111 UNCOVERING SECRETS OF MAGNETISM. Ma...,okay a little mystery here for you let's see i...,Theoria Apophasis
4,525,https://www.youtube.com/watch?v=82uRtaJSw54,Magnet Awesome!! Magnet Magic!! FREE Coolest T...,Magnet Awesome!! Magnet Magic!! FREE Coolest T...,okay I've got all sorts of magnetic tips and t...,Theoria Apophasis
...,...,...,...,...,...,...
3467,6734,https://www.youtube.com/watch?v=wncF5gsE-io,Happy Photographer: Shipped out free 18-200 Ni...,Happy Photographer: Shipped out free 18-200 Ni...,hello greetings I was gonna date you when you'...,theoria apophasis
3468,6735,https://www.youtube.com/watch?v=DQQUDX0OYwA,Angry Photographer: Close up look at the Carl ...,Angry Photographer: Close up look at the Carl ...,okay here we go more than a few people asked m...,theoria apophasis
3469,6736,https://www.youtube.com/watch?v=d_5eyKG03fk,:winking face with tongue:Photography? SEX SEL...,:winking face with tongue:Photography? SEX SEL...,talk about something that pisses me off actual...,theoria apophasis
3470,6737,https://www.youtube.com/watch?v=uu72icBaiiE,The Angry Photographer: $40 MUST BUY to improv...,The Angry Photographer: $40 MUST BUY to improv...,yo another video from the angry photographer h...,theoria apophasis


<p style='font-family: "SF Atarian System"; font-size: 22px;'>This is a markdown cell... Here HTML and CSS can be used to organize and style text. The cells above and below are 'code' cells.</p>

<h3>Step One is to Query the Database: </h3>
<h2>ytscrape@192.168.1.250:3306</h2>
<p>The table we want is called <b>Transcripts</b></p>
<p><b>Transcripts</b> (<i>or <b>ytscrape.Transcripts</b> in project namespace</i>), is described by 5 columns:
    <ol>
        <li>`idTranscripts`</li>
        <li><i><b>`Video Title`</b></i></li>
        <li>`Video URL`</li>
        <li>`Channel Name`</li>
        <li><i><b>`Video Description`</b></i></li>
        <li><i><b>`Video Transcript`</b></i></li>
    </ol>
</p>
<p>Columns marked in bold italic are fulltext indexed and can accept Queries using fulltext index search functionality; e.g. using "MATCH(`Column Name`) AGAINST('search term')" syntax</p>

In [None]:
## The Query Syntax is df = pd.read_sql_query("SQL QUERY", conn)


<p>A list of dataframes can be made, each one the result of a query to Transcripts or even a cross table query</p>
<p>The nested queries using <b>set</b> logic, e.g. "WHERE `Column Name` IN( ... )" can be applied here, using search inputs and other parameters drawn from the user or the filesystem or devices or scrapes</p>

<p>The following line outputs the length of the rows in the dataframe from the SQL query sent to ytscrape.Transcripts above</p>

In [49]:
print(len(df))

3141


<p>The following script pulls all the unique word tokens from a set of all the titles present in the original query result dataframe from above: <b>df</b>.</p>
<h2 style="font-family: 'SF Atarian System';">HYPOTHESIS: </h2>
<h3 style="font-family: 'SF Atarian System';">These words, when ranked for number of occurences can help to cut the total set of searchable videos differently, for example in the 500 result limited set returned experimentally from YouTube</h3>

In [105]:
## regular expressions for niave text processing, TIP use '\w+'
import re
from nltk.corpus import stopwords

# create stopwords from nltk library
stop_words = set(stopwords.words('english'))

# remove emojis with re
titles = ""
for i in range(0, len(df)):
    #print(df.loc[i][2])
    #titles.append(df.loc[i][2])
    
    ## concatenate to string
    titles = titles + " " + df.loc[i][2]
    ## remove emoji descriptions
    
## Removes anything between colons    
titles = re.sub("\:.*?\:"," ",titles)
## All Test to lowercase
titles = titles.lower()

## Clean up the numbers
titles = re.sub('\w\d\w', ' ', titles)
titles = re.sub('\d', ' ', titles)

## Clean up the symbols. Punctuation, special characters, other alphabets / languages
## NOTE the words face and with form the most frequent tokens after cleaning, adding them here as stop words
## They are actually generated by demoji, its the emoji descriptions text from that python library
titles = titles.translate(str().maketrans({
    "'" : " ",    "." : " ",    "," : " ",    '"' : ' ',    "!" : " ",    "?" : " ",    "`" : " ",    ":" : " ",
    "(" : " ",    ")" : " ",    "-" : " ",    "\\" : " ",    "/" : " ",    "~" : " ",    "…" : " ",    "$" : " ",
    "*" : " ",    "=" : " ",    "&" : " ",    "#" : " ",    "@" : " ",    "%" : " ",    "^" : " ",    "_" : " ",
    "+" : " ",    "{" : " ",    "}" : " ",    "[" : " ",    "]" : " ",    "|" : " ",    "<" : " ",    ">" : " ",
    ";" : " ",    "‘" : " ",    "’" : " ",    "€" : " ",    "“" : " ",    "”" : " ",    "ε" : " ",    "ὐ" : " ",
    "δ" : " ",    "α" : " ",    "ι" : " ",    "μ" : " ",    "改" : " ",    "善" : " ",    "σ" : " ",    "τ" : " ",
    "ή" : " ",    "☽" : " ",    "○" : " ",    "☾" : " ",    "ἐ" : " ",    "π" : " ",    "η" : " ",    "γ" : " ",
    "ν" : " ",    "ῶ" : " ",    "ö" : " ",    "ς" : " ",    "ü" : " ",    "ž" : " "
}))

## Additional Considerations: These tokens entered as utf8mb4 valid with mycursor.execute() function
## 改, 善, ☽, ○, ☾, ε, ὐ, δ, α, ι, μ, ο, ν, ί, α, σ, τ, ή, ἐ, π, η, ö, ς, ü, ž

#print(titles)

titles_tokens = titles.split()
#print(titles_tokens)

## Construct new list with no 1 or 2  or N letter words
N = 3
titles_large_tokens = []
for word_token in titles_tokens:
    if len(word_token) > N:
        titles_large_tokens.append(word_token)

print("Length of titles tokens: " + str(len(titles_large_tokens)))

## sort for unique tokens
unique_titles_tokens = list(set(titles_large_tokens))

print("Length of unique titles tokens: " + str(len(unique_titles_tokens)))
print(unique_titles_tokens)

ModuleNotFoundError: No module named 'nltk'

<p style="font-family: 'SF Atarian System';">
For each of the preceeding unique terms, which is a set of pretty clean word tokens in which each element is unique. The original corpus had multiple occurrences of many words, meaning that for each of the elements in the reduced set of unique cleaned tokens, the original document has a frequency count for each of at least 1.
</p>

<h4>It is interesting to note that the symbols
    <ul>
        <li>改</li>
        <li>善</li>
        <li>☽</li>
        <li>○</li>
        <li>☾</li>
    </ul>
    Did not cause an error with MariaDB's utf8mb4 character set encoding as they were pulled from <b>ytscrape.Transcripts</b>
</h4>

<p>Additionally, eliminating two-character and possibly three-character words should shorten the list even further</p>

In [90]:
unique_titles_ranks = []
for word in unique_titles_tokens:
    word_count = 0
    for rawword in titles_large_tokens:
        if rawword == word:
            word_count += 1
    unique_titles_ranks.append(word_count)
    

In [92]:
counter = 0
for word in unique_titles_tokens:
    print(str(word) + ": " + str(unique_titles_ranks[counter]))
    counter += 1

redundancy: 1
aspects: 1
americas: 15
demanded: 1
owned: 1
special: 3
education: 16
missing: 57
more: 39
vectors: 4
bengigi: 1
genius: 3
supposed: 1
adjusting: 1
alter: 5
vriii: 1
advantages: 1
daguerreotype: 1
output: 3
jump: 1
place: 1
pinouts: 1
when: 15
junkie: 1
felicities: 1
knob: 1
quackery: 4
freezer: 1
shadow: 3
iphone: 3
guys: 1
ontology: 1
quit: 1
ferrofluid: 3
zone: 3
content: 2
variable: 1
examined: 1
amuck: 1
ruined: 1
creativity: 1
rendering: 1
consider: 2
dropping: 1
lazy: 1
atman: 1
lawyers: 1
aperture: 2
relates: 1
user: 2
faux: 1
jpegs: 1
europe: 1
drools: 1
culture: 1
saturate: 1
exclusive: 2
refuting: 2
damage: 1
official: 3
healing: 1
anti: 9
finalized: 1
darren: 2
transmission: 1
hanging: 1
hodgenville: 1
echo: 1
insect: 1
option: 2
prior: 1
short: 6
hidden: 5
dismissing: 2
late: 4
sell: 1
sacred: 1
simplex: 4
unsharp: 1
daunting: 1
contents: 2
response: 3
demon: 1
cannot: 3
underlying: 1
accepted: 1
ignorance: 3
youre: 6
ball: 2
thousand: 1
confuses: 1
illogical

In [96]:
most_frequent_word = 0
counter = 0
for word in unique_titles_tokens:
    if unique_titles_ranks[counter] > unique_titles_ranks[most_frequent_word]:
        most_frequent_word = counter
    counter += 1
print(unique_titles_tokens[most_frequent_word] + " " + str(unique_titles_ranks[most_frequent_word]))

face 1227


In [99]:
###
# From: https://stackoverflow.com/questions/27488446/how-do-i-get-word-frequency-in-a-corpus-using-scikit-learn-countvectorizer
###
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv_fit = cv.fit_transform(titles_large_tokens)
print(cv.vocabulary_)



In [104]:
## Further down the post
word_list = cv.get_feature_names()
count_list = cv_fit.toarray().sum(axis=0)    
word_count = dict(zip(word_list,count_list))
print(word_count)



In [8]:
import re
import string

# This is row 0, column 4 of dataframe. The first transcript in the table...
text = str(df.loc[0][4]).lower()

# This is a PyOhio regexp punctuation cleaning technique
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)

print(text)



In [12]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
text_cv = cv.fit_transform(df)

In [16]:
# tokenize wtih split()
text_tokens = text.split()
print(text_tokens)
print(str(len(text_tokens)))

7939


In [17]:
# now the text is tokenized. a unique word list can be made, where repeated words are mapped to one element of a
# basis set. 

###
# This method converts the list of word tokens into a set, then iterates over it into a list (f o g())
###
## Ref: https://www.freecodecamp.org/news/python-unique-list-how-to-get-all-the-unique-values-in-a-list-or-array/
unique_text_tokens = list(set(text_tokens))
print(unique_text_tokens)
print(str(len(unique_text_tokens)))

1485


In [None]:
## Here we can see that there are only 1485 unique word tokens in this above transcript, so we can say the
## lexicon used is of 1485 words in size