# User Guide

This is a experimental user guide to **Youcreep**.

If you have any questions, please contact [steveflyer](steveflyer@gmail.com).


In [1]:
# auto reload modules
%load_ext autoreload
%autoreload 2
import logging

import nest_asyncio
import pandas as pd

from youcreep import YoutubeVideoInfoCrawler, YoutubeCommentCrawler
from youcreep.config.filter_enum import FilterSection, FilterLengthOption, FilterOrderByOption

nest_asyncio.apply()

## 1. VideoInfo Crawler

In [None]:
# configure parameters
n_target = 50               # number of videos to crawl at most
search_term = "Python"      # search term
filter_config = {           # how to filter the search results, here we filter by view count and length
    FilterSection.ORDER_BY: FilterOrderByOption.VIEW_COUNT,
    FilterSection.LENGTH: FilterLengthOption.MEDIUM,
}

# call the crawler to crawl, we can use async with to make sure the crawler is closed after use
async with YoutubeVideoInfoCrawler(headless=False) as video_crawler:
    video_info_list = await video_crawler.crawl(
        search_term=search_term,
        n_target=n_target,
        filter_options=filter_config,
        crawl_details=True
    )

video_info_list[:3]

The result of crawling is a list of dictionaries, you can easily convert it to a pandas dataframe.

In [3]:
video_df = pd.DataFrame([video_info.to_dict() for video_info in video_info_list])

video_df

Unnamed: 0,video_id,title,video_url,is_short,view_count,publish_time,channel_name,channel_url,desc_text
0,dVRhRzE_AkQ,Python的实惠美洲鳄鱼,https://www.youtube.com/watch?v=dVRhRzE_AkQ,False,1.5亿次观看,11年前,ojatro,https://www.youtube.com/@ojatro,网址：http://Ojatro.com\nFacebook的：https://www.fa...
1,fr6WJG3Uw3M,OMG! Capuchin Monkey Save Mouse From Banded Kr...,https://www.youtube.com/watch?v=fr6WJG3Uw3M,False,1.5亿次观看,4年前,Life Of Big Cat,https://www.youtube.com/@lifeofbigcat2374,OMG! Capuchin Monkey Save Mouse From Banded Kr...
2,XcxwOVNyfFY,OMG! Giant Python Hunt Leopard Cubs When Mothe...,https://www.youtube.com/watch?v=XcxwOVNyfFY,False,1.4亿次观看,3年前,Life Of Big Cat,https://www.youtube.com/@lifeofbigcat2374,OMG! Giant Python Hunt Leopard Cubs When Mothe...
3,c98BHS8rSTg,Easy Snake Trap - Build Underground Python Tra...,https://www.youtube.com/watch?v=c98BHS8rSTg,False,1.4亿次观看,3年前,da ra,https://www.youtube.com/@dara-ee4uv,Easy Snake Trap - Build Underground Python Tra...
4,adUlm2m7RrU,Stop Motion ASMR - Big Python eats Alligator G...,https://www.youtube.com/watch?v=adUlm2m7RrU,False,1.1亿次观看,2年前,Unusual Cooking,https://www.youtube.com/@UnusualCooking,"Hello my dear viewers, Unbelievable Fishing at..."
5,dw5eILfy-c8,"Python is too aggressive, Lion Cub mistakes wh...",https://www.youtube.com/watch?v=dw5eILfy-c8,False,8086万次观看,4年前,Alan Tours,https://www.youtube.com/@Alantours4u,"NAP , #NAPanimal Python is too aggressive, Lio..."
6,JgKN3BuvC3E,"Python, Honey Badger & Jackal Fight Each Other",https://www.youtube.com/watch?v=JgKN3BuvC3E,False,6479万次观看,3年前,Caters Video,https://www.youtube.com/@CatersVideo,ID: 3331461 MANDATORY ONSCREEN CREDIT - Rosely...
7,rLycYSY2V6Y,10 Biggest Snakes Ever Discovered,https://www.youtube.com/watch?v=rLycYSY2V6Y,False,5184万次观看,2年前,FactFile,https://www.youtube.com/@FactFile,There are over 3000 species of snake on the pl...
8,5ADMawsgV5M,蜘蛛侠和斗牛犬对抗一只巨蟒,https://www.youtube.com/watch?v=5ADMawsgV5M,False,3767万次观看,1年前,King Of Survival,https://www.youtube.com/@KingOfSurvivalVlogs,蜘蛛侠和斗牛犬对抗一只巨蟒
9,MRnORGSYGCQ,Primitive Survival 4K Video - Build PYTHON Hou...,https://www.youtube.com/watch?v=MRnORGSYGCQ,False,3503万次观看,3年前,Survival Skills Asia,https://www.youtube.com/@survivalskillsasia,Primitive Survival 4K Video - Build PYTHON Hou...


## 2. Comment Crawler

In [21]:
from gembox.debug_utils import ConsoleDebugger

# configure parameters
# test_video_url = video_df.video_url[0]
test_video_url = "https://www.youtube.com/watch?v=gZlvpP97qdU"
n_target = None

# call the crawler to crawl, this time, we use headless mode to avoid opening a browser, and we enable detailed logging to trace the crawling process
comment_crawler = YoutubeCommentCrawler(headless=False, debug_tool=ConsoleDebugger(level=logging.INFO))
await comment_crawler.start()
comment_list = await comment_crawler.crawl(
    video_url=test_video_url,
    n_target=n_target,
)
comment_df = pd.DataFrame([comment.to_dict() for comment in comment_list])

comment_df

2023-08-24 16:01:07,732 - gembox.debug_utils_c7f6_logger - INFO - Browser: Go to https://www.youtube.com/
2023-08-24 16:01:09,554 - gembox.debug_utils_c7f6_logger - INFO - Going to the video page: https://www.youtube.com/watch?v=gZlvpP97qdU
2023-08-24 16:01:09,555 - gembox.debug_utils_c7f6_logger - INFO - Browser: Go to https://www.youtube.com/watch?v=gZlvpP97qdU
2023-08-24 16:01:11,339 - gembox.debug_utils_c7f6_logger - INFO - Scrolling to the bottom of the page for 1 time(s) for loading the page...
2023-08-24 16:01:11,735 - gembox.debug_utils_c7f6_logger - INFO - Scrolling to the bottom of the page for 2 time(s) for loading the page...
2023-08-24 16:01:12,131 - gembox.debug_utils_c7f6_logger - INFO - Scrolling to the bottom of the page for 3 time(s) for loading the page...
2023-08-24 16:01:12,916 - gembox.debug_utils_c7f6_logger - INFO - Scrolling to the bottom of the page for 4 time(s) for loading the page...
2023-08-24 16:01:13,420 - gembox.debug_utils_c7f6_logger - INFO - Scrollin

NetworkError: Protocol error Runtime.evaluate: Target closed.

In [19]:

await comment_crawler._page_parser.parse_video_page_meta_info()


{'view_count': 628020, 'comment_count': 2343}

In [17]:
from gembox.re_utils import search_comma_sep_num

from youcreep.page_parser.selectors.video_page import view_count_sel
view_count_str = (await comment_crawler._browser_agent.get_texts(view_count_sel))[0].strip()
view_count = search_comma_sep_num(view_count_str)

view_count

628020

In [16]:
import re

text = "628,020次观看"
pattern = r"(\d{1,3}(?:,\d{3})*\d*)"

match = re.search(pattern, text)
match.group(1)

'628,020'

In [6]:
dismiss_premium_btn = await comment_crawler._browser_agent.page_interactor.get_element("#dismiss-button > yt-button-renderer > yt-button-shape > button")

await dismiss_premium_btn.click()

In [5]:
from youcreep.page_parser.selectors.video_page import view_count_sel, comment_count_sel

view_count_sel, comment_count_sel

('#count .view-count.ytd-video-view-count-renderer',
 '#count.ytd-comments-header-renderer')

In [15]:
comment_count = (await comment_crawler._browser_agent.get_texts(comment_count_sel))[0].strip()
view_count = (await comment_crawler._browser_agent.get_texts(view_count_sel))[0].strip()
comment_count, view_count

('9,741 条评论', '147,359,887次观看')

In [2]:
from youcreep.page_parser.selectors.video_page import view_count_sel

await comment_crawler._browser_agent.page_interactor.get_element(view_count_sel)
await comment_crawler._browser_agent.get_texts(view_count_sel)

## 3. Multiprocessing Crawler

You can easily crawl multiple videos or comments in parallel using multiprocessing.

The calling methods just receive a list of parameters, and you can specify `n_workers` to control the number of parallel processes.

### Crawl video info from a list of search terms

In [4]:
# configure parameters
search_terms = ["OpenAI", "Deep Learning", "Machine Learning", "Artificial Intelligence"]
output_dir = "./output/search"
n_target = 20

param_dict_list = []
for search_term in search_terms:
    param_dict_list.append({
        "search_term": search_term,
        "n_target": n_target,
        "filter_options": {
            FilterSection.ORDER_BY: FilterOrderByOption.VIEW_COUNT,
            FilterSection.LENGTH: FilterLengthOption.MEDIUM,
        },
        "output_dir": output_dir,
    })

# call the crawler to crawl
await YoutubeVideoInfoCrawler.parallel_crawl(
    param_dict_list=param_dict_list,
    n_workers=4,                        # number of workers, i.e. number of parallel processes
    headless=True,
    verbose=True,
)

#### Check the output

In [3]:
from gembox.io import list_dir

output_dir = "./output/search"
video_csv = list_dir(output_dir)[-1]  # change the file name to the one you just crawled
video_df = pd.read_csv(video_csv, index_col=0)

video_df

Unnamed: 0,video_id,title,video_url,is_short,view_count,publish_time,channel_name,channel_url,desc_text
0,Lu56xVlZ40M,OpenAI Plays Hide and Seek…and Breaks The Game! 🤖,https://www.youtube.com/watch?v=Lu56xVlZ40M,False,收看次數：9.3M 次,3 年前,Two Minute Papers,https://www.youtube.com/@TwoMinutePapers,We would like to thank our generous Patreon su...
1,SX08NT55YhA,A.I. Learns to Drive From Scratch in Trackmania,https://www.youtube.com/watch?v=SX08NT55YhA,False,收看次數：7.1M 次,1 年前,Yosh,https://www.youtube.com/@yoshtm,I made an A.I. that teaches itself to drive in...
2,v3UBlEJDXR0,AI Learns to Escape (deep reinforcement learning),https://www.youtube.com/watch?v=v3UBlEJDXR0,False,收看次數：5.9M 次,9 個月前,AI Warehouse,https://www.youtube.com/@aiwarehouse,AI Teaches Itself How to Escape! In this video...
3,L_4BPjLBF4E,AI Learns to Walk (deep reinforcement learning),https://www.youtube.com/watch?v=L_4BPjLBF4E,False,收看次數：5.5M 次,3 個月前,AI Warehouse,https://www.youtube.com/@aiwarehouse,AI Teaches Itself to Walk! In this video an AI...
4,zpRM25pUD8w,"Smart, seductive, dangerous AI robots. Beyond ...",https://www.youtube.com/watch?v=zpRM25pUD8w,False,收看次數：4.8M 次,6 個月前,Digital Engine,https://www.youtube.com/@DigitalEngine,Sources: Future of Life Institute AI discussio...
5,E_9TLmBeSIU,معاكم ابوفله بعد 5 سنين🕠؟؟ | OpenAI GPT-4,https://www.youtube.com/watch?v=E_9TLmBeSIU,False,收看次數：3.4M 次,4 個月前,AboFlah,https://www.youtube.com/@AboFlah,"شاركوووووووووووا وبإذن الله الفوز, البوست نزل ..."
6,JYtZ2zsdE_s,10X Your Excel Skills with ChatGPT 🚀,https://www.youtube.com/watch?v=JYtZ2zsdE_s,False,收看次數：2.9M 次,7 個月前,Kevin Stratvert,https://www.youtube.com/@KevinStratvert,"In this step-by-step tutorial, learn how you c..."
7,a2ZBEC16yH4,Elon Musk tells Tucker potential dangers of hy...,https://www.youtube.com/watch?v=a2ZBEC16yH4,False,收看次數：2.8M 次,4 個月前,Fox News,https://www.youtube.com/@FoxNews,Tesla and Twitter CEO Elon Musk joins 'Tucker ...
8,kQPUWryXwag,Bring ChatGPT INSIDE Excel to Solve ANY Proble...,https://www.youtube.com/watch?v=kQPUWryXwag,False,收看次數：2.6M 次,6 個月前,Leila Gharani,https://www.youtube.com/@LeilaGharani,OpenAI inside Excel? How can you use an API ke...
9,0vbk1wG7gqs,НА ЧТО СПОСОБЕН ИСКУССТВЕННЫЙ ИНТЕЛЛЕКТ ОТ OPE...,https://www.youtube.com/watch?v=0vbk1wG7gqs,False,收看次數：2.6M 次,3 年前,Kosmo Story,https://www.youtube.com/@KosmoStory,Вся музыка взята с библиотеки Epidemic Sound P...


### Crawl comments from a list of video urls

In [2]:
# get testing video_urls from the data we just crawled
video_urls = ["https://www.youtube.com/watch?v=gZlvpP97qdU"]
n_target = 10000
output_dir = "./output/comment"

param_dict_list = []
for video_url in video_urls:
    param_dict_list.append({
        "video_url": video_url,
        "n_target": n_target,
        "output_dir": output_dir,
    })

# call the crawler to crawl
await YoutubeCommentCrawler.parallel_crawl(
    param_dict_list=param_dict_list,
    n_workers=1,                        # number of workers, i.e. number of parallel processes
)

#### Check the output

In [11]:
comment_csv = "youtube_output/comment/X9MM9vctavM_30_comment.csv"  # change the file name to the one you just crawled
comment_df = pd.read_csv(comment_csv, index_col=0)

comment_df

Unnamed: 0,comment_id,is_reply,author_name,author_url,publish_time,parent_comment_id,video_id,author_thumbnail,content_text,like_count
0,Ugy-pt3ecAMZdKhnJHd4AaABAg,False,@Zahirah12,/channel/UCpAZNibWG4EPdje0CYczw4A,16 小時前,,X9MM9vctavM,https://yt3.ggpht.com/eUX9dK2-xAF7RZMnCNFwKVu1...,Tak tau nk ckp mcm mana. The best reaction yg ...,347
1,UgzeFm1mx0ET4ZD66rB4AaABAg,False,@kritta_v1025,/channel/UCiehePtwUNvp-X6J4GaQwHg,15 小時前,,X9MM9vctavM,https://yt3.ggpht.com/KxmPP0H70aXTyXETiWBCewp7...,"ikut terharu liatnya, persahabatan mereka kuat...",104
2,UgxP3_hLXxgF5wijC4B4AaABAg,False,@itsyaff7280,/channel/UCkKvXhWqvlyJ37b2L4qu_3A,1 小時前,,X9MM9vctavM,https://yt3.ggpht.com/q1tY6lZaOgTbFqwV3xfSikUW...,suka tengok friendship diorang. dari zaman buj...,9
3,Ugx5R5h4PqvSgdiV6tl4AaABAg,False,@fatinnajiha5788,/channel/UCs5nhKcc4ng9OCxKhH5YM7g,7 小時前 (已編輯),,X9MM9vctavM,https://yt3.ggpht.com/ytc/AOPolaTIrHTpGYdhipCY...,Alhamdulilah semoga boy & nisa dimurahkn rezek...,20
4,UgxFzQ0-x5NjVmZoXR54AaABAg,False,@fansaliefirfanteam,/channel/UCz9EX3Eo4CRFIfk5HqyeXjQ,17 小時前,,X9MM9vctavM,https://yt3.ggpht.com/LDzLxxGe32JZjjo81yQjYaHZ...,Congrats Boy & Nisa Semoga Sentiasa Dipermud...,66
5,UgxUQqICQYyiIzvdil94AaABAg,False,@boseychannel8024,/channel/UCAHM0yG3HQrU2aTd3Yi5_bw,16 小時前 (已編輯),,X9MM9vctavM,https://yt3.ggpht.com/ytc/AOPolaTtXiCdh__RL-xi...,"YES!! konten yg dinantikan, reaction drpd ai t...",105
6,UgwHBx478Btb6BP_uqF4AaABAg,False,@muhdsany5929,/channel/UC6DmM9ovCd0k9yoDg7tY-hg,16 小時前,,X9MM9vctavM,https://yt3.ggpht.com/ytc/AOPolaQX-6rcdUAycs_K...,Alhamdulillah …tahniah boy&nisa…semoga Allah p...,44
7,Ugwfv3zWfN30Xs_gjOF4AaABAg,False,@asmiemie87,/channel/UC-DkbMs-Q2aTC7rn7mSlbLA,17 小時前,,X9MM9vctavM,https://yt3.ggpht.com/pIdq8uRu7vqWHYIr5K2RNTKn...,Tahniah boy dan Nisa..Alhamdulillah doa² kami ...,62
8,Ugz0CpvHwIAEH7IQBsF4AaABAg,False,@nuramirazhar984,/channel/UCz9-KgGjidyuQFpqlkWufzA,5 小時前,,X9MM9vctavM,https://yt3.ggpht.com/ytc/AOPolaSM9spboeovun6h...,"Alhamdulillah , terharu tengok persahabatan di...",11
9,Ugw2bJhtV37zZcofJ-l4AaABAg,False,@Lesslayrafim134,/channel/UCK2WpI5zAPvAUegMKwKG2BQ,17 小時前,,X9MM9vctavM,https://yt3.ggpht.com/amZRuTJ-RAaUEmtojJi9PeQc...,Terharu tgk persahabatan ai team A dari dulu s...,131


In [6]:
from gembox.io.pandas_io import PandasIO
from youcreep.crawler import YoutubeCommentCrawler

view_csv_path = "C:\\Users\\Steve\\PycharmProjects\\sentiment_analysis_shanghai\\data\\merge_video_info.csv"
view_df = PandasIO.read(view_csv_path)
video_urls = view_df['video_url'].to_list()
# video_urls = ['https://www.youtube.com/watch?v=vTGR2tUt0m0',
#               'https://www.youtube.com/watch?v=IRLNpeDLhLw',
#               'https://www.youtube.com/shorts/3c_Gufxc734',
#               'https://www.youtube.com/watch?v=L553YKyQ0zo',
#               'https://www.youtube.com/watch?v=0bs3ikY5wbU',
#               'https://www.youtube.com/watch?v=HTS2plAOXDM',
#               'https://www.youtube.com/watch?v=gZlvpP97qdU',
#               'https://www.youtube.com/shorts/qYvKpOyUlYU',
#               'https://www.youtube.com/watch?v=kYpeA_kVcI8',
#               'https://www.youtube.com/watch?v=8T044v8EG5E',
# ]
param_dict_list = []
for video_url in video_urls:
    param_dict_list.append({
        'video_url': video_url,
        'n_target': None,
        'save_dir': './output/video_pages',
        })
param_dict_list = param_dict_list[:]

In [None]:
await YoutubeCommentCrawler.parallel_crawl(crawl_args_list=param_dict_list, log_dir = './output/video_pages/logs', n_workers=5, verbose=True)