The repository of conference talks videos #7

zverok · 2018-02-23T17:01:46Z

Project

Gather (automatically and semi-automatically) as much talk videos from Ruby conferences as there are exist, and provide a nice browsing interface for them.

Plan

Go through http://rubyconferences.org/past/
For each conference, define in Ruby (with the help of wombat?) its site (or web.archive.org copy of it) parser to extract structured list of talks, speakers, topics and so on;
For each conference, find (on site, or in linked YouTube/Confreaks playlist) list of talks, and define extractor for it, and matcher with conference's program;
Define (with the help of Jekyll or another nice static site renderer) the rendering of this data, and provide extensive navigations by years, by speakers, by talk topics, keywords and so on; joining the same talk given on several conferences into "versions".
Publish to GitHub pages
Make sure to have a "framework" of scripts to update the site after future conferences
Provide an ability for collaborative editing of talk classifications.

Importance

There are a lot of good material over there gathered through 25 years of Ruby, and a huge part if it is available online, providing learning, discovery and historical interest. Somebody just should do something about it.

Skills and domains

You will do a lot of automated web scraping, and site generation, as well as generalizing a lot of non-generic data in a bearable way.

subratrout · 2018-03-01T01:57:28Z

Hi Victor, I am interested to work on this project. I am a Ruby developer with 1 year of Ruby experience. Am I eligible to work on this project? Please let me know. Thank you.

zverok · 2018-03-01T14:49:03Z

Am I eligible to work on this project?

Of course! Do you have any ideas of where to start, or do you want me to give you some initial steps and hints?

subratrout · 2018-03-01T21:34:43Z

It will be great if you would be kind to give some initial guidance on how to approach it.

zverok · 2018-03-02T08:15:16Z

OK then! As I see it, the very first steps could be those:

Go to http://rubyconferences.org/past/ and select some conferences that have videos published...
For completeness, select:
- one with videos and slides right on the site, for ex. RubyKaigi: http://rubykaigi.org/2017
- one with the site containing only talk announces, and videos on YouTube playlist, for ex. EuRuko: https://www.euruko2017.org/schedule/ (videos, could be found through Twitter: https://www.youtube.com/playlist?list=PLCrwBqiOtwpIRdk4VMfD6hbFVHZoKOyMg and https://www.youtube.com/playlist?list=PLCrwBqiOtwpINmaC2eBX8QHuLYngECD19&disable_polymer=true)
- (for later) say, one with the site preserved only in webarchive: http://web.archive.org/web/20150927215820/http://rubyconf.org/program and videos on Confreaks.tv: http://confreaks.tv/events/rubyconf2015
Now, we need from all of the examples above, produce some YAML (one file per conference) with the list of talks, containing, probably:
- title
- announce
- link to slides, if available
- link to video, wherever it is
- speaker name
- speaker bio/about
- link to speaker's photo if available
- talk's date/time (for historical reasons)
- some conference meta information (title, date, place, description)
I recommend trying to write conference site scrapers with wombat, it seems the most appropriate tool
- for YouTube, it probably would be wise to use official YouTube Ruby client library (and you'll need to match titles of talks on YouTube with titles on conference site, be careful)
Then, when you have YAML with all the data, you can try writing renderers of all the data into a static site, probably with the help of middleman. For the first iteration, nothing fancy should be done, just pages like /conferences/ (links to all), /conferences/<conferenceid> (metainformation about the conference + list of talks), conferences/<id>/talks/<talkid> (particular talk with info, links to slides, embedded video)
Next step would be to cover all conferences in existence with parsers (sometimes it will also require some digging the web for their long-gone sites, but videos are mostly on YouTube and still present) and have a HUGE bunch of parsed YAML files
- probably, some small DSLs would emerge to DRY this task, like "This conference has a site at that URL, use this Wombat definitions and fetch videos from those YouTube playlists, and match their titles to conference program with this regexp" or something, we'll see.
And then "site crafting" stage comes, which we'll probably discuss and experiment more when the time will come.

How it sounds? Let me know if something is unclear in this plan (especially first stages of it, the later stages intentionally left generic, we can discuss them in more details later)

subratrout · 2018-03-03T16:07:15Z

Thank you Victor for detailed instructions. I believe I can start working it. Where do I create the repo? Or will you create a repo with basic folder and file structure and give me the access to work on it? Please let me know. ᐧ

…

On Fri, Mar 2, 2018 at 1:15 AM, Victor Shepelev ***@***.***> wrote: OK then! As I see it, the very first steps could be those: 1. Go to http://rubyconferences.org/past/ and select some conferences that have videos published... 2. For completeness, select: - one with videos and slides right on the site, for ex. RubyKaigi: http://rubykaigi.org/2017 - one with the site containing only talk announces, and videos on YouTube playlist, for ex. EuRuko: https://www.euruko2017.org/ schedule/ (videos, could be found through Twitter: https://www.youtube.com/playlist?list=PLCrwBqiOtwpIRdk4VMfD6hbFVHZoK OyMg <https://www.youtube.com/playlist?list=PLCrwBqiOtwpIRdk4VMfD6hbFVHZoKOyMg> and https://www.youtube.com/playlist?list= PLCrwBqiOtwpINmaC2eBX8QHuLYngECD19&disable_polymer=true <https://www.youtube.com/playlist?list=PLCrwBqiOtwpINmaC2eBX8QHuLYngECD19&disable_polymer=true> ) - (for later) say, one with the site preserved only in webarchive: http://web.archive.org/web/20150927215820/http:// rubyconf.org/program <http://web.archive.org/web/20150927215820/http://rubyconf.org/program> and videos on Confreaks.tv: http://confreaks.tv/events/rubyconf2015 3. Now, we need from all of the examples above, produce some YAML (one file per conference) with the list of talks, containing, probably: - title - announce - link to slides, if available - link to video, wherever it is - speaker name - speaker bio/about - link to speaker's photo if available - talk's date/time (for historical reasons) - some conference meta information (title, date, place, description) 4. I recommend trying to write conference site scrapers with wombat <https://github.com/felipecsl/wombat>, it seems the most appropriate tool - for YouTube, it probably would be wise to use official YouTube Ruby client library (and you'll need to match titles of talks on YouTube with titles on conference site, be careful) 5. Then, when you have YAML with all the data, you can try writing renderers of all the data into a static site, probably with the help of middleman <https://github.com/middleman/middleman>. For the first iteration, nothing fancy should be done, just pages like /conferences/ (links to all), /conferences/<conferenceid> (metainformation about the conference + list of talks), conferences/<id>/talks/<talkid> (particular talk with info, links to slides, embedded video) 6. Next step would be to cover all conferences in existence with parsers (sometimes it will also require some digging the web for their long-gone sites, but videos are mostly on YouTube and still present) and have a HUGE bunch of parsed YAML files - probably, some small DSLs would emerge to DRY this task, like "This conference has a site at that URL, use this Wombat definitions and fetch videos from those YouTube playlists, and match their titles to conference program with this regexp" or something, we'll see. 7. And then "site crafting" stage comes, which we'll probably discuss and experiment more when the time will come. How it sounds? Let me know if something is unclear in this plan (especially first stages of it, the later stages intentionally left generic, we can discuss them in more details later) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOB6pgKAhP1wJfTzvjZukqsIOWhk8MZks5taP-VgaJpZM4SRLbn>

zverok · 2018-03-03T21:42:27Z

Where do I create the repo?

Just do it in your personal GitHub for now. Then, when the work would be close to publishing, we'll create separate GitHub organization and you'll "transit" repo there (GitHub allows this). The separate organization (something like ruby-talks or something) is needed because GitHub will provide nice org-name.github.io domain name.

zverok · 2018-03-03T21:44:22Z

Or let's create org immediately, in fact :) Going to it.

zverok · 2018-03-03T21:45:31Z

https://github.com/ruby-talks-archive/ruby-talks Here we go. You should have access to it.

subratrout · 2018-03-05T08:02:32Z

Hi Victor, just pushed my first commit to ruby-talks under repo named "videotalks". The first one i.e videos.rb is working. Please advise why creating class and specifying model Wombat::Crawler is not working. Thank you.

KrassCodes · 2020-02-16T17:55:55Z

Hi Victor, has this progressed since the last update? I would be interested to contribute if I can. Thank you for organising these projects!

zverok · 2020-02-17T12:08:43Z

@Krass101 Nope, nothing changed for the project since the discussion in this issue 🤷‍♀️

I'd be glad if you'll contribute your work to this project, and I am ready to answer any questions and provide other help whenever you need.

KrassCodes · 2020-02-17T23:28:07Z

@zverok thanks for your offer to help! I have started my own repo and will make small updates every day. I am still learning ruby, wombat and other gems. My goal is to use a TDD approach and as a first step I have written the basic code to scrape the speech titles of the RubyKaigi 2018 conference. My repo is here: https://github.com/Krass101/ruby-talks

Do you have a preferred way for me to ask you questions? I can collect them somewhere and either send to you at once or maintain a list in the repo?

zverok · 2020-02-20T10:56:00Z

@Krass101 We can discuss things at https://gitter.im/molybdenum-99/rubytalks (it is chat service that is easy to log in from GitHub, and easy to discuss code in), or you can drop me an email at zverok.offline@gmail.com (but Gitter is more convenient).

As for starting the work -- I believe the very first steps could be to just save some page(s) of the conference of interest locally and write some specs saying "I can extract this and that data from the page", this would be a good start (and will also be head-first so you'll see if you love to work on problems like this) which is easy to achieve.

zverok added idea Idea of the project site Ruby-related website rendering labels Feb 23, 2018

zverok added the has candidate Somebody tries to start the work label Mar 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The repository of conference talks videos #7

The repository of conference talks videos #7

zverok commented Feb 23, 2018

subratrout commented Mar 1, 2018

zverok commented Mar 1, 2018

subratrout commented Mar 1, 2018

zverok commented Mar 2, 2018

subratrout commented Mar 3, 2018 via email •

edited

zverok commented Mar 3, 2018

zverok commented Mar 3, 2018

zverok commented Mar 3, 2018

subratrout commented Mar 5, 2018

KrassCodes commented Feb 16, 2020

zverok commented Feb 17, 2020

KrassCodes commented Feb 17, 2020

zverok commented Feb 20, 2020

The repository of conference talks videos #7

The repository of conference talks videos #7

Comments

zverok commented Feb 23, 2018

Project

Plan

Importance

Skills and domains

subratrout commented Mar 1, 2018

zverok commented Mar 1, 2018

subratrout commented Mar 1, 2018

zverok commented Mar 2, 2018

subratrout commented Mar 3, 2018 via email • edited

zverok commented Mar 3, 2018

zverok commented Mar 3, 2018

zverok commented Mar 3, 2018

subratrout commented Mar 5, 2018

KrassCodes commented Feb 16, 2020

zverok commented Feb 17, 2020

KrassCodes commented Feb 17, 2020

zverok commented Feb 20, 2020

subratrout commented Mar 3, 2018 via email •

edited