Extract the audio from videos on YouTube, Vimeo, and other sites and send it to Huffduffer.
Python HTML
Permalink
Failed to load latest commit information.
static add warning about bad UI to front page Mar 7, 2016
.gitignore write downloaded mp3 files to S3 Mar 5, 2015
.gitmodules remove youtube-dl submodule; i install it via pip now Mar 4, 2015
README.md
app.py oops, CPU credit threshold is 10, not 100 Jul 25, 2016
s3_lifecycle.json migrate back to t2.micro, update S3 lifecycle expiration to 30d May 25, 2016
s3_robots.txt trying to block bots. document download numbers, switch to bucket vho… Apr 30, 2015

README.md

huffduff-video

Extracts the audio from videos on YouTube, Vimeo, and many more sites and sends it to Huffduffer.

See huffduff-video.snarfed.org for bookmarklet and usage details.

Uses youtube-dl to download the video and extract its audio track. Stores the resulting MP3 file in S3.

License: this project is placed in the public domain. Alternatively, you may use it under the CC0 license.

Related projects

  • youtube-dl-api-server is a web front-end that uses youtube-dl to extract and return a video's metadata.
  • Flask webapp and Chrome extension for using youtube-dl to download a video to local disk.
  • iOS workflow that does the same thing as huffduff-video, except all client side: downloads a YouTube video, converts it to MP3, uploads the MP3 to Dropbox, and passes it to Huffduffer.

Storage

The aws command line tool is nice, but the man page isn't very useful. Here's the online reference, here's aws s3 (high level but minimal), and here's aws s3api (much more powerful).

Run this see the current usage (from http://serverfault.com/a/644795/274369):

aws --profile=default s3api list-objects --bucket huffduff-video \
  --query "[sum(Contents[].Size), length(Contents[])]"

Our S3 bucket lifecycle is in s3_lifecycle.json. I ran these commands to set a lifecycle that deletes files after 30d. (Config docs, put-bucket-lifecycle docs.)

# show an example lifecycle template
aws s3api put-bucket-lifecycle --generate-cli-skeleton

# set the lifecycle
aws s3api put-bucket-lifecycle --bucket huffduff-video \
  --lifecycle-configuration "`json_pp -json_opt loose <s3_lifecycle.json`"

# check that it's there
aws s3api get-bucket-lifecycle --bucket huffduff-video

As of 3/10/2015, users are putting roughly 2GB/day into S3, ie 180GB steady state for the lifecycle period of 90d. At $.03/GB/month, that costs $5.40/month. I could use RRS (Reduced Redundancy Storage), which costs $.024/GB/month ie $4.32/month, but that's not a big difference.

Monitoring

I set up CloudWatch to monitor and alarm on EC2 instance system checks, billing thresholds, HTTP logs, and application level exceptions. When alarms fire, it emails and SMSes me.

The monitoring alarms are in us-west-2 (Oregon), but the billing alarms have to be in us-east-1 (Virginia). Each region has its own SNS topic for notifications: us-east-1 us-west-2

System metrics

To get system-level custom metrics for memory, swap, and disk space, I set up Amazon's custom monitoring scripts.

sudo yum install perl-DateTime perl-Sys-Syslog perl-LWP-Protocol-https
wget http://aws-cloudwatch.s3.amazonaws.com/downloads/CloudWatchMonitoringScripts-1.2.1.zip
unzip CloudWatchMonitoringScripts-1.2.1.zip
rm CloudWatchMonitoringScripts-1.2.1.zip
cd aws-scripts-mon

cp awscreds.template awscreds.conf
# fill in awscreds.conf
./mon-put-instance-data.pl --aws-credential-file ~/aws-scripts-mon/awscreds.conf --mem-util --swap-util --disk-space-util --disk-path=/ --verify

crontab -e
# add this line:
# * * * * * ./mon-put-instance-data.pl --aws-credential-file ~/aws-scripts-mon/awscreds.conf --mem-util --swap-util --disk-space-util --disk-path=/ --from-cron

Log collection

To set up HTTP and application level monitoring, I had to:

  • add an IAM policy
  • install the logs agent with sudo yum install awslogs
  • add my IAM credentials to /etc/awslogs/awscli.conf and set region to us-west-2
  • add these lines to /etc/awslogs/awslogs.conf:
[/var/log/httpd/access_log]
file = /var/log/httpd/access_log*
log_group_name = /var/log/httpd/access_log
log_stream_name = {instance_id}
datetime_format = %d/%b/%Y:%H:%M:%S %z

[/var/log/httpd/error_log]
file = /var/log/httpd/error_log*
log_group_name = /var/log/httpd/error_log
log_stream_name = {instance_id}
datetime_format = %b %d %H:%M:%S %Y

# WSGI writes Python exception stack traces to this log file across multiple
# lines, and I'd love to collect them multi_line_start_pattern or something
# similar, but each line is prefixed with the same timestamp + severity + etc
# prefix as other lines, so I can't.
  • start the agent and restart it on boot:
sudo service awslogs start
sudo service awslogs status
sudo chkconfig awslogs on
  • wait a while, then check that the logs are flowing:
aws --region us-west-2 logs describe-log-groups
aws --region us-west-2 logs describe-log-streams --log-group-name /var/log/httpd/access_log
aws --region us-west-2 logs describe-log-streams --log-group-name /var/log/httpd/error_log
  • define a few metric filters so we can graph and query HTTP status codes, error messages, etc:
aws logs put-metric-filter --region us-west-2 \
  --log-group-name /var/log/httpd/access_log \
  --filter-name HTTPRequests \
  --filter-pattern '[ip, id, user, timestamp, request, status, bytes]' \
  --metric-transformations metricName=count,metricNamespace=huffduff-video,metricValue=1

aws logs put-metric-filter --region us-west-2 \
  --log-group-name /var/log/httpd/error_log \
  --filter-name PythonErrors \
  --filter-pattern '[timestamp, error_label, prefix = "ERROR:root:ERROR:", ...]' \
  --metric-transformations metricName=errors,metricNamespace=huffduff-video,metricValue=1

aws --region us-west-2 logs describe-metric-filters --log-group-name /var/log/httpd/access_log
aws --region us-west-2 logs describe-metric-filters --log-group-name /var/log/httpd/error_log

Understanding bandwidth usage

As of 2015-04-29, huffduff-video is serving ~257 GB/mo (via S3), which costs ~$24/mo in bandwidth alone. I'm ok with that, but I think it could be lower.

As always, measure first, then optimize. To learn a bit more about who's downloading these files, I turned on S3 access logging, waited 24h, then ran these commands to collect and aggregate the logs:

aws --profile personal s3 sync s3://huffduff-video/logs .
grep REST.GET.OBJECT 2015-* | grep ' 200 ' | cut -d' ' -f8,20- \
  | sort | uniq -c | sort -n -r > user_agents

This gave me some useful baseline numbers. Over a 24h period, there were 482 downloads, 318 of which came from bots. (That's 2/3!) Out of the six top user agents by downloads, five were bots. The one exception was the Overcast podcast app.

(Side note: Googlebot-Video is polite and includes Etag or If-Modified-Since when it refetches files. It sent 68 requests, but exactly half of those resulted in an empty 304 response. Thanks Googlebot-Video!)

I switched huffduff-video to use S3 URLs on the huffduff-video.s3.amazonaws.com virtual host, added a robots.txt file that blocks all bots, waited 24h, and then measured again. The vast majority of huffduff-video links on Huffduffer are still on the s3.amazonaws.com domain, which doesn't serve my robots.txt, so I didn't expect a big difference...but I was wrong. Twitterbot had roughly the same number, but the rest were way down:

(Googlebot-Video was way farther down the chart with just 4 downloads.)

This may have been due to the fact that my first measurement was Wed-Thurs, and the second was Fri-Sat, which are slower social media and link sharing days. Still, I'm hoping some of it was due to robots.txt. Fingers crossed the bots will eventually go away altogether!

To update the robots.txt file:

aws --profile personal s3 cp --acl=public-read ~/src/huffduff-video/s3_robots.txt s3://huffduff-video/robots.txt

I put this in a cron job to run every 30d. I had to run aws configure first and give it the key id and secret.

System setup

Currently on EC2 t2.micro instance.

I started it originally on a t2.micro. I migrated it to a t2.nano on 2016-03-24, but usage outgrew the nano's CPU quota, so I migrated back to a t2.micro on 2016-05-25.

I did both migrations by making an snapshot of the t2.micro's EBS volume, making an AMI from the snapshot, then launching a new t2.nano instance using that AMI. Details.

Here's how I originally set it up:

sudo yum install git httpd-devel mod_wsgi python-devel python27-pip tcsh telnet
sudo update-alternatives --set python /usr/bin/python2.7
sudo yum groupinstall 'Web Server' 'PHP Support'
sudo pip install boto webob youtube-dl

# Amazon Linux AMI has mod_wsgi 3.2, but we need 3.4 to prevent this error when
# running youtube-dl under WSGI:
# AttributeError: 'mod_wsgi.Log' object has no attribute 'isatty'
curl -o mod_wsgi-3.4.tar.gz https://modwsgi.googlecode.com/files/mod_wsgi-3.4.tar.gz
tar xvzf mod_wsgi-3.4.tar.gz
cd mod_wsgi-3.4
sudo yum install httpd-devel -y
./configure
sudo make install

# add these lines to /etc/httpd/conf/httpd.conf
#
# # for huffduff-video
# LoadModule wsgi_module /usr/lib64/httpd/modules/mod_wsgi.so
# Options FollowSymLinks
# WSGIScriptAlias /get /var/www/cgi-bin/app.py
# LogLevel info
#
# # tune number of prefork server processes
# # see http://fuscata.com/kb/set-maxclients-apache-prefork etc.
# StartServers       8
# MinSpareServers    2
# MaxSpareServers    4
# ServerLimit        12
# MaxClients         12
# MaxRequestsPerChild  4000

# start apache
sudo service httpd start
sudo chkconfig httpd on

# install ffmpeg
wget http://johnvansickle.com/ffmpeg/releases/ffmpeg-release-64bit-static.tar.xz
cd /usr/local/bin
sudo tar xJf ~/ffmpeg-release-64bit-static.tar.xz
cd /usr/bin
sudo ln -s ffmpeg-2.5.4-64bit-static/ffmpeg
sudo ln -s ffmpeg-2.5.4-64bit-static/ffprobe

# clone huffduff-video repo and install for apache
cd ~
mkdir src
chmod a+rx ~/src
cd src
git clone git@github.com:snarfed/huffduff-video.git
# create and fill in aws_key_id and aws_secret_key files

cd /var/www/cgi-bin
sudo ln -s ~/src/huffduff-video/app.py
cd /var/www/html
sudo ln -s ~/src/huffduff-video/static/index.html

# clean up /tmp every hour
touch ~/crontab
echo "0 * * * *\tfind /tmp/ -user apache -not -newermt yesterday | xargs rm" >> ~/crontab
crontab crontab