Extract the audio from videos on YouTube, Vimeo, and other sites and send it to Huffduffer.
Python HTML JavaScript
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
static
.gitignore
.gitmodules
README.md
app.py
s3_lifecycle.json
s3_robots.txt

README.md

huffduff-video

Extracts the audio from videos on YouTube, Vimeo, and many more sites and sends it to Huffduffer.

See huffduff-video.snarfed.org for bookmarklet and usage details.

Uses youtube-dl to download the video and extract its audio track. Stores the resulting MP3 file in S3.

License: this project is placed in the public domain. Alternatively, you may use it under the CC0 license.

Related projects

  • youtube-dl-api-server is a web front-end that uses youtube-dl to extract and return a video's metadata.
  • Flask webapp and Chrome extension for using youtube-dl to download a video to local disk.
  • iOS workflow that does the same thing as huffduff-video, except all client side: downloads a YouTube video, converts it to MP3, uploads the MP3 to Dropbox, and passes it to Huffduffer.

Cost and storage

I track monthly costs here. They come from this AWS billing page.

As for determining storage usage, the aws command line tool is nice, but the man page isn't very useful. Here's the online reference, here's aws s3 (high level but minimal), and here's aws s3api (much more powerful).

Run this see the current usage (from http://serverfault.com/a/644795/274369):

aws --profile=personal s3api list-objects --bucket huffduff-video \
  --query "[sum(Contents[].Size), length(Contents[])]"

Our S3 bucket lifecycle is in s3_lifecycle.json. I ran these commands to set a lifecycle that deletes files after 30d. (Config docs, put-bucket-lifecycle docs.)

# show an example lifecycle template
aws s3api put-bucket-lifecycle --generate-cli-skeleton

# set the lifecycle
aws s3api put-bucket-lifecycle --bucket huffduff-video \
  --lifecycle-configuration "`json_pp -json_opt loose <s3_lifecycle.json`"

# check that it's there
aws s3api get-bucket-lifecycle --bucket huffduff-video

As of 3/10/2015, users are putting roughly 2GB/day into S3, ie 180GB steady state for the lifecycle period of 90d. At $.03/GB/month, that costs $5.40/month. I could use RRS (Reduced Redundancy Storage), which costs $.024/GB/month ie $4.32/month, but that's not a big difference.

There are definitely cheaper alternatives outside AWS. Backblaze B2, for example, is < 1/4 S3's price. Worth a look if S3 gets too expensive.

Monitoring

I set up CloudWatch to monitor and alarm on EC2 instance system checks, billing thresholds, HTTP logs, and application level exceptions. When alarms fire, it emails and SMSes me.

The monitoring alarms are in us-west-2 (Oregon), but the billing alarms have to be in us-east-1 (Virginia). Each region has its own SNS topic for notifications: us-east-1 us-west-2

System metrics

To get system-level custom metrics for memory, swap, and disk space, I set up Amazon's custom monitoring scripts.

sudo yum install perl-DateTime perl-Sys-Syslog perl-LWP-Protocol-https
wget http://aws-cloudwatch.s3.amazonaws.com/downloads/CloudWatchMonitoringScripts-1.2.1.zip
unzip CloudWatchMonitoringScripts-1.2.1.zip
rm CloudWatchMonitoringScripts-1.2.1.zip
cd aws-scripts-mon

cp awscreds.template awscreds.conf
# fill in awscreds.conf
./mon-put-instance-data.pl --aws-credential-file ~/aws-scripts-mon/awscreds.conf --mem-util --swap-util --disk-space-util --disk-path=/ --verify

crontab -e
# add this line:
# * * * * *	./mon-put-instance-data.pl --aws-credential-file ~/aws-scripts-mon/awscreds.conf --mem-util --swap-util --disk-space-util --disk-path=/ --from-cron

Log collection

To set up HTTP and application level monitoring, I had to:

  • add an IAM policy
  • install the logs agent with sudo yum install awslogs
  • add my IAM credentials to /etc/awslogs/awscli.conf and set region to us-west-2
  • add these lines to /etc/awslogs/awslogs.conf:
[/var/log/httpd/access_log]
file = /var/log/httpd/access_log*
log_group_name = /var/log/httpd/access_log
log_stream_name = {instance_id}
datetime_format = %d/%b/%Y:%H:%M:%S %z

[/var/log/httpd/error_log]
file = /var/log/httpd/error_log*
log_group_name = /var/log/httpd/error_log
log_stream_name = {instance_id}
datetime_format = %b %d %H:%M:%S %Y

# WSGI writes Python exception stack traces to this log file across multiple
# lines, and I'd love to collect them multi_line_start_pattern or something
# similar, but each line is prefixed with the same timestamp + severity + etc
# prefix as other lines, so I can't.
  • start the agent and restart it on boot:
sudo service awslogs start
sudo service awslogs status
sudo chkconfig awslogs on
  • wait a while, then check that the logs are flowing:
aws --region us-west-2 logs describe-log-groups
aws --region us-west-2 logs describe-log-streams --log-group-name /var/log/httpd/access_log
aws --region us-west-2 logs describe-log-streams --log-group-name /var/log/httpd/error_log
  • define a few metric filters so we can graph and query HTTP status codes, error messages, etc:
aws logs put-metric-filter --region us-west-2 \
  --log-group-name /var/log/httpd/access_log \
  --filter-name HTTPRequests \
  --filter-pattern '[ip, id, user, timestamp, request, status, bytes]' \
  --metric-transformations metricName=count,metricNamespace=huffduff-video,metricValue=1

aws logs put-metric-filter --region us-west-2 \
  --log-group-name /var/log/httpd/error_log \
  --filter-name PythonErrors \
  --filter-pattern '[timestamp, error_label, prefix = "ERROR:root:ERROR:", ...]' \
  --metric-transformations metricName=errors,metricNamespace=huffduff-video,metricValue=1

aws --region us-west-2 logs describe-metric-filters --log-group-name /var/log/httpd/access_log
aws --region us-west-2 logs describe-metric-filters --log-group-name /var/log/httpd/error_log

Understanding bandwidth usage

As of 2015-04-29, huffduff-video is serving ~257 GB/mo (via S3), which costs ~$24/mo in bandwidth alone. I'm ok with that, but I think it could be lower.

As always, measure first, then optimize. To learn a bit more about who's downloading these files, I turned on S3 access logging, waited 24h, then ran these commands to collect and aggregate the logs:

aws --profile personal s3 sync s3://huffduff-video/logs .
grep -R REST.GET.OBJECT . | grep ' 200 ' | grep -vE 'robots.txt|logs/20' \
  | cut -d' ' -f8,20- | sort | uniq -c | sort -n -r > user_agents
grep -R REST.GET.OBJECT . | grep ' 200 ' | grep -vE 'robots.txt|logs/20' \
  | cut -d' ' -f5 | sort | uniq -c | sort -n -r > ips

This gave me some useful baseline numbers. Over a 24h period, there were 482 downloads, 318 of which came from bots. (That's 2/3!) Out of the six top user agents by downloads, five were bots. The one exception was the Overcast podcast app.

(Side note: Googlebot-Video is polite and includes Etag or If-Modified-Since when it refetches files. It sent 68 requests, but exactly half of those resulted in an empty 304 response. Thanks Googlebot-Video!)

I switched huffduff-video to use S3 URLs on the huffduff-video.s3.amazonaws.com virtual host, added a robots.txt file that blocks all bots, waited 24h, and then measured again. The vast majority of huffduff-video links on Huffduffer are still on the s3.amazonaws.com domain, which doesn't serve my robots.txt, so I didn't expect a big difference...but I was wrong. Twitterbot had roughly the same number, but the rest were way down:

(Googlebot-Video was way farther down the chart with just 4 downloads.)

This may have been due to the fact that my first measurement was Wed-Thurs, and the second was Fri-Sat, which are slower social media and link sharing days. Still, I'm hoping some of it was due to robots.txt. Fingers crossed the bots will eventually go away altogether!

To update the robots.txt file:

aws --profile personal s3 cp --acl=public-read ~/src/huffduff-video/s3_robots.txt s3://huffduff-video/robots.txt

I put this in a cron job to run every 30d. I had to run aws configure first and give it the key id and secret.

To find a specific bot's IPs:

$ grep -R FlipboardProxy . | cut -d' ' -f5 |sort |uniq
34.207.219.235
34.229.167.12
34.229.216.231
52.201.0.135
52.207.240.171
54.152.58.154
54.210.190.43
54.210.24.16

...and then to block them, add them to the bucket policy:

{
  "Version": "2012-10-17",
  "Id": "Block IPs",
  "Statement": [
    {
      "Sid": "Block FlipboardProxy (IPs collected 1/25-26/2017)",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::huffduff-video/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "34.207.219.235/32",
            "34.229.167.12/32",
            "34.229.216.231/32",
            "52.201.0.135/32",
            "52.207.240.171/32",
            "54.152.58.154/32",
            "54.210.190.43/32",
            "54.210.24.16/32"
          ]
        }
      }
    }
  ]
}

While doing this, I discovered something a bit interesting: Huffduffer itself seems to download a copy of every podcast that gets huffduffed, ie the full MP3 file. It does this with no user agent, from 146.185.159.94, which reverse DNS resolves to huffduffer.com.

I can't tell that any Huffduffer feature is based on the actual audio from each podcast, so I wonder why they download them. I doubt they keep them all. Jeremy probably knows why!

Something also downloads a lot from 54.154.42.3 (on Amazon EC2) with user agent Ruby. No reverse DNS there though.

Memory tuning

t2.micros only have 1GB of memory, so sometimes the system runs out. Lines like these show up in /var/log/httpd/error_log:

[Mon Jul 03 11:20:09.050893 2017] [mpm_prefork:error] [pid 26214] (12)Cannot allocate memory: AH00159: fork: Unable to fork new process
[Mon Jul 03 11:20:19.471164 2017] [reqtimeout:info] [pid 26962] [client 220.253.163.157:54651] AH01382: Request header read timeout
[Mon Jul 03 12:37:27.462868 2017] [mpm_prefork:info] [pid 26214] AH00162: server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 32 children, there are 0 idle, and 7 total children
[Sun Jul 02 16:54:30.038240 2017] [:error] [pid 5039] [client 174.127.212.155:54658] ImportError: /usr/lib64/python2.7/lib-dynload/_functoolsmodule.so: failed to map segment from shared object: Cannot allocate memory

I made a 4GB swap partition on 2017-07-04 with:

sudo dd if=/dev/zero of=/var/swapfile bs=1M count=4096
sudo chmod 600 /var/swapfile
sudo mkswap /var/swapfile
sudo swapon /var/swapfile

System setup

Currently on EC2 t2.micro instance.

I started it originally on a t2.micro. I migrated it to a t2.nano on 2016-03-24, but usage outgrew the nano's CPU quota, so I migrated back to a t2.micro on 2016-05-25.

I did both migrations by making an snapshot of the t2.micro's EBS volume, making an AMI from the snapshot, then launching a new t2.nano instance using that AMI. Details.

Here's how I set it up:

sudo yum remove httpd httpd-tools  # uninstall apache 2.2 before installing 2.4
sudo yum install git httpd24 httpd24-tools httpd24-devel mod24_wsgi-python27 python27-devel python27-pip tcsh telnet
sudo update-alternatives --set python /usr/bin/python2.7
sudo yum groupinstall 'Web Server' 'PHP Support'
sudo pip install boto webob youtube-dl

# Check that mod_wsgi is at least version 3.4! We need 3.4 to prevent this error when
# running youtube-dl under WSGI:
# AttributeError: 'mod_wsgi.Log' object has no attribute 'isatty'
#
# *If* it's not, build 3.4 from scratch (but check that it's also python 2.7!):
curl -o mod_wsgi-3.4.tar.gz https://modwsgi.googlecode.com/files/mod_wsgi-3.4.tar.gz
tar xvzf mod_wsgi-3.4.tar.gz
cd mod_wsgi-3.4
sudo yum install httpd-devel -y
./configure
sudo make install

# add these lines to /etc/httpd/conf/httpd.conf
#
# # for huffduff-video
# LoadModule wsgi_module /usr/lib64/httpd/modules/mod_wsgi27.so
# Options FollowSymLinks
# WSGIScriptAlias /get /var/www/cgi-bin/app.py
# LogLevel info
#
# # tune number of prefork server processes
# # see http://fuscata.com/kb/set-maxclients-apache-prefork etc.
# StartServers       8
# MinSpareServers    2
# MaxSpareServers    4
# ServerLimit        12
# MaxClients         12
# MaxRequestsPerChild  4000

# start apache
sudo service httpd start
sudo chkconfig httpd on

# install ffmpeg
wget http://johnvansickle.com/ffmpeg/releases/ffmpeg-release-64bit-static.tar.xz
cd /usr/local/bin
sudo tar xJf ~/ffmpeg-release-64bit-static.tar.xz
cd /usr/bin
sudo ln -s ffmpeg-2.5.4-64bit-static/ffmpeg
sudo ln -s ffmpeg-2.5.4-64bit-static/ffprobe

# clone huffduff-video repo and install for apache
cd ~
mkdir src
chmod a+rx ~/src
cd src
git clone git@github.com:snarfed/huffduff-video.git
# create and fill in aws_key_id and aws_secret_key files

cd /var/www/cgi-bin
sudo ln -s ~/src/huffduff-video/app.py
cd /var/www/html
sudo ln -s ~/src/huffduff-video/static/index.html
sudo ln -s ~/src/huffduff-video/static/robots.txt
sudo ln -s ~/src/huffduff-video/static/util.js

touch ~/crontab
# clean up /tmp every hour
echo "0 * * * *\tfind /tmp/ -user apache -not -newermt yesterday | xargs rm" >> ~/crontab
# auto upgrade youtube-dl daily
echo "10 10 * * *	sudo pip install -U youtube-dl; sudo service httpd restart" >> ~/crontab
# recopy robots.txt to S3 since our bucket expiration policy deletes it monthly
echo "1 2 3 * *	aws s3 cp --acl=public-read ~/src/huffduff-video/s3_robots.txt s3://huffduff-video/robots.txt"
crontab crontab