---
title: "Looking into YouTube Free Movies"
subtitle: "You get what you (don't) pay for"
date: 2020-05-10
categories: 
  - Python
tags: 
  - munging
  - cleaning
  - movies
slug: "youtube-movies"
---

## Looking into YouTube Movies

Wanted to look into what kind of ratings the free movies on YouTube are getting.

In [167]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [168]:
url = 'https://www.youtube.com/feed/storefront?bp=kgEmCGQSIlBMSFBUeFR4dEMwaWJWWnJUMl9XS1dVbDJTQXhzS3VLd3iiBQIoAg%3D%3D'
page = requests.get(url)


In [169]:
soup = BeautifulSoup(page.text, 'html.parser')

In [170]:
print(soup.prettify()[:200])


<!DOCTYPE html>
<html data-cast-api-enabled="true" lang="en">
 <head>
  <style name="www-roboto">
   @font-face{font-family:'Roboto';font-style:normal;font-weight:500;src:local('Roboto Medium'),local(


In [171]:
html_films = soup.find_all(class_="yt-lockup-title")

for film in html_films[:5]:
    print(film.get_text())

Dino King - Duration: 1:28:47.
Snow Queen - Duration: 1:16:07.
Beyond Beyond - Duration: 1:19:24.
Fair Game - Duration: 1:47:43.
Sleepover - Duration: 1:29:29.


In [172]:
movies = [film.get_text() for film in html_films]

In [173]:
movies[:6]

['Dino King - Duration: 1:28:47.',
 'Snow Queen - Duration: 1:16:07.',
 'Beyond Beyond - Duration: 1:19:24.',
 'Fair Game - Duration: 1:47:43.',
 'Sleepover - Duration: 1:29:29.',
 'The Magic of Belle Isle - Duration: 1:49:19.']

In [174]:
df = pd.DataFrame(movies)
df.rename(columns={0: 'movie'}, inplace=True)

In [175]:
df[df.movie.str.contains(' - Duration: ')].head()

Unnamed: 0,movie
0,Dino King - Duration: 1:28:47.
1,Snow Queen - Duration: 1:16:07.
2,Beyond Beyond - Duration: 1:19:24.
3,Fair Game - Duration: 1:47:43.
4,Sleepover - Duration: 1:29:29.


In [176]:
df = df.movie.str.split(' - Duration: ', expand=True)

In [177]:
df[1] = df[1].str.rstrip('.')

In [178]:
df = df.reset_index()

In [179]:
df.rename(columns={0: 'yt_title', 1: 'yt_duration', 'index': 'yt_id'}, inplace=True)

In [180]:
df.head()

Unnamed: 0,yt_id,yt_title,yt_duration
0,0,Dino King,1:28:47
1,1,Snow Queen,1:16:07
2,2,Beyond Beyond,1:19:24
3,3,Fair Game,1:47:43
4,4,Sleepover,1:29:29


### Convert Duration to minutes

In [181]:
def split_time(x):
    numbers = x.split(':')
    time = int(numbers[0]) * 60 + int(numbers[1])
    return time

In [182]:
df['yt_minutes'] = df['yt_duration'].apply(split_time)

## IMDb Data

In [183]:
imdb_ratings = pd.read_csv('/Users/zachbogart/Downloads/title.ratings.tsv', sep='\t')
imdb_basics = pd.read_csv('/Users/zachbogart/Downloads/title.basics.tsv', sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [184]:
imdb = imdb_basics.merge(imdb_ratings, how='left', on='tconst')

In [185]:
imdb.shape

(6831547, 11)

- Let's just look at movies (we get what we get)

In [186]:
imdb.titleType.value_counts()

tvEpisode       4869408
short            741081
movie            551301
video            265727
tvSeries         184466
tvMovie          121175
tvMiniSeries      31078
tvSpecial         29209
videoGame         25548
tvShort           12554
Name: titleType, dtype: int64

In [187]:
imdb = imdb.loc[imdb.titleType == 'movie']

In [188]:
imdb.shape

(551301, 11)

In [189]:
imdb.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance,5.9,153.0
145,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,\N,20,"Documentary,News,Sport",5.2,346.0
332,tt0000335,movie,Soldiers of the Cross,Soldiers of the Cross,0,1900,\N,\N,"Biography,Drama",6.1,40.0
499,tt0000502,movie,Bohemios,Bohemios,0,1905,\N,100,\N,3.8,6.0
571,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,\N,70,"Biography,Crime,Drama",6.1,574.0


## Try Joining

In [190]:
joined = df.merge(imdb, how='left', left_on='yt_title', right_on='primaryTitle')

In [191]:
# remove any values with null
joined = joined.dropna().sort_values('primaryTitle').head(30)

Let's not deal with any overlap. Just the movies that have one match

In [192]:
joined.yt_title.value_counts()

Aftershock                       3
Bandits                          3
Bad Trip                         3
A Girl Like Her                  2
A Little Bit of Heaven           2
All We Had                       2
Back in Time                     1
Alien Code                       1
Apartment 1303                   1
Alcatraz                         1
4 Minute Mile                    1
8 Assassins                      1
A Cowgirl's Story                1
17 Miracles                      1
Arthur & Merlin                  1
Almost Adults                    1
Atlas Shrugged II: The Strike    1
American Ninja                   1
2036 Origin Unknown              1
Alien Arrival                    1
Alex & The List                  1
Name: yt_title, dtype: int64

#### Resources
- https://docs.python-guide.org/scenarios/scrape/
- https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3
- https://www.geeksforgeeks.org/split-a-text-column-into-two-columns-in-pandas-dataframe/


#### Image Credit
integrated system by Zach Bogart from the [Noun Project](https://thenounproject.com/search/?q=integrated%20system&creator=4129988&i=3169228) 