Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Movie Dialogue and Closed Caption Data #44

Open
endremborza opened this issue Dec 30, 2021 · 1 comment
Open

Movie Dialogue and Closed Caption Data #44

endremborza opened this issue Dec 30, 2021 · 1 comment
Labels
new project Creation of a new data project

Comments

@endremborza
Copy link
Member

No description provided.

@endremborza endremborza added the new project Creation of a new data project label Dec 30, 2021
@kbenya
Copy link
Contributor

kbenya commented Jan 12, 2022

Screenplay, Closed Caption collection

Screenplay

There are no comprehensive online databases of screenplays. There are some projects of particular series or films with acceptable data formats we can access, like Sex and the City or Star wars original trilogy.

Besides those, some websites try to collect screenplays. The most popular one is IMSDB. At first glance, some factors make it harder to use these scripts. Like different formats, formatting, and script versions, but it's workable.

Closed Caption

It is much easier to access the caption of films and series. We could scrape websites storing subtitles (e. g. YIFY). However, they are not official, so maybe it would be better (and cooler) to use sources with a more official look.

I checked three streaming services Netflix, Amazon Prime and HBO GO. Netflix's and Amazon's captions seem to be relatively easily accessible. However, further examination is needed.

Developer tools -> Network -> Caption file.
Amazon stores its caption in a Timed Text Markup Language 2 (TTML2) format. You only need to find that file.
Netflix starts the name of the caption file with a "?o=".

@kbenya kbenya closed this as completed Jan 14, 2022
@endremborza endremborza changed the title Collect screenplay data for either nlp or character networks for films / TV series Movie Dialogue and Closed Caption Data May 23, 2022
@endremborza endremborza reopened this May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new project Creation of a new data project
Projects
Status: Todo
Development

No branches or pull requests

2 participants