Skip to content

I used transcripts of television episodes to look for patterns between shows using the Latent Dirichlet Allocation (LDA) model, which clusters similar shows based upon the two shows use similar words at similar frequencies.

Notifications You must be signed in to change notification settings

salice/television_show_recommender

Repository files navigation

Television Show Recommender System


Executive Summary


I analyzed the transcripts of 117,937 television episodes from 4,667 different television shows using Latent Dirichlet Allocation in order to find clusters of common language between different shows. and to then take those similarities to build a content based recommender for television shows.

System Requirements


  • Python==3.7.3
  • gensim==3.8.1
  • Flask==1.1.1
  • nltk==3.4.5
  • pandas==0.25.2
  • matplotlib==3.1.1
  • numpy==1.17.2
  • spacy==2.2.1
  • spacy-langdetect==0.1.2
  • beautifulsoup4==4.8.0

For Google Cloud Virtual Instance:

  • need Virtual Machine with at least 104 GBs of RAM
  • google-api-core==1.14.3
  • google-auth==1.7.1
  • google-auth-oauthlib==0.4.1
  • google-cloud==0.34.0
  • google-cloud-core==1.0.3
  • google-cloud-storage==1.23.0
  • google-pasta==0.1.8
  • google-resumable-media==0.5.0

How to Use this Repository


All final production code is in the final_code folder, while the development_code folder contains other pieces of code written during the project that ended up not being used to create the final result. The notebooks Python scripts are listed in chronological order. None of my final data is posted because of its size (2.6 GBs), but please contact me if you would like a copy!

About

I used transcripts of television episodes to look for patterns between shows using the Latent Dirichlet Allocation (LDA) model, which clusters similar shows based upon the two shows use similar words at similar frequencies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published