This project was made by 4 Le Wagon Data Science students as our final project. We have sourced and processed a wide range of stand-up scripts to analyse comedians and make our own comedy.
- Constructed a scraper to source and clean 3.6 million words of stand-up comedy from a variety of online sources using BeautifulSoup, Requests, and Pandas.
- Carried out a machine learning analysis using a Latent Dirichlet Allocation (LDA) on the processed dataset and used wordclouds to visualise results.
- Finetuned a bot using GPT2 (gpt-2-simple, credit: Max Woolf) on the subset of female comedian scripts & generted entirely original stand-up comedy material from only a few words of text input. Created an API for the bot using FastAPI, Docker and Google Cloud Run.
- Integrated and published the completed project on a public site using Heroku and Streamlit.
- Scraps From the Loft - Scripts and Year
- Subsaga - Scripts
- TMDB - The Movie Database - Artist Age and Gender
- 555 individual transcripts
- 268 comedians
- 19 million characters
- 3.6 million words
- Python
- Jupyter Notebook
- Requests + BeatuifulSoup
- Pandas
- NLTK
- Gensim
- GPT-2
- Docker
- Google Cloud AI Platform
- Google Cloud Run
- Heroku
- Streamlit
Full list of python packages can be found in the requirements.txt file.
- Reinis Melbardis @rmelbardis
- Yuqing Wang @yuqingwwang
- James Farrell @jfazz9
- Catriona Beamish @beamishc
- 0.1
- Initial Release