# Discussion of results
---
![](_img/header_bild_02.jpg)
<sub><sup>Image credit: Melanie Pongratz, unsplash.com</sup></sub>

# Summary

- I have acquired a comprehensive data set, cleaned and prepared the data and applied a range of modelling and labeling techniques.
- By that I was able to enrich existing textual meta data of ~7.6k podcasts and 462k episodes. 
- Simpler modelling approaches with LDA or Doc2Vec proved to be robust and fast to train. 
- State-of-the-art methods like Top2Vec, entity recognition with Spacy or zero shot classification with Transformer models yielded even more meaningful and creative ways to refine the given meta data.
- I examined the quality of all models by assessing their ability to classify the primary genre – a label, that is given in the data set and that I used as a proxy metric. Top2Vec with a Support Vector classifier yielded the best results. 
- I am confident that the resulting labels, vectors, embeddings, clusterings and classifications are suitable to enhance podcast search and would allow users to discover audio content in novel ways.

# Results from a data product perspective

## Data acquisition and preparation

- Data acquisition unfortunately is cumbersome. For this project I focused just on iTunes and even so had to retrieve data from various sources and thoroughly harmonize the data. 
- Acquiring data for a search and discovery data product would make much more sense if it tied in podcast of all mayor platforms like Apple, Spotify, Google, Amazon etc. 
- RSS feeds are very diverse in content and structure. For a data product I'd have to establish data validation and normalization to assure consistent quality.
- For a data product it would make much sense to extend the data set with podcasts from Switzerland and Austria. More effort would be needed to filter out foreign language content (especially French, Italian and probably Swiss German) or accomodate these in a multi-language approach. 

## EDA

- It appears that we get sufficient data for a data product from the mayority of podcasts.
- Though the data contains promotional content and sponsor messages nowhere during the project that appeared to be a problem. 
- If we used the data as such, podcasts with many available episodes would have an advantage, simply because there is so much more information to work with. If we wanted to give podcasts with less textual content even chances of discovery in a data product we would have to counterbalance this.
- For a data product we also needed to look into the imbalance of genres that dominate, e.g. `Bildung`, `Gesellschaft und Kultur` or `Comedy`.
- It makes senses to at least give the user the option to focus on podcasts that are still actively mantained, e.g. by filtering out all podcasts that weren't updated in the last 12 months.
- To increase serendipity in search, a sensible balance between surfacing successful top listed / chart podcasts and long tail offerings makes sense.

## Modelling with LDA, Doc2Vec and Top2Vec

- All three approaches were fast to train and yielded meaningful results.
- We have no way of comparing the results exactly. However, Top2Vec has the unique advantage of creating a unified vector space that contains embeddings for words, topics and documents. This allows for much more fine grained recommendations and flexible search.

## Entity recognition with Spacy

- I perceive entities as an interesting addition to the functionality of a potential data product. 
- Recognition was fast, yielded meaningful results. 
- However, we get a lot of noise too, that has to be filtered out.

## Zero shot classification

- My impression is that zero shot classification opens up the most creative and novel ways of creating meta data for podcasts. It is only up to our imagination and experimentation to «ask» for the right labels to enhance discovery. 
- For a data product it would be interesting to collect unusual candidate labels from searches, especially from ones that did not satisfy the users' query with traditional approaches. We could then iteratively feed these to a Transformer model, retrieve the zero shot classification and update the existing meta data with the results.

## Comparing the results quantitatively

- To compare the various approaches quantitatively I chose the primary genre as a ground truth label, trained various classifiers on the vectors (LDA, Doc2Vec, Top2Vec) and measured their performance by a weighted F1 score.
- Since zero shot classification doesn't need any training I could directly infer the labels and again measure the F1 score. 
- An SVC classifier with Top2Vec vectors performed best. The other vector sets and zero shot classification performed less good but came close. 
- It makes intuitive sense that the most advanced embedding used – Top2Vec – yields the best results. 
- Analyzing prediction erros of the zero shot model in the confusion matrix revealed that a substantial share of errors are comprehensible or even seem to be closer to a sensible genre prediction than the actual label given by the creator. 
- I take these findings as a plausible indication that all approaches can be used successfully to generate metadata and by that improve discovery of podcasts.

## Personal notes

- Working through all the steps took a lot of time and brought many challenges. However, I am honestly surprised how robust and fast the tools work and how interesting, varied and meaningful the results are. 
- During the course of the project I had the opportunity to substantially extend my knowledge of NLP techniques and look into methods I hadn't used before (like zero shot classification).
- As it seems to be the case for almost all Data Science projects, data acquisition and preparation, debugging and setting up work environments took by far most of my time. Modeling in comparison usually was quite fast to setup and perform.
- I enjoyed applying a spectrum of tools from older ones (bag-of-word, LDA) to modern ones (Spacy, Hugging Face, Transformer models). Though the toolkit has gotten so much bigger, my impression is that all the approaches have their place and still can be used with good results. 