# Bumper 

In [1]:
# Basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from scipy import stats

# Keybert
from keybert import KeyBERT

In [2]:
doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

In [3]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
 ('algorithm', 0.4556),
 ('training', 0.4487),
 ('class', 0.4086),
 ('mapping', 0.3700)]

[('learning', 0.4604),
 ('algorithm', 0.4556),
 ('training', 0.4487),
 ('class', 0.4086),
 ('mapping', 0.37)]

In [4]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
 ('machine learning', 0.6305),
 ('supervised learning', 0.5985),
 ('algorithm analyzes', 0.5860),
 ('learning function', 0.5850)]

[('learning algorithm', 0.6978),
 ('machine learning', 0.6305),
 ('supervised learning', 0.5985),
 ('algorithm analyzes', 0.586),
 ('learning function', 0.585)]

In [5]:
keywords = kw_model.extract_keywords(doc, highlight=True)

## Test with Kidscreen 

In [6]:
doc = """
        ABC Australia is looking to entertain and inspire the 4.4 million Australian children between two and 14 through a 
        mixed genre output of light entertainment, drama, comedy, preschool and factual, says head of children’s content Libbie Doherty. 
        ABC spans across free channels ABC Kids (two- to six-years-old) and ABC ME (six to 12s) and the VOD platform ABC iview, which 
        carries the broadcaster’s linear content, as well as exclusive kids content commissioned specifically for iview. Doherty primarily 
        develops content from Australian independent producers, and commissions to international producers are rare—but occasionally deals
        will occur when an Australian producer is involved. For co-productions, Australia needs to be represented in the production, if not
        also the story. The broadcaster’s catalogue should reflect the diverse and rich Australian identity, and help guide kids through 
        the big and small transitions of childhood, she says. It should also connect city and regional kids to each other and empower 
        children to speak up and participate within their communities. ABC is looking for content that fits six criteria: It should be 
        bold, brave and takes creative risks; it should always take an inclusive lens, giving children content that they can see 
        themselves in because it creates a sense of belonging in an expanding national identity; it should make the audience laugh
        and remember to have fun; it should focus on accuracy while pushing the boundaries of stories and topics and balancing trust 
        and risk; it should experiment with new formats and approaches to content development; and it should feature diversity in 
        front of and behind the camera, with a focus on under-represented groups from culturally and linguistically diverse groups, 
        Indigenous and disabled communities, as well as building on the broadcaster’s 50/50 female cast and crew targets. In short, 
        the pubcasters wants shows with kind, big-hearted characters and epic locations, which also helps kids explore, investigate
        and make sense of the world around them. This may seem like a very open content purview, but the pubcaster has focused on 
        building out its catalogue with inclusive content, including Epic Film’s live-action series First Day (four x 24-minutes), 
        about the transgender character Hannah as she copes with high school and transitioning into becoming a girl. One of the first
        children’s series to explicitly follow the life of a transgender youth, First Day provides a clearer picture of the type of 
        content the broadcaster is looking for, Doherty says. On top of this, ABC is working to push the boundaries beyond typical 
        protagonists, and picked up Paper Owl Films’ animated series Pablo, which revolves around a five-year-old boy on the autism
        spectrum who uses magical crayons to start adventures with the characters he creates. The channel has also filled its catalogue
        with a variety of content across styles, including animated educational-focused series, including international 
        preschool-skewing titles Daniel Tiger’s Neighbourhood, The Day Henry Met... and Bing, mixed-media series Becca’s Bunch,
        Dino Dana and live-action shows Molly and Mack and Detention Adventure. ABC children’s content is meant to build a life-long 
        connection to the bigger brand, and as a result, content for younger audiences needs to be crafted with an age-appropriate 
        pace and style, says Doherty. To reach a broader audience, Doherty is seeking content that features a range of production 
        techniques, including factual, drama, live-action, puppets, songs and animation, which balances learning with entertainment. 
        Productions for iview should experiment with storytelling and length, since they do not need to be constrained by traditional
        broadcast schedules, she adds.

      """
    
kw_model_bumper = KeyBERT()
keywords_bumper = kw_model.extract_keywords(doc)

In [7]:
bumper_keywords = kw_model_bumper.extract_keywords(doc, highlight=True)

In [8]:
%%time
kw_model_bumper.extract_keywords(doc, keyphrase_ngram_range=(2, 2), stop_words=None)

CPU times: user 2.55 s, sys: 315 ms, total: 2.87 s
Wall time: 1.88 s


[('abc children', 0.6575),
 ('abc kids', 0.651),
 ('doherty abc', 0.5979),
 ('abc australia', 0.595),
 ('australian children', 0.5533)]

While running keyBERT on the initial "looking for" text, the keyBERT feature is presenting different iterations of the same word. For example if we take the above keyword extraction, we see that 'abc' is being presented multiple times as a keyword. For our purposes it is best to only obtain that keyword once and try to obtain the next highest keyword based on cosine similarity.
<br>

There are 2 ways *(that I know of)* we can combat the repetition of words such as 'abc':
- 1. Max Sum Similarity
- 2. Maximal Marginal Relevance

## Max Sum Similarity

To diversify the results, we take the 2 x  top_n most similar words/phrases to the document. Then, we take all top_n combinations from the 2 x top_n words and extract the combination that are the least similar to each other by cosine similarity.

In [20]:
%%time
kw_model_bumper.extract_keywords(doc, keyphrase_ngram_range=(1,1),stop_words='english',
                        use_maxsum=True, nr_candidates=20, top_n=5)

CPU times: user 1.54 s, sys: 305 ms, total: 1.85 s
Wall time: 906 ms


[('media', 0.2074),
 ('preschool', 0.1554),
 ('entertain', 0.1979),
 ('australia', 0.1947),
 ('doherty', 0.4351)]

In [21]:
%%time
kw_model_bumper.extract_keywords(doc, keyphrase_ngram_range=(2,2),stop_words='english',
                        use_maxsum=True, nr_candidates=20, top_n=5)

CPU times: user 2.16 s, sys: 308 ms, total: 2.47 s
Wall time: 1.44 s


[('libbie doherty', 0.227),
 ('productions australia', 0.5099),
 ('channels abc', 0.6575),
 ('comedy preschool', 0.5476),
 ('children content', 0.4949)]

## Maximal Marginal Relevance

To diversify the results, we can use Macimal Margin Relevance (MMR) to create keywords/keyphrases which is also based on cosine similarity. The results with **High Diversity**:

In [19]:
%%time
kw_model_bumper.extract_keywords(doc, keyphrase_ngram_range=(1,1),stop_words='english',
                        use_mmr=True, diversity=0.7)

CPU times: user 767 ms, sys: 21.3 ms, total: 789 ms
Wall time: 735 ms


[('abc', 0.5578),
 ('youth', 0.3056),
 ('risks', 0.1195),
 ('range', 0.0917),
 ('carries', 0.0405)]

In [22]:
%%time
kw_model_bumper.extract_keywords(doc, keyphrase_ngram_range=(2,2),stop_words='english',
                        use_mmr=True, diversity=0.7)

CPU times: user 1.35 s, sys: 38.1 ms, total: 1.39 s
Wall time: 1.29 s


[('abc children', 0.6575),
 ('vod platform', 0.2108),
 ('australian producer', 0.3831),
 ('life long', 0.1256),
 ('trust risk', 0.1554)]

Let's see the results with **Low Diversity**:

In [18]:
%%time
kw_model_bumper.extract_keywords(doc, keyphrase_ngram_range=(1,1),stop_words='english',
                        use_mmr=True, diversity=0.2)

CPU times: user 777 ms, sys: 22.9 ms, total: 800 ms
Wall time: 744 ms


[('abc', 0.5578),
 ('doherty', 0.4512),
 ('children', 0.3986),
 ('australian', 0.4351),
 ('entertain', 0.3991)]

In [23]:
%%time
kw_model_bumper.extract_keywords(doc, keyphrase_ngram_range=(2,2),stop_words='english',
                        use_mmr=True, diversity=0.2)

CPU times: user 1.34 s, sys: 36.1 ms, total: 1.37 s
Wall time: 1.27 s


[('abc children', 0.6575),
 ('doherty abc', 0.5979),
 ('abc australia', 0.595),
 ('abc kids', 0.651),
 ('australian children', 0.5533)]

# Set Up KeyBERT to extract keywords from "looking for" database

## Create csv file form txt 

In [12]:
# reading given csv file 
# and creating dataframe 
dataframe1 = pd.read_csv("data/v2.txt") 
  
# storing this dataframe in a csv file 
dataframe1.to_csv('Kidscreen_looking_for.csv',  
                  index = None)

In [13]:
dataframe1.head()

Unnamed: 0,name,looking_for,team,demographic,how_to_pitch,contact,commission,recent_acquisitions,Africa,Asia Pacific,...,8,9,10,11,12,13,14,15,16,17
0,ABC Australia,ABC Australia is looking to entertain and insp...,Libbie DohertyHead of ABC's Children's Conten...,ABC Kids (2- to 6-years-old)\rABC ME (6 to 12s...,Producers with fully completed projects should...,Children’s development and co-production manag...,Not Available,Kiri and Lou,0,1,...,1,1,1,1,1,1,1,0,0,0
1,Discovery Kids,"Flavio Medeiros, director of programming and a...",Not Available,4 to 8,Pitches with a show description can be emailed...,Flavio_Medeiros@discovery.com,Not Available,Boonie BearsEsme & RoySuper Dinosaur,0,0,...,1,0,0,0,0,0,0,0,0,0
2,Disney Channel,Disney is on the lookout for content that fits...,Elizabeth Waybright TaylorDirector of Develop...,"6 to 11 years old, with a skew towards girls",Disney TVA does not accept unsolicited materia...,Not Available,Not Available,Miraculous: Tales of Ladybug and Cat Noir,0,0,...,1,1,1,1,0,0,0,0,0,0
3,Corus,Corus’ kids networks each have their own ident...,Jennifer AbramsVP of Programming and Multiplat...,"YTV: Kids 6 to 12, and co-viewing;\rTeletoon: ...","For all pitches including original content, ac...",scriptedoriginals@corusent.com,Not Available,Not Available,0,0,...,1,1,1,1,1,0,0,0,0,0
4,De Agostini Editore,"Like many other broadcasters, Massimo Bruno, t...",Massimo BrunoHead of TV channels,DeA Jr: preschool with a focus on family co-vi...,Producers looking to pitch any of De Agostini’...,Property development department: property.digi...,MagikiNew School,Boy Girl Dog Cat Mouse CheeseOggy and the Cock...,0,0,...,1,1,0,0,0,0,0,0,0,0


In [14]:
# Create dataframe with certain features
keyword_df = dataframe1[['name','looking_for','team','demographic','recent_acquisitions']]
keyword_df.head()

Unnamed: 0,name,looking_for,team,demographic,recent_acquisitions
0,ABC Australia,ABC Australia is looking to entertain and insp...,Libbie DohertyHead of ABC's Children's Conten...,ABC Kids (2- to 6-years-old)\rABC ME (6 to 12s...,Kiri and Lou
1,Discovery Kids,"Flavio Medeiros, director of programming and a...",Not Available,4 to 8,Boonie BearsEsme & RoySuper Dinosaur
2,Disney Channel,Disney is on the lookout for content that fits...,Elizabeth Waybright TaylorDirector of Develop...,"6 to 11 years old, with a skew towards girls",Miraculous: Tales of Ladybug and Cat Noir
3,Corus,Corus’ kids networks each have their own ident...,Jennifer AbramsVP of Programming and Multiplat...,"YTV: Kids 6 to 12, and co-viewing;\rTeletoon: ...",Not Available
4,De Agostini Editore,"Like many other broadcasters, Massimo Bruno, t...",Massimo BrunoHead of TV channels,DeA Jr: preschool with a focus on family co-vi...,Boy Girl Dog Cat Mouse CheeseOggy and the Cock...


## Feed csv file into KeyBERT

TO DO:
1. Set up extraction for 
    - `looking_for`
    - `team`
    - `demographic`
    - `how_to_pitch`
2. Create new dataframe from keywords used
3. ?

In [15]:
kw_model = KeyBERT('distilbert-base-nli-mean-tokens')
keywords = kw_model.extract_keywords(doc)
kw_model.extract_keywords(doc,keyphrase_ngram_range=(1, 2),
                          stop_words=[x for x in 'english'] + \
                          ['nickelodeon',
                           '2021',
                           '2019',
                           'wildbrain'
                          ], top_n=10, diversity=0.7)

[('australian children', 0.43),
 ('rich australian', 0.4257),
 ('abc kids', 0.3829),
 ('australian producer', 0.3804),
 ('comedy preschool', 0.3724),
 ('million australian', 0.3569),
 ('abc children', 0.3485),
 ('productions australia', 0.3435),
 ('australia needs', 0.3423),
 ('abc australia', 0.3385)]

In [16]:
keywords = kw_model.extract_keywords(keyword_df)

In [17]:
%%time
kw_model_bumper.extract_keywords(keyword_df, keyphrase_ngram_range=(1,1),stop_words='english',
                        use_mmr=True, diversity=0.7)

IndentationError: unexpected indent (<unknown>, line 2)