## Comparing Series Release Strategies

### Introduction

In this project I try to answer the question: which release strategy produces more positive discussions; releasing all episodes at once or releasing them once every week? I try to answer this question by looking at the most distinct adjectives used in the discussions of series on Reddit.

Datasets used in this mini project:
**discussions_corrected.csv**: 50,000 Reddit comments on episode threads of popular television series between 2011 and 2020.

The discussions dataset contains posts about series that release their season all at once and release an episode every week. Therefore, by splitting the data, these posts can be compared. In the posts, viewers express themselves about the series or the episodes of the series. Usually, people express themselves positively or negatively using adjectives, such as 'good' or 'bad' and 'great' or 'terrible'. These kinds of words can give an indication of whether the posts are positive or negative towards the series. Therefore, if you would look at what adjectives set the strategies apart most, this could give an indication of how the different release strategies are being received by looking at whether these adjectives are more positive or more negative. If the strategies are compared to each other, and the comparison shows that there are significantly more positive words in the highly distinctive adjectives of one release strategy, this would indicate that these positive words are used much more often to describe those series using that strategy than the other. Hence, this would indicate that that strategy should yield more positive discussions.

This project was initially a school assignment for my Master's degree at Utrecht University. Hence, some of the code used in this notebook was adjusted from the code used in this lab manual: https://jveerbeek.gitlab.io/dm-manual/.

### Import Modules

In [1]:
import pandas as pd
import numpy as np
from transformers import pipeline
import spacy 
from collections import Counter
from scipy.stats import chi2_contingency

### Importing and Lemmatizing

First, I import the discussions dataset containing the posts about the different shows. Then, I process the posts using Spacy and make two subsets of the posts based on the two different release strategies so these strategies can be compared. Finally, I lemmatize both subsets using Spacy. I choose to lemmatize instead of stem them because this way I can filter the adjectives by using the pos tags. 

In [3]:
df = pd.read_csv('discussions_corrected.csv')
nlp = spacy.load("en_core_web_sm")
texts = df.post
texts = [text.lower() for text in texts]
processed_texts = [text for text in nlp.pipe(texts, 
                                             disable=["ner",
                                                      "parser"])]
df['processed_texts'] = processed_texts

processed_lin = df[df.type == 'linear'].processed_texts
processed_net = df[df.type == 'netflix'].processed_texts

lemmatized_lin = [[token.lemma_ for token in text if not token.is_punct and token.pos_ == 'ADJ'] for text in processed_lin]
lemmatized_net = [[token.lemma_ for token in text if not token.is_punct and token.pos_ == 'ADJ'] for text in processed_net]

### Calculating Most Distinctive Words

Next, I calculate the most dinstinctive adjectives and flatten the lemmatized subsets of the discussions dataset. The results yield the log likelihood ratios which indicate how distinct a word is.

In [4]:
def distinctive_words(target_corpus, reference_corpus):
    counts_c1 = Counter(target_corpus) # don't forget to flatten your texts!
    counts_c2 = Counter(reference_corpus)
    vocabulary = set(list(counts_c1.keys()) + list(counts_c2.keys()))
    freq_c1_total = sum(counts_c1.values()) 
    freq_c2_total = sum(counts_c2.values()) 
    results = []
    for word in vocabulary:
        freq_c1 = counts_c1[word]
        freq_c2 = counts_c2[word]
        freq_c1_other = freq_c1_total - freq_c1
        freq_c2_other = freq_c2_total - freq_c2
        llr, p_value,_,_ = chi2_contingency([[freq_c1, freq_c2], 
                      [freq_c1_other, freq_c2_other]],
                      lambda_='log-likelihood') 
        if freq_c2 / freq_c2_other > freq_c1 / freq_c1_other:
            llr = -llr
        result = {'word':word, 
                    'llr':llr,
                    'p_value': p_value}
        results.append(result)
    results_df = pd.DataFrame(results)
    return results_df

flatten = lambda t: [item for sublist in t for item in sublist]
results_df = distinctive_words(flatten(lemmatized_lin), flatten(lemmatized_net))

### Interpreting the Result

Subsequently, I sort the results based on the log likelihood ratio in both ascending and descending order to show the most distinctive words for both strategies. To treat the most distinct words fairly between the strategies, I choose to select all results with either a llr >= 15 for words that are most distinctive in the first corpus or a llr <= -15 for words that are most distinctive in the second corpus.

In [5]:
results_df[results_df.llr >= 15].sort_values('llr', ascending=False)

Unnamed: 0,word,llr,p_value
1972,dead,95.675851,1.353211e-22
2681,twin,84.18229,4.5119439999999997e-20
1943,next,72.757673,1.465872e-17
5118,stark,65.121098,7.043391e-16
2730,faceless,54.357545,1.671351e-13
4918,last,44.153454,3.036156e-11
3015,blue,43.063951,5.297948e-11
3815,valyrian,41.645536,1.094149e-10
2553,red,40.339194,2.134829e-10
1425,gus,39.740224,2.900891e-10


In [6]:
results_df[results_df.llr <= -15].sort_values('llr', ascending=True)

Unnamed: 0,word,llr,p_value
3407,black,-203.430374,3.726307e-46
2717,different,-105.128149,1.144869e-24
3648,real,-73.962969,7.95963e-18
1402,young,-69.461163,7.793407000000001e-17
536,nosedive,-57.311212,3.720284e-14
1247,digital,-50.699373,1.07654e-12
647,virtual,-47.584936,5.267168e-12
4471,serial,-46.116226,1.114417e-11
2583,russian,-41.021682,1.505497e-10
3033,netflix,-34.86811,3.528125e-09


The most distinctive adjectives of the first corpus, which is the corpus with shows that release their episodes linearly, contain the positive adjectives 'awesome', 'epic', 'good' and 'badass' while the other corpus contains none of these kinds of positive adjectives. Thus, there are four positive adjectives that are distinctively used in posts of series with the linear release strategy as opposed to zero positive adjectives for the other strategy. Moreover, the results indicate that even very general words like 'good' and 'awesome' set apart this strategy from the other in the data. Hence, the adjectives 'good' and 'awesome' are used much more often in posts of shows that use the linear strategy, which is a result that clearly indicates a linear strategy would be the better option in terms of more positive posts.