# Analysis of Plots of the movies and Remarks made by DoD

Using [Scatter Text](https://github.com/JasonKessler/scattertext).

To comprehend the influence of politics through DoD on the production of movies and TV shows we study the terms used in the remarks and plots of the films and TV shows. More specifically, we look at the differences in terms present in the corpus for the movies that have been provided assistance and those that were rejected. Additionally, we will separate movies using the keywords that represent the major events like the world war, Vietnam, Iraq, Korean wars, and 9/11 (September 11) that have political prominence for the U.S. military. On this corpus subset, we shall look at the similar differences in the vocabulary.

In [1]:
# import libraries
import spacy
import scattertext as st
import pandas as pd
import pytextrank
import numpy as np

In [2]:
# Path to store the plots generated by scatter text
ST_PLOTS_PATH = "./../scatter_text_plots/"

Read the final aggregated movie data with additional information, we will only use the columns of Title, status, Media Type, Remarks and plot.

In [3]:
col_list = ['Title', 'Status', 'Media Type', 'Remarks', 'plot']
movie_data = pd.read_csv("../data/with_additional_data/military_hollywood_with_additional_data.csv", usecols = col_list)
movie_data.head()

Unnamed: 0,Title,Status,Media Type,Remarks,plot
0,"""1968""",OTH,FILM,THE FILM STARTED OUT VERY NEGATIVE FOR THE ARM...,
1,"1,000 MEN AND A BABY",APP,TV,VERY POSITIVE DEPICTION OF NAVY IN THIS KOREAN...,A baby in a foreign land is adopted by the men...
2,1ST FORCE,OTH,FILM,INITIALLY DOD AND USMC WERE INCLINED TO SUPPOR...,
3,24,APP,TV,APPROVED FILMING FOR ONE DAY WITH TWO MARINE C...,Jack and Tony clash as they wait for the time ...
4,3RD DEGREE,APP,TV,PERSONNEL APPEARED ON THIS GAME SHOW AT THE EX...,Scott Weston is a private investigator who is ...


Select the movies and films that have been approved or denied for assistance by the DoD.

In [4]:
# Take a subset of approved and denied movies
array = ['APP', 'DEN']
dod_subset = movie_data.loc[movie_data['Status'].isin(array)]
# lower casing the textual fields
dod_subset["Remarks"] = dod_subset["Remarks"].str.lower()
dod_subset["plot"] = dod_subset["plot"].str.lower()
dod_subset

Unnamed: 0,Title,Status,Media Type,Remarks,plot
1,"1,000 MEN AND A BABY",APP,TV,very positive depiction of navy in this korean...,a baby in a foreign land is adopted by the men...
3,24,APP,TV,approved filming for one day with two marine c...,jack and tony clash as they wait for the time ...
4,3RD DEGREE,APP,TV,personnel appeared on this game show at the ex...,scott weston is a private investigator who is ...
5,50/50,DEN,FILM,never was officially submitted to dod. it was ...,adam is a 27 year old writer of radio programs...
9,A MIDNIGHT CLEAR,DEN,FILM,declined assistance (request for ww ii facilit...,"set in 1944 france, in the ardennes forest reg..."
...,...,...,...,...,...
850,WITHOUT GLORY,APP,FILM,approved by the department since it was based ...,
853,X-15,APP,FILM,airforce and nasa provided full cooperation on...,at the height of the cold war during the 1960s...
854,"YEAR IN THE LIFE, A",DEN,TV,the project was denied assistance.,"joe gardner, a child of the depression, is a s..."
855,"YOUNG LIONS, THE",APP,FILM,pentagon and state department went through lon...,the destiny of three soldiers during world war...


Loading the NLP library to perform the processing pipeline

In [5]:
nlp = spacy.load('en_core_web_trf')

## Part 1: Term Analysis of plots and remarks

In the first part, we shall perform basic analysis of the text in the remarks and plots column separately of only approved and denied films, leaving out the films who status has ambiguty.

We define a function term_freq, that will process the data frame on mentioned columns and returns the word frequency and a corpus in Scatter Text compatible format to help in plotting and other operations.

In [6]:
def term_freq(data_frame, cat_col, text_col, cat_1, cat_2, nlp,
              remove_stop_words=False):
    """
    The function process the data frame on mentioned columns and
    returns the word frequency and a corpus in Scatter Text compatible
    format to help in plotting and other operations.

    :param data_frame (pd.Dataframe): The data frame on which the corpus needs to be created
    :param cat_col (str): The column in the dataframe on which the dataframe is divided.
    :param text_col (str): The column in the dataframe on which the analysis is performed.
    :param cat_1 (str): The first category in the cat_col (out of two categories).
    :param cat_2 (str): The second category in the cat_col (out of two categories).
    :param nlp (spacy.nlp): The nlp pipeline to apply on the text field.
    :param remove_stop_words (bool): Indicates to remove stop words of not (Default is False)

    :return df_cat1 (pd.Dataframe): The dataframe with terms majorly in
    category 1, with their frequency in each category along with f-scores
    in each category.
    :return df_cat2 (pd.Dataframe): The dataframe with terms majorly in
    category 2, with their frequency in each category along with f-scores
    in each category.
    :return corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    """

    if remove_stop_words:
        # We remove stop words by using the method "remove_terms" and
        # provide the argument for default stop words when building the corpus

        corpus = (st.CorpusFromPandas(data_frame,
                                      category_col=cat_col,
                                      text_col=text_col,
                                      nlp=nlp)
                  .build()
                  .remove_terms(nlp.Defaults.stop_words, ignore_absences=True)
                  )
    else:
        corpus = (st.CorpusFromPandas(data_frame,
                                      category_col=cat_col,
                                      text_col=text_col,
                                      nlp=nlp)
                  .build()
                  )

    # obtain the dataframe with term frequencies
    df = corpus.get_term_freq_df()
    # seperate the dataframe for each category and
    # round the f-score to two decimal points
    df[cat_1] = corpus.get_scaled_f_scores(cat_1)
    df[cat_2] = corpus.get_scaled_f_scores(cat_2)
    df[cat_1] = round(df[cat_1], 2)
    df[cat_2] = round(df[cat_2], 2)

    # sort the terms by frequency in decreasing order
    df_cat1 = df.sort_values(by=cat_1 + ' freq',
                             ascending=False).reset_index()
    df_cat2 = df.sort_values(by=cat_2 + ' freq',
                             ascending=False).reset_index()

    return df_cat1, df_cat2, corpus

We define a function plot_html, that will plot the processed scatter text corpus on mentioned columns and saves the file to a HTML page.

In [7]:
def plot_html(corpus, cat, name_of_category_1, name_of_category_2,
              html_file_name):
    """
    The function that will plot the processed scatter text corpus on mentioned
    columns and saves the file to a HTML page..

    :param corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    :param cat (str): Name of the category as in the caterogy column to be plotted on y-axis.
    :param name_of_category_1 (str): Name of the category to appear on the y-axis.
    :param name_of_category_2 (str): Name of the category to appear on the x-axis.
    """

    html = st.produce_scattertext_explorer(
        corpus,
        category=cat,
        category_name=name_of_category_1,
        not_category_name=name_of_category_2,
        metadata=corpus.get_df()['Title'],
        use_full_doc=True,)
    open(html_file_name, 'wb').write(html.encode('utf-8'))

### Analysis on the Remarks of DoD without removing Stopwords

In [8]:
remarks_app_w_stopw, remarks_den_w_stopw, st_remarks_corpus_w_stopw = term_freq(dod_subset,
                                "Status", "Remarks", "APP", "DEN", nlp, remove_stop_words=False)

# Saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_remarks_with_stopwords.html"

plot_html(st_remarks_corpus_w_stopw, "APP", "Approved", "Denined", html_file_name=html_file_name)

#### Interpretation of the analysis on the Remarks of DoD without removing Stopwords

TBD

### Analysis on the Remarks of DoD after removing the Stopwords

In [9]:
remarks_app_wo_stopw, remarks_den_wo_stopw, st_remarks_corpus_wo_stopw = term_freq(dod_subset,
                                    "Status", "Remarks", "APP", "DEN", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_remarks_without_stopwords.html"

plot_html(st_remarks_corpus_wo_stopw, "APP", "Approved", "Denined", html_file_name=html_file_name)

#### Interpretation of the analysis on the Remarks of DoD after removing the Stopwords

TBD

### Analysis on the Movie Plots without removing the Stopwords

In [10]:
plots_app_w_stopw, plots_den_w_stopw, st_plots_corpus_w_stopw = term_freq(dod_subset,
                                "Status", "plot", "APP", "DEN", nlp, remove_stop_words=False)

# saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_plot_with_stopwords.html"

plot_html(st_plots_corpus_w_stopw, "APP", "Approved", "Denined", html_file_name=html_file_name)

#### Interpretation of the analysis on the Movie Plots without removing the Stopwords

TBD

### Analysis on the Movie Plots after removing the Stopwords

In [11]:
plots_app_wo_stopw, plots_den_wo_stopw, st_plots_corpus_wo_stopw = term_freq(dod_subset,
                                "Status", "plot", "APP", "DEN", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_plot_without_stopwords.html"

plot_html(st_plots_corpus_wo_stopw, "APP", "Approved", "Denined", html_file_name=html_file_name)

#### Interpretation of the analysis on the Movie Plots after removing the Stopwords

TBD

## Part 2: Phrases Analysis of plots and remarks using PyRankText

PyTextRank is an implementation of a modified version of the TextRank algorithm. It extracts a scored list of the most prominent phrases in a document.

In the second part, we shall perform similar analysis to Part 1 expecct that we shall use phrases instead of terms.

Now we add the textrank to the nlp pipeline to rank and select the phrases

We define a function hrasep_freq, that will process the data frame on mentioned columns and returns a corpus in Scatter Text compatible format to help in plotting and other operations.

In [12]:
nlp.add_pipe("textrank", last=True)

<pytextrank.base.BaseTextRank at 0x1e0199d7e20>

In [13]:
def phrase_freq(data_frame, cat_col, text_col, nlp):
    """
    The function process the data frame on mentioned columns and a corpus in
    Scatter Text compatible format to help in plotting and other operations.

    :param data_frame (pd.Dataframe): The data frame on which the corpus needs to be created
    :param cat_col (str): The column in the dataframe on which the dataframe is divided.
    :param text_col (str): The column in the dataframe on which the analysis is performed.
    :param nlp (spacy.nlp): The nlp pipeline to apply on the text field.

    :return corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    """

    data_frame_copy = data_frame.copy()
    # compatibility setting
    data_frame_copy[text_col] = data_frame_copy[text_col].astype(str)
    # Applying the nlp pipeline on the text
    data_frame_copy[text_col] = data_frame_copy[text_col].apply(nlp)
    # building the corpus
    corpus = st.CorpusFromParsedDocuments(
            data_frame_copy,
            category_col=cat_col,
            parsed_col=text_col,
            feats_from_spacy_doc=st.PyTextRankPhrases()
        ).build()

    return corpus

In [14]:
def plot_html_pyranktext(corpus, cat, name_of_category_1, name_of_category_2,
                         html_file_name):
    """
    The function that will plot the processed scatter text corpus on mentioned
    columns and saves the file to a HTML page..

    :param corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    :param cat (str): Name of the category as in the caterogy column to be plotted on y-axis.
    :param name_of_category_1 (str): Name of the category to appear on the y-axis.
    :param name_of_category_2 (str): Name of the category to appear on the x-axis.
    """

    phrase_category_scores = corpus.get_metadata_freq_df('')
    # As the aggregate TextRank scores aren’t easily interpretable, we’ll
    # display the per-category rank of each phrase when clicked in the metadata.
    term_ranks = np.argsort(np.argsort(-phrase_category_scores, axis=0), axis=0) + 1
    metadata_descriptions = {
        term: '<br/>' + '<br/>'.join(
            '<b>%s</b> TextRank score rank: %s/%s' % (cat, term_ranks.loc[term, cat], corpus.get_num_metadata())
            for cat in corpus.get_categories())
        for term in corpus.get_metadata()
    }

    # We will define term score using the maximum category-specific score,
    # this will give us the most prominent phrases in each category,
    # regardless of the prominence in the other category.
    category_specific_prominence = phrase_category_scores.apply(
        lambda r: r.APP if r.APP > r.DEN else -r.DEN, axis=1)

    html = st.produce_scattertext_explorer(
        corpus,
        category=cat,
        category_name=name_of_category_1,
        not_category_name=name_of_category_2,
        transform=st.dense_rank,
        metadata=corpus.get_df()['Title'],
        scores=category_specific_prominence,
        sort_by_dist=False,
        use_non_text_features=True,
        topic_model_term_lists={term: [term] for term in corpus.get_metadata()},
        topic_model_preview_size=0,
        metadata_descriptions=metadata_descriptions,
        use_full_doc=True
        )

    open(html_file_name, 'wb').write(html.encode('utf-8'))

### Analysis on the Movie Remarks using PyTextRank

In [15]:
remakrs_phrase_corpus = phrase_freq(dod_subset, "Status", "Remarks", nlp)

pyranktext_remarks_filename = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_PyRankText_remarks.html"

plot_html_pyranktext(remakrs_phrase_corpus, "APP", "Approved", "Denined", html_file_name=pyranktext_remarks_filename)

#### Interpretation of the analysis on the Movie Remarks using PyTextRank

TBD

### Analysis on the Movie Plots using PyTextRank

In [16]:
plots_phrase_corpus = phrase_freq(dod_subset, "Status", "plot", nlp)

pyranktext_plots_filename = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_PyRankText_plot.html"

plot_html_pyranktext(plots_phrase_corpus, "APP", "Approved", "Denined", html_file_name=pyranktext_plots_filename)

  vec_ss = (vec_ss - vec_ss.min()) * 1. / (vec_ss.max() - vec_ss.min())


#### Interpretation of the analysis on the Movie plots using PyTextRank

TBD

## Part 3: Term Analysis of plots and remarks of movies belonging to political significance.

Here we will seperate the movies that have political significance to the US Army. We will seperate these movies if the following keywords are present in the plot or remarks of the movies. The reason for choosing remarks is that the plot information from IMDB is limited sometimes as it is created by users. However, the remarks mention about the background of these movies.

|    Event    |                                                           Key words                                                           |
|:-----------:|:-----------------------------------------------------------------------------------------------------------------------------:|
|  World War  | "world war", "nazi", "pearl harbour", "hawaii", "hiroshima", "nagasaki", "atomic bomb", "hitler", "d-day", "germany", "japan" |
|  WTO Attcks |                                 "9/11", "september 11", "world trade center", "september 2001"                                |
| Vietnam War |                                      "vietnam", "saigon", "hanoi", "laos", "vietnam war"                                      |
|   Gulf War  |                                   "gulf", "iraq", "baghdad", "saddam", "kuwait", "gulf war"                                   |
|  Afghan War |                             "afghan", "taliban", "afghanistan", "al qaeda", "osama", "afghan war"                             |
|  Korean War |                                     "inchon", "north korea", "korean war", "forgotten war"                                    |

As in the previous parts, we shall only consider approved and denied films for the analysis

In [17]:
pol_sig_keywords = ["world war", "nazi", "pearl harbour", "hawaii",
                    "hiroshima", "nagasaki", "atomic bomb", "hitler",
                    "d-day", "germany", "japan", "9/11", "september 11",
                    "world trade center", "september 2001" "vietnam",
                    "saigon", "hanoi", "laos", "vietnam war", "gulf",
                    "iraq", "baghdad", "saddam", "kuwait", "gulf war",
                    "afghan", "taliban", "afghanistan", "al qaeda",
                    "osama", "afghan war", "inchon", "north korea",
                    "korean war", "forgotten war"]

In [18]:
dod_subset["pol_sig"] = dod_subset["plot"].str.contains("|".join(pol_sig_keywords)) | dod_subset["Remarks"].str.contains("|".join(pol_sig_keywords))

In [19]:
dod_pol_sig = dod_subset.loc[dod_subset["pol_sig"] == True]
dod_pol_sig

Unnamed: 0,Title,Status,Media Type,Remarks,plot,pol_sig
1,"1,000 MEN AND A BABY",APP,TV,very positive depiction of navy in this korean...,a baby in a foreign land is adopted by the men...,True
9,A MIDNIGHT CLEAR,DEN,FILM,declined assistance (request for ww ii facilit...,"set in 1944 france, in the ardennes forest reg...",True
11,ABOVE AND BEYOND,APP,FILM,story of paul tibbetts and the atomic bomb mis...,"the story of colonel paul tibbets, the pilot o...",True
13,ACE OF ACES,APP,FILM,army air corps provided planes to stage aerial...,a sculptor who doesn't want to have any part o...,True
14,ACTION IN THE NORTH ATLANTIC,APP,FILM,navy and merchant marines provided use of ship...,lieutenant joe rossi is 1st officer on a liber...,True
...,...,...,...,...,...,...
840,WHISKEY TANGO FOXTROT,APP,FILM,usaf support to this project took place at kir...,"2003. after careful consideration, kim baker, ...",True
846,WINDTALKERS,APP,FILM,dod approved filming at various sites at marin...,during world war ii when the americans needed ...,True
847,WING AND A PRAYER,APP,FILM,fictionalized story of battle of midway. navy ...,an aircraft carrier is sent on a decoy mission...,True
849,WINGS OF EAGLES,APP,FILM,"story of early navy aviator, spig wead. after ...",u.s. navy pilot frank 'spig' wead is a fun-lov...,True


We have 209 films and TV shows that have political significance. Now we repat the same analysis of terms and phrases in these movies.
Redefining the NLP pipeline as to remove the textrank pipeline.

In [20]:
nlp = spacy.load('en_core_web_trf')

### Analysis on the Remarks of DoD without removing Stopwords

In [21]:
remarks_app_w_stopw_pol_sig, remarks_den_w_stopw_pol_sig, st_remarks_corpus_w_stopw_pol_sig = term_freq(dod_pol_sig,
                                "Status", "Remarks", "APP", "DEN", nlp, remove_stop_words=False)

# Saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_denied-Scattertext_remarks_with_stopwords.html"

plot_html(st_remarks_corpus_w_stopw_pol_sig, "APP", "Approved", "Denined", html_file_name=html_file_name)

#### Interpretation of the analysis on the Remarks of DoD without removing Stopwords

TBD

### Analysis on the Remarks of DoD after removing the Stopwords

In [22]:
remarks_app_wo_stopw_pol_sig, remarks_den_wo_stopw_pol_sig, st_remarks_corpus_wo_stopw_pol_sig = term_freq(dod_pol_sig,
                                    "Status", "Remarks", "APP", "DEN", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_denied-Scattertext_remarks_without_stopwords.html"

plot_html(st_remarks_corpus_wo_stopw_pol_sig, "APP", "Approved", "Denined", html_file_name=html_file_name)

#### Interpretation of the analysis on the Remarks of DoD after removing the Stopwords

TBD

### Analysis on the Movie Plots without removing the Stopwords

In [23]:
plots_app_w_stopw_pol_sig, plots_den_w_stopw_pol_sig, st_plots_corpus_w_stopw_pol_sig = term_freq(dod_pol_sig,
                                "Status", "plot", "APP", "DEN", nlp, remove_stop_words=False)

# saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_denied-Scattertext_plot_with_stopwords.html"

plot_html(st_plots_corpus_w_stopw_pol_sig, "APP", "Approved", "Denined", html_file_name=html_file_name)

#### Interpretation of the analysis on the Movie Plots without removing the Stopwords

TBD

### Analysis on the Movie Plots after removing the Stopwords

In [24]:
plots_app_wo_stopw_pol_sig, plots_den_wo_stopw_pol_sig, st_plots_corpus_wo_stopw_pol_sig = term_freq(dod_pol_sig,
                                "Status", "plot", "APP", "DEN", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_denied-Scattertext_plot_without_stopwords.html"

plot_html(st_plots_corpus_wo_stopw_pol_sig, "APP", "Approved", "Denined", html_file_name=html_file_name)

#### Interpretation of the analysis on the Movie Plots after removing the Stopwords

TBD

## Part 4: Phrases Analysis of plots and remarks using PyRankText for politically significnat movies

As there aren't any phrases with repetations, this analysis is not possible.