# Analysis of Plots of the movies and Remarks made by DoD

Using [Scatter Text](https://github.com/JasonKessler/scattertext).

To comprehend the influence of politics through DoD on the production of movies and TV shows we study the terms used in the remarks and plots of the films and TV shows. More specifically, we look at the differences in terms present in the corpus for the movies that have been provided assistance and those that were rejected. Additionally, we will separate movies using the keywords that represent the major events like the world war, Vietnam, Iraq, Korean wars, and 9/11 (September 11) that have political prominence for the U.S. military. On this corpus subset, we shall look at the similar differences in the vocabulary.

## About Scattertext plot.

It is an interactive HTML plot that helps in finding distinguishing terms in corpora. The x- and y- axes are the dense ranks of the term usage by categories mentioned on the axes. Each dot corresponds to a word or phrase mentioned in the remarks or plots of the selected movies. The closer a dot is to the top of the plot, the more frequently it appeared in the remarks to the category mentioned on the y axis and the further right a dot, the more that word or phrase appeared in the remarks to the category mentioned on the x-axis. Words frequently appeared in both categories.

In [1]:
# import libraries
import spacy
import scattertext as st
import pandas as pd
import pytextrank
import numpy as np
import re

In [2]:
# Path to store the plots generated by scatter text
ST_PLOTS_PATH = "./../plots/scatter_text_plots/"

In [3]:
replacements = {'{': '', '}': '','[': '',']': '','(': '',')': '',  '-': ' ', "-":' ', '~': '',
                '#': '','%': ' percent','--': ' ','”': '', '“': '',
                '"': '','9-17-92': '17SEP1992','&': 'and','+': ' plus','|': '','‘': "'",'’': "'",'’': "'"} #'—': '',,'F/A': 'FA','/': 'or'

def clean_text(data_df, column_name):
    """
    It takes a dataframe and a column name and cleans the given attributes in place.
    data: a dataframe
    column_name: column name written as string in the given data frame 
    """
    data = data_df.copy()
    # url removes
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'(https|http)?:\/(\w|\.|\/|\?|\=|\&|\%)*\b','', str(x), re.UNICODE))
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'www\.\S+\.com','', x, re.UNICODE))
    
    # keep a general U.S. name in all the text   
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'((THE|the|The) [uU]{1}[sS]{1}\. )|((THE|the|The) [uU]{1}\.)([sS]{1} )', 'THE U.S. ', x, re.UNICODE))
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'US CAPITOL', 'U.S. CAPITOL', x, re.UNICODE))
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'(IN|TO) US\.', r'\1'+' U.S.', x, re.UNICODE))
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'US\. (MILITARY|ARMY|SUBMARINES|STATE)', 'U.S. '+r'\1', x, re.UNICODE))
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'OF US\.', 'OF U.S.', x, re.UNICODE))
    
    # convert the dates like 'August 21 — 31,2012' like : 'August 21 - 31,2012' 
    #data[column_name] = data[column_name].apply(lambda x: re.sub(r'(\d+) — (\d+)', r"\1 - \2", x, re.UNICODE))
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'\—', '-', x, re.UNICODE))

    # Change all the specific names of military supports' materials to general case
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'\s([a-zA-Z]*)\-\d+', r" \1/", x, re.UNICODE))
    
    # Preprocessing the data
#     data[column_name] = data[column_name].apply(lambda x: re.sub('({})'.format('|'.join(map(re.escape, replacements.keys()))), lambda m: replacements[m.group()], x))

    # convert the dates like 15-NOV-1983 without '-' like : 15NOV1983 
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'(\d+)\-(\w+)\-(\d+)', r"\1\2\3", x, re.UNICODE))
    
    # convert the dates like 15 NOV 1983 without ' ' like : 15NOV1983 
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'(\d+) (\w+) (\d+)', r"\1\2\3", x, re.UNICODE))
    
    # remove '-' from "text-" like MINI-SERIES -> MINI SERIES or MINI -SERIES -> MINI SERIES
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'([a-zA-Z]{1,})\-([a-zA-Z]{1,})', r"\1 \2", x, re.UNICODE))
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'\s\-([a-zA-Z]{1,})', r" \1", x, re.UNICODE))
    
    # remove ' - ' from "text - text" like CRUME - BUM -> CRUME BUM
    data[column_name] = data[column_name].apply(lambda x: re.sub(r'([a-zA-Z]) \- ([a-zA-Z])', r"\1 \2", x, re.UNICODE))

    # lowercase
    data[column_name] = data[column_name].str.lower()
 
    # remove multiple space with only one
    data[column_name]= data[column_name].apply(lambda x: re.sub(r' +', ' ',  x, re.UNICODE))

    return data

Read the final aggregated movie data with additional information, we will only use the columns of Title, status, Media Type, Remarks and plot.

In [4]:
col_list = ['Title', 'IMDB_ID', 'Status', 'Media Type', 'Remarks', 'plot']
movie_data = pd.read_csv("../data/with_additional_data/military_hollywood_with_additional_data.csv", usecols = col_list)
movie_data

Unnamed: 0,Title,IMDB_ID,Status,Media Type,Remarks,plot
0,"""1968""",Never Made,OTH,FILM,THE FILM STARTED OUT VERY NEGATIVE FOR THE ARM...,
1,"1,000 MEN AND A BABY",tt0133231,APP,TV,VERY POSITIVE DEPICTION OF NAVY IN THIS KOREAN...,A baby in a foreign land is adopted by the men...
2,1ST FORCE,Never Made,OTH,FILM,INITIALLY DOD AND USMC WERE INCLINED TO SUPPOR...,
3,24,tt0502209,APP,TV,APPROVED FILMING FOR ONE DAY WITH TWO MARINE C...,Jack and Tony clash as they wait for the time ...
4,3RD DEGREE,tt0098469,APP,TV,PERSONNEL APPEARED ON THIS GAME SHOW AT THE EX...,Scott Weston is a private investigator who is ...
...,...,...,...,...,...,...
852,"WONDER YEARS, THE",tt0094582,LIM,TV,THE UNITED STATES AIR FORCE GRANTED STOCK FOOT...,An adult Kevin Arnold reminisces on his teenag...
853,X-15,tt0055627,APP,FILM,AIRFORCE AND NASA PROVIDED FULL COOPERATION ON...,At the height of the Cold War during the 1960s...
854,"YEAR IN THE LIFE, A",tt0092488,DEN,TV,THE PROJECT WAS DENIED ASSISTANCE.,"Joe Gardner, a child of the Depression, is a s..."
855,"YOUNG LIONS, THE",tt0052415,APP,FILM,PENTAGON AND STATE DEPARTMENT WENT THROUGH LON...,The destiny of three soldiers during World War...


We have a total of 857 entries. First, we remove the films that were never made.

In [5]:
made_movies_data = movie_data.loc[movie_data['IMDB_ID'] != "Never Made"]
clean_remarks = clean_text(made_movies_data, "Remarks")
cleaned_made_movies_data = clean_text(clean_remarks, "plot")

First, we would like to see the difference between the films that were explicitly approved and denied. So, select the movies and films that have been approved or denied for assistance by the DoD.

In [7]:
# Take a subset of approved and denied movies
array = ['APP', 'DEN']
dod_app_den_subset = cleaned_made_movies_data.loc[cleaned_made_movies_data['Status'].isin(array)].reset_index(drop=True)
dod_app_den_subset

Unnamed: 0,Title,IMDB_ID,Status,Media Type,Remarks,plot
0,"1,000 MEN AND A BABY",tt0133231,APP,TV,very positive depiction of navy in this korean...,a baby in a foreign land is adopted by the men...
1,24,tt0502209,APP,TV,approved filming for one day with two marine c...,jack and tony clash as they wait for the time ...
2,3RD DEGREE,tt0098469,APP,TV,personnel appeared on this game show at the ex...,scott weston is a private investigator who is ...
3,50/50,tt1306980,DEN,FILM,never was officially submitted to dod. it was ...,adam is a 27 year old writer of radio programs...
4,A MIDNIGHT CLEAR,tt0102443,DEN,FILM,declined assistance (request for ww ii facilit...,"set in 1944 france, in the ardennes forest reg..."
...,...,...,...,...,...,...
667,WITHOUT GLORY,,APP,FILM,approved by the department since it was based ...,
668,X-15,tt0055627,APP,FILM,airforce and nasa provided full cooperation on...,at the height of the cold war during the 1960s...
669,"YEAR IN THE LIFE, A",tt0092488,DEN,TV,the project was denied assistance.,"joe gardner, a child of the depression, is a s..."
670,"YOUNG LIONS, THE",tt0052415,APP,FILM,pentagon and state department went through lon...,the destiny of three soldiers during world war...


We are left with 672 films that have an explicit status of approved or denied. 

Second, we would like to see the difference between the films that were supported completely and were provided limited support. So, select the movies and films that have been approved or limited for assistance by the DoD.

In [8]:
# Take a subset of approved and limited movies
array = ['APP', 'LIM']
dod_app_lim_subset = cleaned_made_movies_data.loc[cleaned_made_movies_data['Status'].isin(array)].reset_index(drop=True)
dod_app_lim_subset

Unnamed: 0,Title,IMDB_ID,Status,Media Type,Remarks,plot
0,"1,000 MEN AND A BABY",tt0133231,APP,TV,very positive depiction of navy in this korean...,a baby in a foreign land is adopted by the men...
1,24,tt0502209,APP,TV,approved filming for one day with two marine c...,jack and tony clash as they wait for the time ...
2,3RD DEGREE,tt0098469,APP,TV,personnel appeared on this game show at the ex...,scott weston is a private investigator who is ...
3,A FEW GOOD MEN,tt0104257,LIM,FILM,"inaccurate, negative portravals of marines. pr...","in this dramatic courtroom thriller, lt daniel..."
4,ABOVE AND BEYOND,tt0044324,APP,FILM,story of paul tibbetts and the atomic bomb mis...,"the story of colonel paul tibbets, the pilot o..."
...,...,...,...,...,...,...
615,WOMEN OF VALOR,tt0092236,LIM,TV,approved use of stock footage. film was ouite ...,"col. jessup (susan sarandon), an american mili..."
616,"WONDER YEARS, THE",tt0094582,LIM,TV,the united states air force granted stock foot...,an adult kevin arnold reminisces on his teenag...
617,X-15,tt0055627,APP,FILM,airforce and nasa provided full cooperation on...,at the height of the cold war during the 1960s...
618,"YOUNG LIONS, THE",tt0052415,APP,FILM,pentagon and state department went through lon...,the destiny of three soldiers during world war...


We have 620 films that have a status of approved or limited. 

Loading the NLP library to perform the processing pipeline

In [8]:
made_movies_data[made_movies_data["Title"]=="SUM OF ALL FEARS"].Remarks.iloc[0]

'REQUESTED PERMISSION TO FILM THE NAOC AIRCRAFT IN-FLIGHT DURING A ROUTINE REFUELING BY A KC-10 FOR AERIAL SEQUENCES; REQUESTED AIRFORCE FOR PERMISSION TO FILM B-2 AND F-16 AIRCRAFT WHILE IN-FLIGHT, IN-GROUND TAXI AND TAKE OFF ACTIVITIES AS WELL. IN ADDITION, PRODUCERS REQUESTED AND DOD APPROVED FILMING NEAR MONTREAL, CANADA WITH TWO MARINE CORPS CH-53 AND TWO ARMY CH-60 HELICOPTERS, ARMY DISASTER RELIEF EQUIPMENT, AS WELL AS MILITARY EXTRAS. THERE WAS A SECOND FILMING UNIT IN SAN DIEGO, CA THAT DID FILM ADDITIONAL MILITARY SEQUENCES AS WELL.'

In [9]:
dod_app_den_subset[dod_app_den_subset["Title"]=="SUM OF ALL FEARS"].Remarks.iloc[0]

'requested permission to film the naoc aircraft in flight during a routine refueling by a kc/ for aerial sequences; requested airforce for permission to film b/ and f/ aircraft while in flight, in ground taxi and take off activities as well. in addition, producers requested and dod approved filming near montreal, canada with two marine corps ch/ and two army ch/ helicopters, army disaster relief equipment, as well as military extras. there was a second filming unit in san diego, ca that did film additional military sequences as well.'

In [10]:
nlp = spacy.load('en_core_web_trf')

## Part 1: Term Analysis of plots and remarks

In the first part, we shall perform an analysis of the text in the remarks and plots column separately of only approved and denied films, leaving out the films whose status has ambiguity.

We define a function term_freq, that will process the data frame on mentioned columns and returns the word frequency and a corpus in Scatter Text compatible format to help in plotting and other operations.

In [11]:
def term_freq(data_frame, cat_col, text_col, cat_1, cat_2, nlp,
              remove_stop_words=False):
    """
    The function process the data frame on mentioned columns and
    returns the word frequency and a corpus in Scatter Text compatible
    format to help in plotting and other operations.

    :param data_frame (pd.Dataframe): The data frame on which the corpus needs to be created
    :param cat_col (str): The column in the dataframe on which the dataframe is divided.
    :param text_col (str): The column in the dataframe on which the analysis is performed.
    :param cat_1 (str): The first category in the cat_col (out of two categories).
    :param cat_2 (str): The second category in the cat_col (out of two categories).
    :param nlp (spacy.nlp): The nlp pipeline to apply on the text field.
    :param remove_stop_words (bool): Indicates to remove stop words of not (Default is False)

    :return df_cat1 (pd.Dataframe): The dataframe with terms majorly in
    category 1, with their frequency in each category along with f-scores
    in each category.
    :return df_cat2 (pd.Dataframe): The dataframe with terms majorly in
    category 2, with their frequency in each category along with f-scores
    in each category.
    :return corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    """

    if remove_stop_words:
        # We remove stop words by using the method "remove_terms" and
        # provide the argument for default stop words when building the corpus

        corpus = (st.CorpusFromPandas(data_frame,
                                      category_col=cat_col,
                                      text_col=text_col,
                                      nlp=nlp,
                                      feats_from_spacy_doc=st.FeatsFromSpacyDoc(use_lemmas=True))
                  .build()
                  .remove_terms(nlp.Defaults.stop_words, ignore_absences=True)
                  )
    else:
        corpus = (st.CorpusFromPandas(data_frame,
                                      category_col=cat_col,
                                      text_col=text_col,
                                      nlp=nlp,
                                      feats_from_spacy_doc=st.FeatsFromSpacyDoc(use_lemmas=True))
                  .build()
                  )

    # obtain the dataframe with term frequencies
    df = corpus.get_term_freq_df()
    # seperate the dataframe for each category and
    # round the f-score to two decimal points
    df[cat_1] = corpus.get_scaled_f_scores(cat_1)
    df[cat_2] = corpus.get_scaled_f_scores(cat_2)
    df[cat_1] = round(df[cat_1], 2)
    df[cat_2] = round(df[cat_2], 2)

    # sort the terms by frequency in decreasing order
    df_cat1 = df.sort_values(by=cat_1 + ' freq',
                             ascending=False).reset_index()
    df_cat2 = df.sort_values(by=cat_2 + ' freq',
                             ascending=False).reset_index()

    return df_cat1, df_cat2, corpus

We define a function plot_html, that will plot the processed scatter text corpus on mentioned columns and saves the file to an HTML page.

In [12]:
def plot_html(corpus, cat, name_of_category_1, name_of_category_2,
              html_file_name):
    """
    The function that will plot the processed scatter text corpus on mentioned
    columns and saves the file to a HTML page..

    :param corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    :param cat (str): Name of the category as in the caterogy column to be plotted on y-axis.
    :param name_of_category_1 (str): Name of the category to appear on the y-axis.
    :param name_of_category_2 (str): Name of the category to appear on the x-axis.
    """

    html = st.produce_scattertext_explorer(
        corpus,
        category=cat,
        category_name=name_of_category_1,
        not_category_name=name_of_category_2,
        metadata=corpus.get_df()['Title'],
        use_full_doc=True,)
    open(html_file_name, 'wb').write(html.encode('utf-8'))

### Analysis on the Remarks of DoD without removing Stopwords

#### Approved or Denied 

In [13]:
remarks_app_w_stopw, remarks_den_w_stopw, st_remarks_corpus_w_stopw = term_freq(dod_app_den_subset,
                                "Status", "Remarks", "APP", "DEN", nlp, remove_stop_words=False)

# Saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_remarks_with_stopwords.html"

plot_html(st_remarks_corpus_w_stopw, "APP", "Approved", "Denied", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/all_films/screenshots/approved_denied-Scattertext_remarks_with_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The top words in the approved category - _approve_ and _approved filming_ are trivial the other words convey some important information. The word "at" is one of the most frequent term, indicating that the films were approved for shooting at some location and _one_ and _day_ tells us that most of the film shooting happened for a short duration of one day. The last key observation is the word _positive_. It indicates that the DoD felt the movie provides a positive image of them and thus approved. Similarly, _deny_, _decline_ and contradiction words appear highly in remarks of denied movies, trivially indicating that they were denied. However, the key terms are _negative_ and _benifit_ and indicating that those films were rejected because the DoD felt that movies portrayed a negative image of them and it did not benefit the DoD. The last key observation is _inaccurate_ showing that another reason for DoD to deny the support is an inaccurate portrayal. Looking at the characteristic words reveal that the crucial terms that separate the two categories are portrayal and depiction. Examining the remaining terms show us that some words such as _mugu, mcas, hueneme, miramar, hangar, pendleton, usmc_ exclusively appear only in the remarks of the approved movies and these are the support provided to the approved movies in terms of locations mostly.

Looking at the overall plot, apart from the words mentioned above we see that _vietnam_ is inclined to the denied side on the top right. Upon looking at the occurrence of the word _vietnam_ in the corpus, we find that those movies that have celebrated the US military, South Vietnam or the war, in general, have been supported and at the same time those that had anti-war sentiments were denied. On the other side, we see that all films that deal with the korean war (bottom left) were approved. Although the USA did not win (or lose) the Korean war, the sentiment is not against the war unlike the case of Vietnam. We also observe that in the bottom left there are the terms _f_, _f/_, _c/_, and _hawk_ (the fighter jets of the US Air Force) which refer to the support type requested/provided by the DoD to the films. Then, we see that _world war_ is very frequent in approved films. This shows that again many of the films that were supported by DoD were based on historical/significant events to the US military. 

The word _change_ appeared in 0.4% and 2.1% in the remarks of the approved and denied films respectively. This word and the corpus associated with these two categories provide us with two important arguments. One the DoD has asked the filmmakers to make the changes and two, they have provided support to those films that have agreed to make the suggested changes to make the script favourable to them. Next, we see the term recruiting in between average and frequent near the denied terms. Looking at the corpora indicates that DoD denied the films which they deemed did not help in recruiting and at the same time it approved films that they deemed would help in recruiting.  

#### Approved or Limited

In [14]:
remarks_app_w_stopw, remarks_den_w_stopw, st_remarks_corpus_w_stopw = term_freq(dod_app_lim_subset,
                                "Status", "Remarks", "APP", "LIM", nlp, remove_stop_words=False)

# Saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_limited-Scattertext_remarks_with_stopwords.html"

plot_html(st_remarks_corpus_w_stopw, "APP", "Approved", "Limited", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/all_films/screenshots/approved_limited-Scattertext_remarks_with_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

Similar to the previous analysis, the top words in each category are trivial words of the category and the top words in the approved category remain the same as the corpus is unchanged. For the limited category, we see the terms _footage_, _stock footage_ are common suggesting that these films were stated under limited support as they were mostly provided with historical footage. The key term in the limited category is _backgroud_. Along with the footage, the other major support in the limitedly support films was either providing a location to be used as a background or some personnel to appear in the background. Then we have _technical (advisors)_ predominently in limited films suggesting that technical supervison was provided as part of limited assistance.

In both plots above we see that they stop words appear and although it might hinder us from seeing some terms they help us in understanding the contextual meaning. In the next subsection, we repeat the analysis by removing the stopwords.

### Analysis on the Remarks of DoD after removing the Stopwords

#### Approved or Denied

In [15]:
remarks_app_wo_stopw, remarks_den_wo_stopw, st_remarks_corpus_wo_stopw = term_freq(dod_app_den_subset,
                                    "Status", "Remarks", "APP", "DEN", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_remarks_without_stopwords.html"

plot_html(st_remarks_corpus_wo_stopw, "APP", "Approved", "Denied", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/all_films/screenshots/approved_denied-Scattertext_remarks_without_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

After removing stop words we see that army, navy and air force are most commonly occurring in both approved and denied films remarks indicating that the assistance was mostly requested from them. The remaining analysis remains the same as the previous analysis of remarks without removing the stop words. However, we see some new words that were not visible earlier (but they were present). The first one is a _john wayne_ (near average frequent terms on the Approved axis). He is a famous American actor popular through the silent era and golden age of Hollywood. He is one of the top actors that brought heavy box office collections and no film starring him has been rejected for support suggesting that DoD wanted to send their message through people who have a strong foothold.

#### Approved or Limited

In [16]:
remarks_app_wo_stopw, remarks_den_wo_stopw, st_remarks_corpus_wo_stopw = term_freq(dod_app_lim_subset,
                                    "Status", "Remarks", "APP", "LIM", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_limited-Scattertext_remarks_without_stopwords.html"

plot_html(st_remarks_corpus_wo_stopw, "APP", "Approved", "Limited", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/all_films/screenshots/approved_limited-Scattertext_remarks_without_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The analysis is the same as the previous analysis of remarks without removing stopwords.

### Analysis on the Movie Plots without removing the Stopwords

#### Approved or Denied

In [17]:
plots_app_w_stopw, plots_den_w_stopw, st_plots_corpus_w_stopw = term_freq(dod_app_den_subset,
                                "Status", "plot", "APP", "DEN", nlp, remove_stop_words=False)

# saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_plot_with_stopwords.html"

plot_html(st_plots_corpus_w_stopw, "APP", "Approved", "Denied", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/all_films/screenshots/approved_denied-Scattertext_plot_with_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The top terms in approved films are _world war, japanese_ telling us that mostly these film plots are set up in a war background. This is not informing us much.

#### Approved or Limited

In [18]:
plots_app_w_stopw, plots_den_w_stopw, st_plots_corpus_w_stopw = term_freq(dod_app_lim_subset,
                                "Status", "plot", "APP", "LIM", nlp, remove_stop_words=False)

# saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_limited-Scattertext_plot_with_stopwords.html"

plot_html(st_plots_corpus_w_stopw, "APP", "Approved", "Limited", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/all_films/screenshots/approved_limited-Scattertext_plot_with_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

We mostly see stopwords in the Top Approved category. This is not informing us much.

### Analysis on the Movie Plots after removing the Stopwords

#### Approved or Denied

In [19]:
plots_app_wo_stopw, plots_den_wo_stopw, st_plots_corpus_wo_stopw = term_freq(dod_app_den_subset,
                                "Status", "plot", "APP", "DEN", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_plot_without_stopwords.html"

plot_html(st_plots_corpus_wo_stopw, "APP", "Approved", "Denied", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/all_films/screenshots/approved_denied-Scattertext_plot_without_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The terms in the plot that have received full production support include _korean (war), pearl (harbour), japanese, german, iraq and world war_. On the other hand, for the movies that were denied support, we see _vietnam_, _nuclear_ and _government_ terms as most frequent. The remarks of these films containing these words of denied - category clearly show mention that they were denied as it was not beneficial to the DoD.

#### Approved or Limited

In [20]:
plots_app_wo_stopw, plots_den_wo_stopw, st_plots_corpus_wo_stopw = term_freq(dod_app_lim_subset,
                                "Status", "plot", "APP", "LIM", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "all_films/approved_limited-Scattertext_plot_without_stopwords.html"

plot_html(st_plots_corpus_wo_stopw, "APP", "Approved", "Limited", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/all_films/screenshots/approved_limited-Scattertext_plot_without_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

This is not informing us much.


## Part 2: Phrases Analysis of plots and remarks using PyRankText

PyTextRank is an implementation of a modified version of the TextRank algorithm. It extracts a scored list of the most prominent phrases in a document.

In the second part, we shall perform a similar analysis to Part 1 expect that we shall use phrases instead of terms.

Now we add the textrank to the nlp pipeline to rank and select the phrases

We define a function phrase_freq, that will process the data frame on mentioned columns and returns a corpus in Scatter Text compatible format to help in plotting and other operations.

In [21]:
nlp.add_pipe("textrank", last=True)

<pytextrank.base.BaseTextRank at 0x1a109598430>

In [22]:
def phrase_freq(data_frame, cat_col, text_col, nlp):
    """
    The function process the data frame on mentioned columns and a corpus in
    Scatter Text compatible format to help in plotting and other operations.

    :param data_frame (pd.Dataframe): The data frame on which the corpus needs to be created
    :param cat_col (str): The column in the dataframe on which the dataframe is divided.
    :param text_col (str): The column in the dataframe on which the analysis is performed.
    :param nlp (spacy.nlp): The nlp pipeline to apply on the text field.

    :return corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    """

    data_frame_copy = data_frame.copy()
    # compatibility setting
    data_frame_copy[text_col] = data_frame_copy[text_col].astype(str)
    # Applying the nlp pipeline on the text
    data_frame_copy[text_col] = data_frame_copy[text_col].apply(nlp)
    # building the corpus
    corpus = st.CorpusFromParsedDocuments(
            data_frame_copy,
            category_col=cat_col,
            parsed_col=text_col,
            feats_from_spacy_doc=st.PyTextRankPhrases()
        ).build()

    return corpus

In [23]:
def plot_html_pyranktext(corpus, cat1, cat2, name_of_category_1, name_of_category_2,
                         html_file_name):
    """
    The function that will plot the processed scatter text corpus on mentioned
    columns and saves the file to a HTML page..

    :param corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    :param cat1 (str): Name of the category as in the caterogy column to be plotted on y-axis.
    :param cat2 (str): Name of the category as in the caterogy column to be plotted on x-axis.
    :param name_of_category_1 (str): Name of the category to appear on the y-axis.
    :param name_of_category_2 (str): Name of the category to appear on the x-axis.
    """

    phrase_category_scores = corpus.get_metadata_freq_df('')
    # As the aggregate TextRank scores aren’t easily interpretable, we’ll
    # display the per-category rank of each phrase when clicked in the metadata.
    term_ranks = np.argsort(np.argsort(-phrase_category_scores, axis=0), axis=0) + 1
    metadata_descriptions = {
        term: '<br/>' + '<br/>'.join(
            '<b>%s</b> TextRank score rank: %s/%s' % (cat, term_ranks.loc[term, cat], corpus.get_num_metadata())
            for cat in corpus.get_categories())
        for term in corpus.get_metadata()
    }

    # We will define term score using the maximum category-specific score,
    # this will give us the most prominent phrases in each category,
    # regardless of the prominence in the other category.
    category_specific_prominence = phrase_category_scores.apply(
        lambda r: r[cat1] if r[cat1] > r[cat2] else -r[cat2], axis=1)

    html = st.produce_scattertext_explorer(
        corpus,
        category=cat1,
        category_name=name_of_category_1,
        not_category_name=name_of_category_2,
        transform=st.dense_rank,
        metadata=corpus.get_df()['Title'],
        scores=category_specific_prominence,
        sort_by_dist=False,
        use_non_text_features=True,
        topic_model_term_lists={term: [term] for term in corpus.get_metadata()},
        topic_model_preview_size=0,
        metadata_descriptions=metadata_descriptions,
        use_full_doc=True
        )

    open(html_file_name, 'wb').write(html.encode('utf-8'))

### Analysis on the Remarks using PyTextRank

#### Approved or Denied

In [24]:
remakrs_phrase_corpus = phrase_freq(dod_app_den_subset, "Status", "Remarks", nlp)

pyranktext_remarks_filename = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_PyRankText_remarks.html"

plot_html_pyranktext(remakrs_phrase_corpus, "APP", "DEN", "Approved", "Denied", html_file_name=pyranktext_remarks_filename)

##### Interpretation of the analysis

There are only 6 points on the plots and does not provide any useful information.

#### Approved or Limited

In [25]:
remakrs_phrase_corpus = phrase_freq(dod_app_lim_subset, "Status", "Remarks", nlp)

pyranktext_remarks_filename = ST_PLOTS_PATH + "all_films/approved_limited-Scattertext_PyRankText_remarks.html"

plot_html_pyranktext(remakrs_phrase_corpus, "APP", "LIM", "Approved", "Limited", html_file_name=pyranktext_remarks_filename)

##### Interpretation of the analysis

There are only 6 points on the plots and does not provide any useful information.

### Analysis on the Movie Plots using PyTextRank

#### Approved or Denied

In [26]:
plots_phrase_corpus = phrase_freq(dod_app_den_subset, "Status", "plot", nlp)

pyranktext_plots_filename = ST_PLOTS_PATH + "all_films/approved_denied-Scattertext_PyRankText_plot.html"

plot_html_pyranktext(plots_phrase_corpus, "APP", "DEN", "Approved", "Denied", html_file_name=pyranktext_plots_filename)

##### Interpretation of the analysis

There is only 1 point on the plots and does not provide any useful information.

#### Approved or Limited

In [27]:
plots_phrase_corpus = phrase_freq(dod_app_lim_subset, "Status", "plot", nlp)

pyranktext_plots_filename = ST_PLOTS_PATH + "all_films/approved_limited-Scattertext_PyRankText_plot.html"

plot_html_pyranktext(plots_phrase_corpus, "APP", "LIM", "Approved", "Limited", html_file_name=pyranktext_plots_filename)

##### Interpretation of the analysis

There is no point in the plots and does not provide any useful information.

## Part 3: Term Analysis of plots and remarks of movies belonging to political significance.

Here we will separate the movies that have political significance to the US Army. We will separate these movies if the following keywords are present in the plot or remarks of the movies. The reason for choosing remarks is that the plot information from IMDB is limited sometimes as it is created by users. However, the remarks mention the background of these movies.

|    Event    |                                                           Key words                                                           |
|:-----------:|:-----------------------------------------------------------------------------------------------------------------------------:|
|  World War  | "world war", "nazi", "pearl harbour", "hawaii", "hiroshima", "nagasaki", "atomic bomb", "hitler", "d-day", "germany", "japan" |
|  WTO Attcks |                                 "9/11", "september 11", "world trade center", "september 2001"                                |
| Vietnam War |                                      "vietnam", "saigon", "hanoi", "laos", "vietnam war"                                      |
|   Gulf War  |                                   "gulf", "iraq", "baghdad", "saddam", "kuwait", "gulf war"                                   |
|  Afghan War |                             "afghan", "taliban", "afghanistan", "al qaeda", "osama", "afghan war", "laden"                             |
|  Korean War |                                     "inchon", "north korea", "korean war", "forgotten war"                                    |

As in the previous parts, we shall only consider approved and denied films for the analysis

In [10]:
pol_sig_keywords = ["world war", "nazi", "pearl harbour", "hawaii",
                    "hiroshima", "nagasaki", "atomic bomb", "hitler",
                    "d-day", "germany", "japan", "japanese", "9/11", "september 11",
                    "world trade center", "september 2001" "vietnam",
                    "saigon", "hanoi", "laos", "vietnam war", "gulf",
                    "iraq", "baghdad", "saddam", "kuwait", "gulf war",
                    "afghan", "taliban", "afghanistan", "al qaeda",
                    "osama", "afghan war", "laden", "inchon", "north korea",
                    "korean war", "forgotten war", "korea"]

First, select the movies that have political significance among the movies that are explicitly approved or denied.

In [11]:
dod_app_den_subset["pol_sig"] = dod_app_den_subset["plot"].str.contains("|".join(pol_sig_keywords)) | dod_app_den_subset["Remarks"].str.contains("|".join(pol_sig_keywords))
dod_app_den_pol_sig = dod_app_den_subset.loc[dod_app_den_subset["pol_sig"] == True].reset_index(drop=True)
dod_app_den_pol_sig

Unnamed: 0,Title,IMDB_ID,Status,Media Type,Remarks,plot,pol_sig
0,"1,000 MEN AND A BABY",tt0133231,APP,TV,very positive depiction of navy in this korean...,a baby in a foreign land is adopted by the men...,True
1,A MIDNIGHT CLEAR,tt0102443,DEN,FILM,declined assistance (request for ww ii facilit...,"set in 1944 france, in the ardennes forest reg...",True
2,ABOVE AND BEYOND,tt0044324,APP,FILM,story of paul tibbetts and the atomic bomb mis...,"the story of colonel paul tibbets, the pilot o...",True
3,ACE OF ACES,tt0023737,APP,FILM,army air corps provided planes to stage aerial...,a sculptor who doesn't want to have any part o...,True
4,ACTION IN THE NORTH ATLANTIC,tt0035608,APP,FILM,navy and merchant marines provided use of ship...,lieutenant joe rossi is 1st officer on a liber...,True
...,...,...,...,...,...,...,...
209,WHISKEY TANGO FOXTROT,tt3553442,APP,FILM,usaf support to this project took place at kir...,"2003. after careful consideration, kim baker, ...",True
210,WINDTALKERS,tt0245562,APP,FILM,dod approved filming at various sites at marin...,during world war ii when the americans needed ...,True
211,WING AND A PRAYER,tt0037466,APP,FILM,fictionalized story of battle of midway. navy ...,an aircraft carrier is sent on a decoy mission...,True
212,WINGS OF EAGLES,tt0051198,APP,FILM,"story of early navy aviator, spig wead. after ...",u.s. navy pilot frank 'spig' wead is a fun lov...,True


We have 214 films that have some political significance and have an explicit status of denied or supported.

Second, select the movies that have political significance among the movies that are approved or provided with limited support.

In [12]:
dod_app_lim_subset["pol_sig"] = dod_app_lim_subset["plot"].str.contains("|".join(pol_sig_keywords)) | dod_app_lim_subset["Remarks"].str.contains("|".join(pol_sig_keywords))
dod_app_lim_pol_sig = dod_app_lim_subset.loc[dod_app_lim_subset["pol_sig"] == True].reset_index(drop=True)
dod_app_lim_pol_sig

Unnamed: 0,Title,IMDB_ID,Status,Media Type,Remarks,plot,pol_sig
0,"1,000 MEN AND A BABY",tt0133231,APP,TV,very positive depiction of navy in this korean...,a baby in a foreign land is adopted by the men...,True
1,ABOVE AND BEYOND,tt0044324,APP,FILM,story of paul tibbetts and the atomic bomb mis...,"the story of colonel paul tibbets, the pilot o...",True
2,ACE OF ACES,tt0023737,APP,FILM,army air corps provided planes to stage aerial...,a sculptor who doesn't want to have any part o...,True
3,ACTION IN THE NORTH ATLANTIC,tt0035608,APP,FILM,navy and merchant marines provided use of ship...,lieutenant joe rossi is 1st officer on a liber...,True
4,AERIAL GUNNER,tt0035614,APP,FILM,army air corps provided access to the harlinge...,old rivals are pitted against each other in ba...,True
...,...,...,...,...,...,...,...
211,WINDTALKERS,tt0245562,APP,FILM,dod approved filming at various sites at marin...,during world war ii when the americans needed ...,True
212,WING AND A PRAYER,tt0037466,APP,FILM,fictionalized story of battle of midway. navy ...,an aircraft carrier is sent on a decoy mission...,True
213,WINGS OF EAGLES,tt0051198,APP,FILM,"story of early navy aviator, spig wead. after ...",u.s. navy pilot frank 'spig' wead is a fun lov...,True
214,WOMEN OF VALOR,tt0092236,LIM,TV,approved use of stock footage. film was ouite ...,"col. jessup (susan sarandon), an american mili...",True


We have 216 films and TV shows that have political significance that have a status of full support or limited support. Now we repeat the same analysis of terms and phrases in these movies.
Redefining the NLP pipeline to remove the textrank pipeline.

In [31]:
nlp = spacy.load('en_core_web_trf')

### Analysis on the Remarks of DoD without removing Stopwords

#### Approved or Denied

In [32]:
remarks_app_w_stopw_pol_sig, remarks_den_w_stopw_pol_sig, st_remarks_corpus_w_stopw_pol_sig = term_freq(dod_app_den_pol_sig,
                                "Status", "Remarks", "APP", "DEN", nlp, remove_stop_words=False)

# Saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_denied-Scattertext_remarks_with_stopwords.html"

plot_html(st_remarks_corpus_w_stopw_pol_sig, "APP", "Approved", "Denied", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/political_significance/screenshots/approved_denied-Scattertext_remarks_with_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

Similar to the above analysis on remarks with stop words, the top approved and denied are trivial words. We see that movies based on historical events where the US military either won or has not created strong anti-war sentiments in the US are present in the approved category (_world war ii, korean (war), pearl (harbor), german and japanese_) and conversely those movies that are based on situations that were not favourable to the US like _vietnam_, _anti_ are present in the denied category. When we look at the characteristic terms we find that the crucial factors that are differentiating between these categories are _portrayal, depiction and script_. This clearly indicates that the DoD supports only the films that provide a good(and realistic) image of them.

#### Approved or Limited

In [33]:
remarks_app_w_stopw_pol_sig, remarks_den_w_stopw_pol_sig, st_remarks_corpus_w_stopw_pol_sig = term_freq(dod_app_lim_pol_sig,
                                "Status", "Remarks", "APP", "LIM", nlp, remove_stop_words=False)

# Saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_limited-Scattertext_remarks_with_stopwords.html"

plot_html(st_remarks_corpus_w_stopw_pol_sig, "APP", "Approved", "Limited", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/political_significance/screenshots/approved_limited-Scattertext_remarks_with_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The analysis is similar to the previously made analysis where the limited support was mostly based for footage and a biased attitude can be seen towards films that have _vietnam_ backdrop.

### Analysis on the Remarks of DoD after removing the Stopwords

#### Approved or Denied

In [34]:
remarks_app_wo_stopw_pol_sig, remarks_den_wo_stopw_pol_sig, st_remarks_corpus_wo_stopw_pol_sig = term_freq(dod_app_den_pol_sig,
                                    "Status", "Remarks", "APP", "DEN", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_denied-Scattertext_remarks_without_stopwords.html"

plot_html(st_remarks_corpus_wo_stopw_pol_sig, "APP", "Approved", "Denied", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/political_significance/screenshots/approved_denied-Scattertext_remarks_without_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The analysis is similar to the above analysis.

#### Approved or Limited

In [35]:
remarks_app_wo_stopw_pol_sig, remarks_den_wo_stopw_pol_sig, st_remarks_corpus_wo_stopw_pol_sig = term_freq(dod_app_lim_pol_sig,
                                    "Status", "Remarks", "APP", "LIM", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_limited-Scattertext_remarks_without_stopwords.html"

plot_html(st_remarks_corpus_wo_stopw_pol_sig, "APP", "Approved", "Limited", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/political_significance/screenshots/approved_limited-Scattertext_remarks_without_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The analysis is same as previous one

### Analysis on the Movie Plots without removing the Stopwords

#### Approved or Denied

In [36]:
plots_app_w_stopw_pol_sig, plots_den_w_stopw_pol_sig, st_plots_corpus_w_stopw_pol_sig = term_freq(dod_app_den_pol_sig,
                                "Status", "plot", "APP", "DEN", nlp, remove_stop_words=False)

# saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_denied-Scattertext_plot_with_stopwords.html"

plot_html(st_plots_corpus_w_stopw_pol_sig, "APP", "Approved", "Denied", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/political_significance/screenshots/approved_denied-Scattertext_plot_with_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The analysis is similar to the previous plot analysis with stopwords.

#### Approved or Limited

In [37]:
plots_app_w_stopw_pol_sig, plots_den_w_stopw_pol_sig, st_plots_corpus_w_stopw_pol_sig = term_freq(dod_app_lim_pol_sig,
                                "Status", "plot", "APP", "LIM", nlp, remove_stop_words=False)

# saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_limited-Scattertext_plot_with_stopwords.html"

plot_html(st_plots_corpus_w_stopw_pol_sig, "APP", "Approved", "Limited", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/political_significance/screenshots/approved_limited-Scattertext_plot_with_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The analysis is similar to the previous plot analysis with stopwords

### Analysis on the Movie Plots after removing the Stopwords

#### Approved or Denied

In [38]:
plots_app_wo_stopw_pol_sig, plots_den_wo_stopw_pol_sig, st_plots_corpus_wo_stopw_pol_sig = term_freq(dod_app_den_pol_sig,
                                "Status", "plot", "APP", "DEN", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_denied-Scattertext_plot_without_stopwords.html"

plot_html(st_plots_corpus_wo_stopw_pol_sig, "APP", "Approved", "Denied", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/political_significance/screenshots/approved_denied-Scattertext_plot_without_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The analysis is similar to the previous plot analysis without stopwords

#### Approved or Limited

In [39]:
plots_app_wo_stopw_pol_sig, plots_den_wo_stopw_pol_sig, st_plots_corpus_wo_stopw_pol_sig = term_freq(dod_app_lim_pol_sig,
                                "Status", "plot", "APP", "LIM", nlp, remove_stop_words=True)

# saving the graph
html_file_name = ST_PLOTS_PATH + "political_significance/approved_limited-Scattertext_plot_without_stopwords.html"

plot_html(st_plots_corpus_wo_stopw_pol_sig, "APP", "Approved", "Limited", html_file_name=html_file_name)

##### Interpretation of the analysis

<img src="./../scatter_text_plots/political_significance/screenshots/approved_limited-Scattertext_plot_without_stopwords.png" alt="app_den-ST_remarks_with_stopwords" width="1000"/>

The analysis is similar to the previous plot analysis without stopwords

## Part 4: Phrases Analysis of plots and remarks using PyRankText for politically significnat movies

As there aren't any phrases with repetations, this analysis is not possible.