# Analysis of Remarks

Using [Scatter Text](https://github.com/JasonKessler/scattertext).

In [1]:
# import libraries
import spacy
import scattertext as st
import pandas as pd

In [2]:
# Path to store the plots generated by scatter text
ST_PLOTS_PATH = "./../scatter_text_plots/"

## Part 1
In the first part, we shall perform basic analysis of the text in the remarks column of only approved and denied film, leaving out the films who status has ambiguty.

**Read the final aggregated movie data with additional information, we will only use the columns of Title, status, Media Type, Remarks and plot.**

In [3]:
col_list = ['Title', 'Status', 'Media Type', 'Remarks', 'plot']
movie_data = pd.read_csv("../data/with_additional_data/military_hollywood_with_additional_data.csv", usecols = col_list)

In [4]:
movie_data.head()

Unnamed: 0,Title,Status,Media Type,Remarks,plot
0,"""1968""",OTH,FILM,THE FILM STARTED OUT VERY NEGATIVE FOR THE ARM...,
1,"1,000 MEN AND A BABY",APP,TV,VERY POSITIVE DEPICTION OF NAVY IN THIS KOREAN...,A baby in a foreign land is adopted by the men...
2,1ST FORCE,OTH,FILM,INITIALLY DOD AND USMC WERE INCLINED TO SUPPOR...,
3,24,APP,TV,APPROVED FILMING FOR ONE DAY WITH TWO MARINE C...,Jack and Tony clash as they wait for the time ...
4,3RD DEGREE,APP,TV,PERSONNEL APPEARED ON THIS GAME SHOW AT THE EX...,Scott Weston is a private investigator who is ...


**Select the movies and films that have been approved or denied for assistance by the DoD.**

In [5]:
# Take a subset of approved and denied movies
array = ['APP', 'DEN']
dod_subset = movie_data.loc[movie_data['Status'].isin(array)]
dod_subset

Unnamed: 0,Title,Status,Media Type,Remarks,plot
1,"1,000 MEN AND A BABY",APP,TV,VERY POSITIVE DEPICTION OF NAVY IN THIS KOREAN...,A baby in a foreign land is adopted by the men...
3,24,APP,TV,APPROVED FILMING FOR ONE DAY WITH TWO MARINE C...,Jack and Tony clash as they wait for the time ...
4,3RD DEGREE,APP,TV,PERSONNEL APPEARED ON THIS GAME SHOW AT THE EX...,Scott Weston is a private investigator who is ...
5,50/50,DEN,FILM,NEVER WAS OFFICIALLY SUBMITTED TO DOD. IT WAS ...,Adam is a 27 year old writer of radio programs...
9,A MIDNIGHT CLEAR,DEN,FILM,DECLINED ASSISTANCE (REQUEST FOR WW II FACILIT...,"Set in 1944 France, in the Ardennes forest reg..."
...,...,...,...,...,...
850,WITHOUT GLORY,APP,FILM,APPROVED BY THE DEPARTMENT SINCE IT WAS BASED ...,
853,X-15,APP,FILM,AIRFORCE AND NASA PROVIDED FULL COOPERATION ON...,At the height of the Cold War during the 1960s...
854,"YEAR IN THE LIFE, A",DEN,TV,THE PROJECT WAS DENIED ASSISTANCE.,"Joe Gardner, a child of the Depression, is a s..."
855,"YOUNG LIONS, THE",APP,FILM,PENTAGON AND STATE DEPARTMENT WENT THROUGH LON...,The destiny of three soldiers during World War...


## Analysis

**Loading the NLP library to perform the processing pipeline**

In [6]:
nlp = spacy.load('en_core_web_trf')

**We define a function term_freq, that will process the data frame on mentioned columns and returns the word frequency and a corpus in Scatter Text compatible format to help in plotting and other operations.**

In [7]:
def term_freq(data_frame, cat_col, text_col, cat_1, cat_2,
              remove_stop_words=False):
    """
    The function process the data frame on mentioned columns and
    returns the word frequency and a corpus in Scatter Text compatible
    format to help in plotting and other operations.

    :param data_frame (pd.Dataframe): The data frame on which the corpus needs to be created
    :param cat_col (str): The column in the dataframe on which the dataframe is divided.
    :param text_col (str): The column in the dataframe on which the analysis is performed.
    :param cat_1 (str): The first category in the cat_col (out of two categories).
    :param cat_2 (str): The second category in the cat_col (out of two categories).
    :param remove_stop_words (bool): Indicates to remove stop words of not (Default is False)

    :return df_cat1 (pd.Dataframe): The dataframe with terms majorly in
    category 1, with their frequency in each category along with f-scores
    in each category.
    :return df_cat2 (pd.Dataframe): The dataframe with terms majorly in
    category 2, with their frequency in each category along with f-scores
    in each category.
    :return corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    """

    if remove_stop_words:
        # We remove stop words by using the method "remove_terms" and
        # provide the argument for default stop words when building the corpus

        corpus = (st.CorpusFromPandas(data_frame,
                                      category_col=cat_col,
                                      text_col=text_col,
                                      nlp=nlp)
                  .build()
                  .remove_terms(nlp.Defaults.stop_words, ignore_absences=True)
                  )
    else:
        corpus = (st.CorpusFromPandas(data_frame,
                                      category_col=cat_col,
                                      text_col=text_col,
                                      nlp=nlp)
                  .build()
                  )

    # obtain the dataframe with term frequencies
    df = corpus.get_term_freq_df()
    # seperate the dataframe for each category and
    # round the f-score to two decimal points
    df[cat_1] = corpus.get_scaled_f_scores(cat_1)
    df[cat_2] = corpus.get_scaled_f_scores(cat_2)
    df[cat_1] = round(df[cat_1], 2)
    df[cat_2] = round(df[cat_2], 2)

    # sort the terms by frequency in decreasing order
    df_cat1 = df.sort_values(by=cat_1 + ' freq',
                             ascending=False).reset_index()
    df_cat2 = df.sort_values(by=cat_2 + ' freq',
                             ascending=False).reset_index()

    return df_cat1, df_cat2, corpus

**We define a function plot_html, that will plot the processed scatter text corpus on mentioned columns and saves the file to a HTML page.**

In [8]:
def plot_html(corpus, cat, name_of_category_1, name_of_category_2,
              html_file_name):
    """
    The function that will plot the processed scatter text corpus on mentioned
    columns and saves the file to a HTML page..

    :param corpus (st.CorpusDF): The corpus created from the dataframe in Scatter Text.
    :param cat (str): Name of the category as in the caterogy column to be plotted on y-axis.
    :param name_of_category_1 (str): Name of the category to appear on the y-axis.
    :param name_of_category_2 (str): Name of the category to appear on the x-axis.
    """

    html = st.produce_scattertext_explorer(
                   corpus,
                   category=cat,
                   category_name=name_of_category_1,
                   not_category_name=name_of_category_2)
    open(html_file_name, 'wb').write(html.encode('utf-8'))

### Analysis on the Remarks of DoD without removing Stopwords

In [9]:
movies_app_wo_stopw, movies_den_wo_stopw, st_corpus_wo_stopw = term_freq(dod_subset, "Status", "Remarks", "APP", "DEN", remove_stop_words=False)

In [10]:
movies_app_wo_stopw

Unnamed: 0,term,APP freq,DEN freq,APP,DEN
0,the,918,386,0.14,0.86
1,and,649,147,0.92,0.08
2,of,610,170,0.90,0.10
3,to,428,191,0.13,0.87
4,in,354,96,0.90,0.10
...,...,...,...,...,...
17583,abandoning girl,0,1,0.44,0.56
17584,then abandoning,0,1,0.44,0.56
17585,her and,0,1,0.44,0.56
17586,marry her,0,1,0.44,0.56


In [11]:
movies_den_wo_stopw

Unnamed: 0,term,APP freq,DEN freq,APP,DEN
0,the,918,386,0.14,0.86
1,to,428,191,0.13,0.87
2,of,610,170,0.90,0.10
3,and,649,147,0.92,0.08
4,film,201,110,0.11,0.89
...,...,...,...,...,...
17583,approximately thirty,1,0,0.50,0.50
17584,thirty 30,1,0,0.50,0.50
17585,30 army,1,0,0.50,0.50
17586,one i,1,0,0.50,0.50


**Saving the Plot**

In [12]:
html_file_name = ST_PLOTS_PATH + "approved_denied-Scattertext_with_stopwords.html"

plot_html(st_corpus_wo_stopw, "APP", "Approved", "Denined", html_file_name=html_file_name)

### Analysis on the Remarks of DoD removing the Stopwords

In [14]:
movies_app_w_stopw, movies_den_w_stopw, st_corpus_w_stopw = term_freq(dod_subset, "Status", "Remarks", "APP", "DEN", remove_stop_words=True)

In [15]:
movies_app_w_stopw

Unnamed: 0,term,APP freq,DEN freq,APP,DEN
0,approved,254,4,0.99,0.01
1,filming,241,6,0.99,0.01
2,film,201,110,0.12,0.88
3,air,169,55,0.18,0.82
4,navy,168,35,0.93,0.07
...,...,...,...,...,...
17365,machines,0,1,0.42,0.58
17366,holocaust,0,1,0.42,0.58
17367,a fairly,0,1,0.42,0.58
17368,fairly important,0,1,0.42,0.58


In [17]:
movies_den_w_stopw

Unnamed: 0,term,APP freq,DEN freq,APP,DEN
0,film,201,110,0.12,0.88
1,the film,49,63,0.05,0.95
2,denied,1,60,0.00,1.00
3,dod,66,56,0.08,0.92
4,air,169,55,0.18,0.82
...,...,...,...,...,...
17365,plan approved,1,0,0.50,0.50
17366,approved and,7,0,0.84,0.16
17367,and subject,1,0,0.50,0.50
17368,subject to,1,0,0.50,0.50


**Saving the Plot**

In [18]:
html = st.produce_scattertext_explorer(
                   st_corpus_w_stopw,
                   category='APP',
                   category_name='APP',
                   not_category_name='DEN')

open(html_file_name, 'wb').write(html.encode('utf-8'))

611787

In [19]:
html_file_name = ST_PLOTS_PATH + "approved_denied-Scattertext_without_stopwords.html"

plot_html(st_corpus_w_stopw, "APP", "Approved", "Denined", html_file_name=html_file_name)