In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import requests

# Reading the basic dataset

The _military-hollywood-full_imdbidAdded.csv_ contains the rows of movies that US DOD has supported. The data has 7 columns.

|Column name | Description |
| :-:        | -:-:    |
|Title|The title of the film or TV show requesting assistance.|
|Subtitle|An alternate name for the film, or the episode title of the TV show.|
|Status| The military's response to the assistance request.|
|Media Type| The type of media requesting assistance, a Film or TV show.|
|Year| The year that the film or TV episode was released.|
|Remarks| A description of the request from the military's perspective.|      
 

**Note:** This is not an exhaustive list, it only contains the movies list that DoD has choose to release.

In [2]:
# reading the basic dataset
dod_movies = pd.read_csv("./military-hollywood-full_imdbidAdded.csv")
dod_movies

Unnamed: 0,Title,IMDB_ID,Subtitle,Status,Media Type,Year,Remarks
0,"""1968""",Never Made,,OTH,FILM,,THE FILM STARTED OUT VERY NEGATIVE FOR THE ARM...
1,"1,000 MEN AND A BABY",tt0133231,,APP,TV,1997.0,VERY POSITIVE DEPICTION OF NAVY IN THIS KOREAN...
2,1ST FORCE,Never Made,,OTH,FILM,,INITIALLY DOD AND USMC WERE INCLINED TO SUPPOR...
3,24,tt0502209,22,APP,TV,2004.0,APPROVED FILMING FOR ONE DAY WITH TWO MARINE C...
4,3RD DEGREE,tt0098469,,APP,TV,1989.0,PERSONNEL APPEARED ON THIS GAME SHOW AT THE EX...
...,...,...,...,...,...,...,...
852,"WONDER YEARS, THE",tt0094582,ANGEL,LIM,TV,1988.0,THE UNITED STATES AIR FORCE GRANTED STOCK FOOT...
853,X-15,tt0055627,,APP,FILM,1961.0,AIRFORCE AND NASA PROVIDED FULL COOPERATION ON...
854,"YEAR IN THE LIFE, A",tt0092488,ACTS OF FAITH,DEN,TV,1987.0,THE PROJECT WAS DENIED ASSISTANCE.
855,"YOUNG LIONS, THE",tt0052415,,APP,FILM,1958.0,PENTAGON AND STATE DEPARTMENT WENT THROUGH LON...


### Remarks on the basic dataset.

For the movies that were not produced, the IMDB ID is Never made and for the movies we were not able to find online the IMDB ID is left empty.

The Year column is currently filled using the data from the offical document released by the US [Department of Defence](https://drive.google.com/file/d/1NeDVYu_gvEhtdQVtSFPRIapHDxJx6842/view). Later, we shall update it from the IMDB data.

The subtitle is NaN for most cases and contains previous name in case of films and episode name in case of TV series.

In the status column, APP means the assistance was approved by US DoD, DEN means it was denied. LIM means only limited assistance was provided. OTH mean either the film did not request the assistance or has withdrew the request. RSCH means only research assistance was provided.

# Additional Data

Now, we source and append the following additional data to the above data frame: Year, Release Date, Directors, Plot, Awards, Runtime, IMDB Ratings and generes. This data is obtained from [OMDB API](http://www.omdbapi.com/).

The steps involved are
1. Provide the API key and parameter to get full plot.
2. Check if there is a response or not, as some movies are not existing.
3. For the movies that have a response, we collected the required data and return it.
4. Then we shall replace the old year column with new year column and append the remaining columns.

In [3]:
def get_movie_additional_data(imdb_id, curr_year):
    """
    Function to obtain the additional data of the movie such as Year, Release Date, Directors, Plot, Awards, Runtime, IMDB Ratings and generes.

    :param imdb_id(str): The IMDB id of the movie
    :param curr_year(str): The year of release of the movie according to the basic dataset.
    :return addl_data(List): The additional data as a list with elements in the following order [year, genre, runtime, director, plot, award, imdb_rating, release_date]
    """
    # obtaining additional data from OMDB API

    params = {'plot': 'full'}
    apiKey = '7d4700e0'  #  OMDB api key here

    try:
        data_URL = 'http://www.omdbapi.com/?i='+imdb_id+'&apikey='+apiKey
        response = requests.get(data_URL, params=params).json()
    except:
        response = {}

    year = response.get("Year", curr_year)
    if "–" in str(year):
        # In case of tv series, where the episode is not identified, we get the year as an range.
        # In that case, we shall the use the Year provided in the basic dataset.
        year = curr_year
    genre = response.get("Genre")
    runtime = response.get("Runtime")
    director = response.get("Director")
    plot = response.get("Plot")
    award = response.get("Awards")
    imdb_rating = response.get("imdbRating")
    release_date = response.get("Released")

    return [year, genre, runtime, director, plot, award, imdb_rating, release_date]

In [4]:
additional_movie_data = dod_movies.apply(lambda row: get_movie_additional_data(row.IMDB_ID, row.Year), axis='columns', result_type='expand')
additional_movie_data.columns = ["Year_omdb", "genre", "runtime", "director", "plot", "award", "imdb_rating", "release_date"]
dod_movies_with_addl_data = pd.concat([dod_movies, additional_movie_data], axis='columns')
dod_movies_with_addl_data.drop(columns='Year', inplace=True)
dod_movies_with_addl_data.rename(columns={"Year_omdb":"Year"}, inplace=True)
dod_movies_with_addl_data

Unnamed: 0,Title,IMDB_ID,Subtitle,Status,Media Type,Remarks,Year,genre,runtime,director,plot,award,imdb_rating,release_date
0,"""1968""",Never Made,,OTH,FILM,THE FILM STARTED OUT VERY NEGATIVE FOR THE ARM...,,,,,,,,
1,"1,000 MEN AND A BABY",tt0133231,,APP,TV,VERY POSITIVE DEPICTION OF NAVY IN THIS KOREAN...,1997,Drama,96 min,Marcus Cole,A baby in a foreign land is adopted by the men...,,6.9,07 Dec 1997
2,1ST FORCE,Never Made,,OTH,FILM,INITIALLY DOD AND USMC WERE INCLINED TO SUPPOR...,,,,,,,,
3,24,tt0502209,22,APP,TV,APPROVED FILMING FOR ONE DAY WITH TWO MARINE C...,2004,"Action, Crime, Drama, Thriller",42 min,Frederick King Keller,Jack and Tony clash as they wait for the time ...,,9.0,11 May 2004
4,3RD DEGREE,tt0098469,,APP,TV,PERSONNEL APPEARED ON THIS GAME SHOW AT THE EX...,1989,"Crime, Drama, Thriller",100 min,Roger Spottiswoode,Scott Weston is a private investigator who is ...,,5.7,28 May 1989
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
852,"WONDER YEARS, THE",tt0094582,ANGEL,LIM,TV,THE UNITED STATES AIR FORCE GRANTED STOCK FOOT...,1988.0,"Comedy, Drama, Family, Romance",22 min,,An adult Kevin Arnold reminisces on his teenag...,Won 1 Golden Globe. Another 24 wins & 70 nomin...,8.3,31 Jan 1988
853,X-15,tt0055627,,APP,FILM,AIRFORCE AND NASA PROVIDED FULL COOPERATION ON...,1961,"Drama, History",107 min,Richard Donner,At the height of the Cold War during the 1960s...,,5.8,22 Dec 1961
854,"YEAR IN THE LIFE, A",tt0092488,ACTS OF FAITH,DEN,TV,THE PROJECT WAS DENIED ASSISTANCE.,1987.0,Drama,60 min,,"Joe Gardner, a child of the Depression, is a s...",Won 1 Golden Globe. Another 3 wins & 3 nominat...,8.8,16 Sep 1987
855,"YOUNG LIONS, THE",tt0052415,,APP,FILM,PENTAGON AND STATE DEPARTMENT WENT THROUGH LON...,1958,"Action, Drama, War",167 min,Edward Dmytryk,The destiny of three soldiers during World War...,Nominated for 3 Oscars. Another 1 win & 4 nomi...,7.2,02 Apr 1958


### Remarks on the updated dataset

We see that certain values in the columns are 'N/A' obtained from the OMDB API. Thus we shall replace them with NaN.

In [5]:
dod_movies_with_addl_data.replace(regex={'N/A': np.nan}, inplace=True)

We see that dates(year and release data) values in the columns are strings and integers. Thus we shall convert them into datetime.

In [6]:
dod_movies_with_addl_data['release_date'] = pd.to_datetime(dod_movies_with_addl_data['release_date'], errors='ignore', format='%Y%m%d')
dod_movies_with_addl_data['Year'] = pd.to_datetime(dod_movies_with_addl_data['Year'], errors='ignore', format='%Y')

In [7]:
dod_movies_with_addl_data

Unnamed: 0,Title,IMDB_ID,Subtitle,Status,Media Type,Remarks,Year,genre,runtime,director,plot,award,imdb_rating,release_date
0,"""1968""",Never Made,,OTH,FILM,THE FILM STARTED OUT VERY NEGATIVE FOR THE ARM...,,,,,,,,
1,"1,000 MEN AND A BABY",tt0133231,,APP,TV,VERY POSITIVE DEPICTION OF NAVY IN THIS KOREAN...,1997,Drama,96 min,Marcus Cole,A baby in a foreign land is adopted by the men...,,6.9,07 Dec 1997
2,1ST FORCE,Never Made,,OTH,FILM,INITIALLY DOD AND USMC WERE INCLINED TO SUPPOR...,,,,,,,,
3,24,tt0502209,22,APP,TV,APPROVED FILMING FOR ONE DAY WITH TWO MARINE C...,2004,"Action, Crime, Drama, Thriller",42 min,Frederick King Keller,Jack and Tony clash as they wait for the time ...,,9.0,11 May 2004
4,3RD DEGREE,tt0098469,,APP,TV,PERSONNEL APPEARED ON THIS GAME SHOW AT THE EX...,1989,"Crime, Drama, Thriller",100 min,Roger Spottiswoode,Scott Weston is a private investigator who is ...,,5.7,28 May 1989
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
852,"WONDER YEARS, THE",tt0094582,ANGEL,LIM,TV,THE UNITED STATES AIR FORCE GRANTED STOCK FOOT...,1988.0,"Comedy, Drama, Family, Romance",22 min,,An adult Kevin Arnold reminisces on his teenag...,Won 1 Golden Globe. Another 24 wins & 70 nomin...,8.3,31 Jan 1988
853,X-15,tt0055627,,APP,FILM,AIRFORCE AND NASA PROVIDED FULL COOPERATION ON...,1961,"Drama, History",107 min,Richard Donner,At the height of the Cold War during the 1960s...,,5.8,22 Dec 1961
854,"YEAR IN THE LIFE, A",tt0092488,ACTS OF FAITH,DEN,TV,THE PROJECT WAS DENIED ASSISTANCE.,1987.0,Drama,60 min,,"Joe Gardner, a child of the Depression, is a s...",Won 1 Golden Globe. Another 3 wins & 3 nominat...,8.8,16 Sep 1987
855,"YOUNG LIONS, THE",tt0052415,,APP,FILM,PENTAGON AND STATE DEPARTMENT WENT THROUGH LON...,1958,"Action, Drama, War",167 min,Edward Dmytryk,The destiny of three soldiers during World War...,Nominated for 3 Oscars. Another 1 win & 4 nomi...,7.2,02 Apr 1958


# Saving the Data

Since we would require this data frame in the analysis further we shall store it on the harddrive. We will store both a csv file and a binary pickle file. The CSV file will be used as human readable format while pickle file will be useful to quick reading.

In [8]:
dod_movies_with_addl_data.to_csv("military_hollywood_with_additional_data.csv", index=False)
dod_movies_with_addl_data.to_pickle("military_hollywood_with_additional_data.pkl")