# <center>  $\color{indigo}{\text{Data preparation }}$ </center>

## <center>  $\color{indigo}{\text{Bellevue University. }}$ </center>
## <center>  $\color{indigo}{\text{DSC 540 }}$ </center>
#### <center>  $\color{indigo}{\text{Movie Data Analysis}}$ </center>
### <center>  $\color{indigo}{\text{ Project Milestone - 3. }}$ </center>
### <center>  $\color{indigo}{\text{ SAMUEL ABOYE. }}$ </center>

## **Integration with External API**

#### **Overview**

After cleaning the movie flat file data from kaggle.com, we utilize the cleaned data titles to fetch additional information from an external movie-related API. This process allows us to enrich our dataset with more detailed data, such as recent reviews, updated ratings, and additional metadata not originally included in our dataset.

### **API Description**

**Description of data source:** The OMDb API (The Open Movie Database) is a RESTful web service for obtaining movie information. It offers detailed data, including titles, year, ratings, plot descriptions, and poster images for movies and TV series. To expand the dataset, I plan to utilize the movie titles extracted from the 'tmdb_5000_credits.csv' file as search queries for the OMDb API. Subsequently, the movie data obtained from the API will be saved in a new CSV file, systematically amalgamating an extensive compilation of movie metadata.

The API used for this purpose is [Movie Information API](https://www.omdbapi.com/). It provides detailed information about movies, including cast details, recent reviews, and audience ratings. This API was chosen for its comprehensive data offerings and ease of use.

#### **Workflow**

### **Step 1: Preparing the Data**
- **Data Preparation**: We start by ensuring that the movie titles in our dataset are formatted correctly and are suitable for API queries. This involves trimming whitespace and replacing spaces with appropriate characters (e.g., underscores or plus signs) as required by the API.

### **Step 2: API Requests**
- **Sending Requests**: For each movie title, we send a request to the API. The requests are sent using the Python `requests` library. We handle errors and exceptions to ensure that our script is robust against network issues or data discrepancies.

### **Step 3: Data Integration**
Integrating API Data: The responses from the API are parsed and integrated with our existing DataFrame. We ensure that the additional data aligns with our dataset's structure and fill in missing values where API responses are not available.

### **Step 4: Verify Data Structure for Integration**
In this step, we'll conduct a basic metadata check to ensure that the newly cleaned dataset is properly structured for merging with the data cleaned in Milestone 2. This involves verifying the dataset's dimensions, column types, and the presence of key identifiers necessary for a successful merge.

In [1]:
import json
import requests
import pandas as pd

- Load the cleaned dataset from a CSV file into a DataFrame 'df_titles'This dataset, prepared in Milestone 2, contains movie titles among other data fields

In [2]:
df_titles = pd.read_csv('cleaned_data.csv')

Display the first five rows of the DataFrame to inspect the data Quick check to ensure data integrity and structure are as expected after cleaning in Milestone 2.

In [3]:
df_titles.head(5)

Unnamed: 0,Budget,Genres,Homepage,ID,Keywords,Original_Language,Original_Title,Overview,Popularity,Production_Companies,Production_Countries,Release_Date,Revenue,Runtime,Spoken_Languages,Status,Tagline,Title,Vote_Average,Vote_Count
0,237000000,"Action, Adventure, Fantasy, Science Fiction",http://www.avatarmovie.com/,19995,"culture clash, future, space war, space colony...",en,avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"Ingenious Film Partners, Twentieth Century Fox...","United States of America, United Kingdom",2009-12-10,2787965087,162.0,"English, Español",Released,enter the world of pandora.,Avatar,7.2,11800
1,300000000,"Adventure, Fantasy, Action",http://disney.go.com/disneypictures/pirates/,285,,en,pirates of the caribbean: at world's end,"Captain Barbossa, long believed to be dead, ha...",139.082615,"Walt Disney Pictures, Jerry Bruckheimer Films,...",United States of America,2007-05-19,961000000,169.0,English,Released,"at the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"Action, Adventure, Crime",http://www.sonypictures.com/movies/spectre/,206647,"spy, based on novel, secret agent, sequel, mi6...",en,spectre,A cryptic message from Bond’s past sends him o...,107.376788,"Columbia Pictures, Danjaq, B24","United Kingdom, United States of America",2015-10-26,880674609,148.0,"Français, English, Español, Italiano, Deutsch",Released,a plan no one escapes,Spectre,6.3,4466
3,250000000,"Action, Crime, Drama, Thriller",http://www.thedarkknightrises.com/,49026,"dc comics, crime fighter, terrorist, secret id...",en,the dark knight rises,Following the death of District Attorney Harve...,112.31295,"Legendary Pictures, Warner Bros., DC Entertain...",United States of America,2012-07-16,1084939099,165.0,English,Released,the legend ends,The Dark Knight Rises,7.6,9106
4,260000000,"Action, Adventure, Science Fiction",http://movies.disney.com/john-carter,49529,"based on novel, mars, medallion, space travel,...",en,john carter,"John Carter is a war-weary, former military ca...",43.926995,Walt Disney Pictures,United States of America,2012-03-07,284139100,132.0,English,Released,"lost in our world, found in another.",John Carter,6.1,2124


In [4]:
#OMDb API key
api_key = 'XXXXXXXXXXXXXXXXXXXXXX'

In [5]:
# Define the function to get movie details from OMDb API
def get_movie_details(api_key, title):
    """
    Fetches detailed movie data from the OMDb API using a movie title.

    This function queries the OMDb API to retrieve detailed information about a movie specified by its title. 
    The function requires an API key for OMDb API and the title of the movie as inputs. It returns a dictionary 
    containing various details about the movie if the API call is successful. If the API call fails, it returns None.

    Parameters:
        api_key (str): The API key required to authenticate requests to the OMDb API.
        title (str): The title of the movie for which details are to be fetched.

    Returns:
        dict or None: A dictionary containing movie details if the API call is successful, None otherwise.

    Raises:
        requests.exceptions.RequestException: An error thrown by the requests library for issues like network problems,
                                              invalid responses etc.

    Usage:
        api_key = 'your_api_key_here'
        movie_title = 'Inception'
        movie_details = get_movie_details(api_key, movie_title)
        if movie_details:
            print(movie_details)
        else:
            print("Failed to fetch movie details")
    """
    url = f"https://www.omdbapi.com/?t={title}&apikey={api_key}"
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raises an HTTPError for bad responses
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        return None


Iterate over movie titles in the DataFrame 'df_titles' to fetch details from the OMDb API

In [6]:

# Initialize an empty list to store the movie data dictionaries returned by the API
movies_data = []

# Loop through each movie title in the 'Original_Title' column of the DataFrame
for title in df_titles['Original_Title']:
    # Call the get_movie_details function with the API key and the current title
    movie_details = get_movie_details(api_key, title)
    
    # Check if the API call returned data
    if movie_details:
        # If data is returned, append the dictionary to the movies_data list
        movies_data.append(movie_details)
    else:
        # If no data is returned, print an error message for this specific title
        print(f"Data for {title} could not be fetched.")

# After collecting all movie data, convert the list of dictionaries into a DataFrame
df_movies = pd.DataFrame(movies_data)

# Save the newly created DataFrame containing all movie data to a CSV file
df_movies.to_csv('open_movie_data.csv', index=False)

# Print a confirmation message with the number of movies for which data was fetched and saved
print(f"Data for {len(df_movies)} movies fetched and saved to open_movie_data.csv.")

An error occurred: 401 Client Error: Unauthorized for url: https://www.omdbapi.com/?t=what%20the%20#$*!%20do%20we%20(k)now!?&apikey=f70fda88&t
Data for what the #$*! do we (k)now!? could not be fetched.
An error occurred: 401 Client Error: Unauthorized for url: https://www.omdbapi.com/?t=#horror&apikey=f70fda88&t
Data for #horror could not be fetched.
Data for 4801 movies fetched and saved to open_movie_data.csv.


### Ethical Implications of Data Transformation

When transforming and cleaning this movie dataset, several steps were taken to enhance data quality and usability, including standardizing headers and text formats, parsing JSON fields into readable formats, removing duplicates, and formatting date fields. Such processes are essential to making data analysis tasks more straightforward and ensuring accuracy.

#### **Ethical Considerations**

### **Data Changes**
- **Textual Data and Formats**: We standardized textual data and reformatted dates to ensure uniformity and ease of analysis.
- **JSON Structures**: Parsing JSON structures makes the data more accessible but may simplify the nuances of data representation.
- **Duplicate Removal**: While removing duplicates prevents skewed statistical analyses, it's crucial to ensure that this does not inadvertently remove valid data that appears similar.

### **Legal and Regulatory Guidelines**
There are no specific legal restrictions for this publicly available dataset. However, general data handling and privacy principles should always be considered to maintain trust and integrity in data management.

### **Transformation Risks**
- **Over-Cleaning**: There's a risk that over-cleaning might lead to the loss of critical data, particularly with automated processes that remove what appears to be duplicates or outliers without manual verification.

### **Assumptions**
- **Duplicate Identifiers**: The assumption that duplicate IDs always indicate duplicate entries may not hold if there were errors in data collection or processing.

### **Data Sourcing/Credibility**
- **Source Credibility**: The dataset was sourced from a public movie database, which is presumed to be credible. However, verification against multiple sources is recommended to enhance data reliability.

### **Ethical Acquisition**
- **Public Data**: The dataset is assumed to be ethically sourced, containing publicly available information about movies without any personal data.

### **Mitigation of Risks**
- **Transparency and Verification**: Maintain transparency about the data transformations and provide access to the raw data for verification purposes.
- **Best Practices**: Regularly update data handling practices to comply with emerging best practices and ethical standards.


Load the API Data

In [7]:
# Load the data from the CSV file
df_movies = pd.read_csv('open_movie_data.csv')

In [8]:
# Display the first few rows of the DataFrame to understand its structure
df_movies.head()

Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,Plot,...,imdbVotes,imdbID,Type,DVD,BoxOffice,Production,Website,Response,Error,totalSeasons
0,Avatar,2009,PG-13,18 Dec 2009,162 min,"Action, Adventure, Fantasy",James Cameron,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver",A paraplegic Marine dispatched to the moon Pan...,...,1384939,tt0499549,movie,10 Feb 2016,"$785,221,649",,,True,,
1,Pirates of the Caribbean: At World's End,2007,PG-13,25 May 2007,169 min,"Action, Adventure, Fantasy",Gore Verbinski,"Ted Elliott, Terry Rossio, Stuart Beattie","Johnny Depp, Orlando Bloom, Keira Knightley","Captain Barbossa, Will Turner and Elizabeth Sw...",...,691869,tt0449088,movie,01 Jan 2014,"$309,420,425",,,True,,
2,Spectre,2015,PG-13,06 Nov 2015,148 min,"Action, Adventure, Thriller",Sam Mendes,"John Logan, Neal Purvis, Robert Wade","Daniel Craig, Christoph Waltz, Léa Seydoux",A cryptic message from James Bond's past sends...,...,466200,tt2379713,movie,24 Jul 2016,"$200,074,609",,,True,,
3,The Dark Knight Rises,2012,PG-13,20 Jul 2012,164 min,"Action, Drama, Thriller",Christopher Nolan,"Jonathan Nolan, Christopher Nolan, David S. Goyer","Christian Bale, Tom Hardy, Anne Hathaway","Eight years after the Joker's reign of chaos, ...",...,1824185,tt1345836,movie,07 Jan 2014,"$448,149,584",,,True,,
4,John Carter,2012,PG-13,09 Mar 2012,132 min,"Action, Adventure, Sci-Fi",Andrew Stanton,"Andrew Stanton, Mark Andrews, Michael Chabon","Taylor Kitsch, Lynn Collins, Willem Dafoe","Transported to Barsoom, a Civil War vet discov...",...,286278,tt0401729,movie,01 Jan 2014,"$73,078,100",,,True,,


In [9]:
# Print the information about the DataFrame to see column data types and non-null counts
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4801 entries, 0 to 4800
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Title         4557 non-null   object 
 1   Year          4557 non-null   object 
 2   Rated         4418 non-null   object 
 3   Released      4523 non-null   object 
 4   Runtime       4519 non-null   object 
 5   Genre         4550 non-null   object 
 6   Director      4494 non-null   object 
 7   Writer        4465 non-null   object 
 8   Actors        4541 non-null   object 
 9   Plot          4514 non-null   object 
 10  Language      4542 non-null   object 
 11  Country       4552 non-null   object 
 12  Awards        3911 non-null   object 
 13  Poster        4517 non-null   object 
 14  Ratings       4557 non-null   object 
 15  Metascore     3914 non-null   float64
 16  imdbRating    4514 non-null   float64
 17  imdbVotes     4519 non-null   object 
 18  imdbID        4557 non-null 

In [10]:
# Check for missing values in each column
df_movies.isnull().sum()

Title            244
Year             244
Rated            383
Released         278
Runtime          282
Genre            251
Director         307
Writer           336
Actors           260
Plot             287
Language         259
Country          249
Awards           890
Poster           284
Ratings          244
Metascore        887
imdbRating       287
imdbVotes        282
imdbID           244
Type             244
DVD              576
BoxOffice        863
Production      4778
Website         4801
Response           0
Error           4557
totalSeasons    4761
dtype: int64