## Data Engineering about Movies
In this project we are extracting data from an API about movies and serving it to a Data Lake called Big Query . The data is about the likability.

## Required Python Libraries
Firstly we are going to install a couple of python libraries that will help us to execute the Extract Transform and Load Process . What we do is that we list all the Libraries in a file called 'requirements.txt'
the libraries include.

- pandas
- numpy
- google-cloud
- pyarrow
- requests

To install the above libraries , we run a python command

In [None]:
pip install -r requirements.txt

# Extract Transform Load Overview 

## Extract 
We shall be extracting our data from an API in this project

## Transform
This step is optional but this is the step where by we get to add columns , change data types of data columns . This is done according to business rules.

## Load
This is the step where we load our data into analytical environments , there are different tools that are used to achieve this but we are going to use Google Big query(Scalable Storage) in this project.

## Connect to API using Python and getting a response




This is code for connection to the api . I only write this code here once to avoid repetition of code.



In [7]:
import requests
import json
import requests

url = "https://api.themoviedb.org/3/trending/movie/day?language=en-US"

headers = {
    "accept": "application/json",
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiIxMzIwYjU5NmY1ZTNiMjg1MmFlNzk0ZmIyYzQzOWZlOSIsInN1YiI6IjY1YjkxNzhhZTlkYTY5MDE0OGYyZTk1MCIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.FgBnCTQBjxsfPN-6f6Dvkod85rWdVfb6Inp2Mnhy0SI"
}

params = {
    "language": "en-US",
    "page": 1,  # Adjust the page number to get more results
}

response = requests.get(url, headers=headers , params=params)

print(response.text)



{"page":1,"results":[{"adult":false,"backdrop_path":"/4MCKNAc6AbWjEsM2h9Xc29owo4z.jpg","id":866398,"title":"The Beekeeper","original_language":"en","original_title":"The Beekeeper","overview":"One man’s campaign for vengeance takes on national stakes after he is revealed to be a former operative of a powerful and clandestine organization known as Beekeepers.","poster_path":"/A7EByudX0eOzlkQ2FIbogzyazm2.jpg","media_type":"movie","genre_ids":[28,53,18],"popularity":647.361,"release_date":"2024-01-10","video":false,"vote_average":7.403,"vote_count":242},{"adult":false,"backdrop_path":"/jXJxMcVoEuXzym3vFnjqDW4ifo6.jpg","id":572802,"title":"Aquaman and the Lost Kingdom","original_language":"en","original_title":"Aquaman and the Lost Kingdom","overview":"Black Manta, still driven by the need to avenge his father's death and wielding the power of the mythic Black Trident, will stop at nothing to take Aquaman down once and for all. To defeat him, Aquaman must turn to his imprisoned brother Orm

## Understanding the data types in the response
The next thing that we are doing is to do is the check the data types of the response data using pandas data types , we are doing this inorder to check the types of data that we do have in our data such that we faciliate further transformations like mathematical operations in accordance to business rules. One has to be sure of the types of data that they are dealing with.

In [10]:
import pandas as pd

json_data = response.json()

# Convert JSON response to pandas DataFrame
df = pd.DataFrame(json_data.get("results", []))

# Display the data types
print(df.dtypes)

# Get a spoiler of the data
print(df.head())

adult                   bool
backdrop_path         object
id                     int64
title                 object
original_language     object
original_title        object
overview              object
poster_path           object
media_type            object
genre_ids             object
popularity           float64
release_date          object
video                   bool
vote_average         float64
vote_count             int64
dtype: object
   adult                     backdrop_path       id  \
0  False  /4MCKNAc6AbWjEsM2h9Xc29owo4z.jpg   866398   
1  False  /jXJxMcVoEuXzym3vFnjqDW4ifo6.jpg   572802   
2  False  /pWsD91G2R1Da3AKM3ymr3UoIfRb.jpg   933131   
3  False  /yOm993lsJyPmBodlYjgpPwBjXP9.jpg   787699   
4  False  /ehumsuIBbgAe1hg343oszCLrAfI.jpg  1022796   

                          title original_language  \
0                 The Beekeeper                en   
1  Aquaman and the Lost Kingdom                en   
2               Badland Hunters                ko   
3       

## Pandas filters

This enables us to filter out information that is required to fulfill the business rule , for example in this instance we need to see the movies that were established in 2023



In [12]:
# Filter movies released in 2023
df_2023 = df[df['release_date'].str.startswith('2023')]


## Pandas Export to CSV
What we are now doing is that we are extracting the data from the API into pandas then to csv , We do this to prepare our data to be ingested in the Data Lake Big query. In this cell we want to see the movies that were released in 2023 being stored in csv waiting to be moved to a scalable storage like Google Big Query.

In [13]:
import os

cur_path = os.getcwd()

file = 'movies_2023.csv'

file_path = os.path.join(cur_path,'data_files',file)

# Save DataFrame to CSV
df_2023.to_csv(file_path, index=False)

## Loading Data to Big Query
We are now loading data to big query using python , this is the start of the final step in the ETL process .We are looking at moving the movie_rating , movies and watchability data of the movies in the Big query.

In [14]:
from google.cloud import bigquery
import os

client = bigquery.Client(project='charming-autumn-407214')
target_table_1 = 'charming-autumn-407214.sample_dataset.movies_2023'

job_config = bigquery.LoadJobConfig(
    skip_leading_rows = 1,
    source_format = bigquery.SourceFormat.CSV,
    autodetect=True
)

# file vars
cur_path = os.getcwd()
file = 'movies_2023.csv'
file_path = os.path.join(cur_path,'data_files',file)

with open(file_path, 'rb') as source_file:
    load_job   = client.load_table_from_file(
        source_file,
        target_table_1,
        job_config=job_config

    )

load_job.result()

destination_table = client.get_table(target_table_1)

print(f"You have {destination_table.num_rows} rows in your table ")

You have 12 rows in your table 
