#### Project Assignment: Phase 1 - Data Collection and MongoDB

##### Project Title: Analyzing Electric Vehicle Adoption Trends 
##### Author: Vidya Prabhu

##### Data Source: 
###### The dataset is sourced from the U.S. Department of Energy. This platform provides access to a wide range of publicly available datasets across various domains, ensuring accessibility for research, analysis, and policy-making.
###### Data includes information about electric vehicles in Washington, such as the number of vehicles, their geographical distribution, and the time periods during which the data was collected. 
###### The dataset specifically covers Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered with the Washington State Department of Licensing.

##### Step 1. Importing Libraries

In [410]:
import requests
import json

##### Step 2. Fetching Data from API

In [413]:
# URL of the  Electric Vehicle JSON data source
base_url = "https://data.wa.gov/api/views/f6w7-q2d2/rows.json?accessType=DOWNLOAD"

# Fetching the JSON data from the base URL
response = requests.get(base_url)

# Checking if the fetching request was successful
if response.status_code == 200:
    data = response.json()  # Parsing the JSON response
    print("Data successfully fetched from API.") # Success
else:
    print(f"Failed to retrieve data. HTTP Status Code: {response.status_code}") # failure

Data successfully fetched from API.


##### Step 3. Extracting relevant Data

In [416]:
# Extracting the data records
records = data.get("data", [])[:10000]  # Limiting to 10,000 records for efficiency and data analysis

# Displaying column names from metadata (if available)
columns = data.get("meta", {}).get("view", {}).get("columns", [])
column_names = [col.get("name", f"Column_{i}") for i, col in enumerate(columns)]

#printing length of records and length of column names
print(f"Extracted {len(records)} records with {len(column_names)} columns.")

# Displaying all column names
print("Extracted Column Names:")
print(column_names)

Extracted 10000 records with 28 columns.
Extracted Column Names:
['sid', 'id', 'position', 'created_at', 'created_meta', 'updated_at', 'updated_meta', 'meta', 'VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year', 'Make', 'Model', 'Electric Vehicle Type', 'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range', 'Base MSRP', 'Legislative District', 'DOL Vehicle ID', 'Vehicle Location', 'Electric Utility', '2020 Census Tract', 'Counties', 'Congressional Districts', 'WAOFM - GIS - Legislative District Boundary']


##### Step 4. Understanding Data structure, time frame, and size of data

In [419]:
# 1. Data Structure
data_structure = {
    "Type": type(records),  # Type of the data
    "Columns": len(columns),  # Number of columns from metadata
    "Records": len(records),  # Number of records in the dataset
}

# 4. Displaying the Data Attributes
print("Data Structure:")
print(f"  Type: {data_structure['Type']}")
print(f"  Number of Columns: {data_structure['Columns']}")
print(f"  Number of Records: {data_structure['Records']}")

Data Structure:
  Type: <class 'list'>
  Number of Columns: 28
  Number of Records: 10000


In [421]:
# 2. Time Frame
import datetime

# Helper function to convert timestamps to human-readable date-time format using UTC
def convert_to_datetime(timestamp):
    if timestamp != "N/A":
        return datetime.datetime.fromtimestamp(timestamp, datetime.timezone.utc).strftime('%Y-%m-%d %H:%M:%S')
    return timestamp

# Extracting the minimum 'created_at' and maximum 'updated_at' across all records
start_date = min(record[3] for record in records)  # 'created_at' is at index 3
end_date = max(record[5] for record in records)  # 'updated_at' is at index 5

# Converting timestamps to human-readable format
start_date_formatted = convert_to_datetime(start_date)
end_date_formatted = convert_to_datetime(end_date)

# Creating the time frame dictionary
time_frame = {
    "Start Date": start_date_formatted,
    "End Date": end_date_formatted,
    "Data Range": f"{start_date_formatted} - {end_date_formatted}"
}

# Printing the time frame
print("Time Frame:")
print(f"  Start Date: {time_frame['Start Date']}")
print(f"  End Date: {time_frame['End Date']}")
print(f"  Data Range: {time_frame['Data Range']}")

Time Frame:
  Start Date: 2025-02-13 22:04:35
  End Date: 2025-02-13 22:08:10
  Data Range: 2025-02-13 22:04:35 - 2025-02-13 22:08:10


In [423]:
# 3. Data Size (Size of the dataset in memory)
data_size = sys.getsizeof(records)

#printing data size
print("Data Size:")
print(f"  Size in Memory: {data_size} bytes")

Data Size:
  Size in Memory: 80056 bytes


##### Step 5. Selecting columns and displaying data

In [426]:
# Function for displaying data for the first 10 records
def display_relevant_records(records, column_names):
    if records:
        for i, record in enumerate(records[:10]):  # Looping over the first 10 records
            print(f"Record {i + 1}:") # displaying the record number
            for col in column_names:
                # Getting the index of the column_names
                col_index = column_names.index(col)
                value = record[col_index] if col_index < len(record) else "N/A"  # Using index to fetch value
                print(f"{col}: {value}")
            print("-" * 50)  # Separator between records for better readability
    else:
        print("No data records found.")

# Example usage of the function
display_relevant_records(records, column_names)

Record 1:
sid: row-9exd_xzw7-2hfk
id: 00000000-0000-0000-F8E5-5FABDC1A77AA
position: 0
created_at: 1739484275
created_meta: None
updated_at: 1739484437
updated_meta: None
meta: { }
VIN (1-10): 2T3YL4DV0E
County: King
City: Bellevue
State: WA
Postal Code: 98005
Model Year: 2014
Make: TOYOTA
Model: RAV4
Electric Vehicle Type: Battery Electric Vehicle (BEV)
Clean Alternative Fuel Vehicle (CAFV) Eligibility: Clean Alternative Fuel Vehicle Eligible
Electric Range: 103
Base MSRP: 0
Legislative District: 41
DOL Vehicle ID: 186450183
Vehicle Location: POINT (-122.1621 47.64441)
Electric Utility: PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)
2020 Census Tract: 53033023604
Counties: 3009
Congressional Districts: 9
WAOFM - GIS - Legislative District Boundary: 49
--------------------------------------------------
Record 2:
sid: row-njxk_i4aj-akam
id: 00000000-0000-0000-3D39-92D265CBF77E
position: 0
created_at: 1739484275
created_meta: None
updated_at: 1739484437
updated_meta: None
meta: { }
VIN (1

In [428]:
#Understanding the data

##### Data definition/ Basic characteristics of data:

###### 1. sid: Unique system identifier for the row.
###### 2. id: Unique identifier for the record (UUID format).
###### 3. position: Position of the record in the dataset.
###### 4. created_at: Timestamp (Unix format) representing when the record was created.
###### 5. created_meta: Metadata associated with the creation of the record (if applicable).
###### 6. updated_at: Timestamp (Unix format) representing when the record was last updated.
###### 7. updated_meta: Metadata associated with the last update of the record (if applicable).
###### 8. meta: Additional metadata stored as a JSON object.
###### 9. vin (Vehicle Identification Number): Unique identifier for the vehicle.
###### 10. Owner Name:  Name of the registered owner of the vehicle.
###### 11. City:  City where the vehicle is registered.
###### 12. State: State where the vehicle is registered.
###### 13. ZIP Code: ZIP code of the vehicle's registration location.
###### 14. Model Year: Manufacturing year of the vehicle.
###### 15. Make: Manufacturer of the vehicle (e.g., TOYOTA, TESLA).
###### 16. Model: Model name of the vehicle (e.g., RAV4, Model 3).
###### 17. Vehicle Type:Type of electric vehicle (e.g., Battery Electric Vehicle (BEV), Plug-in Hybrid Electric Vehicle (PHEV)).
###### 18. Clean Alternative Fuel Eligibility: Whether the vehicle qualifies as a clean alternative fuel vehicle.
###### 19. Electric Range (Miles): Maximum distance the vehicle can travel on electric power.
###### 20. Fuel Type Code: Fuel type associated with the vehicle (e.g., "0" for electric vehicles).
###### 21. Electric Utility: Electric utility company servicing the vehicle’s registered location.
###### 22. Geolocation: Coordinates (latitude, longitude) of the vehicle's registered address.
###### 23. Census Tract: Census tract information for demographic analysis.
###### 24. Other Identifiers: Additional codes related to the dataset, such as energy providers, tax codes, etc.

##### Step 6. Validating Data (Checking for Missing or Incorrect Values)

In [433]:
# Initializing lists for validation
missing_values = []
incorrect_values = []

# Iterating through each record and checking for missing/incorrect values
for idx, record in enumerate(records):
        for col in column_names:
            column_name = column_names[col_idx] if col_idx < len(column_names) else f"Column_{col_idx}"

            # Checking for missing values
            if value is None or value == "":
                missing_values.append((idx, column_name))
    
            # Checking for incorrect values (e.g., negative values for rates)
            if isinstance(value, (int, float)) and value < 0:
                incorrect_values.append((idx, column_name, value))

#printing result
print("Data validation completed.")
print(f"Total Missing Values Found: {len(missing_values)}")
print(f"Total Incorrect Values Found: {len(incorrect_values)}")

Data validation completed.
Total Missing Values Found: 0
Total Incorrect Values Found: 0


##### Step 7. Writing JSON data to MongoDB Atlas

In [436]:
from pymongo import MongoClient

# Connecting to MongoDB 
client = MongoClient("mongodb+srv://vidyaprabhu96:Myuniverse20@cluster0.6x4bn.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0") 
db = client["ElectricVehiclesDB"]  # Creating database name
collection = db["ElectricVehiclesDataCollections"]  # Creating collection name

# Converting data into a list of dictionaries. Without converting it to dictionaries format we cannot write the data to mongodb
json_data = [dict(zip(column_names,record)) for record in records]

# Inserting data into MongoDB
collection.insert_many(json_data)

print(f"Successfully inserted {len(json_data)} records into MongoDB.") #success

Successfully inserted 10000 records into MongoDB.


##### Step 8. Validating Data written in MongoDB

In [439]:
# Retrieving and printing a sample record to verify insertion
sample_record = collection.find_one()

if sample_record:
    # Convert ObjectId to string for better readability
    sample_record["_id"] = str(sample_record["_id"])

    # Pretty-print the JSON data
    print("Sample Record from MongoDB:\n")
    print(json.dumps(sample_record, indent=4))  # Indented JSON format
else:
    print("No records found in MongoDB.")

Sample Record from MongoDB:

{
    "_id": "67b36f02e23a723cde1e90de",
    "sid": "row-9exd_xzw7-2hfk",
    "id": "00000000-0000-0000-F8E5-5FABDC1A77AA",
    "position": 0,
    "created_at": 1739484275,
    "created_meta": null,
    "updated_at": 1739484437,
    "updated_meta": null,
    "meta": "{ }",
    "VIN (1-10)": "2T3YL4DV0E",
    "County": "King",
    "City": "Bellevue",
    "State": "WA",
    "Postal Code": "98005",
    "Model Year": "2014",
    "Make": "TOYOTA",
    "Model": "RAV4",
    "Electric Vehicle Type": "Battery Electric Vehicle (BEV)",
    "Clean Alternative Fuel Vehicle (CAFV) Eligibility": "Clean Alternative Fuel Vehicle Eligible",
    "Electric Range": "103",
    "Base MSRP": "0",
    "Legislative District": "41",
    "DOL Vehicle ID": "186450183",
    "Vehicle Location": "POINT (-122.1621 47.64441)",
    "Electric Utility": "PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)",
    "2020 Census Tract": "53033023604",
    "Counties": "3009",
    "Congressional Districts"

In [441]:
#counting number of documents loaded in mongodb
doc_count = collection.count_documents({})
print(f"Total records in MongoDB: {doc_count}")

Total records in MongoDB: 10000


##### Step 9. Writing data to a local JSON file

In [444]:
from bson import ObjectId # Importing ObjectId from BSON for handling MongoDB object IDs

# Custom function to handle ObjectId serialization
def objectid_converter(obj):
    if isinstance(obj, ObjectId):
        return str(obj)  # Converting ObjectId to a string for JSON compatibility
    raise TypeError("Type not serializable") # Raising an error if an unsupported type is encountered


try:
    # Opening the JSON file in write mode
    with open("ElectricVehiclesData.json", "w") as json_file:
        # Serializing json_data and handle ObjectId using the custom converter
        json.dump(json_data, json_file, indent=4, default=objectid_converter)  # Using custom converter
    
    print("Data successfully written to Electricvehicledata.json") # Success message
except TypeError as e:
    # Handling serialization errors and printing an error message if writing is unsuccessful
    print(f"Error while writing data to JSON: {e}")

Data successfully written to Electricvehicledata.json
