### Importing Required Libraries
In this section, we import the necessary libraries to facilitate the development of the project:

- requests: A library used to send HTTP requests to interact with APIs and fetch data from web sources.

- pandas: A powerful library for data manipulation and analysis, commonly used to handle structured datasets.

- sys: Provides access to system-specific parameters and functions, allowing interaction with the Python runtime environment.

- os: A library that facilitates interaction with the operating system, such as reading and writing files or managing directories.

In [1]:
import requests
import pandas as pd
import sys
import os

### Fetching Data from an API and Saving to CSV
In this section, we demonstrate how to retrieve data from a public API, process it using Python, and save it to a CSV file for further analysis.

#### Dataset URL:

We use the NYC Open Data API endpoint to fetch data about motor vehicle collisions in New York City.

#### Request Parameters:

We limit the response to 200,000 records using the "$limit" parameter in the request.

#### Steps in the Code:

- Send a GET Request: Use the requests.get() method to fetch the data.
- Check for Success: If the request is successful (status code 200), convert the response into JSON format.
- Create a DataFrame: Load the JSON data into a pandas DataFrame for manipulation and analysis.
- Preview Data: Display the first few rows using df.head().
- Save to CSV: Save the DataFrame to a CSV file named API_data.csv in the ../data/ directory.

#### Error Handling:

If the API request fails, the error status code is printed for debugging.

In [2]:
# URL of the dataset (API endpoint)
url = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"

# Parameters to limit the response to 200,000 records
params = {
    "$limit": 200000
}

# Send a GET request to the API
response = requests.get(url, params=params)

# Check if the request was successful
if response.status_code == 200:
    data = response.json()  # Convert the response to JSON format
    df = pd.DataFrame(data)  # Create a pandas DataFrame from the data
    print(df.head())  # Display the first few records
    # Save the DataFrame to a CSV file
    df.to_csv('../data/API_data.csv', index=False, encoding='utf-8')
else:
    # If the request fails, print the error code
    print(f"Error in the request: {response.status_code}")

                crash_date crash_time           on_street_name  \
0  2021-09-11T00:00:00.000       2:39    WHITESTONE EXPRESSWAY   
1  2022-03-26T00:00:00.000      11:45  QUEENSBORO BRIDGE UPPER   
2  2022-06-29T00:00:00.000       6:55       THROGS NECK BRIDGE   
3  2021-09-11T00:00:00.000       9:35                      NaN   
4  2021-12-14T00:00:00.000       8:13          SARATOGA AVENUE   

  off_street_name number_of_persons_injured number_of_persons_killed  \
0       20 AVENUE                         2                        0   
1             NaN                         1                        0   
2             NaN                         0                        0   
3             NaN                         0                        0   
4  DECATUR STREET                         0                        0   

  number_of_pedestrians_injured number_of_pedestrians_killed  \
0                             0                            0   
1                             0           

### Data Cleaning and Preprocessing
This section focuses on preparing the dataset for analysis by cleaning and filtering the data. The steps include loading the data, handling missing values, correcting inconsistencies, and removing unnecessary columns.

#### Steps in the Code:

1. Load the Dataset:

- Retrieve the absolute file path for API_data.csv.
- Use pandas to load the CSV file into a DataFrame.

2. Display Settings:

- Configure pandas to display all columns to ensure comprehensive visualization during analysis.

3. Convert Columns to DateTime Format:

- Convert the crash_date and crash_time columns to datetime format using the pd.to_datetime() method.
- Handle invalid parsing by setting such values to NaT (Not a Time).

4. Fix Inconsistent Values:

- Standardize the borough column by removing extra spaces and capitalizing the first letter of each word.

5. Filter and Clean Data:

- Keep only rows with valid crash_date values from the year 2021 or later.
- Remove duplicates based on the collision_id column, which is assumed to uniquely identify each collision.

6. Column Adjustments:

- Simplify the crash_date column to include only the date (drop the time part).
- Remove unnecessary columns such as details for additional vehicles and contributing factors.

7. Handle Missing Values:

- Drop rows containing any missing values to ensure data integrity.

8. Add a New Column:

- Insert a city column with the value "New York" for all rows.

9. Summary Information:

- Display the total number of null and duplicate values for quality checks.

10. Save Cleaned Data:

- Export the cleaned dataset to a new CSV file, API_data_Cleaned.csv, for further analysis.




In [3]:
# Get the absolute path of the file
file_path = os.path.abspath(os.path.join('../data/API_data.csv'))

# Load the CSV file using pandas
data = pd.read_csv(file_path)

# Set pandas to display all columns
pd.set_option('display.max_columns', None)

# 2. Convert `crash_date` and `crash_time` to datetime format
# Handle errors by setting invalid parsing as NaT (Not a Time)
data['crash_date'] = pd.to_datetime(data['crash_date'], errors='coerce')
data['crash_time'] = pd.to_datetime(data['crash_time'], format='%H:%M', errors='coerce')

# 3. Fix inconsistent values (e.g., remove whitespace or correct capitalization in the `borough` column)
data['borough'] = data['borough'].str.strip().str.title()

# Filter data to keep only rows with valid crash dates and from the year 2021 or later
data = data[data['crash_date'].notna() & (data['crash_date'].dt.year >= 2021)]

# 4. Remove duplicates based on the `collision_id` column (assuming it's unique for each accident)
data = data.drop_duplicates(subset='collision_id')

# Convert `crash_date` to just the date (drop the time part)
data['crash_date'] = data['crash_date'].dt.date

# Drop unnecessary columns
data = data.drop(['vehicle_type_code_5', 'contributing_factor_vehicle_5',
                  'vehicle_type_code_4', 'contributing_factor_vehicle_4',
                  'vehicle_type_code_3', 'contributing_factor_vehicle_3',
                  'cross_street_name'], axis=1)

print("FILTERED AND CLEANED DATA: \n")

# Drop rows with any missing values
data = data.dropna()

# Add a `city` column with the value "New York"
data['city'] = "New York"

# Print summary information about null values and duplicates
print(f"The total of Null data is: \n{data.isnull().sum()}\n")
print(f"The total of duplicated data is: {data.duplicated().sum()}\n")
print(f"Data: {data.shape[0]} rows\n")

# Save the cleaned data to a new CSV file
data.to_csv('../data/API_data_Cleaned.csv', index=False, encoding='utf-8')
print("File Cleaned Successfully")

FILTERED AND CLEANED DATA: 

The total of Null data is: 
crash_date                       0
crash_time                       0
on_street_name                   0
off_street_name                  0
number_of_persons_injured        0
number_of_persons_killed         0
number_of_pedestrians_injured    0
number_of_pedestrians_killed     0
number_of_cyclist_injured        0
number_of_cyclist_killed         0
number_of_motorist_injured       0
number_of_motorist_killed        0
contributing_factor_vehicle_1    0
contributing_factor_vehicle_2    0
collision_id                     0
vehicle_type_code1               0
vehicle_type_code2               0
borough                          0
zip_code                         0
latitude                         0
longitude                        0
location                         0
city                             0
dtype: int64

The total of duplicated data is: 0

Data: 46438 rows

File Cleaned Successfully
