Skip to content

Data profiling y-data profile, Data staging (Staging tables), Talend for ETL jobs, MySQL validations Dimensional model (Target tables), Facts and Dimensions, Mapping document explaining the source column name and where it finally maps to target column, Stage to Target, Document all transformations if any

Notifications You must be signed in to change notification settings

shushilgirish/Motor-Vehicle-Collision-Crashes-Data_Analysis_Report-chicago-dallas-nyc-

Repository files navigation

I'll update the README file to include the missing details for the Austin and NYC ETL jobs and modify the names in the Dimensional Modeling section as specified. Additionally, I'll integrate the image you provided for the dimension and fact tables.

Here is the updated README file content:


Data Architecture and Business Intelligence - End Term Report

Introduction

This repository contains the final project for the Design Data Architecture course at Northeastern University. The project focuses on advanced data architecture and business intelligence, specifically on data profiling, ETL processes, and data validation using various datasets from Austin, Chicago, and NYC.

Project Overview

The objective of this project is to design and implement a robust data architecture that supports business intelligence activities. The project involves:

  1. Data Profiling: Analyzing the structure, content, and quality of datasets.
  2. Data Staging: Preparing and cleaning data for ETL processes.
  3. ETL Processes: Extracting, transforming, and loading data using Talend.
  4. Data Validation: Ensuring data accuracy and consistency using MySQL.
  5. Dimensional Modeling: Creating star schema models for efficient data analysis.
  6. Reporting and Analysis: Using transformed data for in-depth analysis and reporting.

Data Profiling

Austin Crash Dataset:

  • Total Observations: 147,750
  • Variables: 54
  • Missing Data: 1,725,084 cells (21.6%)
  • Memory Usage: 60.9 MiB
  • Variable Types: 17 numeric, 11 boolean, 2 datetime, 7 text, 15 categorical, 2 unsupported

Key Insights:

  • The Austin dataset contains no duplicate rows, ensuring data uniqueness.
  • Significant missing data poses challenges for data analysis but can be managed through imputation or exclusion【5:5†source】.

Chicago Crash Dataset

  • Total Observations: 817,723
  • Missing Data: 8,268,003 cells (21.1%)
  • Memory Usage: 299.5 MiB
  • Variable Types: 3 text, 9 boolean, 2 datetime, 15 numeric, 19 categorical

Key Insights:

  • No duplicate rows ensure data uniqueness.
  • Missing data requires imputation or exclusion strategies .

NYC Crash Dataset

  • Total Observations: 2,075,427
  • Variables: 29
  • Missing Data: 17,761,578 cells (29.5%)
  • Memory Usage: 459.2 MiB
  • Variable Types: DateTime, Categorical, Numeric, Text

Key Insights:

  • Highest percentage of missing data among the three, necessitating extensive preprocessing to ensure data quality.
  • Alerts include high correlation between various contributing factors and imbalance in certain fields【27:1†source】【27:1†source】.

ETL Processes

Talend for ETL Jobs

Austin Crash Dataset

  • Transformation Component (tMap): Applied business logic, combined data from various sources, and ensured data type consistency.
  • Output Component (tLoadinitialstage_Austin_staging): Loaded 147,750 rows into a staging table.
  • Logging Component (tLogRow): Handled rejected rows, ensuring data quality.

Chicago Crash Dataset

  • Transformation Component (tMap_1): Applied business rules, changed data types, and combined data from different sources.
  • Output Component (tLoadinitialstage_Chicago_staging): Loaded 817,723 rows into a staging table in 143.7 seconds.
  • Logging Component (tLogRow_2): No rows rejected, indicating successful data transformation.

NYC Crash Dataset

  • Transformation Component (tMap_1): Applied business standards, combined fields from various sources, and ensured proper formatting.
  • Output Component (loading_initialnycstaging): Loaded 2,075,427 rows into a staging table.
  • Logging Component (tLogRow_2): No rows rejected, indicating smooth data transformation.

Data Validation

SQL Query Validations

  • Conducted thorough SQL query validations for each staging dataset to ensure data accuracy and consistency before loading into target tables.

Dimensional Modeling for Traffic Accident Analysis

Fact and Dimension Tables

Dimension and Fact Tables

Our team has successfully deployed this dimensional data model to improve traffic accident analysis reporting for a city's traffic management system. A detailed perspective of each occurrence is provided by the interconnected set of dimension and fact tables that comprise the database structure.

Dimension Tables

  • dim_date: Provides date context for each accident, including fields for day, week, month, quarter, and year.
  • dim_location: Contains street names and coordinates for spatial analysis of accidents.
  • dim_time: Offers detailed time information to complement the date context.
  • dim_source: Lists the sources of the data, providing traceability.
  • dim_contribute: Captures contributing factors to accidents, employing a Slowly Changing Dimension Type 2 technique to maintain historical changes.
  • dim_vehicle: Details the types of vehicles involved in accidents for vehicle-related analysis.

Fact Tables

  • fact_accidents: Central table logging accident metrics and associating them with time, location, and vehicle dimensions.
  • fact_contributing_factors: Links accidents with their contributing factors, acknowledging that multiple factors can contribute to a single accident.
  • fact_units_involved: Provides data on ongoing investigations of unresolved incidents, supporting analysis of these cases.

This comprehensive model supports extensive exploratory analysis, predictive modeling, and other statistical investigations, making it an invaluable tool for understanding traffic collision trends and developing safety improvement plans.

Visualizations

Key Visual Insights

  1. Accident Occurrence:

    • Visualized the total number of accidents in NYC, Austin, and Chicago.
    • Highlighted the top three areas in each city with the highest accident rates.
  2. Injury Analysis:

    • Showcased the number of accidents resulting in injuries, both overall and by city.
  3. Pedestrian Involvement:

    • Illustrated how often pedestrians are involved in accidents, comparing overall data and city-specific data.
  4. Temporal Patterns:

    • Analyzed when most accidents happen, providing a seasonality report and a time-based analysis of accidents.
    • Examined time of day, day of the week, and whether accidents occurred on weekdays or weekends.
  5. Motorist Injury and Fatality Analysis:

    • Detailed the number of motorists injured or killed in accidents, both overall and by city.
    • Identified the top 5 areas in each city with the most fatal accidents.
  6. Factor Analysis:

    • Explored the most common factors involved in accidents, emphasizing key contributing factors and their correlations.

Visualization Samples

  • Accident Occurrence by City: Accident Occurrence

  • Injury Analysis: Injury Analysis

  • Pedestrian Involvement: Pedestrian Involvement

  • Temporal Patterns: Temporal Patterns

  • Motorist Injury and Fatality Analysis: Motorist Injury and Fatality

  • Factor Analysis: Factor Analysis

These visualizations provide a comprehensive view of traffic accidents, facilitating deeper insights and informed decision-making.

Conclusion

The project successfully demonstrates the application of advanced data architecture techniques to prepare, transform, and analyze large datasets for business intelligence purposes. The comprehensive profiling, meticulous ETL processes, and robust data validation ensure high-quality data for meaningful analysis and reporting.

For detailed steps, scripts, and additional information, please refer to the provided documentation and scripts within this repository.


By using sophisticated techniques and corporate standards, this report encapsulates the entire lifecycle of data from raw form to analytical insights, reflecting the robust and meticulous approach adopted throughout the project.


This updated README file now includes the missing details for the Austin and NYC ETL jobs, modifies the names in the Dimensional Modeling section according to the usage, and integrates the image of the dimension and fact tables. If you need any additional adjustments or further assistance, please let me know!

About

Data profiling y-data profile, Data staging (Staging tables), Talend for ETL jobs, MySQL validations Dimensional model (Target tables), Facts and Dimensions, Mapping document explaining the source column name and where it finally maps to target column, Stage to Target, Document all transformations if any

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published