Data Architecture and Business Intelligence - End Term Report

I'll update the README file to include the missing details for the Austin and NYC ETL jobs and modify the names in the Dimensional Modeling section as specified. Additionally, I'll integrate the image you provided for the dimension and fact tables.

Here is the updated README file content:

Data Architecture and Business Intelligence - End Term Report

Introduction

This repository contains the final project for the Design Data Architecture course at Northeastern University. The project focuses on advanced data architecture and business intelligence, specifically on data profiling, ETL processes, and data validation using various datasets from Austin, Chicago, and NYC.

Project Overview

The objective of this project is to design and implement a robust data architecture that supports business intelligence activities. The project involves:

Data Profiling: Analyzing the structure, content, and quality of datasets.
Data Staging: Preparing and cleaning data for ETL processes.
ETL Processes: Extracting, transforming, and loading data using Talend.
Data Validation: Ensuring data accuracy and consistency using MySQL.
Dimensional Modeling: Creating star schema models for efficient data analysis.
Reporting and Analysis: Using transformed data for in-depth analysis and reporting.

Data Profiling

Austin Crash Dataset:

Total Observations: 147,750
Variables: 54
Missing Data: 1,725,084 cells (21.6%)
Memory Usage: 60.9 MiB
Variable Types: 17 numeric, 11 boolean, 2 datetime, 7 text, 15 categorical, 2 unsupported

Key Insights:

The Austin dataset contains no duplicate rows, ensuring data uniqueness.
Significant missing data poses challenges for data analysis but can be managed through imputation or exclusion【5:5†source】.

Chicago Crash Dataset

Total Observations: 817,723
Missing Data: 8,268,003 cells (21.1%)
Memory Usage: 299.5 MiB
Variable Types: 3 text, 9 boolean, 2 datetime, 15 numeric, 19 categorical

Key Insights:

No duplicate rows ensure data uniqueness.
Missing data requires imputation or exclusion strategies .

NYC Crash Dataset

Total Observations: 2,075,427
Variables: 29
Missing Data: 17,761,578 cells (29.5%)
Memory Usage: 459.2 MiB
Variable Types: DateTime, Categorical, Numeric, Text

Key Insights:

Highest percentage of missing data among the three, necessitating extensive preprocessing to ensure data quality.
Alerts include high correlation between various contributing factors and imbalance in certain fields【27:1†source】【27:1†source】.

ETL Processes

Talend for ETL Jobs

Austin Crash Dataset

Transformation Component (tMap): Applied business logic, combined data from various sources, and ensured data type consistency.
Output Component (tLoadinitialstage_Austin_staging): Loaded 147,750 rows into a staging table.
Logging Component (tLogRow): Handled rejected rows, ensuring data quality.

Chicago Crash Dataset

Transformation Component (tMap_1): Applied business rules, changed data types, and combined data from different sources.
Output Component (tLoadinitialstage_Chicago_staging): Loaded 817,723 rows into a staging table in 143.7 seconds.
Logging Component (tLogRow_2): No rows rejected, indicating successful data transformation.

NYC Crash Dataset

Transformation Component (tMap_1): Applied business standards, combined fields from various sources, and ensured proper formatting.
Output Component (loading_initialnycstaging): Loaded 2,075,427 rows into a staging table.
Logging Component (tLogRow_2): No rows rejected, indicating smooth data transformation.

Data Validation

SQL Query Validations

Conducted thorough SQL query validations for each staging dataset to ensure data accuracy and consistency before loading into target tables.

Dimensional Modeling for Traffic Accident Analysis

Fact and Dimension Tables

Our team has successfully deployed this dimensional data model to improve traffic accident analysis reporting for a city's traffic management system. A detailed perspective of each occurrence is provided by the interconnected set of dimension and fact tables that comprise the database structure.

Dimension Tables

dim_date: Provides date context for each accident, including fields for day, week, month, quarter, and year.
dim_location: Contains street names and coordinates for spatial analysis of accidents.
dim_time: Offers detailed time information to complement the date context.
dim_source: Lists the sources of the data, providing traceability.
dim_contribute: Captures contributing factors to accidents, employing a Slowly Changing Dimension Type 2 technique to maintain historical changes.
dim_vehicle: Details the types of vehicles involved in accidents for vehicle-related analysis.

Fact Tables

fact_accidents: Central table logging accident metrics and associating them with time, location, and vehicle dimensions.
fact_contributing_factors: Links accidents with their contributing factors, acknowledging that multiple factors can contribute to a single accident.
fact_units_involved: Provides data on ongoing investigations of unresolved incidents, supporting analysis of these cases.

This comprehensive model supports extensive exploratory analysis, predictive modeling, and other statistical investigations, making it an invaluable tool for understanding traffic collision trends and developing safety improvement plans.

Visualizations

Key Visual Insights

Accident Occurrence:
- Visualized the total number of accidents in NYC, Austin, and Chicago.
- Highlighted the top three areas in each city with the highest accident rates.
Injury Analysis:
- Showcased the number of accidents resulting in injuries, both overall and by city.
Pedestrian Involvement:
- Illustrated how often pedestrians are involved in accidents, comparing overall data and city-specific data.
Temporal Patterns:
- Analyzed when most accidents happen, providing a seasonality report and a time-based analysis of accidents.
- Examined time of day, day of the week, and whether accidents occurred on weekdays or weekends.
Motorist Injury and Fatality Analysis:
- Detailed the number of motorists injured or killed in accidents, both overall and by city.
- Identified the top 5 areas in each city with the most fatal accidents.
Factor Analysis:
- Explored the most common factors involved in accidents, emphasizing key contributing factors and their correlations.

Visualization Samples

Accident Occurrence by City:
Injury Analysis:
Pedestrian Involvement:
Temporal Patterns:
Motorist Injury and Fatality Analysis:
Factor Analysis:

These visualizations provide a comprehensive view of traffic accidents, facilitating deeper insights and informed decision-making.

Conclusion

The project successfully demonstrates the application of advanced data architecture techniques to prepare, transform, and analyze large datasets for business intelligence purposes. The comprehensive profiling, meticulous ETL processes, and robust data validation ensure high-quality data for meaningful analysis and reporting.

For detailed steps, scripts, and additional information, please refer to the provided documentation and scripts within this repository.

By using sophisticated techniques and corporate standards, this report encapsulates the entire lifecycle of data from raw form to analytical insights, reflecting the robust and meticulous approach adopted throughout the project.

This updated README file now includes the missing details for the Austin and NYC ETL jobs, modifies the names in the Dimensional Modeling section according to the usage, and integrates the image of the dimension and fact tables. If you need any additional adjustments or further assistance, please let me know!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
AccidentReportAnalysis.pptx		AccidentReportAnalysis.pptx
AccidentReportAnalysis.twbx		AccidentReportAnalysis.twbx
AustinCrashReport.html		AustinCrashReport.html
AustinCrashReportYdataProfiling.ipynb		AustinCrashReportYdataProfiling.ipynb
Chicago Collision Data Profiling.html		Chicago Collision Data Profiling.html
Chicago_DataProfiling.ipynb		Chicago_DataProfiling.ipynb
End Term Report.docx		End Term Report.docx
FinalProject_Sheet.xlsx		FinalProject_Sheet.xlsx
Mapping Document.xlsx		Mapping Document.xlsx
NYCCrashReport.html		NYCCrashReport.html
NYCCrashReportYdataProfiling.ipynb		NYCCrashReportYdataProfiling.ipynb
README.md		README.md
Team8_Final (1).dm1		Team8_Final (1).dm1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Architecture and Business Intelligence - End Term Report

Introduction

Project Overview