YouTube Data Analysis Project By Shashank

Short Description

This project leverages AWS cloud services to ingest, process, and analyze YouTube trending video data. It implements a scalable ETL pipeline and data lake architecture, culminating in interactive dashboards for insights into video popularity trends across different regions.

Overview

This project aims to securely manage, streamline, and perform analysis on structured and semi-structured YouTube videos data based on video categories and trending metrics. We'll be using various AWS services to create a scalable, cloud-based solution for data ingestion, processing, and analysis.

Project Goals

Data Ingestion: Build a mechanism to ingest data from different sources.
ETL System: Transform raw data into the proper format for analysis.
Data Lake: Create a centralized repository to store data from multiple sources.
Scalability: Ensure the system scales as the size of our data increases.
Cloud Infrastructure: Utilize AWS services to process vast amounts of data.
Reporting: Build a dashboard to visualize data and answer key questions.

AWS Services Used

Amazon S3: Object storage service for scalable data storage.
AWS IAM: Identity and Access Management for secure access control.
Amazon QuickSight: Scalable, serverless business intelligence (BI) service for data visualization.
AWS Glue: Serverless data integration service for data discovery, preparation, and combination.
AWS Lambda: Serverless computing service for running code without managing servers.
AWS Athena: Interactive query service for S3 data.

Dataset

We're using a Kaggle dataset containing statistics on daily popular YouTube videos over several months. The dataset includes:

Up to 200 trending videos published daily for various locations.
Data for each region in separate files.
Video information such as title, channel title, publication time, tags, views, likes, dislikes, description, and comment count.
A category_id field (varies by region) included in a linked JSON file.

Dataset source: YouTube Trending Video Dataset

Project Structure

Clone the repository to your local machine.
Set up your AWS credentials and configure the necessary IAM roles and permissions.
Ingest data from the Kaggle dataset into Amazon S3.
Use AWS Glue to transform and load data into the data lake.
Query the data with AWS Athena and visualize results with Amazon QuickSight.
Deploy your Lambda functions to automate the process and ensure scalability.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Amazon Quicksight Dashboard - Shashank Pandey.mp4		Amazon Quicksight Dashboard - Shashank Pandey.mp4
README.md		README.md
architecture.jpeg		architecture.jpeg
data-architecture-diagram.mermaid		data-architecture-diagram.mermaid
lambda_function.py		lambda_function.py
pyspark_code.py		pyspark_code.py
s3_cli_command.sh		s3_cli_command.sh
youtube-data-analysis-readme.md		youtube-data-analysis-readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Data Analysis Project By Shashank

Short Description

Overview

Project Goals

AWS Services Used

Dataset

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

YouTube Data Analysis Project By Shashank

Short Description

Overview

Project Goals

AWS Services Used

Dataset

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages