Skip to content

Data cleaning, pre-processing, and Analytics on a Health care data using Spark and Python.

License

Notifications You must be signed in to change notification settings

tejasjbansal/HELTHCARE-SYSTEM

Repository files navigation

HELTHCARE SYSTEM BIG DATA ANALYTICS

shutterstock_400002673-760x475@2x

Requirement

A Health Care insurance company is facing challenges in enhancing its revenue and understanding the customers so it wants to take help of Big Data Ecosystem to analyze the Competitors company data received from varieties of sources, namely through scrapping and third-party sources. This analysis will help them to track the behavior, condition of customers so that to customize offers for them to buy insurance policies and also calculate royalties to those customers who buy policies in past, this in turn will enhance their revenues.

The goal of the project

The goal of the project is to create data pipelines for the Health Care insurance company which will make the company make appropriate business strategies to enhance their revenue by analyzing customers behaviors and send offers and royalties to customers respectively.

Major Components

Apache Spark Logo hadoop

Environment

  • Linux (Ubuntu 18.04)
  • Hadoop 2.7.2
  • Spark 2.0.2
  • sqoop 1.4.7
  • python3

STEPS:

DATASET CREATION

A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. Data sets can also consist of a collection of documents or files.

DATA CLEANING

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset.

DATABASE CREATION

A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS). Together, the data and the DBMS, along with the applications that are associated with them, are referred to as a database system, often shortened to the just a database.

LOADING DATA TO DATABASE

Data loading refers to the "load" component of ETL. After data is retrieved and combined from multiple sources (extracted), cleaned, and formatted (transformed), it is then loaded into a storage system, such as a cloud data warehouse, or relational database.

DATA TRANSFER TO HDFS USING SQOOP

Apache Sqoop is a command-line interface application used for transferring data between relational databases and Hadoop. The focus of this blog is on making the readers thoroughly understand Apache Sqoop and its deployment.

HIVE

Apache Hive is a particularly efficient tool when it comes to big data (exponential data that is to be analyzed). A warehouse data software that supports the data analysis process of big data on a regular basis, the concept of hive big data is quite popular in the technological realm. As data is stored in the Apache Hadoop Distributed File System (HDFS) wherein data is organized and structured, Apache Hive helps in processing this data and analyzing it producing data-driven patterns and trends. Fit to be used by organizations or institutions, Apache Hive is extremely helpful in big data and its ever-changing growth.

Spark SQL

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.

DATA VISUALIZATION

Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics, and even animations. These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand.

Contributors


Tejash

💻

Abhay

💻

Arjit

💻

Prity

💻

Mohit

💻

Vikalp

💻

Utkarsh

💻

Karan

💻

Piyush

💻

Sumedh

💻

Shivam

💻

Aayush

💻

Pardeep

💻

Rutwick

💻

Madhu

💻

Khushboo

💻

Yuvraj

💻

Harshvardhan

💻

Aditya

💻

Ujjwal

💻

License

This repository is licensed under Apache License 2.0 - see License for more details

About

Data cleaning, pre-processing, and Analytics on a Health care data using Spark and Python.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published