Skip to content

varunu28/AADHAR-Dataset-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AADHAR-Dataset-Analysis

Data analysis of AADHAR dataset using Apache Spark

Technologies Used

  • Spark
  • Scala
  • Spark SQL
  • Linux Shell Scripting

Initial Data Cleaning

  • Removing the header containing column names (Done using scala)
  • Removing NULL values. Assumed them to be 0 (Done using UNIX SED)

Creating a DataFrame

Creating the DataFrame for starting the analysis using the case class corresponding to the column names in input data

Questions Answered about data

Count for number of participants and count for each gender

  • Number of Male Participants = 102037
  • Number of Female Participants = 120225
  • Total Number of Participants = 222281
  • Number of records with unspecified gender(T) = 19

Count the number of identities(Aadhaar) generated by each Enrollment Agency and get Top 3

  • CSC SPV : 85088
  • Rajcomp Info Services Ltd : 16356
  • Mahaonline Limited : 7749

Top 10 districts with maximum identities generated for both Male and Female

  • East Champaran : 3700
  • Jaipur : 3144
  • West Champaran : 2619
  • East Khasi Hills : 2481
  • Siwan : 2402
  • Muzaffarpur : 2250
  • Bharatpur : 1999
  • Agra : 1865
  • Ahmedabad : 1851
  • Shrawasti : 1810

Bottom 10 districts with maximum identities generated for both Male and Female

  • Serchhip : 0
  • Yanam : 1
  • Nicobar : 1
  • North Sikkim : 1
  • Dibang Valley : 1
  • Anjaw : 1
  • Tirap : 2
  • Mokokchung : 2
  • North Cachar Hills : 2
  • Narayanpur : 3

Seeing the top 10 and bottom 10 one thing we can notice that it is easy to bring well-known districts under the radar for issuing the aadhar but work still needs to be done in the remote areas

Top 3 State With number of identities generated for both Male and Female

  • Uttar Pradesh : 50254
  • Bihar : 29842
  • Rajasthan : 20744

Bottom 3 State With number of identities generated for both Male and Female

  • Lakshadweep : 14
  • Dadra and Nagar Haveli : 27
  • Daman and Diu : 45

Top 3 States With number of identities generated for Female

  • Uttar Pradesh : 26063
  • Bihar : 15353
  • Rajasthan : 11404

Bottom 3 States With number of identities generated for Female

  • Lakshadweep - 6
  • Others - 17
  • Dadra and Nagar Haveli - 21

Top 3 States With number identities generated for Male

  • Uttar Pradesh : 24191
  • Bihar : 14489
  • Rajasthan : 9340

Bottom 3 States With number identities generated for Male

  • Dadra and Nagar Haveli - 6
  • Lakshadweep - 8
  • Daman and Diu - 17

The gender-wise distribution follows the same trend as that of same distribution

About

Data analysis of AADHAR dataset using Apache Spark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages