Data analysis of AADHAR dataset using Apache Spark
- Spark
- Scala
- Spark SQL
- Linux Shell Scripting
- Removing the header containing column names (Done using scala)
- Removing NULL values. Assumed them to be 0 (Done using UNIX SED)
Creating the DataFrame for starting the analysis using the case class corresponding to the column names in input data
- Number of Male Participants = 102037
- Number of Female Participants = 120225
- Total Number of Participants = 222281
- Number of records with unspecified gender(T) = 19
- CSC SPV : 85088
- Rajcomp Info Services Ltd : 16356
- Mahaonline Limited : 7749
- East Champaran : 3700
- Jaipur : 3144
- West Champaran : 2619
- East Khasi Hills : 2481
- Siwan : 2402
- Muzaffarpur : 2250
- Bharatpur : 1999
- Agra : 1865
- Ahmedabad : 1851
- Shrawasti : 1810
- Serchhip : 0
- Yanam : 1
- Nicobar : 1
- North Sikkim : 1
- Dibang Valley : 1
- Anjaw : 1
- Tirap : 2
- Mokokchung : 2
- North Cachar Hills : 2
- Narayanpur : 3
Seeing the top 10 and bottom 10 one thing we can notice that it is easy to bring well-known districts under the radar for issuing the aadhar but work still needs to be done in the remote areas
- Uttar Pradesh : 50254
- Bihar : 29842
- Rajasthan : 20744
- Lakshadweep : 14
- Dadra and Nagar Haveli : 27
- Daman and Diu : 45
- Uttar Pradesh : 26063
- Bihar : 15353
- Rajasthan : 11404
- Lakshadweep - 6
- Others - 17
- Dadra and Nagar Haveli - 21
- Uttar Pradesh : 24191
- Bihar : 14489
- Rajasthan : 9340
- Dadra and Nagar Haveli - 6
- Lakshadweep - 8
- Daman and Diu - 17
The gender-wise distribution follows the same trend as that of same distribution