# **Introduction**



Analyzing a dataset from Aadhaar – a unique identity issued to all resident Indians.


**Contains following columns:**

Registrar

Enrollment Agency

State 


District 

Sub Distric 

Pin Code, Gender 

Age

Aadhaar Generated

Enrollment Rejected 

Residents providing email

Residents providing mobile number

# **AIM**

Create a dataframe with Total Aadhaar's generated for each state

Create a dataframe with Total Aadhaar's generated by each enrollment agency

Create dataframe with top 10 districts with maximum Aadhaar's generated for both Male and Female?

Create a dataframe with Total Aadhaar's generated for top 10 least state

For which age most adhar card has declined ?

In [None]:
pip install pyspark #pyspark installation

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#Initializing PySpark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

#Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
spark = SparkSession.builder.appName('Test').getOrCreate()
sqlContext = SQLContext(sc)



In [None]:
import pyspark.sql.functions as F
df = spark.read.format("csv").option("header", "true").option("inferSchema","true").load("/content/UIDAI-ENR-DETAIL-20170308.csv")  

In [None]:
df.show()

+--------------------+--------------------+-------------+----------+------------+--------+------+---+-----------------+------------------+-------------------------+---------------------------------+
|           Registrar|    Enrolment Agency|        State|  District|Sub District|Pin Code|Gender|Age|Aadhaar generated|Enrolment Rejected|Residents providing email|Residents providing mobile number|
+--------------------+--------------------+-------------+----------+------------+--------+------+---+-----------------+------------------+-------------------------+---------------------------------+
|      Allahabad Bank|A-Onerealtors Pvt...|Uttar Pradesh| Allahabad|        Meja|  212303|     F|  7|                1|                 0|                        0|                                1|
|      Allahabad Bank|Asha Security Gua...|Uttar Pradesh| Sonbhadra| Robertsganj|  231213|     M|  8|                1|                 0|                        0|                                0|
|    

In [None]:
aad = df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns]) #Replacing space with _

In [None]:
aad.show()

+--------------------+--------------------+-------------+----------+------------+--------+------+---+-----------------+------------------+-------------------------+---------------------------------+
|           Registrar|    Enrolment_Agency|        State|  District|Sub_District|Pin_Code|Gender|Age|Aadhaar_generated|Enrolment_Rejected|Residents_providing_email|Residents_providing_mobile_number|
+--------------------+--------------------+-------------+----------+------------+--------+------+---+-----------------+------------------+-------------------------+---------------------------------+
|      Allahabad Bank|A-Onerealtors Pvt...|Uttar Pradesh| Allahabad|        Meja|  212303|     F|  7|                1|                 0|                        0|                                1|
|      Allahabad Bank|Asha Security Gua...|Uttar Pradesh| Sonbhadra| Robertsganj|  231213|     M|  8|                1|                 0|                        0|                                0|
|    

In [None]:
aad.printSchema() # schema of dataframe

root
 |-- Registrar: string (nullable = true)
 |-- Enrolment_Agency: string (nullable = true)
 |-- State: string (nullable = true)
 |-- District: string (nullable = true)
 |-- Sub_District: string (nullable = true)
 |-- Pin_Code: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Aadhaar_generated: integer (nullable = true)
 |-- Enrolment_Rejected: integer (nullable = true)
 |-- Residents_providing_email: integer (nullable = true)
 |-- Residents_providing_mobile_number: integer (nullable = true)



# Create a dataframe with Total Aadhaar's generated for each state

In [None]:
from pyspark.sql.functions import sum, col, desc, asc, count #importing functions
aad_state = aad.groupBy("State")  #grouping data state wise

df1=aad_state.agg(sum("Aadhaar_generated").alias("sum_aadhar")) #sum of all aadhar genetrated

df2=df1.sort(desc("sum_aadhar")) #sorting in descending

In [None]:
df2.show(20,truncate=False);  #showing top 20 data

+--------------+----------+
|State         |sum_aadhar|
+--------------+----------+
|Bihar         |162607    |
|West Bengal   |119901    |
|Uttar Pradesh |103767    |
|Madhya Pradesh|53276     |
|Rajasthan     |39570     |
|Gujarat       |34844     |
|Tamil Nadu    |32485     |
|Maharashtra   |26085     |
|Karnataka     |19764     |
|Odisha        |18182     |
|Kerala        |15143     |
|Uttarakhand   |13227     |
|Jharkhand     |9868      |
|Delhi         |8426      |
|Haryana       |6804      |
|Chhattisgarh  |6604      |
|Punjab        |6506      |
|Mizoram       |6279      |
|Andhra Pradesh|5798      |
|Telangana     |5018      |
+--------------+----------+
only showing top 20 rows



# Create a dataframe with Total Aadhaar's generated by each enrollment agency

In [None]:
aadh_erl=aad.groupBy("Enrolment_Agency") #grouping the data set according to Enrolment Agency

df3=aadh_erl.agg(sum("Aadhaar_generated").alias("sum_aadhar"))#sum of all aadhar generated from the grouped data

df4=df3.sort(desc("sum_aadhar"))  #soring 
df4.show(15,truncate=False);

+----------------------------------------+----------+
|Enrolment_Agency                        |sum_aadhar|
+----------------------------------------+----------+
|CSC SPV                                 |173192    |
|Wipro Ltd                               |39619     |
|SREI INFRASTRUCTURE FINANCES L          |26497     |
|SRM Education And Social Welfare Society|26253     |
|Computer LAB                            |21823     |
|Rajcomp Info Services Ltd               |20163     |
|MPOnline Limited                        |17020     |
|AKSH OPTIFIBRE LIMITED                  |16624     |
|Nielsen  India  Private Limited         |15993     |
|TAMILNADU ARASU CABLE TV CORPORATION LTD|15981     |
|Akshaya                                 |14562     |
|CMS Computers Ltd                       |13126     |
|IAP COMPANY Pvt. Ltd                    |10644     |
|VEETECHNOLOGIES PVT. LTD                |9922      |
|NPS Technologies Pvt. Ltd               |9692      |
+---------------------------

# Create dataframe with top 10 districts with maximum Aadhaar's generated for both Male and Female?


In [None]:
male=aad.filter("Gender == 'M'")  #filtering by gender male

dfm=male.groupBy("District")  #grouping the filtered data by district

dfm1=dfm.agg(sum("Aadhaar_generated").alias("sum_aadhar")) #sum of all aadhar generated from the grouped data

dfm2=dfm1.sort(desc("sum_aadhar")) #sorting data

print('Top 10 districts with maximum Aadhaar generated for Male :')
dfm2.show(10,truncate=False);

Top 10 districts with maximum Aadhaar generated for Male :
+-----------------+----------+
|District         |sum_aadhar|
+-----------------+----------+
|Bhagalpur        |11007     |
|South 24 Parganas|7825      |
|Katihar          |6968      |
|Murshidabad      |6808      |
|Samastipur       |6195      |
|Patna            |6191      |
|Barddhaman       |6077      |
|Gaya             |5959      |
|Munger           |5781      |
|Nadia            |5509      |
|North 24 Parganas|5164      |
|Khagaria         |4869      |
|West Champaran   |4711      |
|Haridwar         |4195      |
|Sitamarhi        |3962      |
+-----------------+----------+
only showing top 15 rows



In [None]:
female=aad.filter("Gender == 'F'") #filtering by gender female

fe=female.groupBy("District") #grouping the filtered data by district

fe1=fe.agg(sum("Aadhaar_generated").alias("sum_aadhar")) #sum of all aadhar generated from the grouped data

fe2=fe1.sort(desc("sum_aadhar")) #sorting data


print('Top 10 districts with maximum Aadhaar generated for FeMale :')

fe2.show(10,truncate=False);

Top 15 districts with maximum Aadhaar generated for FeMale :
+-----------------+----------+
|District         |sum_aadhar|
+-----------------+----------+
|Barddhaman       |9744      |
|South 24 Parganas|8382      |
|North 24 Parganas|6108      |
|Gaya             |4796      |
|Jalpaiguri       |4428      |
|Paschim Medinipur|3965      |
|Howrah           |3516      |
|Bhagalpur        |3472      |
|Budaun           |2905      |
|Banka            |2882      |
|Uttar Dinajpur   |2803      |
|Patna            |2754      |
|Bhojpur          |2680      |
|Nadia            |2653      |
|Begusarai        |2625      |
+-----------------+----------+
only showing top 15 rows



# Create a dataframe with Total Aadhaar's generated for top 10 least state

In [None]:
le_state=aad.groupBy("State") #grouping data state wise

le=le_state.agg(sum("Aadhaar_generated").alias("sum_aadhar")) #sum of all aadhar generated from the grouped data

le1=le.sort(asc("sum_aadhar")) #sorting in ascending order

print('Below 10 districts with least Aadhaar generated : ')
le1.show(10,truncate=False)

Below 10 districts with least Aadhaar generated : 
+---------------------------+----------+
|State                      |sum_aadhar|
+---------------------------+----------+
|Lakshadweep                |4         |
|Andaman and Nicobar Islands|5         |
|Others                     |12        |
|Sikkim                     |50        |
|Puducherry                 |83        |
|Daman and Diu              |105       |
|Dadra and Nagar Haveli     |140       |
|Chandigarh                 |259       |
|Meghalaya                  |277       |
|Nagaland                   |545       |
+---------------------------+----------+
only showing top 10 rows



# For which age most adhar card has declined ?

In [None]:
age_df=aad.filter(col("Aadhaar_generated") == 0)   #filtering dataset for aadhar generated==0

agdf1=age_df.groupBy("Age")  #grouping the filterd data according to age

agdf2=agdf1.agg(count("Aadhaar_generated").alias("count_aadhar")) #counting all aadhar genrated

age_df_m=agdf2.sort(desc("count_aadhar"))   #sorting in descending according to count_aadhar

print('Age show most decline in Aadhar card Genration ')

age_df_m.show(15,truncate=False)  #showing data

Age show most decline in Aadhar card Genration 
+---+------------+
|Age|count_aadhar|
+---+------------+
|4  |1729        |
|3  |1492        |
|2  |1389        |
|1  |1294        |
|0  |1087        |
|5  |863         |
|6  |794         |
|7  |724         |
|8  |612         |
|9  |529         |
|10 |500         |
|11 |403         |
|12 |344         |
|13 |298         |
|18 |283         |
+---+------------+
only showing top 15 rows



# **CONCLUSION**

1. Total Aadhaar's generated an enrollment agency is 173192 and the agency is CSC SPV 

2. Bhagalpur is the state where maximum Aadhaar is generated for Male by the number 11007 followed by South 24 Parganas 7825 number

3. Barddhaman is the state where maximum Aadhaar is generated for FeMale by the number 9744 followed by South 24 Parganas with 8382 number
 
4. Lakshadweep is the least state for which the total no of aadhar is generated and i.e 4 followed by Andaman and Nicobar Islands in which total 5 aadhar generated.

5. For the age 4 years the most aadhar has declined by the number 1729 followed by the age 3 years with number 1492