# PySpark


## Installing the required Pyspark library

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 35 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 46.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=ba37a4a93d8a71518b4a7562329f752015a8bab594259b3bff7c35c51917ccb8
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [None]:
#Initializing PySpark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
spark = SparkSession.builder.appName('Aadhaar data Analysis').getOrCreate()
sqlContext = SQLContext(sc)



### Reading the CSV file into Data Frame

In [None]:
aadhar_df = spark.read.format("csv").option("header", "true").option("inferSchema","true").load("/content/UIDAI-ENR-DETAIL-20170308.csv")

# printing first 5 rows
aadhar_df.show(5) 


+--------------+--------------------+-------------+---------+------------+--------+------+---+-----------------+------------------+-------------------------+---------------------------------+
|     Registrar|    Enrolment Agency|        State| District|Sub District|Pin Code|Gender|Age|Aadhaar generated|Enrolment Rejected|Residents providing email|Residents providing mobile number|
+--------------+--------------------+-------------+---------+------------+--------+------+---+-----------------+------------------+-------------------------+---------------------------------+
|Allahabad Bank|A-Onerealtors Pvt...|Uttar Pradesh|Allahabad|        Meja|  212303|     F|  7|                1|                 0|                        0|                                1|
|Allahabad Bank|Asha Security Gua...|Uttar Pradesh|Sonbhadra| Robertsganj|  231213|     M|  8|                1|                 0|                        0|                                0|
|Allahabad Bank|   SGS INDIA PVT LTD|Utt

### 1. Create a dataframe with Total Aadhaar's generated for each **state**

In [None]:
# importing functions from pyspark sql
import pyspark.sql.functions as f

# Grouping by state
count_by_state = aadhar_df.groupby("State").sum("Aadhaar generated").withColumnRenamed("sum(Aadhaar generated)","Total Aadhaar generated")
count_by_state.orderBy(f.desc("sum(Aadhaar generated)")).show()

+--------------+-----------------------+
|         State|Total Aadhaar generated|
+--------------+-----------------------+
|         Bihar|                 162607|
|   West Bengal|                 119901|
| Uttar Pradesh|                 103767|
|Madhya Pradesh|                  53276|
|     Rajasthan|                  39570|
|       Gujarat|                  34844|
|    Tamil Nadu|                  32485|
|   Maharashtra|                  26085|
|     Karnataka|                  19764|
|        Odisha|                  18182|
|        Kerala|                  15143|
|   Uttarakhand|                  13227|
|     Jharkhand|                   9868|
|         Delhi|                   8426|
|       Haryana|                   6804|
|  Chhattisgarh|                   6604|
|        Punjab|                   6506|
|       Mizoram|                   6279|
|Andhra Pradesh|                   5798|
|     Telangana|                   5018|
+--------------+-----------------------+
only showing top

**Above one is the Total Aadhaar generated in each state**


### 2. Create a dataframe with Total Aadhaar's generated by each enrollment agency

In [None]:
# Grouping by Enrolment Agency
count_by_Enrolment_Agency = aadhar_df.groupby("Enrolment Agency").sum("Aadhaar generated").withColumnRenamed("sum(Aadhaar generated)","Total Aadhaar generated ")
count_by_Enrolment_Agency.orderBy(f.desc("sum(Aadhaar generated)")).show(10)

+--------------------+------------------------+
|    Enrolment Agency|Total Aadhaar generated |
+--------------------+------------------------+
|             CSC SPV|                  173192|
|           Wipro Ltd|                   39619|
|SREI INFRASTRUCTU...|                   26497|
|SRM Education And...|                   26253|
|        Computer LAB|                   21823|
|Rajcomp Info Serv...|                   20163|
|    MPOnline Limited|                   17020|
|AKSH OPTIFIBRE LI...|                   16624|
|Nielsen  India  P...|                   15993|
|TAMILNADU ARASU C...|                   15981|
+--------------------+------------------------+
only showing top 10 rows



***CSV SPV*** Enrolment Agency has the highest No of Aadhaar generated

### 3. Create dataframe with top 10 districts with maximum Aadhaar's generated for both Male and Female?

In [None]:
# Grouping by Enrolment District and Gender
count_district_gender = aadhar_df.groupby(["District","Gender"]).sum("Aadhaar generated").withColumnRenamed("sum(Aadhaar generated)","Total Aadhaar generated ")
count_district_gender.orderBy(f.desc("sum(Aadhaar generated)")).show(10)

+-----------------+------+------------------------+
|         District|Gender|Total Aadhaar generated |
+-----------------+------+------------------------+
|        Bhagalpur|     M|                   11007|
|       Barddhaman|     F|                    9744|
|South 24 Parganas|     F|                    8382|
|South 24 Parganas|     M|                    7825|
|          Katihar|     M|                    6968|
|      Murshidabad|     M|                    6808|
|       Samastipur|     M|                    6195|
|            Patna|     M|                    6191|
|North 24 Parganas|     F|                    6108|
|       Barddhaman|     M|                    6077|
+-----------------+------+------------------------+
only showing top 10 rows



Above 10 District's have the maximum Aadhaar's generated for both Male and Female

### 4. Create a dataframe with Total Aadhaar's generated for top 10 least state

In [None]:
# Grouping by State and printing least state with Aadhaar generated
count_leastState = aadhar_df.groupby("State").sum("Aadhaar generated").withColumnRenamed("sum(Aadhaar generated)","Total Aadhaar generated ")
count_leastState.orderBy(f.asc("sum(Aadhaar generated)")).show(10)

+--------------------+------------------------+
|               State|Total Aadhaar generated |
+--------------------+------------------------+
|         Lakshadweep|                       4|
|Andaman and Nicob...|                       5|
|              Others|                      12|
|              Sikkim|                      50|
|          Puducherry|                      83|
|       Daman and Diu|                     105|
|Dadra and Nagar H...|                     140|
|          Chandigarh|                     259|
|           Meghalaya|                     277|
|            Nagaland|                     545|
+--------------------+------------------------+
only showing top 10 rows



These are Top 10 least State's with Total Aadhaar's generated


### 5. For which age most adhar card has declined ?

In [None]:
# Grouping by Age
count_by_age = aadhar_df.groupby("Age").sum("Enrolment Rejected").orderBy(f.desc("sum(Enrolment Rejected)"))
count_by_age.withColumnRenamed("sum(Enrolment Rejected)","Total Enrolment Rejected").show(15)

+---+------------------------+
|Age|Total Enrolment Rejected|
+---+------------------------+
|  4|                    5673|
|  3|                    3842|
|  2|                    3372|
|  1|                    3333|
|  0|                    3219|
|  5|                    2208|
|  6|                    1931|
|  7|                    1572|
|  8|                    1357|
|  9|                     980|
| 10|                     920|
| 11|                     604|
| 12|                     560|
| 13|                     406|
| 18|                     384|
+---+------------------------+
only showing top 15 rows



Who has the Age 4, there adhar card has mostly declined