# Spark EDA


- Analyze it using Spark and answer the following questions: 
> 1. Find which `sector` has the most startups
> 2. Split the `Location of company` into 2 columns, `state` and `city`. If state is not present then keep it as null 
> 3. If `Location of company` column has a data `DIAT,Pune` then set 
    `state` as `Maharashtra` and `city` as `DIAT Pune`.
> 4. If `Location of company` column has a data `Ulhasnagar` then 
    set `state` as `Maharashtra` and `city` as `Ulhasnagar`
> 5. Find which State has the max number of startups
> 6. Find all the startups from `Maharashtra`.
> 7. How many startups were formed in `Healthcare` sector
> 8. Display all startups from `Pune` and `Nashik`
> 9. Sort the cities in `Maharashtra` in descending order of the 
    count of startups
> 10. How many startups are in South India. That is states 
    `Karnataka`, `Tamilnadu`, `Telangana`, `Andhra Pradesh`
> 11. How many startups are in `Gujarat`
> 12. How many startups are in North India.That is states other than 
    `Karnataka`, `Tamilnadu`, `Telangana`, `Andhra Pradesh` and 
    `Maharashtra`
> 13. What is the percentage of startup initiative from South India 
    and Maharashtra
> 14. What is the percentage contribution of startup from Maharashtra
> 15. What is the percentage contribution of startup from Gujarat
> 16. Replace `state` with null values to `Unknown`


In [400]:
#Entrypoint 2 spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

#sc = spark.sparkContext

## 1.Find which sector has the most startups

In [401]:

file_path = "Listofstartups.csv"

startup = spark.read.csv(file_path, header=True, inferSchema=True)

startup.createOrReplaceTempView("startup")

query = '''SELECT Sector,count(Name_of_startup) as num_startups_vs_sector FROM startup group by Sector order by count(Name_of_startup) desc'''
#order by count(Name_of_startup)
#query = '''SELECT *  FROM startup'''
startups_query = spark.sql(query)

startups_query.show(100)



+--------------------+----------------------+
|              Sector|num_startups_vs_sector|
+--------------------+----------------------+
|          Healthcare|                    34|
|            Agritech|                     5|
|Education Technology|                     5|
|     ICT Electronics|                     5|
|                 IOT|                     4|
|      Digital Health|                     4|
|                 IoT|                     4|
|             EduTech|                     3|
|              EdTech|                     3|
| Digital Health Tech|                     3|
|          Healthtech|                     3|
|            Fit-Tech|                     2|
|           Education|                     2|
|              Energy|                     2|
|                  EV|                     2|
|Information Techn...|                     2|
|             Fintech|                     2|
|        Clean Energy|                     2|
|Industrial Automa...|            

### 2. Split the Location of company into 2 columns, state and city. If state is not present then keep it as null 

In [402]:
import pyspark.sql.functions as F
from pyspark.sql.functions import split, col, when, size, trim,lower,initcap

startup = startup.withColumn('splits', F.split(startup["Location of company"],','))
startup.show(10)

+--------------------+--------------------+--------------------+------------------+--------------------+
|   Incubation_Center|     Name_of_startup| Location of company|            Sector|              splits|
+--------------------+--------------------+--------------------+------------------+--------------------+
|      ABES Ghaziabad|            Suryansh|           New Delhi|            EdTech|         [New Delhi]|
|AIC Banasthali Vi...|Thinkpods Educati...| Satara, Maharashtra|           Ed Tech|[Satara,  Maharas...|
|AIC Banasthali Vi...|Inventiway Soluti...| Mumbai, Maharashtra|           HR Tech|[Mumbai,  Maharas...|
|AIC Banasthali Vi...|C2M Internet Indi...|Lucknow, Uttar Pr...|       Retail Tech|[Lucknow,  Uttar ...|
|AIC Pinnacle Entr...|            Wastinno|   Pune, Maharashtra|       agriculture|[Pune,  Maharashtra]|
|AIC Pinnacle Entr...|Diabetico - Rise ...|   Pune, Maharashtra|        Healthcare|[Pune,  Maharashtra]|
|AIC Pinnacle Entr...|  3DGuru Innovations|   Pune, Mah

In [403]:
startup = startup.withColumn('city', initcap(lower(trim(col('splits').getItem(0)))))
startup = startup.withColumn('state', when(size('splits')==2,initcap(lower(trim(col('splits').getItem(1))))))
startup=startup.drop('splits')
startup=startup.drop('Location of company')
#startup['state']=startup['state'].str.strip()
startup.show(10)

+--------------------+--------------------+------------------+---------+-------------+
|   Incubation_Center|     Name_of_startup|            Sector|     city|        state|
+--------------------+--------------------+------------------+---------+-------------+
|      ABES Ghaziabad|            Suryansh|            EdTech|New Delhi|         null|
|AIC Banasthali Vi...|Thinkpods Educati...|           Ed Tech|   Satara|  Maharashtra|
|AIC Banasthali Vi...|Inventiway Soluti...|           HR Tech|   Mumbai|  Maharashtra|
|AIC Banasthali Vi...|C2M Internet Indi...|       Retail Tech|  Lucknow|Uttar Pradesh|
|AIC Pinnacle Entr...|            Wastinno|       agriculture|     Pune|  Maharashtra|
|AIC Pinnacle Entr...|Diabetico - Rise ...|        Healthcare|     Pune|  Maharashtra|
|AIC Pinnacle Entr...|  3DGuru Innovations|           EduTech|     Pune|  Maharashtra|
|AIC Pinnacle Entr...|     Gupte Education|  Ed Tech, Defence|     Pune|  Maharashtra|
|AIC Pinnacle Entr...|Eldew Digital Pvt...|

### 3. If Location of company column has a data DIAT,Pune then set state as Maharashtra and city as DIAT Pune

In [404]:
startup = startup.withColumn('state', when(startup.city.contains('Diat')|startup.city.contains('Pune'), 'Maharashtra').otherwise(startup.state))
#.filter(startup.Sector.contains('Healthcare'))
startup = startup.withColumn('city', when((startup.city.contains('Diat'))|(startup.city.contains('Pune')), 'DIAT Pune').otherwise(startup.city))
startup.show(200)

+--------------------+--------------------+--------------------+--------------+--------------+
|   Incubation_Center|     Name_of_startup|              Sector|          city|         state|
+--------------------+--------------------+--------------------+--------------+--------------+
|      ABES Ghaziabad|            Suryansh|              EdTech|     New Delhi|          null|
|AIC Banasthali Vi...|Thinkpods Educati...|             Ed Tech|        Satara|   Maharashtra|
|AIC Banasthali Vi...|Inventiway Soluti...|             HR Tech|        Mumbai|   Maharashtra|
|AIC Banasthali Vi...|C2M Internet Indi...|         Retail Tech|       Lucknow| Uttar Pradesh|
|AIC Pinnacle Entr...|            Wastinno|         agriculture|     DIAT Pune|   Maharashtra|
|AIC Pinnacle Entr...|Diabetico - Rise ...|          Healthcare|     DIAT Pune|   Maharashtra|
|AIC Pinnacle Entr...|  3DGuru Innovations|             EduTech|     DIAT Pune|   Maharashtra|
|AIC Pinnacle Entr...|     Gupte Education|    Ed 

### 4.If Location of company column has a data Ulhasnagar then set state as Maharashtra and city as Ulhasnagar

In [405]:
startup = startup.withColumn('state', when(startup.city.contains('Ulhasnagar'),'Maharashtra').otherwise(startup.state))
startup.filter(startup.city.contains('Ulhasnagar')).show(10)

+--------------------+--------------------+---------+----------+-----------+
|   Incubation_Center|     Name_of_startup|   Sector|      city|      state|
+--------------------+--------------------+---------+----------+-----------+
|Society for Innov...|Develop Train Mai...|CleanTech|Ulhasnagar|Maharashtra|
+--------------------+--------------------+---------+----------+-----------+



## 5. Find which State has the max number of startups

In [406]:
startup = startup.withColumn("state", when(col("state") == "Unknown", None).otherwise(col("state")))
db5=startup.filter(startup.state.isNotNull()).groupby(startup.state).count().orderBy("count", ascending=False)
db5.show(1)

+---------+-----+
|    state|count|
+---------+-----+
|Karnataka|   35|
+---------+-----+
only showing top 1 row



## 6. Find all the startups from Maharashtra

In [407]:

db6=startup.filter(startup.state=='Maharashtra')
db6.show(200)

+--------------------+--------------------+--------------------+----------+-----------+
|   Incubation_Center|     Name_of_startup|              Sector|      city|      state|
+--------------------+--------------------+--------------------+----------+-----------+
|AIC Banasthali Vi...|Thinkpods Educati...|             Ed Tech|    Satara|Maharashtra|
|AIC Banasthali Vi...|Inventiway Soluti...|             HR Tech|    Mumbai|Maharashtra|
|AIC Pinnacle Entr...|            Wastinno|         agriculture| DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|Diabetico - Rise ...|          Healthcare| DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|  3DGuru Innovations|             EduTech| DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|     Gupte Education|    Ed Tech, Defence| DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|Eldew Digital Pvt...|  IT, Virtual Events| DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|Secumatic Technol...|             Defense| DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|Catalystgr

## 7.How many startups were formed in `Healthcare` sector

In [408]:
startup.filter(startup.Sector.contains('Healthcare')).show(100)

+--------------------+--------------------+----------+-----------+--------------+
|   Incubation_Center|     Name_of_startup|    Sector|       city|         state|
+--------------------+--------------------+----------+-----------+--------------+
|AIC Pinnacle Entr...|Diabetico - Rise ...|Healthcare|  DIAT Pune|   Maharashtra|
|           AIC@36Inc|           Jivandeep|Healthcare|     Raipur|          null|
|Bio-incubator at ...|Predible Health P...|Healthcare|  Bengaluru|     Karnataka|
|Bio-incubator at ...|    ARQ Solution LLP|Healthcare|     Mumbai|   Maharashtra|
|Centre for Innova...|Rekindle Automati...|Healthcare|    Chennai|     Tamilnadu|
|Centre for Innova...|MedCuore Medical ...|Healthcare|    Chennai|     Tamilnadu|
|Chitkara innovati...|Hackspace Securit...|Healthcare|      Delhi|         Delhi|
|    CIIE Initiatives|             Kidaura|Healthcare|     Nashik|   Maharashtra|
|    CIIE Initiatives|      Pacify Medical|Healthcare|     Mumbai|   Maharashtra|
|    CIIE Initia

In [409]:
type(startup)

pyspark.sql.dataframe.DataFrame

## 8. Display all startups from Pune and Nashik

In [410]:
startup.filter(startup.city.contains('Pune')|startup.city.contains('Nashik')).show(30)

+--------------------+--------------------+--------------------+---------+-----------+
|   Incubation_Center|     Name_of_startup|              Sector|     city|      state|
+--------------------+--------------------+--------------------+---------+-----------+
|AIC Pinnacle Entr...|            Wastinno|         agriculture|DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|Diabetico - Rise ...|          Healthcare|DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|  3DGuru Innovations|             EduTech|DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|     Gupte Education|    Ed Tech, Defence|DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|Eldew Digital Pvt...|  IT, Virtual Events|DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|Secumatic Technol...|             Defense|DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|Catalystgreen Pri...|          E-Mobility|DIAT Pune|Maharashtra|
|AIC Pinnacle Entr...|Dynateq Consultin...|Industril Automation|DIAT Pune|Maharashtra|
|    CIIE Initiatives|             Kidaura|

## 9. Sort the cities in Maharashtra in descending order of the count of startups

In [411]:
startup.filter(startup.state.contains('Maharashtra')).groupby(startup.city).count().orderBy("count", ascending=False).show()

+----------+-----+
|      city|count|
+----------+-----+
| DIAT Pune|   14|
|    Mumbai|    9|
|    Nashik|    5|
|    Satara|    2|
|    Nagpur|    2|
|     Thane|    1|
|Ulhasnagar|    1|
+----------+-----+



### 10. How many startups are in South India. That is states Karnataka, Tamilnadu, Telangana, Andhra Pradesh

In [412]:
south_startups=startup.filter(startup.state.contains('Karnataka')|startup.state.contains('Tamilnadu')
               |startup.state.contains('Telangana')|startup.state.contains('Andhra Pradesh')).count()
print('south startups count = ',south_startups)

south startups count =  76


## 11. How many startups are in Gujarat

In [413]:
guj=startup.filter(startup.state.contains('Gujarat')).count()
print('Gujarat startups count = ',guj)

Gujarat startups count =  7


### 12. How many startups are in North India.That is states other than Karnataka, Tamilnadu, Telangana, Andhra Pradesh and Maharashtra

In [414]:
north_startup=startup.filter(~(startup.state.contains('Karnataka')|startup.state.contains('Tamilnadu')
               |startup.state.contains('Telangana')|startup.state.contains('Andhra Pradesh')
               |startup.state.contains('Maharashtra'))).count()
startup_total=startup.count()
print('north startups count = ',north_startup)

north startups count =  63


In [415]:
startup_total=startup.count()
startup_total

241

## 13. What is the percentage of startup initiative from South India and Maharashtra

In [426]:
mands=startup.filter((startup.state.contains('Karnataka')|startup.state.contains('Tamilnadu')
               |startup.state.contains('Telangana')|startup.state.contains('Andhra Pradesh')
               |startup.state.contains('Maharashtra'))).count()

print('South India and Maharashtra startups ratio to startups = ',100*mands/startup_total)

South India and Maharashtra startups ratio to startups =  45.643153526970956


## 14. What is the percentage contribution of startup from Maharashtra

In [425]:
startup_maha=startup.filter(startup.state.contains('Maharashtra')).count()
print('Maharashtra startups to startups percentage = ',100*startup_maha/startup_total)

Maharashtra startups to startups percentage =  14.107883817427386


## 15.  What is the percentage contribution of startup from Gujarat

In [424]:
startup_gur=startup.filter(startup.state.contains('Gujarat')).count()
print('Maharashtra startups to startups percentage = ',100*startup_gur/startup_total)

Maharashtra startups to startups percentage =  2.904564315352697


## 16. Replace `state` with null values to `Unknown`

In [419]:
startup = startup.withColumn('state', when(startup.state.isNull(),'Unknown').otherwise(startup.state))
startup.show(200)

+--------------------+--------------------+--------------------+--------------+--------------+
|   Incubation_Center|     Name_of_startup|              Sector|          city|         state|
+--------------------+--------------------+--------------------+--------------+--------------+
|      ABES Ghaziabad|            Suryansh|              EdTech|     New Delhi|       Unknown|
|AIC Banasthali Vi...|Thinkpods Educati...|             Ed Tech|        Satara|   Maharashtra|
|AIC Banasthali Vi...|Inventiway Soluti...|             HR Tech|        Mumbai|   Maharashtra|
|AIC Banasthali Vi...|C2M Internet Indi...|         Retail Tech|       Lucknow| Uttar Pradesh|
|AIC Pinnacle Entr...|            Wastinno|         agriculture|     DIAT Pune|   Maharashtra|
|AIC Pinnacle Entr...|Diabetico - Rise ...|          Healthcare|     DIAT Pune|   Maharashtra|
|AIC Pinnacle Entr...|  3DGuru Innovations|             EduTech|     DIAT Pune|   Maharashtra|
|AIC Pinnacle Entr...|     Gupte Education|    Ed 

In [420]:
startup.filter(startup.state=='Unknown').show(200)

+--------------------+--------------------+--------------------+--------------------+-------+
|   Incubation_Center|     Name_of_startup|              Sector|                city|  state|
+--------------------+--------------------+--------------------+--------------------+-------+
|      ABES Ghaziabad|            Suryansh|              EdTech|           New Delhi|Unknown|
|           AIC@36Inc|               TECHB|3d printer and cn...|              Bhilai|Unknown|
|           AIC@36Inc|Acculegal Service...|Finance, Legal , ...|              Raipur|Unknown|
|           AIC@36Inc|Bastar Se Bazar T...|       Agri-business|        Uttar Bastar|Unknown|
|           AIC@36Inc|          Coshal Art|          Handicraft|              Raipur|Unknown|
|           AIC@36Inc|           Jivandeep|          Healthcare|              Raipur|Unknown|
|           AIC@36Inc|  Binomial Analytics|IT and Technology...|              Raipur|Unknown|
|           AIC@36Inc|              Rawfit|Heathcare & Welln