<a href="https://colab.research.google.com/github/sandeepgundeboina/LearningSpark/blob/main/SparkFilterOps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pyspark



In [2]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('FilterOps').getOrCreate()

In [3]:
df=spark.read.csv('/content/drive/MyDrive/Abc/total-alcohol-consumption-per-capita-litres-of-pure-alcohol.csv', header=True)
df.show(4)

+-----------+----+----+----------------------------------------------------------------------------------------------------+
|     Entity|Code|Year|Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)|
+-----------+----+----+----------------------------------------------------------------------------------------------------+
|Afghanistan| AFG|2000|                                                                                               0.003|
|Afghanistan| AFG|2001|                                                                                               0.003|
|Afghanistan| AFG|2002|                                                                                               0.007|
|Afghanistan| AFG|2003|                                                                                               0.016|
+-----------+----+----+----------------------------------------------------------------------------------------------------+


In [4]:
df.count()

4185

**Filtering Data**

    Using Like Operator

In [5]:
df.filter(df.Entity.like('%In%')).show(5)

+------+----+----+----------------------------------------------------------------------------------------------------+
|Entity|Code|Year|Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)|
+------+----+----+----------------------------------------------------------------------------------------------------+
| India| IND|2000|                                                                                                1.97|
| India| IND|2001|                                                                                                1.97|
| India| IND|2002|                                                                                                2.11|
| India| IND|2003|                                                                                                2.21|
| India| IND|2004|                                                                                                2.29|
+------+----+----+----------------------

    Using Multiple conditions

In [6]:
df.filter((df.Entity.like('%In%') | (df.Entity.like('%Can%'))) &(df.Year == 2015)).show(10)

+---------+----+----+----------------------------------------------------------------------------------------------------+
|   Entity|Code|Year|Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)|
+---------+----+----+----------------------------------------------------------------------------------------------------+
|   Canada| CAN|2015|                                                                                                9.92|
|    India| IND|2015|                                                                                                4.96|
|Indonesia| IDN|2015|                                                                                                0.11|
+---------+----+----+----------------------------------------------------------------------------------------------------+



In [7]:
df.filter((df.Entity.like('%In%') | (df.Entity.like('%Can%'))) &(df.Year != 2015)).show(10)

+------+----+----+----------------------------------------------------------------------------------------------------+
|Entity|Code|Year|Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)|
+------+----+----+----------------------------------------------------------------------------------------------------+
|Canada| CAN|2000|                                                                                                9.38|
|Canada| CAN|2001|                                                                                                9.38|
|Canada| CAN|2002|                                                                                                9.45|
|Canada| CAN|2003|                                                                                                 9.5|
|Canada| CAN|2004|                                                                                                 9.6|
|Canada| CAN|2005|                      

    StartsWith and EndsWith

In [8]:
df.filter(df.Entity.startswith('In') & (df.Year.cast('int')>=2016)).show()

+---------+----+----+----------------------------------------------------------------------------------------------------+
|   Entity|Code|Year|Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)|
+---------+----+----+----------------------------------------------------------------------------------------------------+
|    India| IND|2016|                                                                                                4.87|
|    India| IND|2017|                                                                                                4.87|
|    India| IND|2018|                                                                                                4.92|
|    India| IND|2019|                                                                                                4.92|
|    India| IND|2020|                                                                                                 4.1|
|Indonesia| IDN|

In [9]:
df.filter(df.Entity.endswith('nd')&(df.Year.cast('int')>=2019)).show()

+-----------+----+----+----------------------------------------------------------------------------------------------------+
|     Entity|Code|Year|Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)|
+-----------+----+----+----------------------------------------------------------------------------------------------------+
|    Finland| FIN|2019|                                                                                                9.16|
|    Finland| FIN|2020|                                                                                                9.08|
|    Iceland| ISL|2019|                                                                                                8.07|
|    Iceland| ISL|2020|                                                                                                7.94|
|    Ireland| IRL|2019|                                                                                                11.7|


In [10]:
df.filter(df.Entity.contains('in')&(df.Year.cast('int')>2019)).show()
print('no of records')
df.filter(df.Entity.contains('in')&(df.Year.cast('int')>2019)).count()

+--------------------+----+----+----------------------------------------------------------------------------------------------------+
|              Entity|Code|Year|Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)|
+--------------------+----+----+----------------------------------------------------------------------------------------------------+
|           Argentina| ARG|2020|                                                                                                8.05|
|             Bahrain| BHR|2020|                                                                                                1.25|
|               Benin| BEN|2020|                                                                                                8.83|
|Bosnia and Herzeg...| BIH|2020|                                                                                                5.87|
|        Burkina Faso| BFA|2020|                              

31

Adding Column, Renaming Column, Drop Column in Spark

In [11]:
from pyspark.sql.functions import * #it will import all the functions in pyspark.sql

In [12]:
df1=df.withColumnRenamed('Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)','Consumption')
df1.show(5)

+-----------+----+----+-----------+
|     Entity|Code|Year|Consumption|
+-----------+----+----+-----------+
|Afghanistan| AFG|2000|      0.003|
|Afghanistan| AFG|2001|      0.003|
|Afghanistan| AFG|2002|      0.007|
|Afghanistan| AFG|2003|      0.016|
|Afghanistan| AFG|2004|      0.021|
+-----------+----+----+-----------+
only showing top 5 rows



In [13]:
df1=df1.withColumn('consumption age',lit(15))
df1.show(5)

+-----------+----+----+-----------+---------------+
|     Entity|Code|Year|Consumption|consumption age|
+-----------+----+----+-----------+---------------+
|Afghanistan| AFG|2000|      0.003|             15|
|Afghanistan| AFG|2001|      0.003|             15|
|Afghanistan| AFG|2002|      0.007|             15|
|Afghanistan| AFG|2003|      0.016|             15|
|Afghanistan| AFG|2004|      0.021|             15|
+-----------+----+----+-----------+---------------+
only showing top 5 rows



In [14]:
df1=df1.drop('consumption age')
df1.show(5)

+-----------+----+----+-----------+
|     Entity|Code|Year|Consumption|
+-----------+----+----+-----------+
|Afghanistan| AFG|2000|      0.003|
|Afghanistan| AFG|2001|      0.003|
|Afghanistan| AFG|2002|      0.007|
|Afghanistan| AFG|2003|      0.016|
|Afghanistan| AFG|2004|      0.021|
+-----------+----+----+-----------+
only showing top 5 rows



**End Of Code**