#Install Java / Spark / Haddop

##Apache Spark with Google Colab


Setting up  Spark 2.4.7 with all dependencies on google colab. 

* Installing Java in the Google Colaboratory
* Setting up Spark 2.4.7  in the Google Colaboratory
* A test example




## Setting up Spark 2.4.7 the Google Colaboratory

This notebook comprises the instructions to run pyspark on google Colab. 

We will install the following OS tools 

* Java 8
* spark-2.4.7
* Hadoop2.7
* [Findspark](https://github.com/minrk/findspark)


> Make sure the spark-version you are downloading is availbale on target link



In [None]:
import time

Start=time.time()
# Download and install tools 

# Install Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download and Install Spark
!wget  -q http://apache.osuosl.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz

# Install findspark
!pip install -q findspark

print(f"\nIt took {(time.time()-Start)} seconds to install all dependencies for spark to run on Google Colab. \n")



It took 11.867131471633911 seconds to install all dependencies for spark to run on Google Colab. 



In [None]:
# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

## Spark Installation test
Lets test the installation of spark in our google colab environment. 

In [None]:
import findspark
findspark.init()
findspark.find()

'/content/spark-2.4.7-bin-hadoop2.7'

In [None]:
import findspark
import numpy as np
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()

# Test the spark 
NameList = ['Ahmad', 'Salem', 'Noor', 'Heba']

NumberOfSamples=int(1000)

df = spark.createDataFrame([{"Column1": np.random.randint(1,100), 
                             "Column2":np.random.randint(24,35), 
                             "Column3":np.random.random(),
                             "Name":str(np.random.choice(NameList)),
                             }
                              for i in range(NumberOfSamples)])

df.show(3, False)

#spark.stop()



+-------+-------+------------------+-----+
|Column1|Column2|Column3           |Name |
+-------+-------+------------------+-----+
|91     |34     |0.1263113060031058|Heba |
|94     |34     |0.9957919952267628|Salem|
|22     |31     |0.9227371872529571|Ahmad|
+-------+-------+------------------+-----+
only showing top 3 rows



In [None]:
df.printSchema()

root
 |-- Column1: long (nullable = true)
 |-- Column2: long (nullable = true)
 |-- Column3: double (nullable = true)
 |-- Name: string (nullable = true)



In [None]:
# Check the pyspark version
import pyspark
print(pyspark.__version__)

2.4.7


## Conclusions

In this notebook, we learned

* Installing spark 2.4.7 in Google Colab
* Running some spark methods without cost 


#Creating a DataFrame

##Creating a DataFrame from an Existing RDD

```python
SparkSession.createDataFrame(data, schema=None)
```

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.master("local[*]").getOrCreate()

# Test the spark 
NameList = ['Ahmad', 'Salem', 'Noor', 'Heba']

NumberOfSamples=int(1000)

data=[{"Column1": np.random.randint(1,100), 
                             "Column2":np.random.randint(24,35), 
                             "Column3":np.random.random(),
                             "Name":str(np.random.choice(NameList)),
                             }
                              for i in range(NumberOfSamples)]


from pyspark.sql.types import StructField, IntegerType, FloatType, StringType
schema = StructType([
    StructField("Column1", IntegerType(), True),
    StructField("Column2", IntegerType(), True),
    StructField("Column3", FloatType(), True),
    StructField("Name", StringType(), True),
    StructField("Address", StringType(), True),
])

df = spark.createDataFrame(data,schema=schema)

df.show(3, False)

#spark.stop()

+-------+-------+-----------+-----+-------+
|Column1|Column2|Column3    |Name |Address|
+-------+-------+-----------+-----+-------+
|85     |30     |0.820353   |Heba |null   |
|47     |34     |0.61169714 |Heba |null   |
|39     |31     |0.033866506|Ahmad|null   |
+-------+-------+-----------+-----+-------+
only showing top 3 rows



##Check ```dir(pyspark.sql.types)``` to see what other data types you could parse 


In [None]:
dir(pyspark.sql.types)

['ArrayType',
 'AtomicType',
 'BinaryType',
 'BooleanType',
 'ByteType',
 'CloudPickleSerializer',
 'DataType',
 'DataTypeSingleton',
 'DateConverter',
 'DateType',
 'DatetimeConverter',
 'DecimalType',
 'DoubleType',
 'FloatType',
 'FractionalType',
 'IntegerType',
 'IntegralType',
 'JavaClass',
 'LongType',
 'MapType',
 'NullType',
 'NumericType',
 'Row',
 'ShortType',
 'SparkContext',
 'StringType',
 'StructField',
 'StructType',
 'TimestampType',
 'UserDefinedType',
 '_FIXED_DECIMAL',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_acceptable_types',
 '_all_atomic_types',
 '_all_complex_types',
 '_array_signed_int_typecode_ctype_mappings',
 '_array_type_mappings',
 '_array_unsigned_int_typecode_ctype_mappings',
 '_atomic_types',
 '_check_dataframe_convert_date',
 '_check_dataframe_localize_timestamps',
 '_check_series_convert_date',
 '_check_series_convert_timestamps_internal',
 '_check_series_convert_

##Creating a DataFrame from an RDD

In [None]:
myrdd = spark.sparkContext.parallelize([('Jeff', 48),('Kellie', 45)])
DF=spark.createDataFrame(myrdd)
DF.show()


+------+---+
|    _1| _2|
+------+---+
|  Jeff| 48|
|Kellie| 45|
+------+---+



In [None]:
DF=DF.select(col("_1").alias("name"),col("_1").alias("age"))
DF.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)



In [None]:
DF.columns

['name', 'age']

##Download Covid-19 Dataset 

In [None]:
import pandas as pd 
URL='https://drive.google.com/uc?export=download&id=1tAEeGKNkPvp-NTKFz2pi3FOa_e4K5YnM'
pd.read_csv(URL,parse_dates=['Date'])[['Date', 'Country_Region', 'Difference', 'Case_Type', 'Cases', 'Lat','Long']].to_csv('covid19.csv',index=False)

In [None]:
!head -n 25 covid19.csv

Date,Country_Region,Difference,Case_Type,Cases,Lat,Long
2020-03-09,India,0,Deaths,0,21.0,78.0
2020-03-08,India,0,Deaths,0,21.0,78.0
2020-03-07,India,0,Deaths,0,21.0,78.0
2020-03-06,India,0,Deaths,0,21.0,78.0
2020-03-05,India,0,Deaths,0,21.0,78.0
2020-03-04,India,0,Deaths,0,21.0,78.0
2020-03-03,India,0,Deaths,0,21.0,78.0
2020-03-23,India,3,Deaths,10,21.0,78.0
2020-03-22,India,3,Deaths,7,21.0,78.0
2020-03-21,India,-1,Deaths,4,21.0,78.0
2020-03-20,India,1,Deaths,5,21.0,78.0
2020-03-02,India,0,Deaths,0,21.0,78.0
2020-03-19,India,1,Deaths,4,21.0,78.0
2020-03-18,India,0,Deaths,3,21.0,78.0
2020-03-17,India,1,Deaths,3,21.0,78.0
2020-03-16,India,0,Deaths,2,21.0,78.0
2020-03-15,India,0,Deaths,2,21.0,78.0
2020-03-14,India,0,Deaths,2,21.0,78.0
2020-03-13,India,1,Deaths,2,21.0,78.0
2020-03-12,India,0,Deaths,1,21.0,78.0
2020-03-11,India,1,Deaths,1,21.0,78.0
2020-03-10,India,0,Deaths,0,21.0,78.0
2020-03-01,India,0,Deaths,0,21.0,78.0
2020-02-09,India,0,Deaths,0,21.0,78.0


In [None]:
!cat covid19.csv | wc -l

35503


###Read Data from CSV 

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructField, IntegerType, FloatType, StringType


spark = SparkSession.builder.master("local[*]").getOrCreate()


df = spark.read.csv('covid19.csv')

df.show(3)

#spark.stop()

+----------+--------------+----------+---------+-----+----+----+
|       _c0|           _c1|       _c2|      _c3|  _c4| _c5| _c6|
+----------+--------------+----------+---------+-----+----+----+
|      Date|Country_Region|Difference|Case_Type|Cases| Lat|Long|
|2020-03-09|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-08|         India|         0|   Deaths|    0|21.0|78.0|
+----------+--------------+----------+---------+-----+----+----+
only showing top 3 rows



In [None]:
df = spark.read.csv('covid19.csv',header=True)
df.show(3)

+----------+--------------+----------+---------+-----+----+----+
|      Date|Country_Region|Difference|Case_Type|Cases| Lat|Long|
+----------+--------------+----------+---------+-----+----+----+
|2020-03-09|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-08|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-07|         India|         0|   Deaths|    0|21.0|78.0|
+----------+--------------+----------+---------+-----+----+----+
only showing top 3 rows



In [None]:
df.count()

35502

In [None]:
df.describe()

DataFrame[summary: string, Date: string, Country_Region: string, Difference: string, Case_Type: string, Cases: string, Lat: string, Long: string]

In [None]:
df.dtypes

[('Date', 'string'),
 ('Country_Region', 'string'),
 ('Difference', 'string'),
 ('Case_Type', 'string'),
 ('Cases', 'string'),
 ('Lat', 'string'),
 ('Long', 'string')]

In [None]:
df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Country_Region: string (nullable = true)
 |-- Difference: string (nullable = true)
 |-- Case_Type: string (nullable = true)
 |-- Cases: string (nullable = true)
 |-- Lat: string (nullable = true)
 |-- Long: string (nullable = true)



In [None]:
!head -n 2 covid19.csv

Date,Country_Region,Difference,Case_Type,Cases,Lat,Long
2020-03-09,India,0,Deaths,0,21.0,78.0


###Define Schema 

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructField, IntegerType, FloatType, StringType, DateType


spark = SparkSession.builder.master("local[*]").getOrCreate()


Myschema = StructType([
 StructField('Date', DateType(),True),
 StructField('Country_Region', StringType(),True),
 StructField('Difference', IntegerType(),True),
 StructField('Case_Type', StringType(),True),
  StructField('Cases', IntegerType(),True),
  StructField('Lat', FloatType(),True),
  StructField('Long', FloatType(),True),
 ])

df = spark.read.csv('covid19.csv',header=True,schema=Myschema)
df.show(3, False)

+----------+--------------+----------+---------+-----+----+----+
|Date      |Country_Region|Difference|Case_Type|Cases|Lat |Long|
+----------+--------------+----------+---------+-----+----+----+
|2020-03-09|India         |0         |Deaths   |0    |21.0|78.0|
|2020-03-08|India         |0         |Deaths   |0    |21.0|78.0|
|2020-03-07|India         |0         |Deaths   |0    |21.0|78.0|
+----------+--------------+----------+---------+-----+----+----+
only showing top 3 rows



In [None]:
df.dtypes

[('Date', 'date'),
 ('Country_Region', 'string'),
 ('Difference', 'int'),
 ('Case_Type', 'string'),
 ('Cases', 'int'),
 ('Lat', 'float'),
 ('Long', 'float')]

###Read from location that has many csvs 

In [None]:
!mkdir manycsvs

In [None]:
import pandas as pd 
URL='https://drive.google.com/uc?export=download&id=1tAEeGKNkPvp-NTKFz2pi3FOa_e4K5YnM'
DF_Pandas=pd.read_csv(URL,parse_dates=['Date'])[['Date', 'Country_Region', 'Difference', 'Case_Type', 'Cases', 'Lat','Long']]#.to_csv('covid19.csv',index=False)

for i in range(100):
  DF_Pandas.sample(frac=0.05).to_csv(f'manycsvs/covid19_{i}.csv',index=False)



In [None]:
!ls manycsvs

covid19_0.csv	covid19_28.csv	covid19_46.csv	covid19_64.csv	covid19_82.csv
covid19_10.csv	covid19_29.csv	covid19_47.csv	covid19_65.csv	covid19_83.csv
covid19_11.csv	covid19_2.csv	covid19_48.csv	covid19_66.csv	covid19_84.csv
covid19_12.csv	covid19_30.csv	covid19_49.csv	covid19_67.csv	covid19_85.csv
covid19_13.csv	covid19_31.csv	covid19_4.csv	covid19_68.csv	covid19_86.csv
covid19_14.csv	covid19_32.csv	covid19_50.csv	covid19_69.csv	covid19_87.csv
covid19_15.csv	covid19_33.csv	covid19_51.csv	covid19_6.csv	covid19_88.csv
covid19_16.csv	covid19_34.csv	covid19_52.csv	covid19_70.csv	covid19_89.csv
covid19_17.csv	covid19_35.csv	covid19_53.csv	covid19_71.csv	covid19_8.csv
covid19_18.csv	covid19_36.csv	covid19_54.csv	covid19_72.csv	covid19_90.csv
covid19_19.csv	covid19_37.csv	covid19_55.csv	covid19_73.csv	covid19_91.csv
covid19_1.csv	covid19_38.csv	covid19_56.csv	covid19_74.csv	covid19_92.csv
covid19_20.csv	covid19_39.csv	covid19_57.csv	covid19_75.csv	covid19_93.csv
covid19_21.csv	covid19_3.csv	co

###Consume all files in one Pyspark DataFrame  

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructField, IntegerType, FloatType, StringType, DateType


spark = SparkSession.builder.master("local[*]").getOrCreate()


Myschema = StructType([
 StructField('Date', DateType(),True),
 StructField('Country_Region', StringType(),True),
 StructField('Difference', IntegerType(),True),
 StructField('Case_Type', StringType(),True),
  StructField('Cases', IntegerType(),True),
  StructField('Lat', FloatType(),True),
  StructField('Long', FloatType(),True),
 ])

df = spark.read.csv('manycsvs',header=True,schema=Myschema)
df.show(3, False)

+----------+--------------+----------+---------+-----+--------+----------+
|Date      |Country_Region|Difference|Case_Type|Cases|Lat     |Long      |
+----------+--------------+----------+---------+-----+--------+----------+
|2020-03-10|Holy See      |0         |Deaths   |0    |41.9029 |12.4534   |
|2020-02-14|US            |0         |Confirmed|0    |39.46567|-105.47238|
|2020-03-07|US            |0         |Deaths   |0    |39.14685|-76.8196  |
+----------+--------------+----------+---------+-----+--------+----------+
only showing top 3 rows



In [None]:
df.count()

177500

###Basic Spark DataFrame Operations 


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructField, IntegerType, FloatType, StringType, DateType


spark = SparkSession.builder.master("local[*]").getOrCreate()


Myschema = StructType([
 StructField('Date', DateType(),True),
 StructField('Country_Region', StringType(),True),
 StructField('Difference', IntegerType(),True),
 StructField('Case_Type', StringType(),True),
  StructField('Cases', IntegerType(),True),
  StructField('Lat', FloatType(),True),
  StructField('Long', FloatType(),True),
 ])

df = spark.read.csv('covid19.csv',header=True,schema=Myschema)
df.show(3, False)

+----------+--------------+----------+---------+-----+----+----+
|Date      |Country_Region|Difference|Case_Type|Cases|Lat |Long|
+----------+--------------+----------+---------+-----+----+----+
|2020-03-09|India         |0         |Deaths   |0    |21.0|78.0|
|2020-03-08|India         |0         |Deaths   |0    |21.0|78.0|
|2020-03-07|India         |0         |Deaths   |0    |21.0|78.0|
+----------+--------------+----------+---------+-----+----+----+
only showing top 3 rows



###show(), collect(), take()

Unlike collect() or take(n), show() cannot return to a variable. 




In [None]:
df.show(n=4, truncate=True)

+----------+--------------+----------+---------+-----+----+----+
|      Date|Country_Region|Difference|Case_Type|Cases| Lat|Long|
+----------+--------------+----------+---------+-----+----+----+
|2020-03-09|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-08|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-07|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-06|         India|         0|   Deaths|    0|21.0|78.0|
+----------+--------------+----------+---------+-----+----+----+
only showing top 4 rows



In [None]:
x=df.show(n=4, truncate=True)

+----------+--------------+----------+---------+-----+----+----+
|      Date|Country_Region|Difference|Case_Type|Cases| Lat|Long|
+----------+--------------+----------+---------+-----+----+----+
|2020-03-09|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-08|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-07|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-06|         India|         0|   Deaths|    0|21.0|78.0|
+----------+--------------+----------+---------+-----+----+----+
only showing top 4 rows



In [None]:
print(x)

None


In [None]:
df.take(3)

[Row(Date=datetime.date(2020, 3, 9), Country_Region='India', Difference=0, Case_Type='Deaths', Cases=0, Lat=21.0, Long=78.0),
 Row(Date=datetime.date(2020, 3, 8), Country_Region='India', Difference=0, Case_Type='Deaths', Cases=0, Lat=21.0, Long=78.0),
 Row(Date=datetime.date(2020, 3, 7), Country_Region='India', Difference=0, Case_Type='Deaths', Cases=0, Lat=21.0, Long=78.0)]

In [None]:
x=df.take(3)

In [None]:
x

[Row(Date=datetime.date(2020, 3, 9), Country_Region='India', Difference=0, Case_Type='Deaths', Cases=0, Lat=21.0, Long=78.0),
 Row(Date=datetime.date(2020, 3, 8), Country_Region='India', Difference=0, Case_Type='Deaths', Cases=0, Lat=21.0, Long=78.0),
 Row(Date=datetime.date(2020, 3, 7), Country_Region='India', Difference=0, Case_Type='Deaths', Cases=0, Lat=21.0, Long=78.0)]

In [None]:
x=df.collect()
x[1:4]

[Row(Date=datetime.date(2020, 3, 8), Country_Region='India', Difference=0, Case_Type='Deaths', Cases=0, Lat=21.0, Long=78.0),
 Row(Date=datetime.date(2020, 3, 7), Country_Region='India', Difference=0, Case_Type='Deaths', Cases=0, Lat=21.0, Long=78.0),
 Row(Date=datetime.date(2020, 3, 6), Country_Region='India', Difference=0, Case_Type='Deaths', Cases=0, Lat=21.0, Long=78.0)]

### select(*cols)
 




In [None]:
df.columns

['Date', 'Country_Region', 'Difference', 'Case_Type', 'Cases', 'Lat', 'Long']

In [None]:
df.select('Date').show(4)

+----------+
|      Date|
+----------+
|2020-03-09|
|2020-03-08|
|2020-03-07|
|2020-03-06|
+----------+
only showing top 4 rows



In [None]:
df.select('Date','Lat','Long').show(4)

+----------+----+----+
|      Date| Lat|Long|
+----------+----+----+
|2020-03-09|21.0|78.0|
|2020-03-08|21.0|78.0|
|2020-03-07|21.0|78.0|
|2020-03-06|21.0|78.0|
+----------+----+----+
only showing top 4 rows



###drop(col)

In [None]:
df.drop('Date').show(3)

+--------------+----------+---------+-----+----+----+
|Country_Region|Difference|Case_Type|Cases| Lat|Long|
+--------------+----------+---------+-----+----+----+
|         India|         0|   Deaths|    0|21.0|78.0|
|         India|         0|   Deaths|    0|21.0|78.0|
|         India|         0|   Deaths|    0|21.0|78.0|
+--------------+----------+---------+-----+----+----+
only showing top 3 rows



In [None]:
df.drop('Date','Difference').show(3)

+--------------+---------+-----+----+----+
|Country_Region|Case_Type|Cases| Lat|Long|
+--------------+---------+-----+----+----+
|         India|   Deaths|    0|21.0|78.0|
|         India|   Deaths|    0|21.0|78.0|
|         India|   Deaths|    0|21.0|78.0|
+--------------+---------+-----+----+----+
only showing top 3 rows



In [None]:
df1=df.drop('Date','Difference')
df1.show(3)

+--------------+---------+-----+----+----+
|Country_Region|Case_Type|Cases| Lat|Long|
+--------------+---------+-----+----+----+
|         India|   Deaths|    0|21.0|78.0|
|         India|   Deaths|    0|21.0|78.0|
|         India|   Deaths|    0|21.0|78.0|
+--------------+---------+-----+----+----+
only showing top 3 rows



###filter

In [None]:
df.filter(col('Country_Region')=='Jordan').show(3)

+----------+--------------+----------+---------+-----+-----+-----+
|      Date|Country_Region|Difference|Case_Type|Cases|  Lat| Long|
+----------+--------------+----------+---------+-----+-----+-----+
|2020-03-09|        Jordan|         0|   Deaths|    0|31.24|36.51|
|2020-03-08|        Jordan|         0|   Deaths|    0|31.24|36.51|
|2020-03-07|        Jordan|         0|   Deaths|    0|31.24|36.51|
+----------+--------------+----------+---------+-----+-----+-----+
only showing top 3 rows



In [None]:
df.filter(df['Country_Region']=='Jordan').show(3)

+----------+--------------+----------+---------+-----+-----+-----+
|      Date|Country_Region|Difference|Case_Type|Cases|  Lat| Long|
+----------+--------------+----------+---------+-----+-----+-----+
|2020-03-09|        Jordan|         0|   Deaths|    0|31.24|36.51|
|2020-03-08|        Jordan|         0|   Deaths|    0|31.24|36.51|
|2020-03-07|        Jordan|         0|   Deaths|    0|31.24|36.51|
+----------+--------------+----------+---------+-----+-----+-----+
only showing top 3 rows



In [None]:
df.filter(df.Country_Region=='Jordan').show(3)

+----------+--------------+----------+---------+-----+-----+-----+
|      Date|Country_Region|Difference|Case_Type|Cases|  Lat| Long|
+----------+--------------+----------+---------+-----+-----+-----+
|2020-03-09|        Jordan|         0|   Deaths|    0|31.24|36.51|
|2020-03-08|        Jordan|         0|   Deaths|    0|31.24|36.51|
|2020-03-07|        Jordan|         0|   Deaths|    0|31.24|36.51|
+----------+--------------+----------+---------+-----+-----+-----+
only showing top 3 rows



In [None]:
df.filter(col('Difference')>20).show(3)

+----------+--------------+----------+---------+-----+----+----+
|      Date|Country_Region|Difference|Case_Type|Cases| Lat|Long|
+----------+--------------+----------+---------+-----+----+----+
|2020-03-04|         India|        23|Confirmed|   28|21.0|78.0|
|2020-03-23|         India|       103|Confirmed|  499|21.0|78.0|
|2020-03-22|         India|        66|Confirmed|  396|21.0|78.0|
+----------+--------------+----------+---------+-----+----+----+
only showing top 3 rows



In [None]:
cond= (col('Difference')>20) & (col('Country_Region')=='Jordan')
df.filter(cond).show(3)

+----------+--------------+----------+---------+-----+-----+-----+
|      Date|Country_Region|Difference|Case_Type|Cases|  Lat| Long|
+----------+--------------+----------+---------+-----+-----+-----+
|2020-03-22|        Jordan|        27|Confirmed|  112|31.24|36.51|
+----------+--------------+----------+---------+-----+-----+-----+



In [None]:
cond= (col('Difference')>10) & ((col('Country_Region')=='Jordan') | (col('Country_Region')=='Canada'))
df.filter(cond).show(3)

+----------+--------------+----------+---------+-----+-------+---------+
|      Date|Country_Region|Difference|Case_Type|Cases|    Lat|     Long|
+----------+--------------+----------+---------+-----+-------+---------+
|2020-03-23|        Canada|        14|Confirmed|   66|52.9399|-106.4509|
|2020-03-22|        Canada|        26|Confirmed|   52|52.9399|-106.4509|
|2020-03-23|        Canada|       409|Confirmed|  628|52.9399| -73.5491|
+----------+--------------+----------+---------+-----+-------+---------+
only showing top 3 rows



In [None]:
cond= (col('Difference')>10) & ((col('Country_Region')=='Jordan') | (col('Country_Region')=='Canada'))
df.filter(cond).select('Country_Region','Difference','Date').show(40)

+--------------+----------+----------+
|Country_Region|Difference|      Date|
+--------------+----------+----------+
|        Canada|        14|2020-03-23|
|        Canada|        26|2020-03-22|
|        Canada|       409|2020-03-23|
|        Canada|        38|2020-03-22|
|        Canada|        42|2020-03-21|
|        Canada|        18|2020-03-20|
|        Canada|        27|2020-03-19|
|        Canada|        20|2020-03-18|
|        Canada|        24|2020-03-17|
|        Canada|        26|2020-03-16|
|        Canada|        78|2020-03-23|
|        Canada|        48|2020-03-22|
|        Canada|        69|2020-03-21|
|        Canada|        51|2020-03-20|
|        Canada|        36|2020-03-19|
|        Canada|        36|2020-03-18|
|        Canada|        73|2020-03-16|
|        Canada|        25|2020-03-15|
|        Canada|        32|2020-03-13|
|        Canada|        13|2020-03-23|
|        Canada|        15|2020-03-23|
|        Canada|        48|2020-03-23|
|        Canada|       15

> The where() method is an alias for filter(), and the two can be used interchangeably.



In [None]:
df.where(col('Difference')>20).show(3)

+----------+--------------+----------+---------+-----+----+----+
|      Date|Country_Region|Difference|Case_Type|Cases| Lat|Long|
+----------+--------------+----------+---------+-----+----+----+
|2020-03-04|         India|        23|Confirmed|   28|21.0|78.0|
|2020-03-23|         India|       103|Confirmed|  499|21.0|78.0|
|2020-03-22|         India|        66|Confirmed|  396|21.0|78.0|
+----------+--------------+----------+---------+-----+----+----+
only showing top 3 rows



### distinct()

In [None]:
df.select('Country_Region').distinct().show()

+--------------+
|Country_Region|
+--------------+
|          Chad|
|        Russia|
|      Paraguay|
|       Senegal|
|        Sweden|
|    Cabo Verde|
|        Guyana|
|       Eritrea|
|   Philippines|
|      Djibouti|
|     Singapore|
|      Malaysia|
|          Fiji|
|        Turkey|
|          Iraq|
|       Germany|
|      Cambodia|
|   Afghanistan|
|      Maldives|
|        Jordan|
+--------------+
only showing top 20 rows



In [None]:
df.select('Country_Region').distinct().count()

168

### drop_dublicates()

In [None]:
df.select('Country_Region').drop_duplicates().count()

168

### dropan()

In [None]:
df.select('Country_Region').dropna().count()

35502

In [None]:
df.dropna().count()

35136

In [None]:
df.sample(fraction=0.1,seed=3).toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3546 entries, 0 to 3545
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            3546 non-null   object 
 1   Country_Region  3546 non-null   object 
 2   Difference      3546 non-null   int32  
 3   Case_Type       3546 non-null   object 
 4   Cases           3546 non-null   int32  
 5   Lat             3519 non-null   float32
 6   Long            3519 non-null   float32
dtypes: float32(2), int32(2), object(3)
memory usage: 138.6+ KB


### map() and flatMap()

In [None]:
rdd=df.rdd.map(lambda x: x.Date)
rdd.take(3)

[datetime.date(2020, 3, 9),
 datetime.date(2020, 3, 8),
 datetime.date(2020, 3, 7)]

In [None]:
df.rdd.flatMap?

In [None]:
rdd=df.rdd.flatMap(lambda x: (x))
rdd.take(20)

[datetime.date(2020, 3, 9),
 'India',
 0,
 'Deaths',
 0,
 21.0,
 78.0,
 datetime.date(2020, 3, 8),
 'India',
 0,
 'Deaths',
 0,
 21.0,
 78.0,
 datetime.date(2020, 3, 7),
 'India',
 0,
 'Deaths',
 0,
 21.0]

In [None]:
df.show(5)

+----------+--------------+----------+---------+-----+----+----+
|      Date|Country_Region|Difference|Case_Type|Cases| Lat|Long|
+----------+--------------+----------+---------+-----+----+----+
|2020-03-09|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-08|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-07|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-06|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-05|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-04|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-03|         India|         0|   Deaths|    0|21.0|78.0|
|2020-03-23|         India|         3|   Deaths|   10|21.0|78.0|
|2020-03-22|         India|         3|   Deaths|    7|21.0|78.0|
|2020-03-21|         India|        -1|   Deaths|    4|21.0|78.0|
+----------+--------------+----------+---------+-----+----+----+
only showing top 10 rows



In [None]:
df.rdd.map(lambda x:x.Date).take(10)

[datetime.date(2020, 3, 9),
 datetime.date(2020, 3, 8),
 datetime.date(2020, 3, 7),
 datetime.date(2020, 3, 6),
 datetime.date(2020, 3, 5),
 datetime.date(2020, 3, 4),
 datetime.date(2020, 3, 3),
 datetime.date(2020, 3, 23),
 datetime.date(2020, 3, 22),
 datetime.date(2020, 3, 21)]

###DataFrame Built-in Functions

In [None]:
import pyspark.sql.functions as F


In [None]:
#dir(F)

In [None]:
df_new = df.withColumn("Scaled_Cases", 2*F.col("Cases"))
df_new.show(10)

+----------+--------------+----------+---------+-----+----+----+------------+
|      Date|Country_Region|Difference|Case_Type|Cases| Lat|Long|Scaled_Cases|
+----------+--------------+----------+---------+-----+----+----+------------+
|2020-03-09|         India|         0|   Deaths|    0|21.0|78.0|           0|
|2020-03-08|         India|         0|   Deaths|    0|21.0|78.0|           0|
|2020-03-07|         India|         0|   Deaths|    0|21.0|78.0|           0|
|2020-03-06|         India|         0|   Deaths|    0|21.0|78.0|           0|
|2020-03-05|         India|         0|   Deaths|    0|21.0|78.0|           0|
|2020-03-04|         India|         0|   Deaths|    0|21.0|78.0|           0|
|2020-03-03|         India|         0|   Deaths|    0|21.0|78.0|           0|
|2020-03-23|         India|         3|   Deaths|   10|21.0|78.0|          20|
|2020-03-22|         India|         3|   Deaths|    7|21.0|78.0|          14|
|2020-03-21|         India|        -1|   Deaths|    4|21.0|78.0|

In [None]:
df_new = df.withColumn("Exp_Cases", F.exp("Cases"))
df_new.show(10)

+----------+--------------+----------+---------+-----+----+----+------------------+
|      Date|Country_Region|Difference|Case_Type|Cases| Lat|Long|         Exp_Cases|
+----------+--------------+----------+---------+-----+----+----+------------------+
|2020-03-09|         India|         0|   Deaths|    0|21.0|78.0|               1.0|
|2020-03-08|         India|         0|   Deaths|    0|21.0|78.0|               1.0|
|2020-03-07|         India|         0|   Deaths|    0|21.0|78.0|               1.0|
|2020-03-06|         India|         0|   Deaths|    0|21.0|78.0|               1.0|
|2020-03-05|         India|         0|   Deaths|    0|21.0|78.0|               1.0|
|2020-03-04|         India|         0|   Deaths|    0|21.0|78.0|               1.0|
|2020-03-03|         India|         0|   Deaths|    0|21.0|78.0|               1.0|
|2020-03-23|         India|         3|   Deaths|   10|21.0|78.0|22026.465794806718|
|2020-03-22|         India|         3|   Deaths|    7|21.0|78.0|1096.6331584