This post draws parallels between Python and Pyspark. If you are already adept in Python, then the move 
into Pyspark is relatively easy. Here I try to show how similar many of the key constructs are. It is important
to note that writing code in Python and in Pyspark work very differently. In Python each statement is evaluated
immediately. Pyspark, on the other hand has transformations (operations performed on dataframes etc) and actions.
Pyspark does lazy evaluation i.e. the transformations on the dataframe do not happen immediately. Spark
queues the transformations and evaluates them only when an action like show(), count(), collect() are performed.
Regardless, it is important to notice the essential similarities in coding these 2 languages namely Python and its
'big brother' Pyspark for handling small data sets to large datasets

## 1a. Read CSV file - Python
This is a common operations that will be required whether one uses Python or Pyspark. To see how load a CSV file to Databricks see this video [Upload Flat File to Databricks Table](https://www.youtube.com/watch?v=H5LxjaJgpSk)

In [3]:
import pandas as pd
import os
os.getcwd()
#Read CSV file
tendulkar = pd.read_csv("/dbfs/FileStore/tables/tendulkar.csv", header='infer')
#Check the shape of the dataframe
tendulkar.shape



##1.b Read CSV file - Pyspark - Option 1

In [5]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext 
sql=SQLContext.getOrCreate(SparkContext.getOrCreate())
tendulkar1= (sql
         .read.format("com.databricks.spark.csv")
         .options(delimiter=',', header='true', inferschema='true')
         .load("/FileStore/tables/tendulkar.csv"))
tendulkar1.count()

##1c. Read CSV file - Pyspark - Option 2

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Read CSV').getOrCreate()
tendulkar2 = spark.read.format('csv').option('header','true').load('/FileStore/tables/tendulkar.csv')
tendulkar2.count()

##2a. Data frame Shape -Python
Determine the shape of the dataframe

In [9]:
tendulkar.shape

##2b.  Data frame Shape -Pyspark
When computing the shape of a dataframe we have to do compute the count of rows and the length of the columns separately

In [11]:
tendulkar1.count()
len(tendulkar1.columns)
def dfShape(df):
  return(df.count(),len(df.columns))

dfShape(tendulkar1)

##3a.  Data frame columns - Python

In [13]:
tendulkar.columns

## 4a. Rename columns - Python

In [15]:
tendulkar.columns=[['Runs','Minutes','BallsFaced','Fours','Sixes','StrikeRate','Position','Dismissal','Innings','Opposition','Ground','StartDate']]
tendulkar.columns

##4b. Rename columns - Pyspark

In [17]:
tendulkar1 = tendulkar1.withColumnRenamed("Runs", "Runs")\
                       .withColumnRenamed("Mins", "Minutes")\
                       .withColumnRenamed("BF", "BallsFaced")\
                       .withColumnRenamed("4s", "Fours")\
                       .withColumnRenamed("6s", "Sixes")\
                       .withColumnRenamed("SR", "StrikeRate")\
                       .withColumnRenamed("Dismissal", "Dismissal")               


##5a. Dtypes - Python

In [19]:
tendulkar.dtypes

##5b. Dtypes -Pyspark

In [21]:
tendulkar1.dtypes

## 5c. printSchema

In [23]:
tendulkar1.printSchema()

##6a.  Select columns - Python

In [25]:
import pandas as pd
import os
os.getcwd()
#Read CSV file
tendulkar = pd.read_csv("/dbfs/FileStore/tables/tendulkar.csv", header='infer')
df=tendulkar[['Runs','Mins','BF']]
df.head(10)

##6b.  Select columns - Pyspark

In [27]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext 
sql=SQLContext.getOrCreate(SparkContext.getOrCreate())
tendulkar1= (sql
         .read.format("com.databricks.spark.csv")
         .options(delimiter=',', header='true', inferschema='true')
         .load("/FileStore/tables/tendulkar.csv"))
df1=tendulkar1.select('Runs','Mins','BF')
df1.show(10)

##7a. Filter rows by criteria - Python

In [29]:
b = tendulkar['Runs'] >50
df = tendulkar[b]
df.head(10)

##7b Filtering by criteria - Pyspark

In [31]:
df1=tendulkar1.filter(tendulkar1['Runs']>50)
df1.show(10)

##8a.  Display unique contents of a column - Python

In [33]:
tendulkar = pd.read_csv("/dbfs/FileStore/tables/tendulkar.csv", header='infer')
tendulkar['Runs'].unique()

##8b. Display unique contents of a column - Pyspark

In [35]:
from pyspark.sql.functions import *
sqlContext=SQLContext.getOrCreate(SparkContext.getOrCreate())
tendulkar1= (sqlContext
         .read.format("com.databricks.spark.csv")
         .options(delimiter=',', header='true', inferschema='true')
         .load("/FileStore/tables/tendulkar.csv"))
tendulkar1.select('Runs').rdd.distinct().collect()

##9a.  Aggregate mean, max, min - Python

In [37]:


import pandas as pd
import os
os.getcwd()
tendulkar = pd.read_csv("/dbfs/FileStore/tables/tendulkar.csv", header='infer')
tendulkar.shape
# Remove rows which have DNB
a=tendulkar.Runs !="DNB"
tendulkar=tendulkar[a]
tendulkar.shape

# Remove rows which have TDNB
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
tendulkar.shape

# Remove the '*' character
c= tendulkar.BF != "-"
tendulkar=tendulkar[c]
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")

#tendulkar.shape
type(tendulkar['Runs'][0])
tendulkar['Runs']=pd.to_numeric(tendulkar['Runs'])
tendulkar['BF']=pd.to_numeric(tendulkar['BF'])

# Group by ground  anc compute mean,min and max
df=tendulkar[['Runs','BF','Ground']].groupby('Ground').agg(['mean','min','max'])
df.head(10)

##9b.  Aggregate mean,min max - Pyspark

In [39]:
from pyspark.sql.functions import *
tendulkar1= (sqlContext
         .read.format("com.databricks.spark.csv")
         .options(delimiter=',', header='true', inferschema='true')
         .load("/FileStore/tables/tendulkar.csv"))
# Filter rows which have DNB
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'DNB')
print(dfShape(tendulkar1))

# Filter rows which have TDNB
tendulkar1= tendulkar1.where(tendulkar1['Runs'] != 'TDNB')
print(dfShape(tendulkar1))

# Replace * with ""
tendulkar1 = tendulkar1.withColumn('Runs', regexp_replace('Runs', '[*]', ''))
tendulkar1.select('Runs').rdd.distinct().collect()

from pyspark.sql import functions as F

#Group by ground and compute mean, min and max
df=tendulkar1[['Runs','BF','Ground']].groupby(tendulkar1['Ground']).agg(F.mean(tendulkar1['Runs']),F.min(tendulkar1['Runs']),F.max(tendulkar1['Runs']))
df.show()

##10 Using SQL with Pyspark

In [41]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Read CSV').getOrCreate()
tendulkar1.createOrReplaceTempView("tendulkarTbl")
df = spark.sql("SELECT * FROM tendulkarTbl")
df.show(3)
df1 = spark.sql("SELECT * FROM tendulkarTbl where Ground='Karachi'")
df1.show()

##11a.  Apply lambda - Python

In [43]:
import pandas as pd
import numpy as np
import os
os.getcwd()
tendulkar = pd.read_csv("/dbfs/FileStore/tables/tendulkar.csv", header='infer')
tendulkar['4s']=tendulkar['4s'].convert_objects(convert_numeric=True)
a=tendulkar['4s'].apply(lambda x: x*4)
a

##11b. Apply lambda - Pyspark

In [45]:
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType
tendulkar1= (sqlContext
         .read.format("com.databricks.spark.csv")
         .options(delimiter=',', header='true', inferschema='true')
         .load("/FileStore/tables/tendulkar.csv"))

tendulkar1 = tendulkar1.withColumn("4s", tendulkar1["4s"].cast(IntegerType()))

tendulkar2= tendulkar1.where(col("4s").isNotNull())
fours = udf(lambda x: 4*x, IntegerType())
users = tendulkar2.withColumn("4s",fours(tendulkar1["4s"]))
users.show(5)
#df_ascii = df.select('integers', spark_convert_ascii('integers').alias('ascii_map'))

##12. Conclusion
The above post shows some of the most important equivalent constructs in Python and Pyspark. I will be adding more constructs soon.
Watch this space.