In [1]:
import pyspark.pandas as ps
from pyspark.sql import SparkSession
from pyspark.sql.functions import *



In [2]:
spark = SparkSession.builder.getOrCreate()


24/04/05 21:26:22 WARN Utils: Your hostname, Vamsees-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.87 instead (on interface en0)
24/04/05 21:26:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/05 21:26:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Accessing the diabetes prediction dataset CSV file using the spard.read.csv method.

In [None]:
diabetesPrediction = spark.read.csv('diabetes_prediction_dataset.csv', header=True)
diabetesPrediction

### As you can see in the output above, you will not be able to see a preview of the dataframe when you call it. It is because pySpark follows a lazy evaluation. Action methods on the spark dataframe only will trigger the computation. Some of the most used action methods are .show() and .collect()

In [None]:
diabetesPrediction.show()

### As you can see in the above output, the .show() method triggered computation on the spark dataframe and is displaying the top 20 rows. 

### To print the schema of a spark data frame, we should use the .printSchema() method.

In [None]:
diabetesPrediction.printSchema()

### To get the summary statistics out of a spark dataframe, we should call .describe() to create a summary statistics dataframe and .show() to trigger the computation. 

In [None]:
diabetesPrediction.describe().show()

### The .show() method on the summary statistics dataframe is visually messy. If you want it to look like the basic pandas dataframe when called, we should use .pandas_api(). This method will convert the spark dataframe to pandas-on-spark dataframe. This type of dataframe is very similar to the everyday pandas dataframe. However, the former is distributed and the latter is on a single machine. 

In [None]:
diabetesPrediction.describe().pandas_api()

### To group by a column's values and apply a aggregate function to get a measure, we should use .groupby("columnName").agg({"columnName":"aggregationFunction"}) to get the desired result.

### Here, we want to calcuate the average age of each gender. We should apply groupby on the gender column and apply average aggregation function on the age column. 

In [None]:
genderAvgAge = diabetesPrediction.groupBy(diabetesPrediction.gender).agg({'age':'mean'})
genderAvgAge.show()

### The average age of a female is 42.5 , while the average age of a male is 29.5

### Getting the columns in a pyspark dataframe is similar to getting columns in a pandas dataframe. We should use the .columns property, and it will return a list of columns.

In [None]:
diabetesPrediction.columns

### To drop duplicates, we can drop use the .dropduplicates() method similar to th

### To get the null values in the spark dataframe, we should use a .isnull method inside a .filter method to get the count of null values. 

In [None]:
diabetesPrediction.filter(col("age").isNull()).count()


### This shows that there are zero null values in the age column. 

### We can use .withColumn() method to add new columns to the dataframe. 

In [None]:
diabetesPrediction.withColumn('age2',diabetesPrediction.age**2).show()

### As you can see age2 column is added to the end of the dataframe. 

### To filter columns by the necessary conditions on columns, we can use either .filter() or .where() methods. 

In [None]:
diabetesPrediction.filter( (diabetesPrediction.age > 25) & (diabetesPrediction.age<35)).show()


In [None]:
diabetesPrediction.where( (diabetesPrediction.age > 25) & (diabetesPrediction.age<35)).show()

### Here, only rows or entities with ages >25 and ages < 35 are shown. 

### We can merge data frames in spark by using the union() method. 

In [None]:
df1 = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["name", "id"])
df2 = spark.createDataFrame([(3, "Charlie"), (4, "Dave")], ["id", "name"])
union_df = df1.union(df2)

### We can use also SQL queries to operate on spark dataframes.

### We can select all rows from a DataFrame by using a placeholder like {table1} in the query and passing the DataFrame using keyword arguments, such as table1=diabetesPrediction.

In [None]:
spark.sql('SELECT * FROM {table1}',table1=diabetesPrediction).show()

### The following query displays the age column

In [18]:
spark.sql("SELECT {table1}.age FROM {table1}",table1=diabetesPrediction).show()

+----+
| age|
+----+
|80.0|
|54.0|
|28.0|
|36.0|
|76.0|
|20.0|
|44.0|
|79.0|
|42.0|
|32.0|
|53.0|
|54.0|
|78.0|
|67.0|
|76.0|
|78.0|
|15.0|
|42.0|
|42.0|
|37.0|
+----+
only showing top 20 rows



### The following query displayes the age column. This difference from the previous query is that it demonstrates how table alias can be used in pyspark SQL.

In [19]:
spark.sql("SELECT t1.age FROM {table1} t1",table1=diabetesPrediction).show()

+----+
| age|
+----+
|80.0|
|54.0|
|28.0|
|36.0|
|76.0|
|20.0|
|44.0|
|79.0|
|42.0|
|32.0|
|53.0|
|54.0|
|78.0|
|67.0|
|76.0|
|78.0|
|15.0|
|42.0|
|42.0|
|37.0|
+----+
only showing top 20 rows



In [None]:
ageFilter = 25
spark.sql("SELECT t1.age FROM {table1} t1 WHERE t1.age > {ageFilter}",table1=diabetesPrediction,ageFilter=ageFilter).show()