In [1]:
import pyspark.pandas as ps
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [2]:
spark = SparkSession.builder.getOrCreate()


### Accessing the diabetes prediction dataset CSV file using the spard.read.csv method.

In [3]:
diabetesPrediction = spark.read.csv('diabetes_prediction_dataset.csv', header=True)
diabetesPrediction

### As you can see in the output above, you will not be able to see a preview of the dataframe when you call it. It is because pySpark follows a lazy evaluation. Action methods on the spark dataframe only will trigger the computation. Some of the most used action methods are .show() and .collect()

In [4]:
diabetesPrediction.show()

### As you can see in the above output, the .show() method triggered computation on the spark dataframe and is displaying the top 20 rows. 

### To print the schema of a spark data frame, we should use the .printSchema() method.

In [5]:
diabetesPrediction.printSchema()

### To get the summary statistics out of a spark dataframe, we should call .describe() to create a summary statistics dataframe and .show() to trigger the computation. 

In [6]:
diabetesPrediction.describe().show()

### The .show() method on the summary statistics dataframe is visually messy. If you want it to look like the basic pandas dataframe when called, we should use .pandas_api(). This method will convert the spark dataframe to pandas-on-spark dataframe. This type of dataframe is very similar to the everyday pandas dataframe. However, the former is distributed and the latter is on a single machine. 

In [7]:
diabetesPrediction.describe().pandas_api()

### To group by a column's values and apply a aggregate function to get a measure, we should use .groupby("columnName").agg({"columnName":"aggregationFunction"}) to get the desired result.

### Here, we want to calcuate the average age of each gender. We should apply groupby on the gender column and apply average aggregation function on the age column. 

In [8]:
genderAvgAge = diabetesPrediction.groupBy(diabetesPrediction.gender).agg({'age':'mean'})
genderAvgAge.show()

### The average age of a female is 42.5 , while the average age of a male is 29.5

### Getting the columns in a pyspark dataframe is similar to getting columns in a pandas dataframe. We should use the .columns property, and it will return a list of columns.

In [9]:
diabetesPrediction.columns

### To drop duplicates, we can drop use the .dropduplicates() method similar to th

### To get the null values in the spark dataframe, we should use a .isnull method inside a .filter method to get the count of null values. 

In [None]:
diabetesPrediction.filter(col("age").isNull()).count()
