In [1]:
import pyspark.pandas as ps
from pyspark.sql import SparkSession
from pyspark.sql.functions import *



In [2]:
spark = SparkSession.builder.getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/22 15:33:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Accessing the diabetes prediction dataset CSV file using the spard.read.csv method.

In [6]:
diabetesPrediction = spark.read.csv('diabetes_prediction_dataset.csv', header=True)
diabetesPrediction

DataFrame[gender: string, age: string, hypertension: string, heart_disease: string, smoking_history: string, bmi: string, HbA1c_level: string, blood_glucose_level: string, diabetes: string]

### As you can see in the output above, you will not be able to see a preview of the dataframe when you call it. It is because pySpark follows a lazy evaluation. Action methods on the spark dataframe only will trigger the computation. Some of the most used action methods are .show() and .collect()

In [7]:
diabetesPrediction.show()

+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|gender| age|hypertension|heart_disease|smoking_history|  bmi|HbA1c_level|blood_glucose_level|diabetes|
+------+----+------------+-------------+---------------+-----+-----------+-------------------+--------+
|Female|80.0|           0|            1|          never|25.19|        6.6|                140|       0|
|Female|54.0|           0|            0|        No Info|27.32|        6.6|                 80|       0|
|  Male|28.0|           0|            0|          never|27.32|        5.7|                158|       0|
|Female|36.0|           0|            0|        current|23.45|        5.0|                155|       0|
|  Male|76.0|           1|            1|        current|20.14|        4.8|                155|       0|
|Female|20.0|           0|            0|          never|27.32|        6.6|                 85|       0|
|Female|44.0|           0|            0|          never|19.31|  

### As you can see in the above output, the .show() method triggered computation on the spark dataframe and is displaying the top 20 rows. 

### To print the schema of a spark data frame, we should use the .printSchema() method.

In [8]:
diabetesPrediction.printSchema()

root
 |-- gender: string (nullable = true)
 |-- age: string (nullable = true)
 |-- hypertension: string (nullable = true)
 |-- heart_disease: string (nullable = true)
 |-- smoking_history: string (nullable = true)
 |-- bmi: string (nullable = true)
 |-- HbA1c_level: string (nullable = true)
 |-- blood_glucose_level: string (nullable = true)
 |-- diabetes: string (nullable = true)



### To get the summary statistics out of a spark dataframe, we should call .describe() to create a summary statistics dataframe and .show() to trigger the computation. 

In [10]:
diabetesPrediction.describe().show()

[Stage 9:>                                                          (0 + 1) / 1]

+-------+------+-----------------+------------------+------------------+---------------+-----------------+------------------+-------------------+-------------------+
|summary|gender|              age|      hypertension|     heart_disease|smoking_history|              bmi|       HbA1c_level|blood_glucose_level|           diabetes|
+-------+------+-----------------+------------------+------------------+---------------+-----------------+------------------+-------------------+-------------------+
|  count|100000|           100000|            100000|            100000|         100000|           100000|            100000|             100000|             100000|
|   mean|  NULL|41.88585600000013|           0.07485|           0.03942|           NULL|27.32076709999422|5.5275069999983275|          138.05806|              0.085|
| stddev|  NULL|22.51683987161704|0.2631504702289171|0.1945930169980986|           NULL|6.636783416648357|1.0706720918835468|  40.70813604870383|0.27888308976661896|
|   

                                                                                

### The .show() method on the summary statistics dataframe is visually messy. If you want it to look like the basic pandas data 

In [13]:
diabetesPrediction.describe().pandas_api()

                                                                                

Unnamed: 0,summary,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,count,100000,100000.0,100000.0,100000.0,100000,100000.0,100000.0,100000.0,100000.0
1,mean,,41.88585600000013,0.07485,0.03942,,27.32076709999422,5.527506999998328,138.05806,0.085
2,stddev,,22.51683987161704,0.2631504702289171,0.1945930169980986,,6.636783416648357,1.0706720918835468,40.70813604870383,0.2788830897666189
3,min,Female,0.08,0.0,0.0,No Info,10.01,3.5,100.0,0.0
4,max,Other,9.0,1.0,1.0,not current,95.69,9.0,90.0,1.0
