### Different Ways to Read Data into PySpark

In PySpark, there are various ways to read data from different sources such as CSV, JSON, Parquet, ORC, and databases like MySQL. 


**1. Reading CSV File:**

PySpark provides read.csv() to load data from a CSV file

In [0]:
csv_df = spark.read.csv("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee_data.csv", header="true", inferSchema="true")

csv_df.show()

+-----------+------------+----------+------+------------+---------+
|employee_id|        name|department|salary|joining_date| location|
+-----------+------------+----------+------+------------+---------+
|        101|Rohan Sharma|        IT| 75000|  2020-05-12|Bangalore|
|        102|  Priya Iyer|        HR| 65000|  2019-08-25|    Delhi|
|        103|Rajesh Kumar|   Finance| 80000|  2021-03-15|   Mumbai|
|        104| Sneha Patil|        IT| 78000|  2018-07-30|     Pune|
|        105| Amit Sharma| Marketing| 72000|  2022-01-10|Hyderabad|
|        106|  Ananya Das|        HR| 67000|  2017-11-20|  Kolkata|
|        107|Vikram Singh|   Finance| 85000|  2023-06-05|  Chennai|
|        108| Rohit Verma|        IT| 76000|  2020-09-18|Bangalore|
|        109| Arjun Mehta| Marketing| 73000|  2019-12-11|    Delhi|
|        110| Rohish Zade|   Finance| 81000|  2016-04-22|   Mumbai|
+-----------+------------+----------+------+------------+---------+



In [0]:
# csv_df.write.format("parquet").save("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/parquet_file")

**2. Reading JSON File:**

To read data from a JSON file, use read.json()

In [0]:
json_df = spark.read.json("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/line_delimited_json.json")
json_df.show()

+---+--------+------+
|age|    name|salary|
+---+--------+------+
| 20|  Manish| 20000|
| 25|  Nikita| 21000|
| 16|  Pritam| 22000|
| 35|Prantosh| 25000|
| 67|  Vikash| 40000|
+---+--------+------+



In [0]:
# json_df.write.format("orc").save("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/orc_file")

**3. Reading Parquet File:**

Parquet is a columnar file format, and PySpark provides read.parquet() to load it.

In [0]:
parquet_df = spark.read.parquet("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/data.parquet")
parquet_df.show()

+-----------+------------+----------+------+------------+---------+
|employee_id|        name|department|salary|joining_date| location|
+-----------+------------+----------+------+------------+---------+
|        101|Rohan Sharma|        IT| 75000|  2020-05-12|Bangalore|
|        102|  Priya Iyer|        HR| 65000|  2019-08-25|    Delhi|
|        103|Rajesh Kumar|   Finance| 80000|  2021-03-15|   Mumbai|
|        104| Sneha Patil|        IT| 78000|  2018-07-30|     Pune|
|        105| Amit Sharma| Marketing| 72000|  2022-01-10|Hyderabad|
|        106|  Ananya Das|        HR| 67000|  2017-11-20|  Kolkata|
|        107|Vikram Singh|   Finance| 85000|  2023-06-05|  Chennai|
|        108| Rohit Verma|        IT| 76000|  2020-09-18|Bangalore|
|        109| Arjun Mehta| Marketing| 73000|  2019-12-11|    Delhi|
|        110| Rohish Zade|   Finance| 81000|  2016-04-22|   Mumbai|
+-----------+------------+----------+------+------------+---------+



**4. Reading ORC File:**

ORC (Optimized Row Columnar) files are commonly used in Hive for efficient storage and processing of large datasets in Hadoop.
You can read them using read.orc()

In [0]:
orc_df = spark.read.orc("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/orc_data.orc")
orc_df.show()

+---+--------+------+
|age|    name|salary|
+---+--------+------+
| 20|  Manish| 20000|
| 25|  Nikita| 21000|
| 16|  Pritam| 22000|
| 35|Prantosh| 25000|
| 67|  Vikash| 40000|
+---+--------+------+



**5. Reading Data from SQL Server (JDBC):**

You can also read data from relational databases like MySQLusing the JDBC connector.


In [0]:
sqlserver_df = spark.read.format("jdbc") \
                    .option("url", "jdbc:sqlserver://<user>:1433;databaseName=rohish_zade") \
                    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
                    .option("dbtable", "rohish_zade.exams") \
                    .option("user", "<user>") \
                    .load()