<a href="https://colab.research.google.com/github/srimantapal205/30_DaysPySpark/blob/main/Day_1_Creating_DataFrames_in_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Creating DataFrame Manually with Hardcoded Values

In [27]:
#!sudo apt-get install python3.7
!pip install pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.getOrCreate()

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpython3.7-minimal libpython3.7-stdlib python3.7-minimal
Suggested packages:
  python3.7-venv binfmt-support
The following NEW packages will be installed:
  libpython3.7-minimal libpython3.7-stdlib python3.7 python3.7-minimal
0 upgraded, 4 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,670 kB of archives.
After this operation, 17.7 MB of additional disk space will be used.
Get:1 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 libpython3.7-minimal amd64 3.7.17-1+jammy1 [608 kB]
Get:2 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 python3.7-minimal amd64 3.7.17-1+jammy1 [1,837 kB]
Get:3 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy/main amd64 libpython3.7-stdlib amd64 3.7.17-1+jammy1 [1,864 kB]
Get:4 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu ja

In [28]:
#Sample Data
data = [(1, "Alice"), (2, "Bob"), (3, "Charlie"), (4, "David"), (5, "Eve")]
column =["Id", "Name"]
#Create DateFrame
df = spark.createDataFrame(data, column)
#Show DataFrame
df.show()

+---+-------+
| Id|   Name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|  David|
|  5|    Eve|
+---+-------+



## 2. Creating DataFrame from Pandas:

In [29]:
# Sample Pandas Dataframe
pandas_df = pd.DataFrame(data, columns=column)

#Conver to Pyspask DataFrame
df_from_Pandas = spark.createDataFrame(pandas_df)
df_from_Pandas.show()

+---+-------+
| Id|   Name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|  David|
|  5|    Eve|
+---+-------+



## 3. Create DataFrame from Dictionary:

In [30]:
#Convert data into Dictionary
dataDict = [dict(zip(column, values)) for values in data]
print(dataDict)



[{'Id': 1, 'Name': 'Alice'}, {'Id': 2, 'Name': 'Bob'}, {'Id': 3, 'Name': 'Charlie'}, {'Id': 4, 'Name': 'David'}, {'Id': 5, 'Name': 'Eve'}]


In [31]:
df_from_dict = spark.createDataFrame(dataDict)
df_from_dict.show()

+---+-------+
| Id|   Name|
+---+-------+
|  1|  Alice|
|  2|    Bob|
|  3|Charlie|
|  4|  David|
|  5|    Eve|
+---+-------+



## 4. Create Empty DataFrame:

In [32]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define Schema
schema = StructType([
    StructField("ID", IntegerType(),True),
    StructField("Name", StringType(), True)
])

#Create Empty DataFrame
empty_df = spark.createDataFrame([], schema=schema)
empty_df.show()


+---+----+
| ID|Name|
+---+----+
+---+----+



In [33]:
from google.colab import drive

# Mount Google Drive with a longer timeout
drive.mount('/content/drive', force_remount=True, timeout_ms=300000)  # Increased timeout to 5 minutes

Mounted at /content/drive


In [7]:
# from os import path
# from google.colab import drive
# drive.mount('/content/drive')

# /content/drive/MyDrive/Colab Notebooks/dataSet/ProductData.csv
#/content/drive/MyDrive/Colab Notebooks/dataSet/ProductData.json
#/content/drive/MyDrive/Colab Notebooks/dataSet/mtcars.parquet

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 5. Creating DataFrame from Structured Data (CSV, JSON, Parquet)

In [48]:
#Reading  CSV file into dataframe
df_csv =  spark.read.csv(path="/content/drive/MyDrive/Colab Notebooks/dataSet/ProductData.csv", header=True, inferSchema=True)
#Show the first 10 rows:
df_csv.show(10)



+---------+--------------------+--------------------+--------+--------------------+-----------------+------------+-----------+-----------------+------------+-----------+-----------+--------+--------------------+--------+---------------+--------------+
|  p_brand|              p_cate|             p_image|  p_mall|              p_name| p_number_reviews|     p_price|p_rate1star|      p_rate2star| p_rate3star|p_rate4star|p_rate5star|p_rating|              s_name|s_rating|s_response_rate| s_ship_ontime|
+---------+--------------------+--------------------+--------+--------------------+-----------------+------------+-----------+-----------------+------------+-----------+-----------+--------+--------------------+--------+---------------+--------------+
|     Dell|Máy vi tính & Laptop|//vn-live-05.slat...|    Mall|Laptop Dell Inspi...|Không có đánh giá|31.490.000 ₫|          0|                0|           0|          0|          0|     0.0|DELL Official Ret...|     88%|           100%|        

In [49]:
#Reading Json file into dataframe
#df_json = spark.read.json(path="/content/drive/MyDrive/Colab Notebooks/dataSet/ProductData.json").show()


#Reading Json file into dataframe
# df_json = spark.read.option("inferSchema", "true").json(path="/content/drive/MyDrive/Colab Notebooks/dataSet/ProductData.json")
# df_json.show()
df_raw = spark.read.text("/content/drive/MyDrive/Colab Notebooks/dataSet/ProductData.json")
df_raw.show(truncate=False)

# df_json = spark.read.json("content/drive/MyDrive/Colab Notebooks/dataSet/ProductData.json")
# df_json.show()


+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                               |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[                                                                                                                                                                                                   |
| {                                                                                                                                                                                                  |
|   "

In [46]:
# Reading Parquet file into DataFrame
df_parquet = spark.read.parquet("/content/drive/MyDrive/Colab Notebooks/dataSet/mtcars.parquet")
df_parquet.show()

+-------------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|              model| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|
+-------------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|          Mazda RX4|21.0|  6|160.0|110| 3.9| 2.62|16.46|  0|  1|   4|   4|
|      Mazda RX4 Wag|21.0|  6|160.0|110| 3.9|2.875|17.02|  0|  1|   4|   4|
|         Datsun 710|22.8|  4|108.0| 93|3.85| 2.32|18.61|  1|  1|   4|   1|
|     Hornet 4 Drive|21.4|  6|258.0|110|3.08|3.215|19.44|  1|  0|   3|   1|
|  Hornet Sportabout|18.7|  8|360.0|175|3.15| 3.44|17.02|  0|  0|   3|   2|
|            Valiant|18.1|  6|225.0|105|2.76| 3.46|20.22|  1|  0|   3|   1|
|         Duster 360|14.3|  8|360.0|245|3.21| 3.57|15.84|  0|  0|   3|   4|
|          Merc 240D|24.4|  4|146.7| 62|3.69| 3.19| 20.0|  1|  0|   4|   2|
|           Merc 230|22.8|  4|140.8| 95|3.92| 3.15| 22.9|  1|  0|   4|   2|
|           Merc 280|19.2|  6|167.6|123|3.92| 3.44| 18.3|  1|  0|   4|   4|
|          M