#### data: walmart_ecommerce_product_details.json
#### questions: walmart questions.docx
#### notebook file: walmart_task_solutions.ipynb
#### python file: walmart_task_solutions.py

At first, create a spark application, read the json file into a dataframe, and explore the data

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("walmart").getOrCreate()

22/10/20 14:27:02 WARN Utils: Your hostname, pallavi-xps resolves to a loopback address: 127.0.1.1; using 192.168.1.79 instead (on interface wlp2s0)
22/10/20 14:27:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/20 14:27:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/20 14:27:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/10/20 14:27:03 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [3]:
# read the walmart_ecommerce_product_details.json file into a dataframe df

df_walmart = spark.read.json("walmart_ecommerce_product_details.json")

                                                                                

In [4]:
#view the data in the dataframe

# df_walmart.show()

In [5]:
#removing corrupt record column

df_walmart = df_walmart.drop('_corrupt_record')

In [6]:
#view the schema of the dataframe

df_walmart.printSchema()

root
 |-- Available: string (nullable = true)
 |-- Brand: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Crawl Timestamp: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Gtin: string (nullable = true)
 |-- Item Number: string (nullable = true)
 |-- List Price: string (nullable = true)
 |-- Package Size: string (nullable = true)
 |-- Postal Code: string (nullable = true)
 |-- Product Name: string (nullable = true)
 |-- Product Url: string (nullable = true)
 |-- Sale Price: string (nullable = true)
 |-- Uniq Id: string (nullable = true)



Clean the data: assign appropriate data types of Columns and rename columns to remove spaces

In [7]:
#all columns have data type string. So, we need to convert the datatypes using cast method

from pyspark.sql.types import StringType, FloatType, IntegerType, BooleanType, TimestampType
from pyspark.sql.functions import col

df_walmart = df_walmart\
        .withColumn('Available', col('Available').cast(BooleanType()))\
        .withColumn('Brand', col('Brand').cast(StringType()))\
        .withColumn('Category', col('Category').cast(StringType()))\
        .withColumn('Crawl Timestamp', col('Crawl Timestamp').cast(TimestampType()))\
        .withColumn('Description', col('Description').cast(StringType()))\
        .withColumn('Gtin', col('Gtin').cast(IntegerType()))\
        .withColumn('Item Number', col('Item Number').cast(IntegerType()))\
        .withColumn('List Price', col('List Price').cast(FloatType()))\
        .withColumn('Package Size', col('Package Size').cast(StringType()))\
        .withColumn('Postal Code', col('Postal Code').cast(StringType()))\
        .withColumn('Product Name', col('Product Name').cast(StringType()))\
        .withColumn('Product Url', col('Product Url').cast(StringType()))\
        .withColumn('Sale Price', col('Sale Price').cast(FloatType()))\
        .withColumn('Uniq Id', col('Uniq Id').cast(StringType()))

In [8]:
#rename columns with spaces in their names

df_walmart = df_walmart\
            .withColumnRenamed('Crawl Timestamp', 'Crawl_Timestamp')\
            .withColumnRenamed('Item Number', 'Item_Number')\
            .withColumnRenamed('List Price', 'List_Price')\
            .withColumnRenamed('Package Size', 'Package_Size')\
            .withColumnRenamed('Postal Code', 'Postal_Code')\
            .withColumnRenamed('Product Name', 'Product_Name')\
            .withColumnRenamed('Product Url', 'Product_Url')\
            .withColumnRenamed('Sale Price', 'Sale_Price')\
            .withColumnRenamed('Uniq Id', 'Uniq_Id')

In [9]:
#view the schema 

df_walmart.printSchema()

root
 |-- Available: boolean (nullable = true)
 |-- Brand: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Crawl_Timestamp: timestamp (nullable = true)
 |-- Description: string (nullable = true)
 |-- Gtin: integer (nullable = true)
 |-- Item_Number: integer (nullable = true)
 |-- List_Price: float (nullable = true)
 |-- Package_Size: string (nullable = true)
 |-- Postal_Code: string (nullable = true)
 |-- Product_Name: string (nullable = true)
 |-- Product_Url: string (nullable = true)
 |-- Sale_Price: float (nullable = true)
 |-- Uniq_Id: string (nullable = true)



In [10]:
#create a separate dataframe for storing product-related data only (as most questions are related to products)

df_product= df_walmart.select(
    ["Product_Name", "Category", "Brand", "Description",\
     "List_Price","Sale_Price","Available", "Product_Url"])

# df_product.show()
df_product.printSchema()

root
 |-- Product_Name: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Brand: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- List_Price: float (nullable = true)
 |-- Sale_Price: float (nullable = true)
 |-- Available: boolean (nullable = true)
 |-- Product_Url: string (nullable = true)



In [11]:
df_product = df_product.dropDuplicates()

# df_product.orderBy(df_product.Product_Name).show()

In [12]:
#create sql table for df_walmart and df_product

df_walmart.createOrReplaceTempView("walmart_table")
df_product.createOrReplaceTempView("product_table")

# Questions and Solutions

###       1. Get the Brand along with products associated with it.

In [13]:
#sqlway

brand_products = spark.sql("""
                            SELECT Brand, Product_Name
                            FROM product_table 
                            """)
brand_products.show()



+-------------------+--------------------+
|              Brand|        Product_Name|
+-------------------+--------------------+
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|                Ace|Celebrations Repl...|
|      Box Packaging|Box Packaging Whi...|
|          Cybrtrayd|Dimetrodon Lolly ...|
|   Rosalind Wheeler|Rosalind Wheeler ...|
|         Relief Pak|Digital electric ...|
|   Hawaii Pharm LLC|Foxnut (Euryale F...|
|    CHOSEN SUPPLIES|Replacement for A...|
|      Envelopes.com|A6 Invitation Env...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|      Envelopes.com|6 1/4 x 6 1/4 Gat...|
|      Envelopes.com|6 1/4 x 6 1/4 Gat...|
|          Areo Home|Areo Home Wall Bu...|
|       SheaMoisture|3 Pack - Shea Moi...|
+----------

                                                                                

In [14]:
#dataframe way
#using df_product and window function

import pyspark.sql.functions as f
from pyspark.sql import Window
import pyspark.sql.types as t

In [15]:
windowSpec=Window.partitionBy("Brand")

In [16]:
brand_products=df_product.withColumn("Products",f.collect_list(f.col("Product_Name")).over(windowSpec))
brand_products.select('Brand', 'Product_Name').show()

+-------------------+--------------------+
|              Brand|        Product_Name|
+-------------------+--------------------+
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|                Ace|Celebrations Repl...|
|      Box Packaging|Box Packaging Whi...|
|          Cybrtrayd|Dimetrodon Lolly ...|
|   Rosalind Wheeler|Rosalind Wheeler ...|
|         Relief Pak|Digital electric ...|
|   Hawaii Pharm LLC|Foxnut (Euryale F...|
|    CHOSEN SUPPLIES|Replacement for A...|
|      Envelopes.com|A6 Invitation Env...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|Business Essentials|Business Essentia...|
|      Envelopes.com|6 1/4 x 6 1/4 Gat...|
|      Envelopes.com|6 1/4 x 6 1/4 Gat...|
|          Areo Home|Areo Home Wall Bu...|
|       SheaMoisture|3 Pack - Shea Moi...|
+----------

###         2. List all the product names whose list price is greater than sales price

In [17]:
product_names = spark.sql("""
                          SELECT Product_Name, List_Price, Sale_Price, (List_Price - Sale_Price) as Difference
                          FROM product_table
                          WHERE List_Price > Sale_Price
                            """)

product_names.show()

+--------------------+----------+----------+----------+
|        Product_Name|List_Price|Sale_Price|Difference|
+--------------------+----------+----------+----------+
|Celebrations Repl...|      4.59|       3.9|0.69000006|
|Areo Home Wall Bu...|     81.99|     36.96|     45.03|
|Rayne Mirrors Ame...|    279.99|    259.98|  20.00998|
|Color Reinforced ...|      54.7|     41.98| 12.720001|
|Rayne Mirrors Ame...|    309.99|    289.98|  20.00998|
|Women's Breeze Wa...|     63.99|     44.08|     19.91|
|Metal Faux Rhines...|     15.99|      7.57|      8.42|
|Kinsman Enterpris...|     51.35|      40.0| 11.349998|
|Eros ATT882930 Kn...|    478.63|    354.44|    124.19|
|UBesGoo Large Dog...|     19.99|     13.59| 6.3999996|
|The Maple Guild O...|     52.66|     45.79|  6.869999|
|Bone Shaped Ameri...|     17.37|     13.66|  3.710001|
|Rayne Mirrors Ame...|    770.69|    570.78| 199.90997|
|MAC Gareth Pugh C...|      44.0|     24.99|     19.01|
|UBesGoo Large Dog...|     19.99|     10.59|    

In [18]:
#dfway
from pyspark.sql.functions import col

product_names_df = df_product.filter(col('List_Price') > col('Sale_Price')).select(col('Product_Name')\
                  .alias("Product Names"), col('List_Price').alias("List Price"), col('Sale_Price')\
                  .alias("Sale Price"),\
                  (col('List_Price')-col('Sale_Price')).alias('Difference in Price'))\

product_names_df.show()

+--------------------+----------+----------+-------------------+
|       Product Names|List Price|Sale Price|Difference in Price|
+--------------------+----------+----------+-------------------+
|Celebrations Repl...|      4.59|       3.9|         0.69000006|
|Areo Home Wall Bu...|     81.99|     36.96|              45.03|
|Rayne Mirrors Ame...|    279.99|    259.98|           20.00998|
|Color Reinforced ...|      54.7|     41.98|          12.720001|
|Rayne Mirrors Ame...|    309.99|    289.98|           20.00998|
|Women's Breeze Wa...|     63.99|     44.08|              19.91|
|Metal Faux Rhines...|     15.99|      7.57|               8.42|
|Kinsman Enterpris...|     51.35|      40.0|          11.349998|
|Eros ATT882930 Kn...|    478.63|    354.44|             124.19|
|UBesGoo Large Dog...|     19.99|     13.59|          6.3999996|
|The Maple Guild O...|     52.66|     45.79|           6.869999|
|Bone Shaped Ameri...|     17.37|     13.66|           3.710001|
|Rayne Mirrors Ame...|   

###       3. Count the number of product names whose list price is greater than sales price

In [19]:
#sqlway

product_names.createOrReplaceTempView("product_names_tbl")

number_of_products = spark.sql("""
                                SELECT COUNT(Product_Name) as Number_of_Products
                                FROM product_names_tbl                                
                                """)

number_of_products.show()

+------------------+
|Number_of_Products|
+------------------+
|              1145|
+------------------+



In [20]:
#dfway

product_names_df.count()

1145

###       4. List all the products belong to a “women” category.

In [21]:
#sqlway

women_products = spark.sql("""
                            SELECT Product_Name, Category
                            FROM product_table
                            WHERE Category LIKE "%Women%" 
                            OR Category LIKE "%women%"
                        """)
women_products.show()

+--------------------+--------------------+
|        Product_Name|            Category|
+--------------------+--------------------+
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Classique 766 Pos...|Clothing|Women|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Handmade Glass Pe...|Jewelry|Womens Je...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|The Wonder Years ...|Clothing|Women|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Ames Walker AW St...|Clothing|W

In [22]:
#dfway

df_product.where(col('Category').contains('Women') | col('Category').contains('women'))\
                .select('Product_Name', 'Category').distinct().show()             

+--------------------+--------------------+
|        Product_Name|            Category|
+--------------------+--------------------+
|Classique 766 Pos...|Clothing|Women|Wo...|
|Owl shaped keycha...|Clothing|Bags & A...|
|Secret Key Hider,...|Clothing|Bags & A...|
|  Stanford Key Chain|Clothing|Bags & A...|
|The Wonder Years ...|Clothing|Women|Wo...|
|Handmade Glass Pe...|Jewelry|Womens Je...|
|Keychain - F-16 F...|Clothing|Bags & A...|
|Diabetic Bamboo Q...|Clothing|Women|Wo...|
|Women's Breeze Wa...|Clothing|Shoes|Wo...|
|Ames Walker AW St...|Clothing|Women|Wo...|
|Celtic Knot Weave...|Clothing|Bags & A...|
|Monarch M-initial...|Clothing|Bags & A...|
|Ramones Plastic K...|Clothing|Bags & A...|
|Instant Gift Phot...|Clothing|Bags & A...|
|Retro Owl Time Ge...|Clothing|Bags & A...|
|KEY-BAK SUPER48 X...|Clothing|Bags & A...|
|Keychain - Soccer...|Clothing|Bags & A...|
|Crew Diabetic Soc...|Clothing|Women|Wo...|
|Gemstone Globe Br...|Clothing|Bags & A...|
|Smart Blonde KC-8...|Clothing|B

###      5. List the products which are not available.

In [23]:
# SQL way
unavailable_products = spark.sql("""
                            SELECT DISTINCT(Product_Name), Available as Availability
                            FROM product_table
                            WHERE Available = FALSE
                        """)

unavailable_products.show()

+--------------------+------------+
|        Product_Name|Availability|
+--------------------+------------+
|Hudson Baby Boy a...|       false|
|MoYo Natural Labs...|       false|
|Chergui Eau De Pa...|       false|
|PB3655R Red 10 In...|       false|
|"Cure For All Dou...|       false|
|Gillette Fusion P...|       false|
|Ebe Women Reading...|       false|
|Revlon Photo Read...|       false|
|MATRIX Total Resu...|       false|
|Solid Dog Polo by...|       false|
|Travel Cosmetic M...|       false|
|Makeup Brushes Pr...|       false|
|Frosted Clear Gla...|       false|
|Toujours Moi Set-...|       false|
|Dimmable 8W MR16 ...|       false|
|NARS Night Series...|       false|
|Holiday Time 8 Ct...|       false|
|DuraGlobe Monthly...|       false|
|Business Essentia...|       false|
|Ebe Prescription ...|       false|
+--------------------+------------+
only showing top 20 rows



In [24]:
#dfway

df_product.filter(df_product['Available'] == 'False').show()

+--------------------+--------------------+-------------------+--------------------+----------+----------+---------+--------------------+
|        Product_Name|            Category|              Brand|         Description|List_Price|Sale_Price|Available|         Product_Url|
+--------------------+--------------------+-------------------+--------------------+----------+----------+---------+--------------------+
|Business Essentia...|Office|Shipping &...|Business Essentials|Use Business Esse...|     31.16|     31.16|    false|https://www.walma...|
|Business Essentia...|Office|Shipping &...|Business Essentials|Use Business Esse...|     51.23|     51.23|    false|https://www.walma...|
|Business Essentia...|Office|Shipping &...|Business Essentials|Use Business Esse...|     55.63|     55.63|    false|https://www.walma...|
|Rosalind Wheeler ...|Office|Boards & E...|   Rosalind Wheeler|Features: -Made i...|    240.31|    240.31|    false|https://www.walma...|
|6 1/4 x 6 1/4 Gat...|Office|Envel

In [25]:
# df way

df_product.select(["Product_Name", "Available"]).where(df_product.Available == "FALSE").show()

+--------------------+---------+
|        Product_Name|Available|
+--------------------+---------+
|Business Essentia...|    false|
|Business Essentia...|    false|
|Business Essentia...|    false|
|Rosalind Wheeler ...|    false|
|6 1/4 x 6 1/4 Gat...|    false|
|Areo Home Wall Bu...|    false|
|STC211 30 Inch x ...|    false|
|Business Essentia...|    false|
|Business Essentia...|    false|
|Eye Buy Express K...|    false|
|Business Essentia...|    false|
|Business Essentia...|    false|
|Heepo 6 Pcs Smoot...|    false|
|3 Pack - CHI Volu...|    false|
|(3 Pack) CHINA GL...|    false|
|Business Essentia...|    false|
|Oliga Calura Perm...|    false|
|Double Wear Stay ...|    false|
|Poland Spring Nat...|    false|
|Business Essentia...|    false|
+--------------------+---------+
only showing top 20 rows



###         6. Count the number of products which are available.

In [26]:
#sqlway
#counting all products that are available

products_count = spark.sql("""
                            SELECT COUNT(Product_Name)
                            FROM product_table
                            WHERE Available = TRUE""")

products_count.show()

+-------------------+
|count(Product_Name)|
+-------------------+
|               4990|
+-------------------+



In [27]:
#dfway

df_product.filter(df_product['Available']== True).count()

4990

###         7. List the products that are made up of Nylon.

In [28]:
#sqlway

products = spark.sql("""
                        SELECT DISTINCT(Product_Name), Description 
                        FROM product_table 
                        WHERE Description LIKE "%Nylon%" 
                        OR Description LIKE "%nylon%" 
                    """)
products.show()
products.count()

+--------------------+--------------------+
|        Product_Name|         Description|
+--------------------+--------------------+
|Country Brook Des...|Country Brook Des...|
|Plain Nylon Dog C...|Plain Nylon Dog C...|
|Country Brook Des...|Made by hand in t...|
|Ames Walker AW St...|There's a reason ...|
|Pre-Vent II Serie...|Part #36810500 - ...|
|Unique Bargains A...|Description: It i...|
|Country Brook Pet...|Half check collar...|
|Doco DCA201-18XL ...|Puffy air mesh st...|
|Doco DCROPE2072-0...|Reflective rope l...|
|comfortable Flex-...|High quality Magn...|
|Sigvaris Advance ...|Sigvaris Advance ...|
|Pug Life Puppy Ca...|A versatile toile...|
|Extra large Profe...|Extra large Profe...|
|Coastal Pet Nylon...|This lead consist...|
|L'Oreal Paris Inf...|This is the Long-...|
|Country Brook Des...|Country Brook Des...|
|Coastal Pet Produ...|Coastal Pet Produ...|
|Anti-Embolism Sto...||Blue Jay Brand *...|
|Unique Bargains P...|Package Content: ...|
|DJO ProCare Elbow...|DJO ProCar

77

In [29]:
#dfway

df_product.where(col('Description').contains('Nylon') | col('Description').contains('nylon'))\
                .select('Product_Name', 'Description').distinct().show()

#counting to make sure 
df_product.where(col('Description').contains('Nylon') | col('Description').contains('nylon'))\
                .select('Product_Name', 'Description').distinct().count()

+--------------------+--------------------+
|        Product_Name|         Description|
+--------------------+--------------------+
|Country Brook Des...|Country Brook Des...|
|Plain Nylon Dog C...|Plain Nylon Dog C...|
|Country Brook Des...|Made by hand in t...|
|Ames Walker AW St...|There's a reason ...|
|Pre-Vent II Serie...|Part #36810500 - ...|
|Unique Bargains A...|Description: It i...|
|Country Brook Pet...|Half check collar...|
|Doco DCA201-18XL ...|Puffy air mesh st...|
|Doco DCROPE2072-0...|Reflective rope l...|
|comfortable Flex-...|High quality Magn...|
|Sigvaris Advance ...|Sigvaris Advance ...|
|Pug Life Puppy Ca...|A versatile toile...|
|Extra large Profe...|Extra large Profe...|
|Coastal Pet Nylon...|This lead consist...|
|L'Oreal Paris Inf...|This is the Long-...|
|Country Brook Des...|Country Brook Des...|
|Coastal Pet Produ...|Coastal Pet Produ...|
|Anti-Embolism Sto...||Blue Jay Brand *...|
|Unique Bargains P...|Package Content: ...|
|DJO ProCare Elbow...|DJO ProCar

77