# Retail Data Analysis 

### Analyze the Retail Purchase Data set 

#### Calculating three KPIs in this mini project - 
#### 1. Calculate sales breakdown by product category across all of the stores.
#### 2. Calculate sales breakdown by store across all of the stores. Assume there is one store per city 


## 1. Calculate sales breakdown by product category across all of the stores.
#### There are 18 types of product categories and we have to find the total sale values in all the product categories.

In [1]:
#import libraries to run spark 
import findspark
findspark.init()
import pyspark
import random
import re,string
import pandas as pd

In [2]:
#Running spark local 
from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')

In [3]:
#Loading the data 
#this is our base RDD
data = sc.textFile("/Users/ravishaggarwal/Desktop/scala_proj/retail/Retail_Sample_Data_Set.csv")

In [4]:
#Checking first 5 values of the data 
#data contains the fileds-> Date,Time,City,Product_Category, Sale-Value,Payment-Mode 
data.take(5)

["2012-01-01\t09:00\tSan Jose\tMen's Clothing\t214.05\tAmex",
 "2012-01-01\t09:00\tFort Worth\tWomen's Clothing\t153.57\tVisa",
 '2012-01-01\t09:00\tSan Diego\tMusic\t66.08\tCash',
 '2012-01-01\t09:00\tPittsburgh\tPet Supplies\t493.51\tDiscover',
 "2012-01-01\t09:00\tOmaha\tChildren's Clothing\t235.63\tMasterCard"]

In [5]:
#Transformation
#tab-splitting the data 
first_transformed_RDD = data.map(lambda x:x.split("\t"))


In [6]:
#Second-Transformation
#Since in our data the Product-Category is on 4th Column and Sale-Value is on 5th Column 
#fetching the Product-Categories and Sale-value from their respective column in this RDD 


second_transformed_RDD = first_transformed_RDD.map(lambda x: (x[3],float(x[4])))

#second_transformed_RDD = first_transformed_RDD .withColumn("Product-Cat",first_transformed_RDD ["Sale-Value"].cast(DoubleType().alias("Sale-Value"))

In [7]:
#Showcasing the Product Categories and their Sale-values 
second_transformed_RDD.take(5)

[("Men's Clothing", 214.05),
 ("Women's Clothing", 153.57),
 ('Music', 66.08),
 ('Pet Supplies', 493.51),
 ("Children's Clothing", 235.63)]

In [8]:
#Apply ReducebyKey
#Key is the Product-Category
#Value is the Sale-Value 
#Reducing by Key thus for each Key adding all the Values 
# For each Product-Category adding all the Sale-Value
second_transformed_RDD.reduceByKey(lambda x,y: x+y).take(20)


[("Men's Clothing", 4030.8899999999994),
 ("Women's Clothing", 3736.869999999999),
 ('Music', 2396.4),
 ('Pet Supplies', 2660.83),
 ("Children's Clothing", 2778.21),
 ('Cameras', 2591.27),
 ('Consumer Electronics', 2963.59),
 ('Toys', 3188.18),
 ('Video Games', 2573.3799999999997),
 ('DVDs', 2831.0),
 ('Garden', 1882.25),
 ('Baby', 2034.23),
 ('Books', 3492.7999999999997),
 ('Crafts', 3258.09),
 ('Sporting Goods', 1952.89),
 ('CDs', 2644.51),
 ('Computers', 2102.6600000000003),
 ('Health and Beauty', 2467.32)]

## 2. Calculate sales breakdown by store across all of the stores. Assume there is one store per city

#### Here in this KPI we will be finding the Stores across all the cities assuming that there are one store per city.

In [15]:
#Third-Transformation
#Since in our data the City  is on 3rd Column and Sale-Value is on 5th Column 
#fetching the City and Sale-value from their respective column in this RDD
#We are fetching city because we are assuming that there is only one store in each city.


third_transformed_RDD = first_transformed_RDD.map(lambda x: (x[2],float(x[4])))



In [16]:
#Showcasing the Store locations  and their Sale-values 
third_transformed_RDD.take(30)

[('San Jose', 214.05),
 ('Fort Worth', 153.57),
 ('San Diego', 66.08),
 ('Pittsburgh', 493.51),
 ('Omaha', 235.63),
 ('Stockton', 247.18),
 ('Austin', 379.6),
 ('New York', 296.8),
 ('Corpus Christi', 25.38),
 ('Fort Worth', 213.88),
 ('Las Vegas', 53.26),
 ('Newark', 39.75),
 ('Austin', 469.63),
 ('Greensboro', 290.82),
 ('San Francisco', 260.65),
 ('Lincoln', 136.9),
 ('Buffalo', 483.82),
 ('San Jose', 215.82),
 ('Boston', 418.94),
 ('Houston', 309.16),
 ('Las Vegas', 93.39),
 ('Virginia Beach', 376.11),
 ('Riverside', 252.88),
 ('Tulsa', 205.06),
 ('Reno', 88.25),
 ('Chicago', 31.08),
 ('Fort Wayne', 370.55),
 ('San Bernardino', 170.2),
 ('Madison', 16.78),
 ('Austin', 327.75)]

In [17]:
#Apply ReducebyKey
#Key is the City
#Value is the Sale-Value 
#Reducing by Key thus for each Key adding all the Values 
# For each City adding all the Sale-Value
third_transformed_RDD.reduceByKey(lambda x,y: x+y).take(20)



[('San Jose', 429.87),
 ('Fort Worth', 1128.1399999999999),
 ('San Diego', 448.92),
 ('Pittsburgh', 1271.35),
 ('Omaha', 1811.89),
 ('Stockton', 247.18),
 ('Austin', 1787.88),
 ('New York', 468.90999999999997),
 ('Corpus Christi', 25.38),
 ('Las Vegas', 146.65),
 ('Newark', 39.75),
 ('Greensboro', 749.73),
 ('San Francisco', 260.65),
 ('Lincoln', 712.77),
 ('Buffalo', 483.82),
 ('Boston', 1114.54),
 ('Houston', 1101.95),
 ('Virginia Beach', 647.6700000000001),
 ('Riverside', 1106.01),
 ('Tulsa', 431.95)]