<h1 align="center">Ecommerce Analysis</h1>

## HDFS

### Creating `ecommerce` directory in hdfs
```bash
hdfs dfs -mkdir ecommerce
```

### Adding `ecommerce.csv` to hdfs
```bash
hdfs dfs -put ecommerce.csv ecommerce/
```

## PySpark

### importing packages
```python
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
import happybase
```

### Creating SparkSession

```python
spark = SparkSession.builder.master("local[5]").appName("VarunSpark").enableHiveSupport().getOrCreate()
```
![image.png](attachment:409ee4de-6dbb-4265-b344-9f180e6d83e6.png)

### reading CSV file

```python
ecommerce = spark.read.csv("ecommerce/ecommerce.csv", header=True, sep=',', inferSchema=True)
```

### TypeCasting `invoice_date` from `string` to `date`

```python
ecommerce = ecommerce.withColumn("InvoiceDate", ecommerce.InvoiceDate.cast(DateType()))
```

### Adding `total_amount` column in ecommerce table

```python
ecommerce = ecommerce.withColumn("total_amount", ecommerce.Quantity*ecommerce.UnitPrice)
```
### Checking for null values
```python
print(ecommerce.describe().show())
```

![image.png](attachment:1c4e127f-f910-41c4-a230-4eb47e233ce3.png)


```python
null_counts = ecommerce.select([sum(col(c).isNull().cast('int')).alias(c) for c in ecommerce.columns])
print(null_counts.show())
```
![image.png](attachment:391ecef6-b65c-4410-996b-22f5ecc3043f.png)

### Dropping if any null values are present
```python
ecommerce = ecommerce.dropna()
```
### Printing schema of `ecommerce` table
```python
print(ecommerce.printSchema())
```
![image.png](attachment:518f54d8-6dcc-46bb-8fff-f77ff99bf137.png)

### Printing ecommerce table of first 10 rows
```python
print(ecommerce.show(10))
```
![image.png](attachment:25e3f589-7ad6-4815-9b12-1a297a741616.png)

### Creating Hive external table
```sql
CREATE EXTERNAL TABLE `varunmdb.ecom_transactions`(
  InvoiceNo integer,
  StockCode string,
  Description string,
  Quantity integer,
  UnitPrice double,
  InvoiceDate date,
  CustomerID integer,
  total_amount DOUBLE)
partitioned by (Country string)
stored as PARQUET
LOCATION 'hdfs:///user/varunm15t38hedu/ecommerce/parquet';
```

![image.png](attachment:da231c4b-e61c-42c5-939f-3aafa1a54f53.png)

### writing as parquet to ecommerce/parquet partitioning by `country` column
```python
ecommerce.write.mode('overwrite').partitionBy('Country').parquet("hdfs:///user/varunm15t38hedu/ecommerce/parquet")
```

### check if files are partitioned correctly
![image.png](attachment:0dc1b53d-b9ce-43e1-9d85-0c86663cede2.png)

### reparing partitions
```python
spark.sql("msck repair table varunmdb.ecom_transactions")
```
![image.png](attachment:a1aa1c42-c0c2-4ca3-bdf2-422a10c076de.png)

### reading table to check it is working
```python
df = spark.read.table("varunmdb.ecom_transactions")
```

### printing 10 rows of the table from hive
```python
print(spark.sql("select * from varunmdb.ecom_transactions limit 10").show())
```

![image.png](attachment:a21be3e5-5085-4b07-a37c-c3affa3c5d66.png)

## HBASE

### Create hbase table in hbase shell

```bash
hbase shell
create 'varuntcs:ecom_txn' 'info'
```

### create connection with hbase table
```python
connection = happybase.Connection('master')
table = connection.table('varuntcs:ecom_txn')
```

### Inserting data into hbase for first 10 rows
```python
for row in ecommerce.limit(10).collect():
    row_key = f"{row.InvoiceNo}_{row.StockCode}"
    table.put(row_key.encode(), {
        b'info:InvoiceNo': str(row.InvoiceNo).encode(),
        b'info:StockCode': str(row.StockCode).encode(),
        b'info:Quantity': str(row.Quantity).encode(),
        b'info:UnitPrice': str(row.UnitPrice).encode(),
        b'info:CustomerID': str(row.CustomerID).encode(),
        b'info:InvoiceDate': str(row.InvoiceDate).encode(),
        b'info:total_amount': str(row.total_amount).encode(),
        b'info:Country': str(row.Country).encode()
    })
```

### printing sample rows
```python
sample_key = f"{row.InvoiceNo}_{row.StockCode}".encode()
print(table.row(sample_key))
```
![image.png](attachment:9a23c6ec-0c32-4e83-af7e-16010ee80909.png)

### count number of rows in hbase shell
![image.png](attachment:83e3ec8f-180b-49da-8423-3697701f8f7a.png)

### check data available on hbase
![image.png](attachment:97589937-fdc4-49ac-86e9-5164d324943e.png)

### 
### 1. Total sales per country
```python
print(ecommerce.groupBy("Country").count().alias("count").orderBy(col("count").desc()).show())
```
![image.png](attachment:5e5305e4-e1a4-4752-8077-12d5ea742ea5.png)

### 2. Monthly sales trend
```python
print(ecommerce.select(concat(year(col("InvoiceDate")),lit(" / "),month(col("InvoiceDate"))).alias("month"),col("total_amount")).groupBy("month").agg(round(sum(col("total_amount")),2).alias("total_amount")).orderBy(col("month").desc()).show())
```
![image.png](attachment:03a26d5d-53d3-41b2-8447-a8176b4ecbf9.png)

### 3. Top 10 most sold products
```python
print(ecommerce.select(["Description","Quantity"]).groupBy("Description").agg(sum(col("Quantity")).alias("Quantity")).orderBy(col("Quantity").desc()).show(10))
```
![image.png](attachment:04762b06-2f01-481c-9070-cf221c7e9f5d.png)

### 4. Total revenue
```python
print(ecommerce.select(round(sum("total_amount"),2).alias("total_amount")).show())
```
![image.png](attachment:a943e74e-c416-4595-a67b-004d03e8673c.png)

### 5. Average basket size
```python
print(ecommerce.select(col("InvoiceNo"),col("Quantity")).groupBy("InvoiceNo").agg(avg(col("Quantity")).alias("Quantity")).show())
```
![image.png](attachment:e397a1ed-a05a-4bd0-b53b-53935b8284f4.png)


<h1 align="center">Thank You</h1>