# Cassandra Data Lake

### Cassandra Architecture

A typical cassandra cluster architecture follows the following schema>

![Estructura Cassandra](https://www.red-gate.com/simple-talk/wp-content/uploads/2019/01/cassandra-cluster-node-l-n-keyspace-1-column-f.png "a title")

In this stage, there is one node with one keyspace called 'Henry' with the following Tables:

**reviews**:
  * gmap_id: Google Maps unique id of the store.
  * state: USA state where the store is located.
  * user_id: Client idetification.
  * name: User's name.
  * time: datetime data of the moment the review was made.
  * rating: Int score (1 to 5).
  * text: The text posted by the client/user.
  * resp: Dictionary storing the answers to the comment from the user [*text datatype*]. Each comment stores the 'time' and the 'text'.

**stores**:
  * gmap_id: Google Maps unique id of the store.
  * name: Store's name.
  * Address: Store's address.
  * category: A List storing the categories associated to the store.
  * description:
  * hours:
  * latitud:
  * longitud:
  * price:

### Imports

In [1]:
!pip install cassandra-driver

Collecting cassandra-driver
  Downloading cassandra_driver-3.28.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
Collecting geomet<0.3,>=0.1 (from cassandra-driver)
  Downloading geomet-0.2.1.post1-py3-none-any.whl (18 kB)
Installing collected packages: geomet, cassandra-driver
Successfully installed cassandra-driver-3.28.0 geomet-0.2.1.post1


In [2]:

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement, dict_factory

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Connecting to Cluster

In [6]:
cluster = Cluster(['186.87.6.161'], port='9042', protocol_version = 5) #IP del servidor y el puerto estandar de cassandra 9042
session = cluster.connect('henry')
session.row_factory = dict_factory



### Creating ETL for Stores Table

In [40]:
for row in session.execute("SELECT * FROM stores  LIMIT 5"):
  print(row)

{'gmap_id': '0x89c260cbcf2848c7:0x9b12a43601b1da0d', 'name': 'Chandrabali Home Improvement', 'address': 'Chandrabali Home Improvement, 133-20 95th Ave, Queens, NY 11419', 'category': SortedSet(['General contractor']), 'description': None, 'hours': "[['Monday', '7AM–5PM'], ['Tuesday', '7AM–5PM'], ['Wednesday', '7AM–5PM'], ['Thursday', '7AM–5PM'], ['Friday', '7AM–5PM'], ['Saturday', '8AM–1PM'], ['Sunday', 'Closed']]", 'latitude': 40.695640563964844, 'longitude': -73.81490325927734, 'misc': 'None', 'price': 'None', 'url': 'https://www.google.com/maps/place//data=!4m2!3m1!1s0x89c260cbcf2848c7:0x9b12a43601b1da0d?authuser=-1&hl=en&gl=us'}
{'gmap_id': '0x88f67d6459172897:0xb6897aae90060770', 'name': 'Morgan County Detention Center', 'address': 'Morgan County Detention Center, 1380 Monticello Rd, Madison, GA 30650', 'category': SortedSet(['Government office']), 'description': None, 'hours': "[['Wednesday', 'Open 24 hours'], ['Thursday', 'Open 24 hours'], ['Friday', 'Open 24 hours'], ['Saturday

In [47]:
categories_searched = ['Restaurant',
                       'restaurant',
                       'Bar',
                       'bar',
                       'Deli',
                       'Grocery',
                       'Coffee',
                       'Bakery',
                       'Sandwich']

dictionary = {'gmap_id':[],
              'name': [],
              'address': [],
              'latitude': [],
              'longitude': [],
              'category':[],
              'misc':[]}


for category_searched in categories_searched:
  # Construct the query
  query = f"""
        SELECT gmap_id,name,address,latitude,longitude,category, misc
        FROM stores
        WHERE category CONTAINS '{category_searched}'
        """
  statement = SimpleStatement(query, fetch_size = 5000)

  answer = session.execute(statement, timeout=None)


  for row in answer:
    dictionary['gmap_id'].append(row['gmap_id'])
    dictionary['name'].append(row['name'])
    dictionary['address'].append(row['address'])
    dictionary['latitude'].append(row['latitude'])
    dictionary['longitude'].append(row['longitude'])
    dictionary['category'].append(row['category'])
    dictionary['misc'].append(row['misc'])




In [53]:
# Create a DataFrame from the dictionary
df_stores = pd.DataFrame(dictionary)

df_stores.head()

Unnamed: 0,gmap_id,name,address,latitude,longitude,category,misc
0,0x87f38b6db3bf5557:0xb793ff26fbedb7fe,Shell,"Shell, 1510 Giant Dr, Blue Earth, MN 56013",43.657063,-94.097725,"(ATM, Car wash, Convenience store, Gas station...",{'Accessibility': ['Wheelchair accessible entr...
1,0x80c8bf86a8b9a3f1:0xacfe57cd1919e249,Ember's Grill + Spirits,"Ember's Grill + Spirits, 740 S Rampart Blvd, L...",36.16338,-115.289413,"(Bar, Restaurant)","{'Service options': ['Delivery'], 'Highlights'..."
2,0x808506614c768671:0xde51b62058db3bce,La Toque,"La Toque, 1314 McKinstry St, Napa, CA 94559",38.304173,-122.283989,"(French restaurant, Restaurant)","{'Service options': ['Outdoor seating', 'Dine-..."
3,0x8842ecec5931b86b:0x703640198411596d,Godfather's Pizza,"Godfather's Pizza, 233 Lexington Rd, Lancaster...",37.62571,-84.580132,"(Buffet restaurant, Delivery Restaurant, Itali...","{'Service options': ['Delivery'], 'Popular for..."
4,0x87e030ec559cac7f:0x8f04ab05959695e6,Subway,"Subway, 330 N 1st St, Cuba, IL 61427",40.496239,-90.198875,"(Caterer, Fast food restaurant, Restaurant, Sa...","{'Service options': ['Curbside pickup', 'Takeo..."


In [50]:
df_stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157444 entries, 0 to 157443
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   gmap_id    157444 non-null  object 
 1   name       157444 non-null  object 
 2   address    156787 non-null  object 
 3   latitude   157444 non-null  float64
 4   longitude  157444 non-null  float64
 5   category   157444 non-null  object 
 6   misc       157444 non-null  object 
dtypes: float64(2), object(5)
memory usage: 8.4+ MB


### Normalize Categories

#### Category df

The first step is to create a df storing all the unique categories.

In [61]:
# Create a new DataFrame with the unique categories
df_category = pd.DataFrame({'category': df_stores['category'].apply(list).explode().unique()})

#aux_df = df_stores['category'].apply(list).explode().unique()

'''
# Split the categories into separate rows
splitted_df = aux_df['category'].explode()

# Get the unique categories
unique_categories = splitted_df.unique()

# Create a new DataFrame with the unique categories
df_category = pd.DataFrame({'category': unique_categories})

'''
#aux_df.explode().unique()
df_category.head()

Unnamed: 0,category
0,ATM
1,Car wash
2,Convenience store
3,Gas station
4,Restaurant


#### Get Categories id

The second step is to get the categories id of each store

In [63]:
# Create a dictionary to map each unique category to its index
category_index_map = {category: index for index, category in enumerate(df_category['category'])}

# Create a new column 'category_id' in the original DataFrame
df_stores['category_id'] = df_stores['category'].apply(lambda x: [category_index_map[category] for category in x])

df_stores.head()

Unnamed: 0,gmap_id,name,address,latitude,longitude,category,misc,category_id
0,0x87f38b6db3bf5557:0xb793ff26fbedb7fe,Shell,"Shell, 1510 Giant Dr, Blue Earth, MN 56013",43.657063,-94.097725,"(ATM, Car wash, Convenience store, Gas station...",{'Accessibility': ['Wheelchair accessible entr...,"[0, 1, 2, 3, 4]"
1,0x80c8bf86a8b9a3f1:0xacfe57cd1919e249,Ember's Grill + Spirits,"Ember's Grill + Spirits, 740 S Rampart Blvd, L...",36.16338,-115.289413,"(Bar, Restaurant)","{'Service options': ['Delivery'], 'Highlights'...","[5, 4]"
2,0x808506614c768671:0xde51b62058db3bce,La Toque,"La Toque, 1314 McKinstry St, Napa, CA 94559",38.304173,-122.283989,"(French restaurant, Restaurant)","{'Service options': ['Outdoor seating', 'Dine-...","[6, 4]"
3,0x8842ecec5931b86b:0x703640198411596d,Godfather's Pizza,"Godfather's Pizza, 233 Lexington Rd, Lancaster...",37.62571,-84.580132,"(Buffet restaurant, Delivery Restaurant, Itali...","{'Service options': ['Delivery'], 'Popular for...","[7, 8, 9, 10, 4]"
4,0x87e030ec559cac7f:0x8f04ab05959695e6,Subway,"Subway, 330 N 1st St, Cuba, IL 61427",40.496239,-90.198875,"(Caterer, Fast food restaurant, Restaurant, Sa...","{'Service options': ['Curbside pickup', 'Takeo...","[11, 12, 4, 13, 14]"


#### Create Pivot table

The third step is to create a pivot table representing the N:M relation between Store and Category

In [64]:
# Explode the 'category_id' column to create separate rows for each category_id
exploded_df = df_stores.explode('category_id')

# Create a new DataFrame with the desired columns
stores_categories = exploded_df[['gmap_id', 'category_id']]

stores_categories.head()

Unnamed: 0,gmap_id,category_id
0,0x87f38b6db3bf5557:0xb793ff26fbedb7fe,0
0,0x87f38b6db3bf5557:0xb793ff26fbedb7fe,1
0,0x87f38b6db3bf5557:0xb793ff26fbedb7fe,2
0,0x87f38b6db3bf5557:0xb793ff26fbedb7fe,3
0,0x87f38b6db3bf5557:0xb793ff26fbedb7fe,4


#### Drop non-normalized columns

The last step is to drop the columns we have already normalized by the previous steps

In [65]:
# Drop the 'category' and 'category_id' columns from the original DataFrame
df_stores = df_stores.drop(['category', 'category_id'], axis=1)

df_stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157444 entries, 0 to 157443
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   gmap_id    157444 non-null  object 
 1   name       157444 non-null  object 
 2   address    156787 non-null  object 
 3   latitude   157444 non-null  float64
 4   longitude  157444 non-null  float64
 5   misc       157444 non-null  object 
dtypes: float64(2), object(4)
memory usage: 7.2+ MB


### Normalize Misc Column

The only data that it's important from this columns is the information regarding the delivery service.

### Creating ETL for Reviews Table

In [25]:
for row in session.execute("SELECT * FROM reviews  LIMIT 5"):
  print(row)

{'gmap_id': '0x872b04470d741527:0x7f1db0b8d8a08f24', 'state': 'Arizona', 'user_id': 1.183714955019923e+20, 'time': datetime.datetime(2021, 2, 22, 18, 13, 43, 61000), 'name': 'Margaret Chavez', 'rating': 5, 'resp': 'None', 'text': 'Staff was very helpful and kind.'}
{'gmap_id': '0x872b04470d741527:0x7f1db0b8d8a08f24', 'state': 'Arizona', 'user_id': 1.1816620348694698e+20, 'time': datetime.datetime(2018, 2, 1, 16, 31, 24, 414000), 'name': 'Peter USA', 'rating': 4, 'resp': "{'time': 1517589819669, 'text': 'Thanks for the review, Peter Balogh!'}", 'text': 'For your eyes only'}
{'gmap_id': '0x872b04470d741527:0x7f1db0b8d8a08f24', 'state': 'Arizona', 'user_id': 1.1764944181798529e+20, 'time': datetime.datetime(2020, 2, 27, 23, 28, 9, 731000), 'name': 'S V', 'rating': 5, 'resp': "{'time': 1582901005609, 'text': 'We truly appreciate you taking the time to give us such a fabulous rating shiva!'}", 'text': None}
{'gmap_id': '0x872b04470d741527:0x7f1db0b8d8a08f24', 'state': 'Arizona', 'user_id': 

In [35]:
state_searched = 'Florida'
query = f"""
        SELECT gmap_id,user_id,time,rating,text
        FROM reviews
        WHERE state = '{state_searched}'
        LIMIT 50
        """

statement = SimpleStatement(query, fetch_size = 5000)

dictionary = {'gmap_id':[],
              'user_id': [],
              'time': [],
              'rating': [],
              'text': []}



answer = session.execute(statement, timeout=None)

for row in answer:
  dictionary['gmap_id'].append(row['gmap_id'])
  dictionary['user_id'].append(row['user_id'])
  dictionary['time'].append(row['time'])
  dictionary['rating'].append(row['rating'])
  dictionary['text'].append(row['text'])


In [39]:
# Expanding the code to get only categories of interest

dictionary = {'gmap_id':[],
              'user_id': [],
              'time': [],
              'rating': [],
              'text': []}

state_searched = 'Florida'
categories_searched = ['Restaurant',
                       'restaurant',
                       'Bar',
                       'bar',
                       'Deli',
                       'Grocery',
                       'Coffee',
                       'Bakery',
                       'Sandwich']


for category_searched in categories_searched:
  # Construct the query
  query = f"""
      SELECT gmap_id, user_id, time, rating, text
      FROM reviews
      WHERE state = '{state_searched}'
            AND
            category CONTAINS '{category_searched}'
      ;
      """
  statement = SimpleStatement(query, fetch_size = 5000)

  answer = session.execute(statement, timeout=None)

  for row in answer:
    dictionary['gmap_id'].append(row['gmap_id'])
    dictionary['user_id'].append(row['user_id'])
    dictionary['time'].append(row['time'])
    dictionary['rating'].append(row['rating'])
    dictionary['text'].append(row['text'])

InvalidRequest: ignored

In [37]:
print(categories_clause)

category CONTAINS 'Restaurant' OR category CONTAINS 'Bar' OR category CONTAINS 'Deli' OR category CONTAINS 'Grocery' OR category CONTAINS 'Coffee' OR category CONTAINS 'Bakery' OR category CONTAINS 'Sandwich'


In [28]:
# Create a DataFrame from the dictionary
df_Florida = pd.DataFrame(dictionary)

df_Florida.head()

Unnamed: 0,gmap_id,user_id,time,rating,text
0,0x88dd38d385b5eb65:0xb4c00e4025ae82f3,1.161696e+20,2018-03-22 03:13:11.693,1,The reason I rated this business a one is due ...
1,0x88dd38d385b5eb65:0xb4c00e4025ae82f3,1.156656e+20,2017-06-12 20:16:07.334,5,
2,0x88dd38d385b5eb65:0xb4c00e4025ae82f3,1.151872e+20,2019-11-04 15:35:22.941,5,I don't like going to the dentist or anything ...
3,0x88dd38d385b5eb65:0xb4c00e4025ae82f3,1.141875e+20,2018-06-12 19:58:18.204,5,
4,0x88dd38d385b5eb65:0xb4c00e4025ae82f3,1.110949e+20,2017-05-23 20:28:31.360,5,I had to get a root canal a few years back out...


In [29]:
df_Florida.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   gmap_id  5000 non-null   object        
 1   user_id  5000 non-null   float64       
 2   time     5000 non-null   datetime64[ns]
 3   rating   5000 non-null   int64         
 4   text     2932 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 195.4+ KB


Checking the length of the longest value in 'text' column

In [31]:
# Find the length of each string in the 'text' column
text_lengths = df_Florida['text'].str.len()

# Find the index of the longest string
max_length_index = text_lengths.idxmax()

# Get the longest string
longest_string = df_Florida.loc[max_length_index, 'text']

# Print the length and the longest string
print("index:", max_length_index)
print("Length of the longest string:", text_lengths[max_length_index])
print("Longest string:", longest_string)

index: 1444
Length of the longest string: 2843.0
Longest string: I have been a member of Massage Envy for ~2 years.  In that time, I've enjoyed monthly or sometimes twice monthly sessions.

Today, I was sick and in the interest of not wanting to spread my illness, I called at the opening of business, to let them know I was sick.  I was told that I would be charged 50% of the standard rate for the cancellation.  The young lady was pleasant and asked if she could run the credit card on file for the fee.  I replied that I was mixed in my mind as to the charge, and that I would sort the charge when I wasn't sick.  I told her that I may be a bit agitated due to the illness, and wanted to have a clear head prior to addressing the charge.

Five minutes later, I received a notification from American Express that my card had been charged the cancellation fee.  I called back and asked if they had charged my card for the cancellation fee.  She replied that she did not but her manager had charged 

In [33]:
df_Florida.iloc[1444]

gmap_id                0x88e78f1b97d09887:0x667ae9efc37cbcc2
user_id                              109643352298380853248.0
time                              2014-11-20 16:36:57.333000
rating                                                     1
text       I have been a member of Massage Envy for ~2 ye...
Name: 1444, dtype: object