# Big Data Modeling and Management Assigment - Homework 1

# Submission

GROUP NUMBER: **XXXXXX** - please add your group number into the file name

GROUP MEMBERS:

|STUDENT NAME|STUDENT NUMBER|
|---|---|
|Gaspar Pereira|20230984|
|Íris Moreira|XXXXXX|
|Jude|XXXXXX|
|Rita Wang|XXXXXX|

## 🍺 The Beer project  🍺 

As it was shown in classes, graph databases are a natural way of navegating related information. For this first project we will be taking a graph database to analyse beer and breweries!   

The project datasets are based on [kaggle](https://www.kaggle.com/ehallmar/beers-breweries-and-beer-reviews), released by Evan Hallmark. 

### Problem description

Imagine you are working in the Data Management department of Analytics company.
Explore the database via python neo4j connector and/or the graphical tool in the NEO4J webpage. Answer the questions while adjusting the database to meet the needs of your colleagues.
Please record and keep track of your database changes, and submit the file with all cells run and with the output shown.

### Questions

1. Explore the database: get familiar with current schema, elements and other important database parameters. [1 point]
2. Adjust the database and mention reasoning behind: e.g. clean errors, remove redundancies, adjust schema as necessary. Visualize the final version of database schema. [4 points]
3. Analytics department requires the following information for the biweekly reporting: [5 points]
    1. How many reviews has the beer with the most reviews?
    2. Which three users wrote the most reviews about beers?
    3. Find all beers that are described with following words: 'fruit', 'complex', 'nutty', 'dark'.
    4. Which top three breweries produce the largest variety of beer styles?
    5. Which country produces the most beer styles?
4. Market Analysis department in your company accesses and updates the trends data on the daily basis. Given that, consider how you need to optimize the database and its performance so that the following queries are efficient. Measure performance to communicate your improvements using PROFILE before final query. Answer the following: [4 points]
    1. Using ABV score, find five strongest beers, display their ABV score and the corresponding brewery? Keep in mind that the strongest known beer is Snake Venom, and deal with the error entries in the database.
    2. Using the answer from question 2, find the top 5 distict beer styles with the highest average score of smell + feel that were reviewed by the third most productive user. Keep in mind that cleaning the database earlier should ensure correct results.
5. Answer **two out of four** of the following questions using Graph Algorithms (gds): [NB: make sure to clear the graph before using it again] For the quarterly report, Analytics department the follownig information. [6 points]
    1. Which two countries are most similiar when it comes to their top five most produced Beer styles?
    2. Which beer is the most popular when considering the number of users who reviewed it? 
    3. Users are connected together by their reviews of beers, taking into consideration the "smell" score they assign as a weight, how many communities are formed from these relationships? How many users are in the three largest communities? 
    4. Which user is the most influential when it comes to reviews of distinct beers by style?
 

## Loading the Database

In [2]:
from neo4j import GraphDatabase
from pprint import pprint

In [3]:
NEO4J_URI="neo4j://localhost:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="test"

In [4]:
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD), )

In [5]:
def execute_read(driver, query):    
    with driver.session(database="neo4j") as session:
        result = session.execute_read(lambda tx, query: list(tx.run(query)), query)
    return result

In [6]:
def execute_write(driver, query):
    with driver.session(database="neo4j") as session:
        # Write transactions allow the driver to handle retries and transient errors
        result = session.execute_write(lambda tx, query: list(tx.run(query)), query)
    return result

# Question 1: Exploratory Data Analysis

> Let's start by counting the number of nodes in our database: 

In [46]:
query = """
        MATCH () RETURN count(*)
    """
result = execute_read(driver, query)
pprint(result)

[<Record count(*)=3215489>]


> We also check the labels in our database:

In [7]:
query = """
       call db.labels();
		    """

result = execute_read(driver, query)

pprint(result)

[<Record label='COUNTRIES'>,
 <Record label='CITIES'>,
 <Record label='BREWERIES'>,
 <Record label='BEERS'>,
 <Record label='REVIEWS'>,
 <Record label='STYLE'>,
 <Record label='USER'>]


In [8]:
labels = ['COUNTRIES','CITIES','BREWERIES','BEERS','REVIEWS','STYLE','USER']

> Check the number of nodes for each label

In [None]:
for label in labels:
    query = f"""
        MATCH (n:{label})
        RETURN COUNT(n) AS count
        ORDER BY count(n) DESC
    """
    
    result = execute_read(driver, query)
    print(f"Label: {label}, Count: {result[0]['count']}")


Label: COUNTRIES, Count: 200
Label: CITIES, Count: 11665
Label: BREWERIES, Count: 50347
Label: BEERS, Count: 358873
Label: REVIEWS, Count: 2538063
Label: STYLE, Count: 113
Label: USER, Count: 106645


> Relationships:

In [10]:
query = """
        CALL db.relationshipTypes();
    """

result = execute_read(driver, query)

pprint(result)

[<Record relationshipType='REVIEWED'>,
 <Record relationshipType='BREWED'>,
 <Record relationshipType='IN'>,
 <Record relationshipType='HAS_STYLE'>,
 <Record relationshipType='POSTED'>]


![image.png](img/schema_original.png)

In [11]:
relationships = ['REVIEWED','BREWED','IN','HAS_STYLE','POSTED']

In [12]:
for relationship in relationships:
    query = f"""
        MATCH ()-[r:{relationship}]->()
        RETURN COUNT(r) AS count
    """
    
    result = execute_read(driver, query)
    print(f"Relationship: {relationship}, Count: {result[0]['count']}")


Relationship: REVIEWED, Count: 2537991
Relationship: BREWED, Count: 358873
Relationship: IN, Count: 62424
Relationship: HAS_STYLE, Count: 358873
Relationship: POSTED, Count: 2538044


> What labels are connected by each relationship type?

> Let's see if there are any relationships that don't make sense:

In [15]:
query = """
        MATCH (a)-[r]->(b)
        RETURN DISTINCT labels(a) AS Start, type(r) AS Relationship, labels(b) AS End, COUNT(r)
        ORDER BY Relationship;


    """

result = execute_read(driver, query)

pprint(result)

[<Record Start=['BREWERIES'] Relationship='BREWED' End=['BEERS'] COUNT(r)=358873>,
 <Record Start=['BEERS'] Relationship='HAS_STYLE' End=['STYLE'] COUNT(r)=358873>,
 <Record Start=['CITIES'] Relationship='IN' End=['COUNTRIES'] COUNT(r)=12077>,
 <Record Start=['BREWERIES'] Relationship='IN' End=['CITIES'] COUNT(r)=50347>,
 <Record Start=['REVIEWS'] Relationship='POSTED' End=['USER'] COUNT(r)=2538044>,
 <Record Start=['BEERS'] Relationship='REVIEWED' End=['REVIEWS'] COUNT(r)=2537991>]


> Not sure if it makes sense to have <br>
'REVIEWS'--POSTED-->'USER' and <br>
'BEERS'--'REVIEWED-->'REVIEWES'.<br>
We should look into reversing the direction of these relationships.

> Schema visualization:

In [19]:
query = """
        CALL db.schema.visualization()
    """
result = execute_read(driver, query)
pprint(result)


[<Record nodes=[<Node element_id='-5' labels=frozenset({'REVIEWS'}) properties={'name': 'REVIEWS', 'indexes': ['id'], 'constraints': []}>, <Node element_id='-4' labels=frozenset({'BEERS'}) properties={'name': 'BEERS', 'indexes': ['id'], 'constraints': []}>, <Node element_id='-1' labels=frozenset({'COUNTRIES'}) properties={'name': 'COUNTRIES', 'indexes': ['name'], 'constraints': []}>, <Node element_id='-3' labels=frozenset({'BREWERIES'}) properties={'name': 'BREWERIES', 'indexes': ['id'], 'constraints': []}>, <Node element_id='-6' labels=frozenset({'STYLE'}) properties={'name': 'STYLE', 'indexes': ['name'], 'constraints': []}>, <Node element_id='-2' labels=frozenset({'CITIES'}) properties={'name': 'CITIES', 'indexes': ['name'], 'constraints': []}>, <Node element_id='-7' labels=frozenset({'USER'}) properties={'name': 'USER', 'indexes': ['name'], 'constraints': []}>] relationships=[<Relationship element_id='-1' nodes=(<Node element_id='-4' labels=frozenset({'BEERS'}) properties={'name': '

> There are some relationships present in schema, not present if we count those relationships: 

> - (Cities)--(Cities)
> - (Breweries) -- (Countries)

In [None]:
query = """
        MATCH (ci:CITIES)-[r]-(ci2:CITIES)
        RETURN count(r)
"""
result = execute_read(driver, query)
pprint(result)

In [None]:
query = """
        MATCH (b:BREWERIES)-[r]-(c:COUNTRIES)
        RETURN count(r)
"""
result = execute_read(driver, query)
pprint(result)

> Properties:

In [13]:
query = """
        CALL db.propertyKeys();
    """

result = execute_read(driver, query)

pprint(result)

[<Record propertyKey='name'>,
 <Record propertyKey='types'>,
 <Record propertyKey='notes'>,
 <Record propertyKey='state'>,
 <Record propertyKey='id'>,
 <Record propertyKey='abv'>,
 <Record propertyKey='retired'>,
 <Record propertyKey='availability'>,
 <Record propertyKey='brewery_id'>,
 <Record propertyKey='date'>,
 <Record propertyKey='score'>,
 <Record propertyKey='taste'>,
 <Record propertyKey='feel'>,
 <Record propertyKey='overall'>,
 <Record propertyKey='beer_id'>,
 <Record propertyKey='text'>,
 <Record propertyKey='smell'>,
 <Record propertyKey='look'>]


In [14]:
properties = ['name','types','notes','state','id','abv','retired','availability',\
              'brewery_id','date','score','taste','feel','overall','beer_id',\
                'text','smell','look']

> What properties does each label have?

In [16]:
query = """
        CALL db.schema.nodeTypeProperties()
        YIELD nodeType, propertyName
        RETURN nodeType AS Label, COLLECT(propertyName) AS Properties
        ORDER BY Label;


    """

result = execute_read(driver, query)

pprint(result)

[<Record Label=':`BEERS`' Properties=['name', 'notes', 'state', 'id', 'abv', 'retired', 'availability', 'brewery_id']>,
 <Record Label=':`BREWERIES`' Properties=['name', 'types', 'notes', 'state', 'id']>,
 <Record Label=':`CITIES`' Properties=['name']>,
 <Record Label=':`COUNTRIES`' Properties=['name']>,
 <Record Label=':`REVIEWS`' Properties=['id', 'date', 'score', 'taste', 'feel', 'overall', 'beer_id', 'text', 'smell', 'look']>,
 <Record Label=':`STYLE`' Properties=['name']>,
 <Record Label=':`USER`' Properties=['name']>]


# Question 2: Preprocessing

## 2.1 Isolated Nodes

> Number of initial nodes:

In [17]:
query = """
        MATCH (n)
        RETURN COUNT(n) AS TotalNodeCount
"""
result = execute_read(driver, query)
pprint(result)

[<Record TotalNodeCount=3065906>]


> Number of nodes not connected: 

In [18]:
query = """
        MATCH (n)
        WHERE NOT (n)-[]-()
        RETURN labels(n) AS NodeLabel, COUNT(n) AS IsolatedNodeCount
        ORDER BY IsolatedNodeCount DESC
"""
result = execute_read(driver, query)
pprint(result)



[]


> Deleting those nodes; space optimization

In [47]:
query = """
        MATCH (n)
        WHERE NOT (n)-[]-()
        DELETE n
        RETURN count(n)
"""
result = execute_write(driver, query)
pprint(result)



[<Record count(n)=149583>]


> Now the number of total nodes is decreased:

In [None]:
query = """
        MATCH (n)
        RETURN COUNT(n) AS TotalNodeCount
"""
result = execute_read(driver, query)
pprint(result)

## 2.2 Checking values consistency

In [13]:
node_properties = {
    "BEERS": ["notes", "name", "state", "id", "retired", "availability", "brewery_id"],
    "BREWERIES": ["notes", "types", "id", "name", "state"],
    "CITIES": ["name"],
    "COUNTRIES": ["name"],
    "REVIEWS": ["text", "smell", "look", "taste", "feel", "overall", "beer_id", "id", "date", "score"],
    "STYLE": ["name"],
    "USER": ["name"]
}

In [None]:
for label, properties in node_properties.items():
    for property in properties:
        query = f"""
            MATCH (n:{label})
            RETURN DISTINCT n.{property} AS {property}, COUNT(n) AS Count
            ORDER BY Count DESC
            LIMIT 10
        """
        result = execute_read(driver, query)
        print(f"Label: {label}, Property: {property}")
        pprint(result)
        print("\n\n")

Label: BEERS, Property: notes
[<Record notes='No notes at this time.' Count=309078>,
 <Record notes='nan' Count=46>,
 <Record notes='Single-Hop IPA' Count=26>,
 <Record notes='Brewed at De Proefbrouwerij.' Count=24>,
 <Record notes=' No notes at this time.' Count=23>,
 <Record notes='30 IBU' Count=22>,
 <Record notes='Permutation is our experimental series of small batch offerings, showcasing the unique visions and innovative concepts developed by our brewing and cellar crew.' Count=19>,
 <Record notes='20 IBU' Count=19>,
 <Record notes='70 IBU' Count=19>,
 <Record notes='The Intervals series is a platform that allows our brewers to experience and study the ingredients that we use in brewing. As we develop new flavors and experience those nuances we can share that with others. From single hop varieties to alternate grains we want you to learn with us! Experiment, learn, repeat!' Count=18>]



Label: BEERS, Property: name
[<Record name='Oktoberfest' Count=755>,
 <Record name='IPA' Count

> **1.** We will check first what properties have those blank spaces.

In [None]:
for node, properties in node_properties.items():
    for prop in properties:
        print(f"Checking outter white spaces for: {prop} in {node}")
        query = f"""
                MATCH (n:{node})
                WHERE n.{prop} =~ "^\s.*|.*\s$"
                RETURN count(n) AS NodeCount
        """
        result = execute_read(driver, query)
        pprint(result)

>  We will remove those blank spaces for BEERS[notes,availability], BREWERIES[notes]

In [48]:
white_space_nodes_prop = {
    "BEERS": ["notes", "availability"],
    "BREWERIES": ["notes"]
}

for node, properties in white_space_nodes_prop.items():
    for prop in properties:
        print(f"Deleting outter white spaces for: {prop} in {node}")
        query = f"""
                MATCH (n:{node})
                WHERE n.{prop} =~ "^\s.*|.*\s$"
                SET n.{prop} = TRIM(n.{prop})
                RETURN count(n) AS NodeCount
        """
        result = execute_write(driver, query)
        pprint(result)

  """


Deleting outter white spaces for: notes in BEERS
[<Record NodeCount=2287>]
Deleting outter white spaces for: availability in BEERS
[<Record NodeCount=248467>]
Deleting outter white spaces for: notes in BREWERIES
[<Record NodeCount=315>]


> **2.** Remove '\xa0\xa0', which represents non breaking spaces, due to consistency.  

In [None]:
# query = """
#         MATCH (r:REVIEWS)
#         WHERE r.text=~ '\xa0\xa0.*'
#         RETURN COUNT(r) AS Count
# """
# result = execute_read(driver, query)
# pprint(result)

[<Record Count=2536660>]


In [25]:
# query = """
#         MATCH (r:REVIEWS)
#         SET r.text = REPLACE(r.text, '\xa0\xa0', '')
#         RETURN COUNT(r) AS Count
# """
# result = execute_write(driver, query)
# pprint(result)

## 2.3 Missing values

In [24]:
nodes = ["BEERS", "BREWERIES", "CITIES", "COUNTRIES", "REVIEWS", "STYLE", "USER"]

for node in nodes: 
    query = f"""
            MATCH (n:{node})
            UNWIND keys(n) AS key
            WITH key
            WHERE n[key] IS NULL OR n[key] = 'nan'
            RETURN key, count(*) AS EmptyValuesCount
            ORDER BY key
    """
    result = execute_read(driver, query)
    print(f"Label: {node}")
    pprint(result)
    print("\n")

Label: BEERS
[<Record key='abv' EmptyValuesCount=38797>,
 <Record key='state' EmptyValuesCount=60726>]


Label: BREWERIES
[<Record key='state' EmptyValuesCount=11271>]


Label: CITIES
[<Record key='name' EmptyValuesCount=1>]


Label: COUNTRIES
[<Record key='name' EmptyValuesCount=1>]


Label: REVIEWS
[<Record key='feel' EmptyValuesCount=1060483>,
 <Record key='look' EmptyValuesCount=1060483>,
 <Record key='overall' EmptyValuesCount=1060483>,
 <Record key='smell' EmptyValuesCount=1060483>,
 <Record key='taste' EmptyValuesCount=1060483>]


Label: STYLE
[<Record key='name' EmptyValuesCount=1>]


Label: USER
[<Record key='name' EmptyValuesCount=1>]




> Concert nan to NULL

> BEERS: missing values in abv and state, but we'll leave it as it is. 

> BREWERIES: missing values in state, but we'll leave it as it is. There are missing states, because those beers 

> COUNTRIES: There are 2 cities connected to the nan country, which are: Windhoek and Swakopmund,  which are both located in Namimbia. So we've decided to change it to Namimbia. There is also a nan city...

> CITIES: nan city connected to lots of different BREWERIES and COUNTRIES, leave it as it is.

> STYLE: Ony has one beer related to it: 	American Three Threads. We'll leave it as it is.

> USER: no name, but has reviews.

> REVIEWS: has NULL values in some of the ratings, but it has values score.

> From Countries with "nan" names we check that 2 cities belong to Namibia; so we will rename it as Namibia  

In [52]:
query = """
            MATCH (c:COUNTRIES)<--(ci:CITIES)
            WHERE c.name = 'nan'
            RETURN ci.name
    """
result = execute_read(driver, query)
pprint(result)


[<Record ci.name='nan'>,
 <Record ci.name='Swakopmund'>,
 <Record ci.name='Windhoek'>]


In [53]:
query = f"""
        MATCH (c:COUNTRIES)
        WHERE c.name IS NULL or c.name = 'nan'
        SET c.name = 'Namibia'
        RETURN c
    """
result = execute_write(driver, query)
print(result)

[<Record c=<Node element_id='133' labels=frozenset({'COUNTRIES'}) properties={'name': 'Namibia'}>>]


In [54]:
query = f"""
        MATCH (c:COUNTRIES)
        WHERE c.name='Namibia'
        RETURN c
    """
result = execute_read(driver, query)
pprint(result)

[<Record c=<Node element_id='133' labels=frozenset({'COUNTRIES'}) properties={'name': 'Namibia'}>>]


> SET all "nan" to NULL, for space optimization (str takes more space)

In [None]:
nodes = ["BEERS", "BREWERIES", "CITIES", "COUNTRIES", "REVIEWS", "STYLE", "USER"]

nan_values_dict = {
    "BEERS": ["abv", "notes", "state"],
    "BREWERIES": ["notes", "state"],
    "CITIES": ["name"],
    "REVIEWS": ["smell", "look", "taste", "feel", "overall"],
    "STYLE": ["name"],
    "USER": ["name"]
}

for node, properties in nan_values_dict.items():
    for prop in properties:
        print(f"Correcting missing {prop} in {node}")
        query = f"""
                MATCH (n:{node})
                WHERE n.{prop} = 'nan'
                SET n.{prop} = NULL
                RETURN count(*) AS CorrectValues
        """
        result=execute_write(driver, query)
        pprint(result)

Correcting missing abv in BEERS
[<Record CorrectValues=38797>]
Correcting missing notes in BEERS
[<Record CorrectValues=46>]
Correcting missing state in BEERS
[<Record CorrectValues=60726>]
Correcting missing notes in BREWERIES
[<Record CorrectValues=85>]
Correcting missing state in BREWERIES
[<Record CorrectValues=11271>]
Correcting missing name in CITIES
[<Record CorrectValues=1>]
Correcting missing smell in REVIEWS
[<Record CorrectValues=1060483>]
Correcting missing look in REVIEWS
[<Record CorrectValues=1060483>]
Correcting missing taste in REVIEWS
[<Record CorrectValues=1060483>]
Correcting missing feel in REVIEWS
[<Record CorrectValues=1060483>]
Correcting missing overall in REVIEWS
[<Record CorrectValues=1060483>]
Correcting missing name in STYLE
[<Record CorrectValues=1>]
Correcting missing name in USER
[<Record CorrectValues=1>]


> Change "No notes at this time." to NULL for space optimization

In [57]:
query = f"""
    MATCH (n:BEERS)
    WHERE n.notes = "No notes at this time."
    SET n.notes = NULL
    RETURN count(*) AS CorrectValues
    """
result = execute_write(driver, query)
print(node)
pprint(result)

USER
[<Record CorrectValues=309101>]


## 2.4 Duplicate values

> Checking for duplicate values in identifier properties like "id" and "name"

In [32]:
unique_properties = ["id", "name"]

for node, properties in node_properties.items():
    for prop in unique_properties:
        if prop in properties: 
            print(f"Checking for duplicated values for '{prop}' in '{node}'")

            query = f"""
                    MATCH (n:{node})
                    WITH TRIM(n.{prop}) AS {prop}, count(n) AS count
                    WHERE count > 1
                    RETURN {prop}, count
                    ORDER BY count DESC
                    LIMIT 5
            """
            result = execute_read(driver, query)
            pprint(result)
            print("\n")
    print("-------------------------------")

Checking for duplicated values for 'id' in 'BEERS'
[]


Checking for duplicated values for 'name' in 'BEERS'
[<Record name='Oktoberfest' count=755>,
 <Record name='IPA' count=633>,
 <Record name='Pale Ale' count=620>,
 <Record name='Hefeweizen' count=477>,
 <Record name='Oatmeal Stout' count=443>]


-------------------------------
Checking for duplicated values for 'id' in 'BREWERIES'
[]


Checking for duplicated values for 'name' in 'BREWERIES'
[<Record name='Whole Foods Market' count=162>,
 <Record name='Total Wine & More' count=147>,
 <Record name='Cost Plus World Market' count=118>,
 <Record name='Mellow Mushroom' count=114>,
 <Record name="Trader Joe's" count=88>]


-------------------------------
Checking for duplicated values for 'name' in 'CITIES'
[]


-------------------------------
Checking for duplicated values for 'name' in 'COUNTRIES'
[]


-------------------------------
Checking for duplicated values for 'id' in 'REVIEWS'
[<Record id=None count=19>]


--------------------

In [None]:
query = """
    MATCH (r:REVIEWS)
    WITH r.id AS reviews_id, COUNT(r) AS count
    WHERE count > 1
    RETURN reviews_id, count
    ORDER BY count DESC;
"""

result = execute_read(driver, query)

print(result)


Reviews ID: None, Count: 19


> 19 REVIEWS without ID, and also have no properties. 

In [None]:
#There are 19 REVIEWS with no id property and have no info associated
query = """
    MATCH (n:REVIEWS)
    WHERE n.id IS NULL
    RETURN n
"""
result = execute_read(driver, query)
pprint(result)


[<Record n=<Node element_id='921375' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='921921' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='922467' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='923013' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='923559' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='924105' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='924651' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='925197' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='925743' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='926289' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='926835' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='927381' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='927927' labels=frozenset() properties={}>>,
 <Record n=<Node element_id='928473' labels=frozens

In [None]:
to_delete=[921375,921921,922467,923013,923559,924105,924651,925197]

In [29]:
#There are 19 REVIEWS with no id property and have no info associated
query = """
    MATCH (n:REVIEWS)
    WHERE ID(n)='931203'
    RETURN n
"""
result = execute_read(driver, query)
pprint(result)


[]


In [58]:
#This will be removed
query = """
    MATCH (n:REVIEWS)
    WHERE n.id IS NULL
    DELETE n
    RETURN count(n)
"""
result = execute_write(driver, query)
pprint(result)

[<Record count(n)=19>]


In [9]:
query = """
    MATCH (b:BREWERIES)
    WITH b.id AS beer_id, COUNT(b) AS count
    WHERE count > 1
    RETURN beer_id, count
    ORDER BY count DESC;
"""

result = execute_read(driver, query)

for record in result:
    print(f"Beer ID: {record['beer_id']}, Count: {record['count']}")


## 2.x Data types correction

> First we check for non int in numerical properties 

In [7]:
integer_properties = {
    "BEERS": ["abv"],
    "REVIEWS": ["smell", "look", "taste", "feel", "overall", "score"],
}

In [None]:


for node, properties in integer_properties.items():
    for prop in properties:
        print(f"Checking for non int values in '{prop}' in '{node}'")
        query = f"""
                MATCH (n:{node})
                WHERE NOT n.{prop} IS NULL and NOT n.{prop} =~ '^[0-9.]+$'
                RETURN n.{prop} AS {prop}
                LIMIT 5
        """
        result = execute_read(driver, query)
        pprint(result)
        print("\n\n")


> Convert str to int, for properties with numerical values 

> First we will run for beers, considering for reviews as they are too many nodes we will do it in batches

In [8]:

integer_properties = {
    "BEERS": ["abv"],
    #"REVIEWS": ["smell", "look", "taste", "feel", "overall", "score"],
}

In [None]:
for node, properties in integer_properties.items():
    for prop in properties:
        print(f"Typecasting '{prop}' to float in '{node}'")
        query = f"""
                MATCH (n:{node})
                WHERE n.{prop} IS NOT NULL 
                SET n.{prop} = toFloat(n.{prop})
                RETURN count(n) AS NodeCount
        """
        result = execute_write(driver, query)
        pprint(result)


Typecasting 'abv' to int in 'BEERS'
[<Record NodeCount=320076>]


In [8]:
integer_properties = {
    #"BEERS": ["abv"],
    "REVIEWS": ["smell", "look", "taste", "feel", "overall", "score"],
}

In [20]:
a = [2, 1]
a = [1]

In [None]:
query = """
        MATCH (n:REVIEWS)
        WHERE n.smell IS NOT NULL 
        WITH n SKIP 1 LIMIT 1
        RETURN id(n) as ID
"""
result = execute_read(driver, query)
pprint(result)


[<Record ID=421086>,
 <Record ID=421087>,
 <Record ID=421088>,
 <Record ID=421089>,
 <Record ID=421091>,
 <Record ID=421097>,
 <Record ID=421099>,
 <Record ID=421103>,
 <Record ID=421113>,
 <Record ID=421117>,
 <Record ID=421126>,
 <Record ID=421130>,
 <Record ID=421131>,
 <Record ID=421138>,
 <Record ID=421142>,
 <Record ID=421151>,
 <Record ID=421157>,
 <Record ID=421166>,
 <Record ID=421171>,
 <Record ID=421172>,
 <Record ID=421173>,
 <Record ID=421175>,
 <Record ID=421177>,
 <Record ID=421178>,
 <Record ID=421179>,
 <Record ID=421181>,
 <Record ID=421182>,
 <Record ID=421183>,
 <Record ID=421184>,
 <Record ID=421186>]


In [18]:
batch_size = 100_000

for node, properties in integer_properties.items():
    for prop in properties:
        print(f"Typecasting '{prop}' to float in '{node}'")

        processed_count = 0

        while True:
            query = f"""
                MATCH (n:{node})
                WHERE n.{prop} IS NOT NULL
                WITH n ORDER BY ID(n) SKIP {processed_count} LIMIT {batch_size}
                SET n.{prop} = toFloat(n.{prop})
                RETURN count(n) AS NodeCount
            """
            
            result = execute_write(driver, query)
            
            if not result or result[0]['NodeCount'] == 0:
                break  
            
            processed_count += result[0]['NodeCount']

            print(f"Processed {processed_count} nodes in '{node}' for property '{prop}'")

Typecasting 'smell' to float in 'REVIEWS'
Processed 100000 nodes in 'REVIEWS' for property 'smell'
Processed 200000 nodes in 'REVIEWS' for property 'smell'
Processed 300000 nodes in 'REVIEWS' for property 'smell'
Processed 400000 nodes in 'REVIEWS' for property 'smell'
Processed 500000 nodes in 'REVIEWS' for property 'smell'


: 

: 

In [15]:
to_skip = {3,4}
to_skip = set()
to_skip

set()

In [15]:
result[0]["NodeCount"]

300000

Checking for non int values in 'abv' in 'BEERS'
[]



Checking for non int values in 'smell' in 'REVIEWS'
[]



Checking for non int values in 'look' in 'REVIEWS'
[]



Checking for non int values in 'taste' in 'REVIEWS'
[]



Checking for non int values in 'feel' in 'REVIEWS'
[]



Checking for non int values in 'overall' in 'REVIEWS'
[]



Checking for non int values in 'score' in 'REVIEWS'
[]





In [14]:
# Check for alpha; ordered; 

for node, properties in node_properties.items():
    for prop in properties:
        print(f"Checking for ordered values in '{prop}' in '{node}'")
        query = f"""
                MATCH (n:{node})
                WHERE NOT n.{prop} IS NULL
                RETURN n.{prop} AS {prop}
                ORDER BY n.{prop} ASC
                LIMIT 10
        """
        result = execute_read(driver, query)
        pprint(result)
        print("\n\n")

Checking for ordered values in 'notes' in 'BEERS'
[<Record notes='!Poblamo! is the latest in our farm bounty lineup, utilizing freshly fire roasted Poblano chiles straight from the field. Infused into a base of a malty and balanced amber ale, the roasted notes of the chiles dominate the aroma of this beer. The depth of maltiness and rich fruity flavors provide a solid canvas to present the bold flavor profile of the chiles. While having a lively tingle on the tongue, the heat level remains subdued and pleasant the whole glass through.'>,
 <Record notes='"**Chef Collaboration Series** Brewed with the crew from Longman & Eagle. Golden Belgian Double IPA with a fruity aroma and hint of pine. Dry-hopped with Citra and Simcoe hops."'>,
 <Record notes='"**Coffee Collaboration Series** A rotating series of local coffee companies. Our first \'blend\' is from Crop to Cup Coffee Co. Dark, rich, and roasted flavors from JuJu espresso beans."'>,
 <Record notes='"**Coffee Collaboration Series** A r

> Convert dates to datetime

## 2.5 Invert relationship

> Instead of actually reverting, we will change the name 

> Ver se o state das breweries = state de beers. 

> Cervejas estão só conectadas a uma brewery? Se tiver + que uma então faz sentido existir brewery id? mas pode ser melhorado talvez pelo facto de procurar mos.