### Problem Statement

The data is originally taken from the [NYC Open Data website](https://opendata.cityofnewyork.us/) and contains data related to park events in the New York City area.

The data provided here contain two collections - **events** and **neighbourhoods**.

**events** collection documents have the following fields - 

- `event_id` - Unique event id

- `title` - Name of the event

- `start_date_time` - The start date and time of the event

- `end_date_time` - The end date and time of the event

- `snippet` - A brief description of the event

- `cost_free` - Indicating whether an event is free (0) or not (1)

- `must_see` - Indicates if event should be featured on Parks website with "Must See" banner. 0 if event is not featured and 1 if event is featured.

- `location_name` - Location name where event takes place

- `location` - Longitude and latitude of the location of event


**neighbourhoods** collection documents have the following fields -

- `properties` - Embedded document containing information related to the neighbourhood

>- `ntacode` - Neighbourhood code
>- `ntaname` - Neighbourhood name
>- `boro_code` - Code of borough in which neighbourhood falls
>- `boro_name` - name of borough in which neighbourhood falls

- `geometry` - GEOJSON object containing coordinates of boundary of the neighbourhood 



----

*The data for **events** collection has been originally taken from - https://data.cityofnewyork.us/browse?Data-Collection_Data-Collection=NYC+Parks+Events&sortBy=most_accessed&utf8=%E2%9C%93*

*The data for **neighbourhoods** collectio has been originally taken from - https://data.cityofnewyork.us/City-Government/Neighborhood-Tabulation-Areas-NTA-/cpf4-rkhq*


----

### Connecting to MongoDB


----

In [16]:
# Importing the required libraries
import pymongo
import pprint as pp

pp.sorted = lambda x, key=None: x

In [17]:
client = pymongo.MongoClient("mongodb+srv://vikasmandal380:Vikas995511@vikas1234.3rgy0.mongodb.net/?retryWrites=true&w=majority&appName=Vikas1234")

---
### Importing data

----

In [3]:
# # Restore database
# !mongorestore /home/avadmin/Desktop/Mongo/Content/Indexing/Assignment/Data/indexing_assignment

In [18]:
# Database
#db = client['indexing_assignment']
db = client["indexing"]
events_collection = db["events"]
neighbourhoods_collection = db["neighbourhoods"]

In [6]:
import os 
import bson
os.chdir(r"C:\Users\vicky\Downloads\Assignment-210629-132632\Assignment\Data\indexing_assignment\indexing_assignment")

In [19]:
os.listdir()

['events.bson',
 'events.metadata.json',
 'neighbourhoods.bson',
 'neighbourhoods.metadata.json']

In [21]:
with open('events.bson', "rb") as bson_file:
    for doc in bson.decode_file_iter(bson_file):
        events_collection.update_one({'_id': doc['_id']}, {'$set': doc}, upsert=True)

In [27]:
with open('neighbourhoods.bson', "rb") as bson_file:
    for doc in bson.decode_file_iter(bson_file):
        neighbourhoods_collection.update_one({'_id': doc['_id']}, {'$set': doc}, upsert=True)

In [28]:
# List collections
db.list_collection_names()

['events', 'neighbourhoods']

In [29]:
# Sample document
pp.pprint(
    db.events.find_one()
)

{'_id': ObjectId('60d9cb7310d0be7a77638579'),
 'cost_free': 0,
 'end_date_time': datetime.datetime(2018, 10, 21, 12, 30),
 'event_id': 173635,
 'location': {'type': 'Point',
              'coordinates': [-73.973614931107, 40.769109102536]},
 'location_name': 'Dairy Visitor Center & Gift Shop',
 'must_see': 0,
 'snippet': 'Some of New York’s most iconic sights are found in Central Park, '
            'including the fountain at Bethesda Terrace and Bow Bridge. Join '
            'Central Park Conservancy guides for an insider’s look.',
 'start_date_time': datetime.datetime(2018, 10, 21, 11, 0),
 'title': 'Central Park Tour: Iconic Views of Central Park'}


In [30]:
# Sample document
pp.pprint(
    db.neighbourhoods.find_one()
)

{'_id': ObjectId('60d9d8036fa8d9e558634f2c'),
 'geometry': {'type': 'MultiPolygon',
              'coordinates': [[[[-73.97604935657381, 40.631275905646774],
                                [-73.97716511994669, 40.63074665412933],
                                [-73.97699848928193, 40.629871496125375],
                                [-73.9768496430902, 40.6290885814784],
                                [-73.97669604371914, 40.628354564208756],
                                [-73.97657775689153, 40.62757318681896],
                                [-73.9765146210018, 40.627294490493874],
                                [-73.97644970441577, 40.627008255472994],
                                [-73.97623453682755, 40.625976350730234],
                                [-73.97726150032737, 40.6258527728136],
                                [-73.97719665645002, 40.62510197855896],
                                [-73.97710959292857, 40.624948259691514],
                                [-73.

----
### Assignment Questions


Note - View all queries before attempting the questions. Use proper indexing to answer the questions.

----

### Q1

How many events were `must see events`?

In [31]:
# Enter your code here
must_see_count = events_collection.count_documents({"must_see": 1})
print("Number of must-see events:", must_see_count)


Number of must-see events: 4360


In [32]:
events_collection.create_index([("must_see", pymongo.ASCENDING)])

'must_see_1'

### Q2

How `many events` were must see as well as `cost free`?

In [33]:
# Enter your code here
must_see_free_count = events_collection.count_documents({"must_see": 1, "cost_free": 1})
print("Number of must-see and cost-free events:", must_see_free_count)

Number of must-see and cost-free events: 3643


In [34]:
events_collection.create_index([("must_see", pymongo.ASCENDING), ("cost_free", pymongo.ASCENDING)])

'must_see_1_cost_free_1'

### Q3

How many `must see and cost free events` were held after `2018-01-01`?

In [35]:
# Enter your code here
from datetime import datetime

# Define the date filter
date_filter = datetime(2018, 1, 1)

# Query for must-see and cost-free events after 2018-01-01
filtered_count = events_collection.count_documents({
    "must_see": 1,
    "cost_free": 1,
    "start_date_time": {"$gte": date_filter}
})

print("Number of must-see and cost-free events after 2018-01-01:", filtered_count)

Number of must-see and cost-free events after 2018-01-01: 597


In [36]:
events_collection.create_index([("must_see", pymongo.ASCENDING), 
                               ("cost_free", pymongo.ASCENDING), 
                               ("start_date_time", pymongo.ASCENDING)])

'must_see_1_cost_free_1_start_date_time_1'

### Q4

How many indexes did you use to answer the above queries? List the index keys for each index used.

In [11]:
# Answer
Total Number of Indexes Used: 3
Single-Field Index on must_see for Q1:
Compound Index on must_see and cost_free for Q2:
Compound Index on must_see, cost_free, and start_date_time for Q3:     

### Q5

What was the combined size of all the index created for the above queries?

In [12]:
# Answer
Total Combined Index Size: 24576 bytes

### Q6

How many events have the exact term `"Central Park" but not the term "Tour"` in the `title` field? 

***Hint - You will need to create a text index here.***

In [38]:
events_collection.create_index([("title", pymongo.TEXT)])

'title_text'

In [39]:
event_count = events_collection.count_documents({
    "$text": {"$search": "\"Central Park\" -Tour"}
})

print("Number of events with 'Central Park' but not 'Tour':", event_count)


Number of events with 'Central Park' but not 'Tour': 462


### Q7

How many events were held in `Williamsburg` neighbourhood of `Brooklyn` borough?

***Hint - Create geospatial index for this query. Use the `neighbourhoods` collection for geometry of the neighbourhood. Query on the `ntaname` and `boro_name` fields.***

In [40]:
# Enter your code here
events_collection.create_index([("location.geometry", pymongo.GEOSPHERE)])

'location.geometry_2dsphere'

In [41]:
williamsburg_geo = neighbourhoods_collection.find_one({
    "properties.ntaname": "Williamsburg",
    "properties.boro_name": "Brooklyn"
}, {"geometry": 1})

In [42]:
if williamsburg_geo:
    geometry = williamsburg_geo["geometry"]

    # Find events within Williamsburg's boundary
    williamsburg_events_count = events_collection.count_documents({
        "location.geometry": {
            "$geoWithin": {
                "$geometry": geometry
            }
        }
    })

    print("Number of events in Williamsburg, Brooklyn:", williamsburg_events_count)
else:
    print("Williamsburg neighborhood not found in the database.")

Number of events in Williamsburg, Brooklyn: 0


In [43]:
# Geospatial index on event locations
events_collection.create_index([("location.geometry", pymongo.GEOSPHERE)])

# Index for quick lookup of neighborhood geometry
neighbourhoods_collection.create_index([("properties.ntaname", pymongo.ASCENDING), 
                                        ("properties.boro_name", pymongo.ASCENDING)])

'properties.ntaname_1_properties.boro_name_1'

### Q8

Name the title of the `paid and must see events` that are located maximum `500 meters` from the `Brooklyn Museum (coordinates = [-73.9636, 40.6712])` after `2018-06-06`.

In [44]:
# Enter your code here
events_collection.create_index([("location.geometry", pymongo.GEOSPHERE)])

'location.geometry_2dsphere'

In [45]:
from datetime import datetime

# Brooklyn Museum Coordinates
brooklyn_museum_coords = [-73.9636, 40.6712]

# Convert meters to radians (for geospatial query)
max_distance_meters = 500
earth_radius_meters = 6371000  # Earth's radius in meters
max_distance_radians = max_distance_meters / earth_radius_meters

# Define date filter
start_date = datetime(2018, 6, 6)

# Query events matching criteria
query = {
    "location.geometry": {
        "$geoWithin": {
            "$centerSphere": [brooklyn_museum_coords, max_distance_radians]
        }
    },
    "cost_free": 0,  # Paid events only
    "must_see": 1,   # Must-see events
    "start_date_time": {"$gte": start_date}  # Events after 2018-06-06
}

# Fetch matching event titles
matching_events = events_collection.find(query, {"title": 1, "_id": 0})

# Print results
print("Paid & Must-See Events Near Brooklyn Museum (After 2018-06-06):")
for event in matching_events:
    print(event["title"])

Paid & Must-See Events Near Brooklyn Museum (After 2018-06-06):


In [46]:
# Geospatial index for location-based queries
events_collection.create_index([("location.geometry", pymongo.GEOSPHERE)])

# Index for faster event filtering
events_collection.create_index([
    ("cost_free", pymongo.ASCENDING), 
    ("must_see", pymongo.ASCENDING), 
    ("start_date_time", pymongo.ASCENDING)
])

'cost_free_1_must_see_1_start_date_time_1'