## Loading the required libraries

The Data is collected from the Kaggle website. intially the data is in comma seperated values format (CSV), I have converted the file into json format using the website "https://csvjson.com/csv2json".

Data Source: https://www.kaggle.com/datasets/sujaykapadnis/cats-vs-dogs

Code Explanation:
1. import json: This line imports the json module, which provides functions for working with JSON data in Python.

2. with open("C:/Users/saivi/bd-f23/bd-f23/catsvsdogsjson.json", "r") as json_file:: This line uses a context manager (the with statement) to open the JSON file located at the specified path ("C:/Users/saivi/bd-f23/bd-f23/catsvsdogsjson.json") in read mode ("r"). The as keyword is used to create an alias for the opened file, and it is named json_file.

3. cvd=json.load(json_file): Inside the context block, this line reads the contents of the opened JSON file using the json.load() function. It parses the JSON data from the json_file and stores it in the cvd variable as a Python dictionary.

In [1]:
import json

with open("C:/Users/saivi/bd-f23/bd-f23/catsvsdogsjson.json", "r") as json_file:
    cvd=json.load(json_file)


In [2]:
print(json.dumps(cvd, indent=4))

[
    {
        "": 1,
        "state": "Alabama",
        "n_households": 1828,
        "percent_pet_households": 59.5,
        "n_pet_households": 1088,
        "percent_dog_owners": 44.1,
        "n_dog_households": 807,
        "avg_dogs_per_household": 1.7,
        "dog_population": 1410,
        "percent_cat_owners": 27.4,
        "n_cat_households": 501,
        "avg_cats_per_household": 2.5,
        "cat_population": 1252
    },
    {
        "": 2,
        "state": "Arizona",
        "n_households": 2515,
        "percent_pet_households": 59.5,
        "n_pet_households": 1497,
        "percent_dog_owners": 40.1,
        "n_dog_households": 1008,
        "avg_dogs_per_household": 1.8,
        "dog_population": 1798,
        "percent_cat_owners": 29.6,
        "n_cat_households": 743,
        "avg_cats_per_household": 1.9,
        "cat_population": 1438
    },
    {
        "": 3,
        "state": "Arkansas",
        "n_households": 1148,
        "percent_pet_households": 62.4,

## Load the Cats vs Dogs data into MongoDB


1. import pymongo: This line imports the pymongo module, which is a Python driver for MongoDB, allowing you to interact with MongoDB databases in your Python code.

2. import credential_ak: Assuming you have a module named credential_ak containing your MongoDB credentials (such as username and password), you import it here. This is a common practice to keep sensitive information separate from your main code.

3. connection_string = f"mongodb+srv://{credential_ak.username}:{credential_ak.password}@cluster1.cs30gni.mongodb.net/?retryWrites=true&w=majority": This line creates a connection string by formatting the MongoDB credentials (username and password) obtained from the credential_ak module. This connection string can be used to connect to a MongoDB cluster hosted at the specified URI (cluster1.cs30gni.mongodb.net). The retryWrites=true&w=majority parameters are common options used for MongoDB Atlas clusters to ensure write operations are retried and to specify write concern.

In [3]:
import pymongo
import credential_ak

connection_string = f"mongodb+srv://{credential_ak.username}:{credential_ak.password}@cluster1.cs30gni.mongodb.net/?retryWrites=true&w=majority"

## Laod the Cats vs dogs data into MongoDB


import pymongo: This line imports the pymongo module, which is used for MongoDB operations in Python.

import credential_ak: I import a module named credential_ak containing my MongoDB credentials to use the connection_string.

client = pymongo.MongoClient(connection_string): Here, I create a MongoClient instance using the connection_string I've defined earlier. This client allows me to connect to my MongoDB database.

db = client['Assignment_MongoDB_3']: I access the database named 'Assignment_MongoDB_3' within the MongoDB server.

collection = db['Cats vs Dogs Data']: I access the collection named 'Cats vs Dogs Data' within the database. Collections are similar to tables in traditional databases.

cvd_json = collection.insert_many(cvd): This line inserts multiple JSON documents (presumably stored in the cvd variable) into the 'Cats vs Dogs Data' collection using the insert_many method. It's a way for me to populate the MongoDB collection with data from the cvd variable.

In [4]:
client=pymongo.MongoClient(connection_string)
db=client['Assignment_MongoDB_3']
collection=db['Cats vs Dogs Data']
cvd_json=collection.insert_many(cvd)

## Querying the data

I import the necessary modules, including json and ObjectId from bson. I define a query to find documents in the MongoDB collection where the "state" field is equal to "Alabama." I then use the collection's find method with the query to retrieve matching documents. For each document found, I convert its ObjectId to a string, format the document as JSON with proper indentation, and print it to the console. This code allows me to retrieve and display JSON-formatted data for documents matching the specified query in the collection.

In [6]:
import json
from bson import ObjectId

query={"state":"Alabama"}
doc=collection.find(query)
for record in doc:
    record["_id"] = str(record["_id"])
    json_data=json.dumps(record, indent=4)
    print(json_data)

{
    "_id": "6511be7c30e613232b62647c",
    "": 1,
    "state": "Alabama",
    "n_households": 1828,
    "percent_pet_households": 59.5,
    "n_pet_households": 1088,
    "percent_dog_owners": 44.1,
    "n_dog_households": 807,
    "avg_dogs_per_household": 1.7,
    "dog_population": 1410,
    "percent_cat_owners": 27.4,
    "n_cat_households": 501,
    "avg_cats_per_household": 2.5,
    "cat_population": 1252
}


## Aggregating the Data

This code defines an aggregation pipeline in MongoDB to calculate the average values for "n_households," "dog_population," and "cat_population" across all documents in the collection. It then executes the pipeline and prints the result in JSON format with proper indentation. If no data is found, it prints "No data found."

In [8]:
# Define the aggregation pipeline
pipeline = [
    {
        "$group": {
            "_id": None,
            "avg_n_households": {"$avg": "$n_households"},
            "avg_dog_population": {"$avg": "$dog_population"},
            "avg_cat_population": {"$avg": "$cat_population"}
        }
    },
    {
        "$project": {
            "_id": 0,
            "avg_n_households": 1,
            "avg_dog_population": 1,
            "avg_cat_population": 1
        }
    }
]

# Execute the aggregation pipeline
result = list(collection.aggregate(pipeline))

# Print the result
if result:
    print(json.dumps(result[0], indent=4))
else:
    print("No data found.")


{
    "avg_n_households": 2403.8979591836733,
    "avg_dog_population": 1414.1632653061224,
    "avg_cat_population": 1492.795918367347
}


## Save the record in the file

This code writes the aggregation result (stored in the result list) to a JSON file located at the specified file path with proper indentation. It uses a with block to ensure that the file is properly closed after writing the data.

In [9]:
write_data=open("C:/Users/saivi/bd-f23/bd-f23/catsvsdogs_agg.json", "w")
write_data.write(json.dumps(list(result), indent=5))
write_data.close()