# Loading the Iris Json File

Out of the many ways to load the data, we have two better methods
1. Take the Json file from the kaggle dataset. here i have used the Iris dataset. the link for the dataset is provided below
https://www.kaggle.com/datasets/rtatman/iris-dataset-json-version?resource=download

2. To synthesize the data. we can create our own data using tools like Chat GPT, which reduces the manual effort.


For this project I have used the method 1.


Code Explaination:
1. Imported the json module to work with JSON data.
2. Opened the "iris.json" file in read mode.
3. Loaded the JSON data from the file into a Python variable named iris.
4. This code reads JSON data from a file and stores it in a Python variable for further use.

In [2]:
import json

with open("C:/Users/saivi/bd-f23/bd-f23/W05/iris.json", "r") as json_file:
    iris=json.load(json_file)


## Printing the Contents of the Data

The Iris dataset consists of measurements of various features of three species of Iris flowers: Setosa, Versicolor, and Virginica. Each species is represented by 50 samples, making a total of 150 samples in the dataset. For each sample, four features (attributes) are measured:

Sepal Length (in centimeters), Sepal Width (in centimeters), Petal Length (in centimeters), Petal Width (in centimeters)

These measurements were taken from 50 individual Iris flowers of each species, resulting in a total of 150 data points.

In [4]:
print(json.dumps(iris, indent=4))

[
    {
        "sepalLength": 5.1,
        "sepalWidth": 3.5,
        "petalLength": 1.4,
        "petalWidth": 0.2,
        "species": "setosa"
    },
    {
        "sepalLength": 4.9,
        "sepalWidth": 3.0,
        "petalLength": 1.4,
        "petalWidth": 0.2,
        "species": "setosa"
    },
    {
        "sepalLength": 4.7,
        "sepalWidth": 3.2,
        "petalLength": 1.3,
        "petalWidth": 0.2,
        "species": "setosa"
    },
    {
        "sepalLength": 4.6,
        "sepalWidth": 3.1,
        "petalLength": 1.5,
        "petalWidth": 0.2,
        "species": "setosa"
    },
    {
        "sepalLength": 5.0,
        "sepalWidth": 3.6,
        "petalLength": 1.4,
        "petalWidth": 0.2,
        "species": "setosa"
    },
    {
        "sepalLength": 5.4,
        "sepalWidth": 3.9,
        "petalLength": 1.7,
        "petalWidth": 0.4,
        "species": "setosa"
    },
    {
        "sepalLength": 4.6,
        "sepalWidth": 3.4,
        "petalLength": 1.4,
   

## Load the iris data into MongoDB

Connect to mongoDB with credientials

### Step 1: Import Required Modules
Import the pymongo module for MongoDB interaction.
Import a custom module (presumably named credentials.py) where you store your MongoDB username and password.
### Step 2: Create a Connection String
Construct a connection string that includes:
The protocol for connecting to MongoDB Atlas.
Your MongoDB username and password obtained from the credentials module.
The hostname of your MongoDB Atlas cluster.
Connection options for enabling write retries and specifying majority writes.
### Step 3: Create a MongoDB Client
Create a MongoDB client using the connection string. The client will be used to interact with the MongoDB server.
### Step 4: Access the Database and Collection
Access a specific MongoDB database (e.g., 'MongoDB_Assignment').
Access a specific collection (similar to a table) within the database (e.g., 'Iris_Data'). These objects (db and collection) are used for database and collection operations, respectively.
### Step 5: Insert Data into the Collection
Insert data into the 'Iris_Data' collection. Use the insert_many() method to insert multiple documents from a variable (e.g., iris) into the collection. The method returns an object containing information about the inserted documents, including their ObjectIds.

In [6]:
import pymongo
import credentials

connection_string = f"mongodb+srv://{credentials.username}:{credentials.password}@cluster1.2mfunqg.mongodb.net/?retryWrites=true&w=majority"

Load the iris data into database

In [8]:
client=pymongo.MongoClient(connection_string)
db=client['MongoDB_Assignment']
collection=db['Iris_Data']
iris_json=collection.insert_many(iris)

## Querying the Data

In the provided code, I'm interacting with a MongoDB collection named collection to retrieve documents that match a specific query. This query is designed to find documents where the "species" field is set to "virginica." As I iterate through the results using a for loop, each MongoDB document is converted into a Python dictionary. This conversion is necessary because MongoDB documents contain a special type called ObjectId, which isn't directly serializable to JSON. To overcome this, I convert the ObjectId field, "_id," to a string representation using str(record["_id"]). Subsequently, I use the json.dumps() function from the json module to serialize each dictionary as a formatted JSON string. This enables me to print the document in a human-readable JSON format with proper indentation. In summary, this code retrieves MongoDB documents that match a specific query, converts them into Python dictionaries, converts the ObjectId to a string, and then prints the documents in a well-structured JSON format for readability.

In [17]:
import json
from bson import ObjectId

query={"species":"virginica"}
doc=collection.find(query)
# Iterate through the documents and convert them to dictionaries
for record in doc:
    # Convert the ObjectId to a string
    record["_id"] = str(record["_id"])
    # Use json.dumps() to serialize the dictionary
    json_str = json.dumps(record, indent=4)
    print(json_str)


{
    "_id": "650f7ede5f98fbdd16d25c99",
    "sepalLength": 6.3,
    "sepalWidth": 3.3,
    "petalLength": 6.0,
    "petalWidth": 2.5,
    "species": "virginica"
}
{
    "_id": "650f7ede5f98fbdd16d25c9a",
    "sepalLength": 5.8,
    "sepalWidth": 2.7,
    "petalLength": 5.1,
    "petalWidth": 1.9,
    "species": "virginica"
}
{
    "_id": "650f7ede5f98fbdd16d25c9b",
    "sepalLength": 7.1,
    "sepalWidth": 3.0,
    "petalLength": 5.9,
    "petalWidth": 2.1,
    "species": "virginica"
}
{
    "_id": "650f7ede5f98fbdd16d25c9c",
    "sepalLength": 6.3,
    "sepalWidth": 2.9,
    "petalLength": 5.6,
    "petalWidth": 1.8,
    "species": "virginica"
}
{
    "_id": "650f7ede5f98fbdd16d25c9d",
    "sepalLength": 6.5,
    "sepalWidth": 3.0,
    "petalLength": 5.8,
    "petalWidth": 2.2,
    "species": "virginica"
}
{
    "_id": "650f7ede5f98fbdd16d25c9e",
    "sepalLength": 7.6,
    "sepalWidth": 3.0,
    "petalLength": 6.6,
    "petalWidth": 2.1,
    "species": "virginica"
}
{
    "_id": "65

## Aggregation Query

In the provided code, I'm performing aggregation operations on a MongoDB collection named collection to calculate average values for various attributes grouped by the "species" field. Here's a breakdown of the steps:

1. First, I use the $match stage to filter documents where the "species" field is not equal to None. This filters out any documents that do not have a valid "species" value.

2. Next, I use the $group stage to group the filtered documents by the "species" field. Within this grouping, I calculate the average values for four attributes: "sepal length," "sepal width," "petal length," and "petal width" using the $avg aggregation operator. This results in a new document for each unique "species" value with the corresponding average values.

3. Finally, I apply the $sort stage to sort the grouped documents by the "_id" field in ascending order (alphabetical order of species).

After performing these aggregation stages, I convert the result into a list using list(averages). This list contains the calculated average values for each species, making it easy to work with or display the data. In summary, the code uses MongoDB aggregation to compute and organize average attribute values for different species in the collection and stores the result in a list for further processing or display.

In [12]:
averages=collection.aggregate([
    {
        "$match":
                {"species": {"$ne" : None}}
    },
    {
        "$group":
                {"_id": "$species", "Avg sepal length": {"$avg": "$sepalLength"}, "Avg sepal width":{"$avg": "$sepalWidth"}, 
                 "Avg petal length": {"$avg":"$petalLength"}, "Avg petal width": {"$avg": "$petalWidth"}}
    },
    {
        "$sort": {"_id": 1}
    }
])

result=list(averages)

In [13]:
for record in result:
    print(json.dumps(record, indent=5))

{
     "_id": "setosa",
     "Avg sepal length": 5.006,
     "Avg sepal width": 3.428,
     "Avg petal length": 1.462,
     "Avg petal width": 0.24600000000000002
}
{
     "_id": "versicolor",
     "Avg sepal length": 5.936,
     "Avg sepal width": 2.77,
     "Avg petal length": 4.26,
     "Avg petal width": 1.3259999999999998
}
{
     "_id": "virginica",
     "Avg sepal length": 6.587999999999999,
     "Avg sepal width": 2.9739999999999998,
     "Avg petal length": 5.5520000000000005,
     "Avg petal width": 2.026
}


## Saving the results in json file

In the provided code, I'm writing the result of the aggregation to a JSON file named "iris_agg.json." Here's an explanation of each step:

1. write_data = open("C:/Users/saivi/bd-f23/bd-f23/W05/iris_agg.json", "w"): This line opens a file named "iris_agg.json" in write mode and assigns the file object to the variable write_data. The file will be created or overwritten if it already exists.

2. write_data.write(json.dumps(list(result), indent=5)): Here, I use the json.dumps() function to serialize the result (which contains the aggregated data) into a JSON-formatted string. The indent=5 argument is used to add indentation to the JSON string, making it more human-readable. The serialized JSON string is then written to the opened file using the write() method.

3. write_data.close(): This line closes the file to ensure that all data is written and resources are released properly.

In summary, this code snippet opens a JSON file for writing, serializes the aggregation result as a JSON string with indentation, writes the JSON string to the file, and then closes the file to complete the writing process. This results in the "iris_agg.json" file containing the aggregated data in a structured JSON format.

In [15]:
write_data=open("C:/Users/saivi/bd-f23/bd-f23/W05/iris_agg.json", "w")
write_data.write(json.dumps(list(result), indent=5))
write_data.close()