# Scratchpad

Over here I will start playing around with stuff and see what sticks/works!

Here are some of the steps I took to get to the point where I was able to access the database.
1. Downloaded the database `dump.zip` file into the `data` folder
2. Installed mongodb via brew (see instructions [here](https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-os-x/?msockid=302ca59c4ee16b1d2d8fb14e4fe96a19))
3. Started the MongoDB server: `brew services start mongodb-community@8.0`
4. Checked that MongoDB was running: `brew services list`
5. List all the databases: first entered the mongo shell by `mongosh` and then listing the databases from the shell `show dbs`
6. Navigated to the data folder (wherein the extracted `dump` folder is), opened a terminal there
7. mongorestore --uri="mongodb://localhost:27017" --gzip
8. `use gfibot`
9. `show collections`
10. Moved over to this jupyter notebook. Note that I had to install the pymongo library beforehand.


In [1]:
# imports
import polars as pl
import pandas as pd
from pymongo import MongoClient
#

Here we will connect to MongoDB and access the database and collections

In [2]:
# connect to MongoDB
client = MongoClient('localhost', 27017)

# access the database and collections
db = client['gfibot']
dataset_collection = db['dataset']
resolved_issue_collection = db['resolved_issue']

In [3]:
# fetch a sample document from the 'dataset' collection
sample_doc = dataset_collection.find_one()
print("Sample document from 'dataset' collection:", sample_doc)

# count the number of documents in each collection
print("Number of documents in 'dataset' collection:", dataset_collection.count_documents({}))
print("Number of documents in 'resolved_issues' collection:", resolved_issue_collection.count_documents({}))

Sample document from 'dataset' collection: {'_id': ObjectId('62a45eefa962f9390b35f92c'), 'owner': 'OpenMined', 'name': 'PySyft', 'number': 19, 'created_at': datetime.datetime(2017, 8, 9, 20, 11, 35), 'closed_at': datetime.datetime(2017, 8, 11, 1, 22, 53), 'before': datetime.datetime(2017, 8, 9, 20, 11, 35), 'resolver_commit_num': 1, 'title': 'Implement Base Tensor Object', 'body': 'In this ticket, we want to create a new basic type called a "Tensor". A Tensor is a nested list of numbers with an arbitrary number of dimensions. With one dimension, it\'s a list of numbers known as a "vector". With two dimensions, it\'s a list of lists of numbers known as a "matrix". In this ticket, we want to build our implementation of Tensor with inspiration from PyTorch\'s  [Tensor]() object ([Tensor Docs]()). \r\n\r\nSo, in this ticket, you should build a basic tensor class. You should be able to pass in as an argument an arbitrary shape (number of rows, columns, etc.) when you create it. Furthermore,

After that noobie attempt at exploring the dataset, let's import as a dataframe. Let's try polars!

In [9]:
# fetch documents (limit for large datasets)
dataset_docs = list(dataset_collection.find().limit(1000))

# conver to pandas df
pandas_df = pd.DataFrame(dataset_docs)

# convert to polars df
polars_df = pl.DataFrame(dataset_docs)

# print the first few rows
print(polars_df.head(5))
# print(pandas_df.head(5))

shape: (5, 36)
┌────────────┬────────────┬───────────┬────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ _id        ┆ owner      ┆ name      ┆ number ┆ … ┆ comments  ┆ events    ┆ comment_u ┆ event_use │
│ ---        ┆ ---        ┆ ---       ┆ ---    ┆   ┆ ---       ┆ ---       ┆ sers      ┆ rs        │
│ object     ┆ str        ┆ str       ┆ i64    ┆   ┆ list[str] ┆ list[str] ┆ ---       ┆ ---       │
│            ┆            ┆           ┆        ┆   ┆           ┆           ┆ list[stru ┆ list[stru │
│            ┆            ┆           ┆        ┆   ┆           ┆           ┆ ct[14]]   ┆ ct[14]]   │
╞════════════╪════════════╪═══════════╪════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 62a45eefa9 ┆ OpenMined  ┆ PySyft    ┆ 19     ┆ … ┆ []        ┆ []        ┆ []        ┆ []        │
│ 62f9390b35 ┆            ┆           ┆        ┆   ┆           ┆           ┆           ┆           │
│ f92c       ┆            ┆           ┆        ┆   ┆           ┆           ┆

Let's do some exploration...

In [13]:
# print the columns and the data types
print("number of columns in df:", len(polars_df.columns))

for col, dtype in zip(polars_df.columns, polars_df.dtypes):
    print(f"{col}: {dtype}")


number of columns in df: 36
_id: Object
owner: String
name: String
number: Int64
created_at: Datetime(time_unit='us', time_zone=None)
closed_at: Datetime(time_unit='us', time_zone=None)
before: Datetime(time_unit='us', time_zone=None)
resolver_commit_num: Int64
title: String
body: String
len_title: Int64
len_body: Int64
n_code_snips: Int64
n_urls: Int64
n_imgs: Int64
coleman_liau_index: Float64
flesch_reading_ease: Float64
flesch_kincaid_grade: Float64
automated_readability_index: Float64
labels: List(String)
label_category: Struct({'bug': Int64, 'feature': Int64, 'test': Int64, 'build': Int64, 'doc': Int64, 'coding': Int64, 'enhance': Int64, 'gfi': Int64, 'medium': Int64, 'major': Int64, 'triaged': Int64, 'untriaged': Int64})
reporter_feat: Struct({'name': String, 'n_commits': Int64, 'n_issues': Int64, 'n_pulls': Int64, 'resolver_commits': List(Int64), 'n_repos': Int64, 'n_commits_all': Int64, 'n_issues_all': Int64, 'n_pulls_all': Int64, 'n_reviews_all': Int64, 'max_stars_commit': Int