# Pull Requests Data Collection

In this notebook, we will collect the raw Pull Request data from a github repo of choice, and save it as a json file on an s3 bucket. To do this, we will use the `srcopsmetrics` tool developed by the Thoth team.

In [1]:
import os

from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv(), override=True)

True

## Set Environment variables

Before you start, make sure to create a .env file at the root of the repository with the required credentials and paths. Refer to workshop section **"Set your Environment Variables"** for details on how to set it and this example [file](env_example_workshop) for the structure.

In [2]:
# get the org/repo from env vars
ORG = os.getenv("GITHUB_ORG")
REPO = os.getenv("GITHUB_REPO")
CEPH_BUCKET_PREFIX = os.getenv("CEPH_BUCKET_PREFIX")

print(f"{ORG}/{REPO}")
print(
    f"Downloaded file is being stored at {CEPH_BUCKET_PREFIX}/srcopsmetrics/bot_knowledge/{ORG}/{REPO}/PullRequest.json"
)

/
Downloaded file is being stored at oct12/srcopsmetrics/bot_knowledge///PullRequest.json


In [3]:
# run collection on the org/repo specified
!python -m srcopsmetrics.cli --create-knowledge --repository $ORG/$REPO --entities PullRequest

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/srcopsmetrics/cli.py", line 217, in <module>
    cli(auto_envvar_prefix="MI")
  File "/opt/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.9.14/x64/lib/python3.9/site-packages

# Conclusion

By running this notebook we have collected and stored the github PR data to our s3 bucket. It is now ready for the cleaning and feature engineering steps of the ML workflow.

# Next Step

In the next [notebook](./02_feature_engineering.ipynb), we will engineer some features from the raw PR data which can be used to train a ML model