Here, we'll have a look at [GitHub Public Repository Metadata](https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars?resource=download) ([Dropbox link to parquet file](https://www.dropbox.com/s/kokiypcm2ylx4an/github-repos.parquet?dl=0))

The `GithubReposData` class is your entry point to the github repo data. 
You can parametrized instances of `GithubReposData`, but if you don't, it'll work anyway with its own defaults. 
You can specify the parameters when making a `GithubReposData` instance, or set environment variables. If you do neither, it'll ask you (user input) everytime you make an instance, to see if you want to set these parameters.


In [4]:
from imbed_data_prep.github_repos import *

d = GithubReposData()

In [5]:
planar_vects = d.planar_embeddings  # 14m15s to compute from scratch

In [6]:
data_for_cosmo = d.data_with_planar_embeddings()

Making dataframe from parquet bytes
Loading bytes of raw data from /Users/thorwhalen/Dropbox/_odata/figiri/github-repos.parquet
... Done
Dropping duplicate nameWithOwner (github stub)...
... Dropped 209524 duplicates
... Done


In [6]:
# d.cache['github_repo_for_cosmos.parquet'] = data_for_cosmo

In [19]:
# get numpy array of segment vectors
# the d.segment_vectors is a series whose elements are numpy arrays
# We want to have a 2 dimentional array with the same number of rows as the series
# and the number of columns as the length of the numpy arrays

import numpy as np

X = np.vstack(d.segment_vectors.tolist())
X.shape

(3065063, 256)

In [15]:
from sklearn.cluster import KMeans
d.embeddings_matrix

(3065063, 256)

In [9]:
t = d.cluster_indices

Computing cluster indices for num of clusters: (5, 8, 13, 21, 34)
... Done


# Data prep misc

In [None]:
from imbed_data_prep.github_repos import *

d = GithubReposData()

## Can we take the github stub (nameWithOwner) as an id?

In [18]:
assert len(d.raw_data.nameWithOwner.dropna()) == len(d.raw_data), (
    "We assumed that there's no missing data in the 'nameWithOwner' column"
)

Making dataframe from parquet bytes
Loading bytes of raw data from /Users/thorwhalen/Dropbox/_odata/figiri/github-repos.parquet


In [24]:
assert d.raw_data.nameWithOwner.nunique() == len(d.raw_data), (
    "We assumed that the 'nameWithOwner' were unique"
)

AssertionError: We assumed that the 'nameWithOwner' were unique

### Let's look at the duplicates

In [25]:
from collections import Counter

Counter(d.raw_data.nameWithOwner).most_common(10)

[('plapadoo/ledger-jira-sync', 4),
 ('wasp-platform/wasp', 4),
 ('dimdenGD/OldTweetDeck', 3),
 ('TravisWThompson1/Makefile_Example_CUDA_CPP_To_Executable', 3),
 ('Roave/EmailTemplates', 3),
 ('grigio/meteor-weuse', 3),
 ('laewahn/CutCAD', 3),
 ('cockroachdb/cdc-sink', 3),
 ('AP-Atul/Torpido', 3),
 ('althonos/pyhmmer', 3)]

In [26]:
d.raw_data[d.raw_data.nameWithOwner == "plapadoo/ledger-jira-sync"]

Unnamed: 0,owner,name,stars,forks,watchers,isFork,isArchived,languages,languageCount,topics,...,createdAt,pushedAt,defaultBranchCommitCount,license,assignableUserCount,codeOfConduct,forkingAllowed,nameWithOwner,parent,embedding
2117813,plapadoo,ledger-jira-sync,6,0,6,False,False,"Python: 8602, Nix: 1530",2,"ledger: 15, python: 277441, accounting: 67, bo...",...,2018-05-19T08:00:12Z,2019-03-05T17:55:25Z,9.0,"BSD 3-Clause ""New"" or ""Revised"" License",3,,True,plapadoo/ledger-jira-sync,,"[-0.04487917, 0.025739525, 0.106529795, -0.135..."
2117814,plapadoo,ledger-jira-sync,6,0,6,False,False,"Python: 8602, Nix: 1530",2,"ledger: 15, python: 277441, accounting: 67, bo...",...,2018-05-19T08:00:12Z,2019-03-05T17:55:25Z,9.0,"BSD 3-Clause ""New"" or ""Revised"" License",3,,True,plapadoo/ledger-jira-sync,,"[-0.04487917, 0.025739525, 0.106529795, -0.135..."
2117815,plapadoo,ledger-jira-sync,6,0,6,False,False,"Python: 8602, Nix: 1530",2,"ledger: 15, python: 277441, accounting: 67, bo...",...,2018-05-19T08:00:12Z,2019-03-05T17:55:25Z,9.0,"BSD 3-Clause ""New"" or ""Revised"" License",3,,True,plapadoo/ledger-jira-sync,,"[-0.04487917, 0.025739525, 0.106529795, -0.135..."
2117816,plapadoo,ledger-jira-sync,5,0,6,False,False,"Python: 8602, Nix: 1530",2,"ledger: 15, python: 277500, accounting: 67, bo...",...,2018-05-19T08:00:12Z,2019-03-05T17:55:25Z,9.0,"BSD 3-Clause ""New"" or ""Revised"" License",3,,True,plapadoo/ledger-jira-sync,,"[-0.04487917, 0.025739525, 0.106529795, -0.135..."


It looks like the meta data is exactly the same, so the non-unique entries might be duplicates. Let's drop any rows that are exact duplicates (except for the embeddings, which are numpy arrays, so not hashable, and could have slight variations). 

In [35]:
# drop duplicates, ignoring the embedding column (which is not hashable)
t = d.raw_data.drop_duplicates(subset=d.raw_data.columns.difference(["embedding"]))
assert d.raw_data.shape[1] == t.shape[1], "Assumed no columns were dropped"
print(f"{len(d.raw_data) - len(t)} rows were dropped")
d.raw_data.shape, t.shape

207147 rows were dropped


((3274587, 26), (3067440, 26))

In [36]:
# Still some duplicate nameWithOwner, with different stars etc.
# So we'll just drop duplicate nameWithOwner
assert t.nameWithOwner.nunique() == len(t), (
    "We assumed that the 'nameWithOwner' were unique"
)

AssertionError: We assumed that the 'nameWithOwner' were unique

In [39]:
Counter(t.nameWithOwner).most_common(10)

[('dimdenGD/OldTweetDeck', 3),
 ('weaigc/gradio-chatbot', 3),
 ('MoyuruAizawa/Cropify', 3),
 ('Lakr233/BBackupp', 3),
 ('daboigbae/react-native-template', 3),
 ('esp-rs/esp-mbedtls', 3),
 ('Nahid1911/space-travelers-proj-react', 3),
 ('Rel1cx/eslint-plugin-react-ts', 3),
 ('redknotmiaoyuqiao/EyerH264Decoder', 2),
 ('ShuifaHe/STM32', 2)]

In [40]:
d.raw_data[d.raw_data.nameWithOwner == 'dimdenGD/OldTweetDeck']

Unnamed: 0,owner,name,stars,forks,watchers,isFork,isArchived,languages,languageCount,topics,...,createdAt,pushedAt,defaultBranchCommitCount,license,assignableUserCount,codeOfConduct,forkingAllowed,nameWithOwner,parent,embedding
8747,dimdenGD,OldTweetDeck,129,5,10,False,False,JavaScript: 5093,1,,...,2023-08-17T00:31:05Z,2023-08-18T19:55:14Z,14.0,MIT License,1,,True,dimdenGD/OldTweetDeck,,"[-0.093423545, -0.07952896, -0.038683187, -0.0..."
8748,dimdenGD,OldTweetDeck,97,4,8,False,False,JavaScript: 5093,1,,...,2023-08-17T00:31:05Z,2023-08-18T16:44:39Z,13.0,MIT License,1,,True,dimdenGD/OldTweetDeck,,"[-0.093423545, -0.07952896, -0.038683187, -0.0..."
8749,dimdenGD,OldTweetDeck,128,5,10,False,False,JavaScript: 5093,1,,...,2023-08-17T00:31:05Z,2023-08-18T19:55:14Z,14.0,MIT License,1,,True,dimdenGD/OldTweetDeck,,"[-0.093423545, -0.07952896, -0.038683187, -0.0..."


## Different ways to get an array out of a dataframe

In [3]:
import pandas as pd
import numpy as np
t = pd.DataFrame(np.random.randint(0, 10000, (10000, 2)), columns=['a', 'b'])

In [12]:
%timeit tt = t.a.to_numpy()  # 3.38 µs ± 10.1 ns

3.38 µs ± 10.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [13]:
%timeit tt = np.array(t.a.to_list())  # 351 µs ± 2.92 µs

351 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


# Scrap: Github Repos Embeddings

In [1]:
from imbed_data_prep.github_repos import _DFLT_RAW_DATA_FILEPATH, _DFLT_CACHE_DIR
from imbed_data_prep.github_repos import *

In [1]:
log_flag_attr = 'verbose'
args = ()
getattr(next(iter(args), None), log_flag_attr, False)

False

In [4]:
from lkj import log_calls

class MyClass:
    def __init__(self, verbose=False):
        self.verbose = verbose

    @log_calls(log_flag_attr='verbose')
    def foo(self):
        print("Executing foo")

# Example usage
obj = MyClass(verbose=True)
obj.foo()  # This will log


Calling foo...
Executing foo
.... Done calling foo
