# Extracting and processing the dataset

We now have downloaded all of the packages' correctly included depencies lists in folder dependencies_complete_dataset. The next thing we would want to is to actually load these dependencies into a list of transactions.

First we import libraries needed for reading the files.

In [45]:
from os import listdir
from os.path import isfile, join

We declare the path to the folder where downloaded dependencies are.

In [46]:
folder_path = '../dependencies_extracted_from_specs/'

Then we make a list of all of the files present in the dependencies_extracted_from_specs folder

In [47]:
files = [f for f in listdir(folder_path) if isfile(join(folder_path, f))]

The entries list is declared. This list will contain all of the transactions (itemsets).

In [48]:
entries = []

Now we would want to process every file as a transaction of items and insert it to the entries list. One transaction can be perceived as one entry.

However, there is one information we purposely omitted. One module in dependencies can be represented in the following shape:

<dependency_name> <version>

This <version> is often written with symbols <=, >= or =. So, for example:
foo_module <= 2.5.0
                  
For the simplicity of a trained model we purposely left out this information from dataset by using the .split()[0] function, which results in less unique dependencies used. If you want to keep this information, just simply set the simplicity variable to false.

In [49]:
simplicity = True

In [50]:
for file_path in files:
    cur_entry = []
    with open(join(folder_path,file_path)) as file:
        cur_entry = [line.rstrip('\n') for line in file]
        if simplicity:
            cur_entry = [dependency.split()[0] for dependency in cur_entry]
    entries.append(cur_entry)

Now the entries list is filled with transactions, we can look at some information about it.

In [51]:
print('number of transactions: ', len(entries))

number of transactions:  21210


In [52]:
print('number of unique dependencies used: ', len(set(x for y in entries for x in y)))

number of unique dependencies used:  17926


In [53]:
print('the length of all transactions when summed together: ', len(list(x for y in entries for x in y)))

the length of all transactions when summed together:  175264


# Saving the extracted dataset

Now we want to store this proprocessed list of transactions to be saved in file, so that the processing is not needed anytime we want to use the dataset.

In [54]:
import csv

In [55]:
with open('extracted_dependencies_list_of_lists.csv','w+') as dataset_file:
    wr = csv.writer(dataset_file)
    wr.writerows(entries)

The preparation of dataset should be done here, now we can look at the different approaches of recommendation systems.  