# 1. Data fetching and cleaning
In this notebook we are going to fetch and perform an initial exploration of the dataset for the Git track of the Hércules Challenge. This dataset consists of 50 GitHub repositories.

## Setup
As always, we will begin the notebook by starting the logging system and importing some constants defined in the "\_\_init\_\_.py" file:

In [1]:
%run __init__.py

INFO:root:Starting logger


We will also define an auxiliary function to print the empty columns of the dataframes:

In [2]:
def print_empty_cols(df):
    for col in df.columns:
        print(col)
        print('-' * len(col))
        res = df[df[col] == ''].index
        print(f"{len(res)} articles have no value for column {col}")
        print(res)
        print('\n')


Finally, we will impor the bokeh library to show the charts in the notebook, and we will import the BokehHistogram class from the _herc\_common_ library to show our results:

In [3]:
from bokeh.io import output_notebook

output_notebook()



In [4]:
from herc_common import BokehHistogram

hist = BokehHistogram(color_fill="mediumslateblue", color_hover="slateblue", bins=25)

## Getting the repository URLs


The URL for every repository of the Git dataset is stored in the '_data/repo\_urls.txt_' file. First of all, we will be loading all the urls from that file into a list:

In [5]:
REPO_URLS_FILE = 'repo_urls.txt'

with open(os.path.join(DATA_DIR, REPO_URLS_FILE), 'r') as f:
    repo_urls = [line.rstrip('\n') for line in f]

len(repo_urls)

50

In [6]:
repo_urls[0]

'https://github.com/cmungall/LIRICAL/'

## Parsing the data

Now that the URLs of every repository have been saved, we can start calling the [GitHub API](https://developer.github.com/v3/) to obtain information about each repo. Since the API has a limit of 50 requests per hour for non-authorized requests, and we will be making about 200 requests to fetch all the information, we will need to make use of a personal token to make calls to the API. More information about what a personal token is, and how to create one, can be accessed through the [following link](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token). In the next cell we will be asking for the token:

In [7]:
import getpass


try:
    from secret import GITHUB_TOKEN
except ModuleNotFoundError:
    GITHUB_TOKEN = getpass.getpass("Introduce your personal access token to acces the GitHub API: ")


We will now import a series of classes and functions that will be used to fetch information about a given repo and convert it to an instance of the _GitHubRepoData_ class. More information about these functions and classes can be accessed at the _src_ package:

In [8]:
from src import GitHubIssue, GitHubRepoData, parse_repo_url

Finally, we will be creating a list of GitHubRepoData instances with information about every repository from the dataset:

In [None]:
from tqdm import tqdm

git_dataset = []
pbar = tqdm(repo_urls)
for url in pbar:
    pbar.set_description(f"Processing repository: {url}")
    git_dataset.append(parse_repo_url(url, GITHUB_TOKEN))


Processing repository: https://github.com/GullyAPCBurns/ProvToolbox:  88%|████████▊ | 44/50 [01:51<00:13,  2.27s/it]                      

## Creating a dataframe

The instances created before provide a _to\_dict_ method that can be used to convert the class to a Python dict. This dict can be used to easily create a pandas DataFrame. This DataFrame will be used from now on to explore and interact with the dataset:

In [None]:
import pandas as pd

df = pd.DataFrame([repo.to_dict() for repo in git_dataset])
df.head()

## Data cleaning and feature engineering

First of all, we will be taking an initial look to the values from the dataset:

In [None]:
df.loc[:, df.columns != 'gh_id'].describe()

As we can see above, altough all the repository names are unique, the other columns have some repeated values. Those repeated values could be empty or null values, so we are going to check if that is the case:

In [None]:
df[df.isnull().any(axis=1)]

We can see from the output above that there are 7 repositories which do not have a description. We are going to replace those null values by an empty string:

In [None]:
df.fillna(value="", inplace=True)

Now, we are going to see know how many columns have an empty value:

In [None]:
print_empty_cols(df)

Most of the repositories (45 out of 50) don't have any issues, and 3 of them don't have a readme.

Finally, we are going to join both the description and the readme of each repository into a new column, and remove all extra spaces from that column:

In [None]:
import re

def clean(text):
    return re.sub('\s+', ' ', text).strip()

df['full_text'] = df["description"] + ". " + df["readme_text"]
df['full_text_cleaned'] = df['full_text'].apply(lambda x: clean(x))
df['full_text_cleaned'].loc[0][:500]

## Initial exploration

To finish with the contents of this notebook, we will make an initial exploration of the dataset.

### Text length

We are going to add a new column to the DataFrame with the length in number of characters of each repo's full text:

In [None]:
df['num_chars_text'] = df['full_text_cleaned'].apply(lambda x: len(x))
df['num_chars_text'].describe()

We can see that the average number of characters in the readme + description is about 2471, and the maximum length is 20382 characters. However, 75% of the repositories have a number of characters lower than 3027.

We are going to plot this distribution and save it to disk:

In [None]:
GIT_HIST_COLUMN = "num_chars_text"
GIT_HIST_TITLE = "Readme + Description length distribution"
GIT_HIST_XLABEL = "Readme and description length (# of characters)"
GIT_HIST_YLABEL = "Number of repositories"

hist.load_plot(df, GIT_HIST_COLUMN, GIT_HIST_TITLE,
          GIT_HIST_XLABEL, GIT_HIST_YLABEL, True)

In [None]:
hist.save_plot(os.path.join(RESULTS_DIR, '1_Repo_text_length.svg'))

### Languages used

Finally, we are going to also explore the most used programming languages for each repository.

We will begin by creating an auxiliary function that will create an horizontal bar chart with the given data:

In [None]:
from bokeh.io import show
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.palettes import Category20b_20
from bokeh.plotting import figure

def plot_horizontal_bar_chart(x_data, y_data, title, tooltip_x, tooltip_y,
                              sort=True, color_palette=Category20b_20):
    sorted_y_data = sorted(y_data, key=lambda x: x_data[y_data.index(x)]) if sort else y_data
    source = ColumnDataSource(data=dict(y_data=y_data, x_data=x_data, color=color_palette))
    p = figure(y_range=sorted_y_data, x_range=(0, max(x_data) * 1.1), plot_height=750, title=title,
               toolbar_location='right')
    p.hbar(y='y_data', right='x_data', height=0.7, color='color', legend_field="y_data",
           fill_alpha=0.75, hover_fill_alpha=1.0, source=source)
    p.ygrid.grid_line_color = None
    p.legend.orientation = "vertical"
    p.legend.location = "bottom_right"
    p.add_tools(HoverTool(tooltips=[(tooltip_y, "@y_data"), (tooltip_x, "@x_data")],
                          point_policy="follow_mouse"))

    show(p, notebook_handle=True)



Now, we can create new dataframe with each different programming language used in the dataset and the number of bytes belonging to that language for each repo:

In [None]:
languages_df = pd.DataFrame(df['languages'].values.tolist()).fillna(value=0, inplace=False)
languages_df.head()

By making use of the previously defined function and the new DataFrame we can plot the top 15 languages with the most number of bytes written in each repository:

In [None]:
from bokeh.palettes import Category20_15

NUM_LANGUAGES = 15

languages_sum = languages_df.sum().sort_values(ascending=False)
num_bytes = languages_df.sum()[:NUM_LANGUAGES]

plot_horizontal_bar_chart(list(languages_sum)[:NUM_LANGUAGES],
                          list(languages_sum.keys())[:NUM_LANGUAGES],
                          "Languages with the most number of bytes",
                          "Number of bytes", "Language",
                          color_palette=Category20_15)

We can see that most repositories have code written in Jupyter Notebooks, JavaScript and Java.

Although the number of bytes is an interesting measure, some languages tend to have a bigger repository size by nature. In the following cell we are going to select the most prominent language for each repository and plot the most used languages in the dataset:

In [None]:
from bokeh.palettes import Category20_10

most_used_languages = languages_df.idxmax(axis=1).value_counts()[:10]
plot_horizontal_bar_chart(list(most_used_languages),
                          list(most_used_languages.keys()),
                          "Top 10 most used languages",
                          "Number of repositories", "Language",
                          color_palette=Category20_10)

With this new measure the top order has changed a bit. Both JavaScript and Jupyter Notebooks remain in the top 5, but they have fallen some positions to both Java and Python.

## Saving the dataframe

Finally, we are going to serialize the dataframe so we can load it later on in the following notebooks:

In [None]:
GIT_DF_FILE_PATH = os.path.join(DATA_DIR, 'git_dataframe.pkl')

df.to_pickle(GIT_DF_FILE_PATH)