This notebook was used to perform pre-processing operations on the collected dataset from GitHub Google BigQuery Public Data. 

For details about source from where original data was collected, please refer to the `data/` directory at the root of this repository. The data used in this notebook was collected from [GitHub Google BigQuery Public Dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code).

<hr>

At a high level, the data input from `data/*.csv` files have been saved into `dataset.json` file for later use in:
1. classification model
2. text retrieval

both of which form part of the project web UI.


In [1]:
import pandas as pd
import numpy as np
import random
import json
import regex as re

from IPython.display import display, Markdown

from sklearn.utils import shuffle
from dockerfile_parse import DockerfileParser

In [2]:
random.seed(10)
np.random.seed(10)

In [3]:
sh_df = pd.read_csv('../data/sample_contents_sh.csv')
bat_df = pd.read_csv('../data/sample_contents_bat.csv')
df_df = pd.read_csv('../data/sample_contents_dockerfile.csv')

In [4]:
sh_df.head()

Unnamed: 0,id,size,content,sample_repo_name,sample_path
0,c9cf4a93718cce2c5bcfd5caf5ecbd8f0c1ae4a6,221,#!/bin/bash\nset -e\n\nenv | sed 's/^/export /...,SpisTresci/SpisTresci,compose/django/cron.sh
1,c9aaa28acde02cba18658aa4f2eb335fb4cf20b7,265,"#!/bin/sh -f\nxv_path=""/home/huchao/vivado/Viv...",chaohu/Daily-Learning,Verilog/lab2/lab2_1/lab1_2_2/lab1_2_2.sim/sim_...
2,2f1c35188379847bdb4d907196bb7f2dd7a515f7,376,#!/bin/bash\n#--------------------------------...,BeeeOn/android,tests/monkey/kill-test.sh
3,04fe6cd6b11a81f12a91bd6effd318d2483a17d9,89,#!/bin/bash\nrabbitmq-plugins enable rabbitmq_...,oscm/shell,mq/rabbitmq/enable.rabbitmq_management.sh
4,b46ba78936c1ec41fbf4aed15a876d858432db88,1905,{% if cluster.type == 'ec2' -%}\n#$ -q all.q@@...,Kitware/HPCCloud,server/taskflows/hpccloud/taskflow/pvw.sh


In [5]:
bat_df.head()

Unnamed: 0,id,size,content,sample_repo_name,sample_path
0,2d4c92d2e5c15fdf704c19e77be0634faa4fd3ff,146,echo set building enviroment\nset VS_PATH=C:\P...,frankee/cetty2,contribute/build-boost-bjam.bat
1,e8708b6e3ed44a777d6c2145b4280dcd40b38988,699,"@echo off\n\ncall ""%~dp0_settings.bat""\nprompt...",Jumpscale/jumpscale_core8,tools/windows/js_install.bat
2,cbd036f347f54ff02d7e99b8e6e3dbbe3628a195,78,\ncall clean_api.bat\ncall clean_dyn.bat\ncall...,xsmart/opencvr,3rdparty/firebird/examples/build_win32/clean_a...
3,d171450bade60170933c4ebcd4120d0dec4640b9,1818,"@echo off\n\nREM # Copyright (c) 2014-2016, In...",01org/pyMIC,examples/double_it_lowlevel/make.bat
4,7b47aec9be5afe9ab8bf553eeb71155435fe4db1,6459,@ECHO OFF\n\nREM Command file for Sphinx docum...,dnnsoftware/Docs,main/make.bat


In [6]:
df_df.head()

Unnamed: 0,id,size,content,sample_repo_name,sample_path
0,f4abc38e22fc9bd551b5ae4c14c382309b506d13,153,FROM daunnc/geodocker-accumulo:latest\n\nMAINT...,geotrellis/geodocker-cluster,extras/accumulo-gis/Dockerfile
1,c6562467d9c7e80093cb8f0b743dabfe1cf2cbac,612,FROM andreptb/oracle-java:6-alpine\n\nMAINTAIN...,andreptb/Dockerfiles,maven/alpine/jdk-6/Dockerfile
2,88fe8ed4b008c4761f9ba9dd9b16c14961e822c9,2089,[#ftl]\n#\n# Copyright 2014-2015 by Cloudsoft...,brooklyncentral/clocker,docker/src/main/resources/clocker/docker/entit...
3,401a1578848bc4e8c2ddec31958db1f06180df67,993,#\n# -----------------------------------------...,pintostack/core,provisioning/roles/ansible-java/Dockerfile
4,8d013899004583b41184848569b1d8b81f1c66cc,725,FROM ubuntu:vivid\nMAINTAINER Joel Martin <git...,kanaka/mal,cpp/Dockerfile


In [7]:
print(f'Dataset contains {len(sh_df)} shell scripts, {len(bat_df)} batch files and {len(df_df)} Dockerfiles.')
print(f'Thus, our corpus would be total of {len(sh_df) + len(bat_df) +  len(df_df)} documents.')

Dataset contains 15626 shell scripts, 2204 batch files and 1413 Dockerfiles.
Thus, our corpus would be total of 19243 documents.


In [8]:
display(Markdown('## Let\'s look at a random .sh file:<br><br>\n\n```sh\n' + sh_df.loc[1020]['content'] + '\n```'))

## Let's look at a random .sh file:<br><br>

```sh
echo "Testing Code Coverage"

gocov test > coverage/go-tmdb.json
gocov-html coverage/go-tmdb.json > coverage/coverage_report.html
```

In [9]:
display(Markdown('## Also, look at a random .bat file:<br><br>\n\n```sh\n' + bat_df.loc[1]['content'] + '\n```'))

## Also, look at a random .bat file:<br><br>

```sh
@echo off

call "%~dp0_settings.bat"
prompt $LQ$G$S$P$G
rem echo %1%
cd %~dp0

rem to get jumpscale code
git clone https://github.com/Jumpscale/jumpscale_core8.git
cd jumpscale_core8
git pull origin @ys:@ys
git checkout @ys
cd ..

mkdir Lib\site-packages\JumpScale
junction  Lib\site-packages\JumpScale\baselib jumpscale_core8\lib\JumpScale\baselib 
junction  Lib\site-packages\JumpScale\core jumpscale_core8\lib\JumpScale\core
junction  Lib\site-packages\JumpScale\grid jumpscale_core8\lib\JumpScale\grid
copy jumpscale_core8\install\InstallTools.py Lib\site-packages\JumpScale\InstallTools.py
copy jumpscale_core8\lib\JumpScale\__init__.py Lib\site-packages\JumpScale\__init__.py

mkdir hrd\system
```

In [10]:
display(Markdown('## And, how a typical Dockerfile would look like? Let\'s see: <br><br>\n\n```dockerfile\n' + df_df.loc[67]['content'] + '```'))

## And, how a typical Dockerfile would look like? Let's see: <br><br>

```dockerfile
FROM alpine:3.4

RUN apk add --no-cache \
		ca-certificates \
		curl \
		openssl

ENV DOCKER_VERSION 1.12.0-rc2
ENV DOCKER_URL http://experimental.docker.com.s3.amazonaws.com/builds/Linux/x86_64/docker-1.12.0-rc2.tgz
ENV DOCKER_SHA256 fa4d7737b80fd5ac38940f5e582047e9032ab71148bcba2c059ed7b2e0d28545

RUN set -x \
	&& curl -fSL "${DOCKER_URL}" -o docker.tgz \
	&& echo "${DOCKER_SHA256} *docker.tgz" | sha256sum -c - \
	&& tar -xzvf docker.tgz \
	&& mv docker/* /usr/local/bin/ \
	&& rmdir docker \
	&& rm docker.tgz \
	&& docker -v

COPY docker-entrypoint.sh /usr/local/bin/

ENTRYPOINT ["docker-entrypoint.sh"]
CMD ["sh"]
```

Parse the contents of a Dockerfile and produce sh-like contents.

In [11]:
def get_sh_content_from_dockerfile(content):
    dfp = DockerfileParser()
    dfp.content = content
    
    sh = ""
    for line in dfp.structure:
        if line['instruction'] == "FROM":
            sh += "export BASE_IMAGE=" + line["value"] + "\n"
        elif line['instruction'] == "RUN":
            sh += line["value"] + "\n"
        elif line['instruction'] == "ENV":
            sh += "export " + "=".join(line["value"].split(" ")) + "\n"
        elif line['instruction'] == "COPY":
            sh += "cp -r " + line["value"] + "\n"
        elif line['instruction'] == "CMD":
            try:
                sh += " ".join(json.loads(line["value"])) + "\n"
            except:
                sh += line["value"] + "\n"
        elif line['instruction'] == "ENTRYPOINT":
            try:
                sh += " ".join(json.loads(line["value"])) + "\n"
            except:
                sh += line["value"] + "\n"
        else:
            pass
    return sh

In [12]:
dfc = df_df.loc[67]['content']

parsh = get_sh_content_from_dockerfile(dfc)

display(Markdown(f"""
**Input**:

```dockerfile
{dfc}
```

<hr>

**Output**:
```sh
{parsh}
```

"""))


**Input**:

```dockerfile
FROM alpine:3.4

RUN apk add --no-cache \
		ca-certificates \
		curl \
		openssl

ENV DOCKER_VERSION 1.12.0-rc2
ENV DOCKER_URL http://experimental.docker.com.s3.amazonaws.com/builds/Linux/x86_64/docker-1.12.0-rc2.tgz
ENV DOCKER_SHA256 fa4d7737b80fd5ac38940f5e582047e9032ab71148bcba2c059ed7b2e0d28545

RUN set -x \
	&& curl -fSL "${DOCKER_URL}" -o docker.tgz \
	&& echo "${DOCKER_SHA256} *docker.tgz" | sha256sum -c - \
	&& tar -xzvf docker.tgz \
	&& mv docker/* /usr/local/bin/ \
	&& rmdir docker \
	&& rm docker.tgz \
	&& docker -v

COPY docker-entrypoint.sh /usr/local/bin/

ENTRYPOINT ["docker-entrypoint.sh"]
CMD ["sh"]

```

<hr>

**Output**:
```sh
export BASE_IMAGE=alpine:3.4
apk add --no-cache 		ca-certificates 		curl 		openssl
export DOCKER_VERSION=1.12.0-rc2
export DOCKER_URL=http://experimental.docker.com.s3.amazonaws.com/builds/Linux/x86_64/docker-1.12.0-rc2.tgz
export DOCKER_SHA256=fa4d7737b80fd5ac38940f5e582047e9032ab71148bcba2c059ed7b2e0d28545
set -x 	&& curl -fSL "${DOCKER_URL}" -o docker.tgz 	&& echo "${DOCKER_SHA256} *docker.tgz" | sha256sum -c - 	&& tar -xzvf docker.tgz 	&& mv docker/* /usr/local/bin/ 	&& rmdir docker 	&& rm docker.tgz 	&& docker -v
cp -r docker-entrypoint.sh /usr/local/bin/
docker-entrypoint.sh
sh

```



Creating extra column to store Dockerfile source and the parsed Dockerfile shell contents into the current content column.

In [13]:
df_df["original_content"] = df_df["content"]
df_df["content"] = df_df["original_content"].apply(get_sh_content_from_dockerfile)
df_df.head()

Unnamed: 0,id,size,content,sample_repo_name,sample_path,original_content
0,f4abc38e22fc9bd551b5ae4c14c382309b506d13,153,export BASE_IMAGE=daunnc/geodocker-accumulo:la...,geotrellis/geodocker-cluster,extras/accumulo-gis/Dockerfile,FROM daunnc/geodocker-accumulo:latest\n\nMAINT...
1,c6562467d9c7e80093cb8f0b743dabfe1cf2cbac,612,export BASE_IMAGE=andreptb/oracle-java:6-alpin...,andreptb/Dockerfiles,maven/alpine/jdk-6/Dockerfile,FROM andreptb/oracle-java:6-alpine\n\nMAINTAIN...
2,88fe8ed4b008c4761f9ba9dd9b16c14961e822c9,2089,export BASE_IMAGE=${fullyQualifiedImageName}\n...,brooklyncentral/clocker,docker/src/main/resources/clocker/docker/entit...,[#ftl]\n#\n# Copyright 2014-2015 by Cloudsoft...
3,401a1578848bc4e8c2ddec31958db1f06180df67,993,export BASE_IMAGE=ansibleshipyard/ansible-base...,pintostack/core,provisioning/roles/ansible-java/Dockerfile,#\n# -----------------------------------------...
4,8d013899004583b41184848569b1d8b81f1c66cc,725,export BASE_IMAGE=ubuntu:vivid\napt-get -y upd...,kanaka/mal,cpp/Dockerfile,FROM ubuntu:vivid\nMAINTAINER Joel Martin <git...


Combine and concat the contents of the 3 sources (contents from .sh, .bat, Dockerfile) into a single dataframe.

In [14]:
sh_df["original_content"] = ""
bat_df["original_content"] = ""

sh_df["source_category"] = '.sh'
df_df["source_category"] = 'Dockerfile'
bat_df["source_category"] = '.bat'

In [15]:
df = pd.concat([sh_df, df_df, bat_df])
df.reset_index(inplace=True)
df.drop(columns=['index'], inplace=True)

shuffle(df)
df.reset_index(inplace=True)
df.drop(columns=['index'], inplace=True)
df.head()

Unnamed: 0,id,size,content,sample_repo_name,sample_path,original_content,source_category
0,c9cf4a93718cce2c5bcfd5caf5ecbd8f0c1ae4a6,221,#!/bin/bash\nset -e\n\nenv | sed 's/^/export /...,SpisTresci/SpisTresci,compose/django/cron.sh,,.sh
1,c9aaa28acde02cba18658aa4f2eb335fb4cf20b7,265,"#!/bin/sh -f\nxv_path=""/home/huchao/vivado/Viv...",chaohu/Daily-Learning,Verilog/lab2/lab2_1/lab1_2_2/lab1_2_2.sim/sim_...,,.sh
2,2f1c35188379847bdb4d907196bb7f2dd7a515f7,376,#!/bin/bash\n#--------------------------------...,BeeeOn/android,tests/monkey/kill-test.sh,,.sh
3,04fe6cd6b11a81f12a91bd6effd318d2483a17d9,89,#!/bin/bash\nrabbitmq-plugins enable rabbitmq_...,oscm/shell,mq/rabbitmq/enable.rabbitmq_management.sh,,.sh
4,b46ba78936c1ec41fbf4aed15a876d858432db88,1905,{% if cluster.type == 'ec2' -%}\n#$ -q all.q@@...,Kitware/HPCCloud,server/taskflows/hpccloud/taskflow/pvw.sh,,.sh


In [16]:
df.tail()

Unnamed: 0,id,size,content,sample_repo_name,sample_path,original_content,source_category
19238,ed92597d721406874e3b31165bca2b42621673dd,5100,@ECHO OFF\n\nREM Command file for Sphinx docum...,eduardocereto/pyboleto,docs/make.bat,,.bat
19239,313c605c2d014ebd8e785133988c5b839c6bbe3f,4118,@ECHO OFF\n\nREM Command file for Sphinx docum...,divio/django-filer,docs/make.bat,,.bat
19240,32d9f8f20fcd3ac355cbc4283cce8b7eea1a4979,7250,@ECHO OFF\r\n\r\nREM Command file for Sphinx d...,robhowley/nhlscrapi,docs/make.bat,,.bat
19241,30f747c6acd36a32290c2cc36b582dd9a73f58f5,6472,@ECHO OFF\n\nREM Command file for Sphinx docum...,Tanganelli/CoAPthon,docs/make.bat,,.bat
19242,22dc12f666f76c954ffe4f0aeddcb442a894aacc,5111,@ECHO OFF\n\nREM Command file for Sphinx docum...,buttinsky/buttinsky,docs/make.bat,,.bat


In [17]:
len(df)

19243

Save the merged dataset to a CSV file so it can be re-used later.

In [18]:
df.to_json('./web/data/dataset.json')

Load the saved CSV dataset from file (just to verify contents).

In [19]:
df_again = pd.read_json('./web/data/dataset.json')
df_again.head()

Unnamed: 0,id,size,content,sample_repo_name,sample_path,original_content,source_category
0,c9cf4a93718cce2c5bcfd5caf5ecbd8f0c1ae4a6,221,#!/bin/bash\nset -e\n\nenv | sed 's/^/export /...,SpisTresci/SpisTresci,compose/django/cron.sh,,.sh
1,c9aaa28acde02cba18658aa4f2eb335fb4cf20b7,265,"#!/bin/sh -f\nxv_path=""/home/huchao/vivado/Viv...",chaohu/Daily-Learning,Verilog/lab2/lab2_1/lab1_2_2/lab1_2_2.sim/sim_...,,.sh
2,2f1c35188379847bdb4d907196bb7f2dd7a515f7,376,#!/bin/bash\n#--------------------------------...,BeeeOn/android,tests/monkey/kill-test.sh,,.sh
3,04fe6cd6b11a81f12a91bd6effd318d2483a17d9,89,#!/bin/bash\nrabbitmq-plugins enable rabbitmq_...,oscm/shell,mq/rabbitmq/enable.rabbitmq_management.sh,,.sh
4,b46ba78936c1ec41fbf4aed15a876d858432db88,1905,{% if cluster.type == 'ec2' -%}\n#$ -q all.q@@...,Kitware/HPCCloud,server/taskflows/hpccloud/taskflow/pvw.sh,,.sh
