# Big Data Analysis Project - Group Project FS2020
## University of Zurich
## 5th of June 2020

Jara Fuhrer, 15-702-889

Claudio Brasser,

Severin Siffert, 14-720-536

Andrea Giambonini, 10-726-842

Elizabeth Oladipo, 17-722-414


## "The goal of the project is to apply the data science pipeline"

![title](doingDS.png)

# Data Collection / Acquisition

Our dataset comes from http://www.kave.cc/, where 15'000 hours of interaction data from Visual Studio was collected in almost 11 million events from voluntaries.
To see what kind of data is collected, you can look at the data schema here: http://www.kave.cc/feedbag/event-generation.

Within this project, we focus on three events:
- build event: actions like build, build all, or clean
- edit event: changes made by the developer, like renaming
- test run event: which tests where run when and with what result


# Data Exploration

#### Who constructed data set, when, why?
The KaVE Project originally was a German research program. Over the past 10 years, it evolved into a platform for research around recommendation systems for software engineering. Generally, they're interested in questions like how humans influence software engineering or how certain tools can support humans to better / more efficiently perform their tasks. 
Examples are intelligent code completion, interaction trackers or evaluation tools.

The KaVE are collecting and providing these data sets such that we can better understand what software engineers do / what they interact with / where their problems lie. With this data, we can try to see relations betweek the workign behaviour of developers (edit events, time, run test and results of them) and his efficienty / performance / productivity / habits /...

Below, the three data sets are explained.

#### What do we want to learn from this data?
From personal experience, we know how important it is to frequently build your code and run some tests. Only then you're able to link what you've done (i.e. the edit events) to the outcome (i.e. build and test run events). 

Our goal of this data analysis therefore is to analyze the link between the probability of Unit tests or project builds succeeding in connection with how much time has passed since the last build/test run.


#### Hypotheses
TODO --> how did we get to this hypos?
Our hypotheses are:

1) the longer a developer waits to build his code, the higher the probability that the build will fail

2) the longer a developer waits to run some tests, the higher the probability for test failures

3) the more edit events a developer executes, the lower the probability that the build will succeed

4) the more edit events a developer executes, the lower the probability for test success



#### Final Data Analysis Questions:
1) linking time since last (successful?) build to probability of build succeeding -- SEVERIN

2) linking time since last tests run to probability of tests passing -- ELIZABETH

3) linking number of edit events since last successful build to probability of build succeeding -- ANDREA

4) linking number of edit events since last passing tests to probability of unit tests passing -- JARA


In [1]:
import pandas as pd
import numpy as np
import math
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator, FormatStrFormatter,
                               AutoMinorLocator)
from itertools import islice
from sklearn import datasets, linear_model
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score




In [3]:
df_edit = pd.read_csv("../data/df_edit.csv")
df_test = pd.read_csv("../data/df_test.csv")
df_build = pd.read_csv("../data/df_build.csv")

## TODO JARA: Short description of the data frames & their columns

### Edit Events

In [4]:
df_edit.iloc[:,1:3]

Unnamed: 0,sessionID,timestamp
0,0,2016-10-04 14:36:01
1,0,2016-10-04 14:36:07
2,0,2016-10-04 14:36:49
3,0,2016-10-04 14:36:53
4,1,2016-10-04 14:37:03
...,...,...
497454,ffcbdaa4-e264-45a3-bddb-6f2f0afeac2f,2016-04-20 15:13:48
497455,ffcbdaa4-e264-45a3-bddb-6f2f0afeac2f,2016-04-20 15:13:52
497456,ffcbdaa4-e264-45a3-bddb-6f2f0afeac2f,2016-04-20 15:13:58
497457,ffcbdaa4-e264-45a3-bddb-6f2f0afeac2f,2016-04-20 15:14:03


The first column is the index of the dataframe.

The second column includes the sessionID.

The third colum is the timestamp of the edit event.

In total, we got 2876 unique sessionIDs for which in total 497'459 edit events have been recorded.

In [5]:
df_edit["timestamp"].describe() 
 

count                  497459
unique                 488049
top       2016-09-04 23:28:12
freq                        4
Name: timestamp, dtype: object

In [5]:
df_edit["sessionID"].describe()

count                                   497459
unique                                    2876
top       8d0ea603-57cd-4b1f-b3cf-ce39ec9203c7
freq                                     17006
Name: sessionID, dtype: object

### Test Events

In [6]:
df_test.iloc[:,1:5]

Unnamed: 0,sessionID,timestamp,totalTests,testsPassed
0,006eb9aa-33f1-4e9e-8e74-7c978b58ee4a,2016-05-03 09:32:16,33,33
1,03c83bf2-8938-4a8f-9f58-d52bf3b2eccd,2016-05-10 17:21:18,1,0
2,03c83bf2-8938-4a8f-9f58-d52bf3b2eccd,2016-05-10 17:21:54,26,26
3,03c83bf2-8938-4a8f-9f58-d52bf3b2eccd,2016-05-10 17:28:26,1,1
4,0504fbd1-cce2-4431-b4e2-edc63eea1c6d,2016-07-13 20:24:46,21,21
...,...,...,...,...
3821,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 14:57:23,1,0
3822,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:04:16,1,0
3823,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:05:03,1,0
3824,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:23:52,1,1


In [7]:
df_test.describe()

Unnamed: 0.1,Unnamed: 0,totalTests,testsPassed
count,3826.0,3826.0,3826.0
mean,1912.5,88.780972,76.577627
std,1104.615393,430.78577,392.217908
min,0.0,-1.0,-1.0
25%,956.25,1.0,0.0
50%,1912.5,3.0,1.0
75%,2868.75,21.0,15.0
max,3825.0,6618.0,6090.0


The first column is the index of the dataframe.

The second column includes the sessionID.

The third colum is the timestamp of the edit event.

The fourth column includes the total number of tests run at this specific time.

And the firth column states how many tests actually passed.

### Build Events

In [8]:
df_build.iloc[:,1:4]

Unnamed: 0,sessionID,timestamp,buildSuccessful
0,0,2016-10-04 14:35:55,False
1,0,2016-10-04 14:36:07,False
2,0,2016-10-04 14:36:50,False
3,0,2016-10-04 14:36:53,False
4,1,2016-10-04 14:37:03,True
...,...,...,...
14952,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:04:17,True
14953,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:05:05,True
14954,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:23:53,True
14955,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:24:13,True


In [9]:
df_build.describe()

Unnamed: 0.1,Unnamed: 0
count,14957.0
mean,7478.0
std,4317.858323
min,0.0
25%,3739.0
50%,7478.0
75%,11217.0
max,14956.0


# Data Preprocessing / Cleaning

The data was originally available in individual json events, with great support for parsing with Java or C#. Since we have to work with python, we used Java to convert the relevant information into csv without cleaning the data first.

<i> --> TODO Claudio </i>

# Model / Algorithm Building

- chapter 6
- we should agree on some basics
- fit model with optimization methods?
- linear vs non-linear, blackbox vs descriptive, first principle vs data driven, stochastic vs deterministic, flat vs hierarchical
- model evaluation: trianign data set, validation data set, test data set
- classifiers, value prediction, absolute / relative / squared error, baseline models?

In [10]:
grouped = df_edit.groupby(['sessionID'])
sessionStarts = grouped.agg({'timestamp':np.min}).to_dict()

In [11]:
##Predicting Build success by time since last build

In [12]:
#build events are ordered by session id and timestamp
previous = {'sessionID': 'nonexistent-atsirtsakitaiea'}
time_to_build = []
for _, event in df_build.iterrows():
    try:
        if previous['sessionID'] is event['sessionID']:
            begin = previous['timestamp']
        else:
            begin = sessionStarts['timestamp'][event['sessionID']]
        end = event['timestamp']
        duration = pd.Timedelta(pd.to_datetime(end)-pd.to_datetime(begin)).seconds
        time_to_build.append((duration,event['buildSuccessful']))
    except Exception:
        #nothing
        ;
    previous = event
time_to_build

[(86394, False),
 (12, False),
 (43, False),
 (3, False),
 (0, True),
 (146, False),
 (520, True),
 (3331, True),
 (2154, True),
 (572, True),
 (1795, True),
 (216, True),
 (50, True),
 (1154, True),
 (1683, True),
 (1493, True),
 (188, True),
 (451, True),
 (204, True),
 (114, True),
 (310, True),
 (146, True),
 (51, True),
 (265, False),
 (137, False),
 (116, False),
 (119, False),
 (52, True),
 (176, True),
 (94, True),
 (42, True),
 (194, True),
 (170, True),
 (38, True),
 (90, False),
 (37, True),
 (90, True),
 (196, True),
 (93, True),
 (82, True),
 (138, True),
 (405, True),
 (70, True),
 (412, True),
 (289, True),
 (26, True),
 (2711, True),
 (544, False),
 (50, False),
 (100, True),
 (19, True),
 (114, False),
 (24, False),
 (27, False),
 (57, False),
 (61, False),
 (8, False),
 (8, False),
 (74, False),
 (301, True),
 (81287, True),
 (9707, False),
 (20, True),
 (30, True),
 (79399, True),
 (42, True),
 (15, True),
 (10, True),
 (35, True),
 (11, True),
 (4056, True),
 (524, 

In [18]:
def analyze_classifiers(x,y):
    classifiers = [('linear',SGDClassifier()),
                   ('logistic',LogisticRegression()),
                   ('knn',KNeighborsClassifier(3))]
    for name,model in classifiers:
        accuracies = []
        kf = KFold(n_splits=5,shuffle=True)
        for train_index, test_index in kf.split(x):
            x_train, x_test = x.iloc[train_index], x.iloc[test_index]
            y_train, y_test = y[train_index], y[test_index]
            fit = model.fit(x_train, y_train)
            accuracies.append(accuracy_score(y_test, model.predict(x_test), normalize=True))
        print('accuracy of',name, 'is',np.mean(accuracies))
    

In [19]:
frame = pd.DataFrame(time_to_build, columns=['time','success'])
x = frame[['time']]
y = frame['success']

analyze_classifiers(x,y)

accuracy of linear is 0.7199703458200688
accuracy of logistic is 0.8617758738869898
accuracy of knn is 0.830747694588783


# Data Visualization

- we should agree on some basic design / plots such that we have a consistent viz