# Big Data Analysis Project - Group Project FS2020
## University of Zurich
## 5th of June 2020

Jara Fuhrer, 15-702-889

Claudio Brasser,

Severin Siffert, 14-720-536

Andrea Giambonini, 10-726-842

Elizabeth Oladipo, 17-722-414


## "The goal of the project is to apply the data science pipeline"

![title](doingDS.png)

# Data Collection / Acquisition

Our dataset comes from http://www.kave.cc/, where 15'000 hours of interaction data from Visual Studio was collected in almost 11 million events from voluntaries.
To see what kind of data is collected, you can look at the data schema here: http://www.kave.cc/feedbag/event-generation.

Within this project, we focus on three events:
- build event: actions like build, build all, or clean
- edit event: changes made by the developer, like renaming
- test run event: which tests where run when and with what result


# Data Exploration

#### Who constructed data set, when, why?
The KaVE Project originally was a German research program. Over the past 10 years, it evolved into a platform for research around recommendation systems for software engineering. Generally, they're interested in questions like how humans influence software engineering or how certain tools can support humans to better / more efficiently perform their tasks. 
Examples are intelligent code completion, interaction trackers or evaluation tools.

The KaVE are collecting and providing these data sets such that we can better understand what software engineers do / what they interact with / where their problems lie. With this data, we can try to see relations betweek the workign behaviour of developers (edit events, time, run test and results of them) and his efficienty / performance / productivity / habits /...

Below, the three data sets we extracted from the gigantic pile of possibilities are explained.

#### What do we want to learn from this data?
From personal experience, we know how important it is to frequently build your code and run some tests. Only then you're able to link what you've done (i.e. what you've written, the edit events) to the outcome (i.e. (hopefully) working code, build and test run events). 

* often repeated advice: compile early, compile often
* intuition: longer time between tests/builds = more chance to screw up
* Is there a way to show that empirically?

Our goal of this data analysis therefore is to analyze the link between the probability of Unit tests or project builds succeeding in connection with how much time has passed since the last build/test run.


#### Hypotheses
TODO --> how did we get to this hypos?

* we want to test the advice 'compile early, compile often' and the closely related 'test a lot' empirically
* found relevant events in the data: edit (code modified), build and test events

Our hypotheses are:

1) the longer a developer waits to build his code, the higher the probability that the build will fail

2) the longer a developer waits to run some tests, the higher the probability for test failures

3) the more edit events a developer executes, the lower the probability that the build will succeed

4) the more edit events a developer executes, the lower the probability for test success



#### Final Data Analysis Questions:
1) linking time since last (successful?) build to probability of build succeeding -- SEVERIN

2) linking time since last tests run to probability of tests passing -- ELIZABETH

3) linking number of edit events since last successful build to probability of build succeeding -- ANDREA

4) linking number of edit events since last passing tests to probability of unit tests passing -- JARA


In [1]:
import pandas as pd
import numpy as np
import math
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator, FormatStrFormatter,
                               AutoMinorLocator)
from itertools import islice
from sklearn import datasets, linear_model
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score




In [2]:
df_edit = pd.read_csv("../data/df_edit.csv")
editEvents = df_edit
df_test = pd.read_csv("../data/df_test.csv")
testEvents = df_test
df_build = pd.read_csv("../data/df_build.csv")
buildEvents = df_build

## TODO JARA: Short description of the data frames & their columns

* first, a walkthrough about what data we actually extracted from the huge pile of events
* three categories: edit, build, test

### Edit Events

* edit event = code was modified
 * typing
 * copy/paste
 * auto generating getter/setter
 * other refactoring
* sessionID see data cleaning

In [3]:
df_edit.iloc[:,1:3]

Unnamed: 0,sessionID,timestamp
0,0,2016-10-04 14:36:01
1,0,2016-10-04 14:36:07
2,0,2016-10-04 14:36:49
3,0,2016-10-04 14:36:53
4,1,2016-10-04 14:37:03
...,...,...
497454,ffcbdaa4-e264-45a3-bddb-6f2f0afeac2f,2016-04-20 15:13:48
497455,ffcbdaa4-e264-45a3-bddb-6f2f0afeac2f,2016-04-20 15:13:52
497456,ffcbdaa4-e264-45a3-bddb-6f2f0afeac2f,2016-04-20 15:13:58
497457,ffcbdaa4-e264-45a3-bddb-6f2f0afeac2f,2016-04-20 15:14:03


The first column is the index of the dataframe.

The second column includes the sessionID.

The third colum is the timestamp of the edit event.

In total, we got 2876 unique sessionIDs for which in total 497'459 edit events have been recorded.

In [4]:
df_edit["timestamp"].describe() 
 

count                  497459
unique                 488049
top       2016-09-04 23:02:24
freq                        4
Name: timestamp, dtype: object

In [5]:
df_edit["sessionID"].describe()

count                                   497459
unique                                    2876
top       8d0ea603-57cd-4b1f-b3cf-ce39ec9203c7
freq                                     17006
Name: sessionID, dtype: object

### Test Events

In [6]:
df_test.iloc[:,1:5]

Unnamed: 0,sessionID,timestamp,totalTests,testsPassed
0,006eb9aa-33f1-4e9e-8e74-7c978b58ee4a,2016-05-03 09:32:16,33,33
1,03c83bf2-8938-4a8f-9f58-d52bf3b2eccd,2016-05-10 17:21:18,1,0
2,03c83bf2-8938-4a8f-9f58-d52bf3b2eccd,2016-05-10 17:21:54,26,26
3,03c83bf2-8938-4a8f-9f58-d52bf3b2eccd,2016-05-10 17:28:26,1,1
4,0504fbd1-cce2-4431-b4e2-edc63eea1c6d,2016-07-13 20:24:46,21,21
...,...,...,...,...
3821,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 14:57:23,1,0
3822,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:04:16,1,0
3823,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:05:03,1,0
3824,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:23:52,1,1


In [7]:
df_test.describe()

Unnamed: 0.1,Unnamed: 0,totalTests,testsPassed
count,3826.0,3826.0,3826.0
mean,1912.5,88.780972,76.577627
std,1104.615393,430.78577,392.217908
min,0.0,-1.0,-1.0
25%,956.25,1.0,0.0
50%,1912.5,3.0,1.0
75%,2868.75,21.0,15.0
max,3825.0,6618.0,6090.0


The first column is the index of the dataframe.

The second column includes the sessionID.

The third colum is the timestamp of the edit event.

The fourth column includes the total number of tests run at this specific time.

And the firth column states how many tests actually passed.

### Build Events

In [8]:
df_build.iloc[:,1:4]

Unnamed: 0,sessionID,timestamp,buildSuccessful
0,0,2016-10-04 14:35:55,False
1,0,2016-10-04 14:36:07,False
2,0,2016-10-04 14:36:50,False
3,0,2016-10-04 14:36:53,False
4,1,2016-10-04 14:37:03,True
...,...,...,...
14952,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:04:17,True
14953,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:05:05,True
14954,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:23:53,True
14955,ffc444d0-8382-4e52-9f04-3c42601ec739,2016-06-14 15:24:13,True


In [9]:
df_build.describe()

Unnamed: 0.1,Unnamed: 0
count,14957.0
mean,7478.0
std,4317.858323
min,0.0
25%,3739.0
50%,7478.0
75%,11217.0
max,14956.0


# Data Preprocessing / Cleaning

The data was originally available in individual json events, with great support for parsing with Java or C#. Since we have to work with python, we used Java to convert the relevant information into csv without cleaning the data first.

* remove wrong data
* parse timestamps
* split into sessions

<i> --> TODO Claudio </i>

# Model / Algorithm Building

- chapter 6
- we should agree on some basics
- fit model with optimization methods?
- linear vs non-linear, blackbox vs descriptive, first principle vs data driven, stochastic vs deterministic, flat vs hierarchical
- model evaluation: trianign data set, validation data set, test data set
- classifiers, value prediction, absolute / relative / squared error, baseline models?

In [10]:
grouped = df_edit.groupby(['sessionID'])
sessionStarts = grouped.agg({'timestamp':np.min}).to_dict()

## Predicting Build success

* first investigation: how likely is a build to succeed
* hypothesis: longer time or more edits between builds = less probability of build succeeding

### Predicting Build success by time since last build

* first up: prepare data
 * starting points for sessions
 * time since last build

In [11]:
#build events are ordered by session id and timestamp
previous = {'sessionID': 'nonexistent-atsirtsakitaiea'}
time_to_build = []
for _, event in df_build.iterrows():
    try:
        if previous['sessionID'] is event['sessionID']:
            begin = previous['timestamp']
        else:
            begin = sessionStarts['timestamp'][event['sessionID']]
        end = event['timestamp']
        duration = pd.Timedelta(pd.to_datetime(end)-pd.to_datetime(begin)).seconds
        time_to_build.append((duration,event['buildSuccessful']))
    except Exception:
        #nothing
        ;
    previous = event
time_to_build

[(86394, False),
 (12, False),
 (43, False),
 (3, False),
 (0, True),
 (146, False),
 (520, True),
 (3331, True),
 (2154, True),
 (572, True),
 (1795, True),
 (216, True),
 (50, True),
 (1154, True),
 (1683, True),
 (1493, True),
 (188, True),
 (451, True),
 (204, True),
 (114, True),
 (310, True),
 (146, True),
 (51, True),
 (265, False),
 (137, False),
 (116, False),
 (119, False),
 (52, True),
 (176, True),
 (94, True),
 (42, True),
 (194, True),
 (170, True),
 (38, True),
 (90, False),
 (37, True),
 (90, True),
 (196, True),
 (93, True),
 (82, True),
 (138, True),
 (405, True),
 (70, True),
 (412, True),
 (289, True),
 (26, True),
 (2711, True),
 (544, False),
 (50, False),
 (100, True),
 (19, True),
 (114, False),
 (24, False),
 (27, False),
 (57, False),
 (61, False),
 (8, False),
 (8, False),
 (74, False),
 (301, True),
 (81287, True),
 (9707, False),
 (20, True),
 (30, True),
 (79399, True),
 (42, True),
 (15, True),
 (10, True),
 (35, True),
 (11, True),
 (4056, True),
 (524, 

In [12]:
def analyze_classifiers(x,y):
    classifiers = [('linear',SGDClassifier()),
                   ('logistic',LogisticRegression()),
                   ('knn',KNeighborsClassifier(3))]
    for name,model in classifiers:
        accuracies = []
        kf = KFold(n_splits=5,shuffle=True)
        for train_index, test_index in kf.split(x):
            x_train, x_test = x.iloc[train_index], x.iloc[test_index]
            y_train, y_test = y[train_index], y[test_index]
            fit = model.fit(x_train, y_train)
            accuracies.append(accuracy_score(y_test, model.predict(x_test), normalize=True))
        print('accuracy of',name, 'is',np.mean(accuracies))
    

In [13]:
frame = pd.DataFrame(time_to_build, columns=['time','success'])
x = frame[['time']]
y = frame['success']

analyze_classifiers(x,y)

accuracy of linear is 0.8617757397051802
accuracy of logistic is 0.8617762764324187
accuracy of knn is 0.829744931561686


### Predicting build success by number of edit events

* Step: For each sessionID:
 - get timestamp of the last successfull build
 - get timestamp of next build after a)
* Step: For each session ID:
 - Count number of edit between the two timestamp in Step 1
 - if no buildEvent after last successfull build occurs, dont count
* Step: Create new DataFrame with following variables:
 - sessionID,timestampSuccessBuild, timestampNextBuild, #editsUntilNextBuild and nextBuildResult



In [14]:
def get_all_build(sessionID):
    all_build = buildEvents[buildEvents["sessionID"] == sessionID]
    return all_build.sort_values(["timestamp"]).values

In [15]:
def get_successful_build(sessionID):
    all_successful_build = buildEvents[(buildEvents["sessionID"] == sessionID) & (buildEvents["buildSuccessful"] == True)] 
    return all_successful_build.sort_values(["timestamp"]).values

In [16]:
def get_nr_edits_between_build(sessionID,timeLastSuccessfullBuild, timeNextBuild):
    totalEdit = editEvents[editEvents["sessionID"] == sessionID]
    totalEdit = totalEdit.sort_values(["timestamp"])
    editBetweenSuccessBuildAndBuild = totalEdit[(totalEdit["timestamp"] <= timeNextBuild) & (totalEdit["timestamp"] >= timeLastSuccessfullBuild)]
    return editBetweenSuccessBuildAndBuild["timestamp"].values.size

In [17]:
def edits_from_pass_to_next_build(sessionID):
    result = []

    # get all successful build of session 
    allSuccessfulBuild = get_successful_build(sessionID)
    allBuild = get_all_build(sessionID)
    numOfSuccessfulBuild = np.size(allSuccessfulBuild,0)
    numOfBuild = np.size(allBuild,0)
    # iterate over each passed build
    for s in range(numOfSuccessfulBuild):
        # if build follows, count edits
        if allBuild[-1][2]>allSuccessfulBuild[s][2]:
            for b in range(numOfBuild):
                if allBuild[b][2]>allSuccessfulBuild[s][2]:
                    break
            timeLastSuccessfulBuild=allSuccessfulBuild[s][2]
            timeNextBuild=allBuild[b][2]
            resultNextBuild=allBuild[b][3]
            nrOfEdit=get_nr_edits_between_build(sessionID,timeLastSuccessfulBuild, timeNextBuild)
            sessionID = allBuild[b][1]
            time_passed = pd.Timedelta(datetime.strptime(timeNextBuild, '%Y-%m-%d %H:%M:%S')-datetime.strptime(timeLastSuccessfulBuild, '%Y-%m-%d %H:%M:%S')).seconds
            
            result.append([sessionID, timeLastSuccessfulBuild, timeNextBuild, time_passed, nrOfEdit, resultNextBuild])
            
    return result

#### Iterate over all sessions and create DataFrame

In [18]:
final_result=[]
allSessionID=buildEvents.sessionID.unique()
print(f'time before loop: {datetime.now(tz=None)} \n')
# get edits_from_pass_to_next_build for all sessionID
for sessionID in allSessionID:
    Observations = edits_from_pass_to_next_build(sessionID)
    NrOfObservations = len(Observations)
    # take only sessionID with at least 1 successful build and at least 1 build after that  
    if NrOfObservations > 0:
        for Obs in Observations:
            final_result.append(Obs)
print(f'time after loop: {datetime.now(tz=None)} \n')

time before loop: 2020-05-29 14:51:26.656752 



ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "C:\Users\Sev\anaconda3\envs\pai2020\lib\site-packages\IPython\core\interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-2dc4b385240f>", line 6, in <module>
    Observations = edits_from_pass_to_next_build(sessionID)
  File "<ipython-input-17-19f07ffdffad>", line 19, in edits_from_pass_to_next_build
    nrOfEdit=get_nr_edits_between_build(sessionID,timeLastSuccessfulBuild, timeNextBuild)
  File "<ipython-input-16-afc728217508>", line 2, in get_nr_edits_between_build
    totalEdit = editEvents[editEvents["sessionID"] == sessionID]
  File "C:\Users\Sev\anaconda3\envs\pai2020\lib\site-packages\pandas\core\frame.py", line 2791, in __getitem__
    return self._getitem_bool_array(key)
  File "C:\Users\Sev\anaconda3\envs\pai2020\lib\site-packages\pandas\core\frame.py", line 2844, in _getitem_bool_array
    indexer = key.nonzero()[0]
KeyboardInterrupt

During handling of the above 

KeyboardInterrupt: 

In [None]:
#create DataFrame
column_labels = ['sessionID','timestampSuccessBuild','timestampNextBuild','time_passed', '#editsUntilNextBuild', 'nextBuildResult']
df = pd.DataFrame(final_result, columns=column_labels)
df

#### Create model

In [None]:
from sklearn import linear_model
from sklearn import model_selection
from sklearn import metrics
import statsmodels.tools.tools as sm
import statsmodels.api as sm1
import seaborn as sns
import imblearn
from imblearn.over_sampling import SMOTE

In [None]:
df.isnull().any()

In [None]:
sns.countplot(x="nextBuildResult",data=df)
plt.show()
count_failedBuild = len(df[df["nextBuildResult"]==False])
count_passedBuild = len(df[df["nextBuildResult"]==True])
pct_failedBuild = round(count_failedBuild/(count_failedBuild+count_passedBuild)*100,1)
print(f'only {pct_failedBuild}% of the build-events fail')

In [None]:
df.groupby("nextBuildResult").agg({"#editsUntilNextBuild":["count","mean","max"]})

* Our classes (successful vs. fail Build events) are strongly imbalanced.
* The average number of edits between the last successful Build and the next Build event is more then twice as much for fail Build as for successful Build. This seems to support our hypothesis.

Let's do some more exploration


In [None]:
sns.boxplot(x=df["nextBuildResult"], y=df["#editsUntilNextBuild"])



There might be outlier in the dataset. We might want to keep only observation with at most 200? 75? edits between Build event


In [None]:
df1=df.loc[(df["#editsUntilNextBuild"]>0) & (df["#editsUntilNextBuild"]<200)]
df1.groupby("nextBuildResult").agg({"#editsUntilNextBuild":["count","mean","max"]})

In [None]:
df2=df.loc[(df["#editsUntilNextBuild"]>0) & (df["#editsUntilNextBuild"]<75)]
df2.groupby("nextBuildResult").agg({"#editsUntilNextBuild":["count","mean","max"]})

In [None]:
sns.boxplot(x=df1["nextBuildResult"], y=df1["#editsUntilNextBuild"])

In [None]:
sns.boxplot(x=df2["nextBuildResult"], y=df2["#editsUntilNextBuild"])

In [None]:
sns.boxplot(x=df3["nextBuildResult"], y=df3["#editsUntilNextBuild"])

In [None]:
print(f'{round(len(df2)/len(df)*100,1)}% of all observation has a number of edits between Build events lower than 75')

In [None]:
plt.scatter(df["#editsUntilNextBuild"],df["nextBuildResult"])
plt.xlabel("Edits from successful build until next build")
plt.ylabel("next build success")
plt.xlim(0,100)

Logistic Model:

Goal: predict the probability of the categorical dependent variable (i.e. success or failure of the Build-Event). \ The logistic regression predicts P(Y=success) as a function of X (nr. of edits).

In [None]:
Y = df1["nextBuildResult"]
X = df1[["#editsUntilNextBuild",'time_passed']]
X1 = sm.add_constant(X)
logit_model = sm1.Logit(Y, X1)
result_logit_model = logit_model.fit()
print(result_logit_model.summary())

The coefficient of 'time_passed' is not statistical significant (at 90% significant level) and therefore we will remove it.

In [None]:
Y = df1["nextBuildResult"]
X = df1["#editsUntilNextBuild"]
X1 = sm.add_constant(X)
logit_model = sm1.Logit(Y, X1)
result_logit_model = logit_model.fit()
print(result_logit_model.summary())

As we saw in the data exploration part. Our dataset is strong imbalanced. Let's try to balance the dataset!

In [None]:
os = SMOTE(random_state=0)
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X1, Y, test_size=0.1,random_state=0)

os_data_X, os_data_Y = os.fit_sample(X_train,Y_train)
os_data_X = pd.DataFrame(data=os_data_X, columns=['constant','#editsUntilNextBuild'])
os_data_Y = pd.DataFrame(data=os_data_Y, columns=['nextBuildResult'])

print("length of oversampled data is", len(os_data_X))
print("Number of failed Build in oversampled data", len(os_data_Y[os_data_Y['nextBuildResult']==False]))
print("Number of successful Build", len(os_data_Y[os_data_Y['nextBuildResult']==True]))
print("Proportion of failed Build in oversampled data is",len(os_data_Y[os_data_Y['nextBuildResult']==False])/len(os_data_X))
print("Proportion of successful Build in oversampled data is",len(os_data_Y[os_data_Y['nextBuildResult']==True])/len(os_data_X))

logit_model=sm1.Logit(os_data_Y,os_data_X)
result_logit=logit_model.fit()
print(result_logit.summary2())

In [None]:
#Y = df1["nextBuildResult"]
#X = df1[["#editsUntilNextBuild",'time_passed']].values
#X_train, X_test, y_train, y_test = model_selection.train_test_split(os_data_X.values, os_data_Y, test_size=0.1,random_state=0)
#X_test=sm.add_constant(X_test)
logit_model=linear_model.LogisticRegression()
logit_model.fit(os_data_X, os_data_Y.values)

print('Accurancy of logistic regression classifier on test set:{:.2f}'.format(logit_model.score(X_test,Y_test)))

In [None]:
Y_pred = logit_model.predict(X_test)
confusion_matrix = metrics.confusion_matrix(Y_test,Y_pred)
print(confusion_matrix)
print(metrics.classification_report(Y_test,Y_pred))

The coefficient for the nr. of edits is statistical significant and the number of edits seems to have a negative effect on the probability of a successful build. However the Pseudo R-squared of this logistical regression is very small. This leads us to think that there might be other (omitted) variables that have a much greater influence on the probability of successful build events.

## Successful Tests percentage

### Time since last tests run

In [None]:
# gets all passed tests for sessionID
def get_passed_tests(sessionID):
    allPassedTests = testEvents[(testEvents["sessionID"] == sessionID) 
                        & (testEvents["testsPassed"] > 0)]
    return np.asarray(allPassedTests)

# counts the numbers of edits between a test and the next test
# for our purpose, be aware to only call it with passed tests
def get_nr_edits_between_tests(sessionID, timeFirstTest , timeNextTest):
    result = editEvents[(editEvents["timestamp"] < timeNextTest)
                                 & (editEvents["timestamp"] > timeFirstTest) 
                                 & (editEvents["sessionID"] == sessionID)]
    result = np.asarray(result)
    return len(result)

In [None]:
# counts nr of edits from passed test until next test event
# returns a list
def edits_from_pass_to_next_test(sessionID):
    result = []
    resultRow = []
    # get all passed tests of the session 
    allPassedTests = get_passed_tests(sessionID)

    numOfRows = np.size(allPassedTests, 0)
    # iterate over each passed test
    for p in range(numOfRows):
        timePass = allPassedTests[p][2]

        # no following test
        if p == numOfRows-1:
            hasNextTest = False
        else:
            hasNextTest = True
            timeTestNext = allPassedTests[p+1][2]
            resultNext = allPassedTests[p+1][4]

        # for each passed test, get nr of edits until next test
        if hasNextTest:
            nrOfEdits = get_nr_edits_between_tests(sessionID,timePass, timeTestNext)
            print(f'time of pass to append: {timePass}')
            print(f'nr of edits: {nrOfEdits}')
            print(f'result next: {resultNext} \n')
            result.append([sessionID, timePass, timeTestNext, nrOfEdits, resultNext])
            
    return result

In [None]:
# creates a DF with 8 columns
def create_df(resultArray):
    print(f'Length result: {len(resultArray)} \n\n')

    # Create DataFrame
    column_labels = ['sessionID','timePass','timeNext', 'editsUntilNextTest', 'ratio P/T','totalTestsNext', 'testPassedNext' ,'booleanNextTest']
    df = pd.DataFrame(resultArray, columns=column_labels)
    return df;

#### Iterate over sessions and create DF

In [None]:
# Creates a DF and returns a list that contains among other things
# the nr of edits from a passing test until the next test event and whether this next test has been successful
def get_edits_tests_list():
    sessionWithPassedTests=0
    sessionWithoutPassedTests=0
    sessionWithPassedTestsButNoNextTest=0
    result=[]
    print(f'time before loop: {datetime.now(tz=None)} \n')
    for index, row in testEvents.iterrows():
        if(row["testsPassed"] > 0):
            sessionWithPassedTests+=1
            #has next test
            if((index+1 < len(testEvents)) and (testEvents.iloc[index+1].sessionID == row["sessionID"])):
                item=[]
                timestampNext=testEvents.iloc[index+1].timestamp
                totalTestsNext=testEvents.iloc[index+1].totalTests
                passedTestNext=testEvents.iloc[index+1].testsPassed
                nrOfEdits=get_nr_edits_between_tests(row["sessionID"], row["timestamp"], timestampNext)
                if(passedTestNext == 0):
                    resultTestNextBoolean = 0
                else:
                    resultTestNextBoolean = 1
                if(totalTestsNext != 0):
                    ratioPT = (passedTestNext/totalTestsNext)
                else:
                    ratioPT = 0
                item.append(row["sessionID"])
                item.append(row["timestamp"])
                item.append(timestampNext)
                item.append(nrOfEdits)
                item.append(float(ratioPT))
                item.append(int(totalTestsNext))
                item.append(int(passedTestNext))
                item.append(int(resultTestNextBoolean))
                result.append(item)
            else:
                sessionWithPassedTestsButNoNextTest+=1
        else:
            sessionWithoutPassedTests+=1
    print(f'with passed: {sessionWithPassedTests}')
    print(f'with passed test but no next test: {sessionWithPassedTestsButNoNextTest}')
    print(f'without passed test: {sessionWithoutPassedTests}')
    print(f'total: {sessionWithoutPassedTests+sessionWithPassedTests}')
    print(f'time after loop and df: {datetime.now(tz=None)} \n')
    
    print(result[0])
    print(result[1])
    print(result[2])

    
    return result

In [None]:
edits_tests= get_edits_tests_list()

In [None]:
create_df(edits_tests)

In [None]:
#calculate avg 
print(len(edits_tests))
edits_tests_array = np.asarray(edits_tests, dtype='O')
nrEventsTest = edits_tests_array[:, 4:6]
nrOfEvents = edits_tests_array[:,3]
testPass = edits_tests_array[:,4]

#sort ascending to nr of events
nrEventsTestSorted = sorted(nrEventsTest, key=lambda entry: entry[0]) 

#chunk size 200 = 12 chunks
chunkSize = math.ceil(int(len(nrOfEvents))/200)
print(f'chunk size: {chunkSize}')

chunk1, chunk2, chunk3, chunk4, chunk5, chunk6, chunk7, chunk8, chunk9, chunk10,chunk11, chunk12 = np.array_split(nrEventsTestSorted, chunkSize)
print(f'first chunk len: {len(chunk1)}')
print(f'last chunk len: {len(chunk12)}')

fails=[]
fails.append(np.count_nonzero(chunk1[:,1] == 0))
fails.append(np.count_nonzero(chunk2[:,1] == 0))
fails.append(np.count_nonzero(chunk3[:,1] == 0))
fails.append(np.count_nonzero(chunk4[:,1] == 0))
fails.append(np.count_nonzero(chunk5[:,1] == 0))
fails.append(np.count_nonzero(chunk6[:,1] == 0))
fails.append(np.count_nonzero(chunk7[:,1] == 0))
fails.append(np.count_nonzero(chunk8[:,1] == 0))
fails.append(np.count_nonzero(chunk9[:,1] == 0))
fails.append(np.count_nonzero(chunk10[:,1] == 0))
fails.append(np.count_nonzero(chunk11[:,1] == 0))
fails.append(np.count_nonzero(chunk12[:,1] == 0))
print(fails)

passList=[]
passList.append(np.count_nonzero(chunk1[:,1] == 1))
passList.append(np.count_nonzero(chunk2[:,1] == 1))
passList.append(np.count_nonzero(chunk3[:,1] == 1))
passList.append(np.count_nonzero(chunk4[:,1] == 1))
passList.append(np.count_nonzero(chunk5[:,1] == 1))
passList.append(np.count_nonzero(chunk6[:,1] == 1))
passList.append(np.count_nonzero(chunk7[:,1] == 1))
passList.append(np.count_nonzero(chunk8[:,1] == 1))
passList.append(np.count_nonzero(chunk9[:,1] == 1))
passList.append(np.count_nonzero(chunk10[:,1] == 1))
passList.append(np.count_nonzero(chunk11[:,1] == 1))
passList.append(np.count_nonzero(chunk12[:,1] == 1))

print(passList)

In [None]:
edits_tests_array = np.asarray(edits_tests)

nrOfEvents = edits_tests_array[:,3]
testPass = edits_tests_array[:,4]
start = max(nrOfEvents)
end = min(nrOfEvents)


#plot 1
fig, ax = plt.subplots()
ax.scatter(nrOfEvents, testPass)
plt.xlabel('nr of events from passed test until next test')
plt.ylabel('nr of passed tests')
# Make a plot with major ticks that are multiples of 20 and minor ticks that
# are multiples of 5.  Label major ticks with '%d' formatting but don't label
# minor ticks.
ax.xaxis.set_major_locator(MultipleLocator(20))
ax.xaxis.set_major_formatter(FormatStrFormatter('%d'))
# For the minor ticks, use no labels; default NullFormatter.
ax.xaxis.set_minor_locator(MultipleLocator(5))
#yaxis
ax.yaxis.set_major_locator(MultipleLocator(20))
ax.yaxis.set_major_formatter(FormatStrFormatter('%d'))
ax.yaxis.set_minor_locator(MultipleLocator(10))
plt.show()


#plot 3
df = create_df(edits_tests)
df.groupby(['editsUntilNextTest','booleanNextTest']).size().unstack().plot(kind='bar',stacked=True)

#### Build model regression

In [None]:
df_edit_tests = create_df(edits_tests)
grouped = df_edit_tests.groupby(['sessionID'])
sessionStarts = grouped.agg({'timePass':np.min}).to_dict()

In [None]:
edits_tests_array = np.array(edits_tests, dtype="O")
edits_ratio_array = edits_tests_array[:,3:5]
edits_ratio_list = edits_ratio_array.tolist()
x = edits_tests_array[:,3]
y = edits_tests_array[:,4]
print(type(x[2]))
print(type(y[2]))

frame = pd.DataFrame(edits_ratio_list, columns=['edits','success'])
print(frame.head(20))

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [None]:
def analyze_classifiers(x,y):
    classifiers = [('linear',SGDClassifier()),
                   ('logistic',LogisticRegression()),
                   ('knn',KNeighborsClassifier(3))]
    for name,model in classifiers:
        accuracies = []
        kf = KFold(n_splits=5,shuffle=True)
        for train_index, test_index in kf.split(x):
            x_train, x_test = x.iloc[train_index], x.iloc[test_index]
            y_train, y_test = y[train_index], y[test_index]
            fit = model.fit(x_train, y_train)
            accuracies.append(accuracy_score(y_test, model.predict(x_test), normalize=True))
        print('accuracy of',name, 'is',np.mean(accuracies))

In [None]:
model = LinearRegression(fit_intercept=True)

model.fit(x[:, np.newaxis], y)

xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])

plt.scatter(x, y)
plt.plot(xfit, yfit);

# Data Visualization

- we should agree on some basic design / plots such that we have a consistent viz