# Elog Tagging

The goal is to try and tag elog entries with the correct tag. In order to do this we need to:
* Scrape the data from the entries off the elog
* Also get the corresponding tag for entries
* Run through NLP algorithm to try and train the entries (NOTE: Can't really do this rn because all of our tagging is kinda horseshit so we'd be training it on poop data

In [74]:
import pandas as pd
import numpy as np
import requests
import time
from datetime import datetime
from sqlalchemy import create_engine

In [87]:
def get_data(s,e):
    '''
    --- Imports data from Elog and stores it in a workable format ---
    INPUT
        s: start time as unix timestamp
        e: end time as unix time stamp
    RETURN
        df: dataframe of uncleaned data between selected time range
    '''
    
    # api-endpoint 
    URL = "https://mccelog.slac.stanford.edu/elog/dev/mgibbs/dev_elog_display_json.php"

    PARAMS = {'logbook': 'MCC', 'start': s, 'end': e} 

    # sending get request and saving the response as response object 
    r = requests.get(url = URL, params = PARAMS) 

    # extracting data in json format 
    data = r.json()

    # Turning list of json objects into dataframe
    df = pd.DataFrame.from_records(data)

    return df

In [99]:
# Just checking that things work as expected
s = datetime(2008, 1, 11, 0, 0).timestamp()
e = datetime(2009, 1, 11, 0, 0).timestamp()
df = get_data(s,e)
print(df.shape)
df.head()

(24284, 14)


Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,children,parent,attachments,superseded_by,supersedes,highPriority,tag
0,270417,"MCC Shift Change: Owl Shift, Sunday, 11-Jan-2009",.250 nC 13.6 GeV 10 Hz e- to main dump. Undula...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1053, 'username': 'spw', 'firstna...",1231660800,"Owl Shift, Sun, 11-Jan-09",,,,,,,
1,270419,SWING SHIFT SUMMARY,<table CellPadding=5 BORDER=1>\n\t\t <TR><TD><...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231660799,"Swing Shift, Sat, 10-Jan-09",,,,,,,
2,270415,* RE: Frisch 6x6 misbehaving,Disabled BSY/LTU energy part of Frisch feedbac...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231660530,"Swing Shift, Sat, 10-Jan-09",[270428],270413.0,,,,,
3,270412,Instructions for resetting BSOBTH02,Go to the large blue box on the <u>North</u> h...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1160, 'username': 'jab', 'firstna...",1231660060,"Swing Shift, Sat, 10-Jan-09",,,,,,,
4,270413,Frisch 6x6 misbehaving,LTU energy BPM DL1 oscillating about 2mm. Pag...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231659900,"Swing Shift, Sat, 10-Jan-09",[270415],,,,,,


### Now we have a method to store the data in the data frame but there still is a lot of useless data here. Let's get rid of useless columns

* logbook (all mcc)
* author, eventTime, shift, parent, children, attachments, supersedes (irrelevant) 

This leaves the following columns left: `title`, `text`, `elog_id`, `tag`, and `superceded_by`
* `superceded_by` is useful because any column where this is not Nan, we can drop this. The reasoning behind this is that there basically are duplicate entries (when an entry is superceded) and we only want to keep one copy (the correct one). So we can drop the original entries, i.e. the entries where `superceded_by` is not Nan, and then delete this column

Finally we'd be left with: `title`, `text`, `elog_id`, `tag` <br>
<b> Questions </b> 
* NLP algorithm should really only be working on the `titl`e of the entries right? Like that's great if there are more key words in the body, but the title should be enough to tag the location (in my head). If this is true than there's no need for `text`
* Do I really need `elog_id` for anything...? If I keep title and text I definitely need to, if not then I see no need

In [100]:
def clean_data(df):
    '''
    --- Cleans data frame ---
    INPUT
        df: dataframe (not cleaned)
    RETURN
        df: dataframe (cleaned)
    '''
    # Dropping rows without any tags (these rows are useless for us)
    df = df[df.tag.notnull() == True]
    
    # Dropping useless columns
    important_cols = {'title', 'text', 'elogid', 'tag', 'superseded_by'}
    list1 = df.columns.tolist()
    list1 = [ele for ele in list1 if ele not in important_cols]
    for column in df.columns.tolist():
        if column in list1:
            df = df.drop(column,axis = 1)

    # Dropping all columns where superceded_by is not null to essentially drop duplicates. Then drop superceded_by column
    df = df[df['superseded_by'].isnull() == True]
    df = df.drop(['superseded_by'],axis = 1)
    
    # Reset the index
    df = df.reset_index(drop=True)
    
    return df

In [101]:
# Just checking that things work as expected
df = clean_data(df)
print(df.shape)
df.head()

(49, 4)


Unnamed: 0,elogid,title,text,tag
0,265530,Restart LCLS Magnet ChannelWatcher,I've restarted the lcls magnet channel watcher...,LCLS
1,259842,BYKIK pulse width change,Tony Beukers and I chagned the BYKIK pulse wid...,LCLS
2,252459,* Re: SW: Reboot BC1 Bunch Length IOCs-,Greg Dallt from the Klystron Group is working ...,LCLS
3,252453,SW: Reboot BC1 Bunch Length IOCs-,Rebooted Bunch Length Monitor EPICS IOC in li2...,LCLS
4,252399,Fallout from 120Hz Testing: BCS: Gun SBI (20-5...,"Hello,\n\nAfter the 120Hz testing, after the c...",LCLS


In [98]:
# Checking to see the number of tags present
df.tag.value_counts()

LCLS    49
Name: tag, dtype: int64

<b> Now lets save the data in a way that we can easily access </b>

In [80]:
# Function to save the data into sql database
def save_data(df, database_filename):
    engine = create_engine('sqlite:///'+database_filename+'.db')
    df.to_sql(database_filename, engine, index=False)

### Important changes that still need to be made:
* What time frame is a good time frame to capture all needed data???
> Looks like you want to capture data up till 2011. Perhaps the most efficient way to do this would be either by month or year and then process this data individually and recreate a giant dataframe. You would likely have to add more methods to your main() class and add a function that incorporates this

In [125]:
def main():
    '''
    Will go through all the necessary steps to extract the data from the elog, clean it, and save the data
    in an SQL database
    '''
    s = datetime(2009, 1, 11, 0, 0).timestamp()
    e = datetime(2010, 1, 11, 0, 0).timestamp()
    df = get_data(s,e)
    df = clean_data(df)
    save_data(df,'elog_data')

In [126]:
# Running this will save the data that we want to collect
main()

In [135]:
s = datetime(2009, 1, 11, 0, 0).timestamp()
e = datetime(2010, 1, 11, 0, 0).timestamp()
df1 = get_data(s,e)
df1.head()

Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,parent,children,superseded_by,supersedes,attachments,highPriority,tag
0,372644,"Hollosi out, MCC un - populated.",,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1062, 'username': 'cfh', 'firstna...",1262999006,"Swing Shift, Fri, 08-Jan-10",,,,,,,
1,372634,"Done, Re: Sec. CID and 0 / 1 being searched.",,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1241, 'username': 'skalsi', 'firs...",1262995291,"Swing Shift, Fri, 08-Jan-10",372618.0,,,,,,
2,372626,"Blackwell reports Sec. 26 ACC., WG. water pump...",,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1062, 'username': 'cfh', 'firstna...",1262993310,"Day Shift, Fri, 08-Jan-10",,,,,,,
3,372619,Summary of refrigerated baffle brine pump inci...,This is a repeat of the incident from last Aug...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1163, 'username': 'cyterski', 'fi...",1262991853,"Day Shift, Fri, 08-Jan-10",,,,,,,
4,372618,Sec. CID and 0 / 1 being searched.,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1062, 'username': 'cfh', 'firstna...",1262990863,"Day Shift, Fri, 08-Jan-10",,[372634],,,,,


In [134]:
s = datetime(2010, 1, 11, 0, 0).timestamp()
e = datetime(2011, 1, 11, 0, 0).timestamp()
df2 = get_data(s,e)
df2.head()

Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,superseded_by,children,attachments,supersedes,parent,tag,highPriority
0,461672,"MCC Shift Change: Owl Shift, Tuesday, 11-Jan-2011",LCLS-MD using Hutch 1,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1051, 'username': 'hvs', 'firstna...",1294732800,"Owl Shift, Tue, 11-Jan-11",,,,,,,
1,461667,SWING SHIFT SUMMARY,"<table CellPadding=""5"" BORDER=1>\n<tr>\n<th>Co...","{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1301, 'username': 'mgibbs', 'firs...",1294732799,"Swing Shift, Mon, 10-Jan-11",,,,,,,
2,461659,50PR3 'Not Out' causing MPS 10 Hz rate limit,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1068, 'username': 'sommer', 'firs...",1294732767,"Swing Shift, Mon, 10-Jan-11",461660.0,,,,,,
3,461658,50PR3 'Not Out' causing MPS 10 Hz rate limit,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1068, 'username': 'sommer', 'firs...",1294732755,"Swing Shift, Mon, 10-Jan-11",,[461661],,,,,
4,461654,PEM Ranged L0A,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1301, 'username': 'mgibbs', 'firs...",1294730677,"Swing Shift, Mon, 10-Jan-11",,,,,,,


In [140]:
s = datetime(2011, 1, 11, 0, 0).timestamp()
e = datetime(2011, 12, 31, 0, 0).timestamp()
df3 = get_data(s,e)
df3.head()

Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,superseded_by,supersedes,tag,attachments,highPriority,children,parent
0,376580,DAY SHIFT SUMMARY,<table CellPadding=5 BORDER=1>\n\t\t \n\t\t <...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1301, 'username': 'mgibbs', 'firs...",1325289599,"Day Shift, Fri, 30-Dec-11",376582.0,,,,,,
1,376581,DAY SHIFT SUMMARY,<table CellPadding=5 BORDER=1>\n\t\t \n\t\t <...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1301, 'username': 'mgibbs', 'firs...",1325289599,"Day Shift, Fri, 30-Dec-11",376583.0,,,,,,
2,425296,DAY SHIFT SUMMARY,<table CellPadding=5 BORDER=1>\n\t\t \n\t\t <...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1057, 'username': 'mboyes', 'firs...",1325289599,"Day Shift, Fri, 30-Dec-11",425304.0,,,,,,
3,557827,Injector vault to C/A at Ted Martinez's request,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1004, 'username': 'schuh', 'first...",1324414645,"Day Shift, Tue, 20-Dec-11",557828.0,,,,,,
4,557828,Injector vault to C/A at Ted Martinez's request,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1004, 'username': 'schuh', 'first...",1324414645,"Day Shift, Tue, 20-Dec-11",557831.0,557827.0,LCLS,,,,


In [141]:
s = datetime(2008, 1, 1, 0, 0).timestamp()
e = datetime(2009, 1, 11, 0, 0).timestamp()
df4 = get_data(s,e)
df4.head()

Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,children,parent,attachments,superseded_by,supersedes,highPriority,tag
0,270417,"MCC Shift Change: Owl Shift, Sunday, 11-Jan-2009",.250 nC 13.6 GeV 10 Hz e- to main dump. Undula...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1053, 'username': 'spw', 'firstna...",1231660800,"Owl Shift, Sun, 11-Jan-09",,,,,,,
1,270419,SWING SHIFT SUMMARY,<table CellPadding=5 BORDER=1>\n\t\t <TR><TD><...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231660799,"Swing Shift, Sat, 10-Jan-09",,,,,,,
2,270415,* RE: Frisch 6x6 misbehaving,Disabled BSY/LTU energy part of Frisch feedbac...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231660530,"Swing Shift, Sat, 10-Jan-09",[270428],270413.0,,,,,
3,270412,Instructions for resetting BSOBTH02,Go to the large blue box on the <u>North</u> h...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1160, 'username': 'jab', 'firstna...",1231660060,"Swing Shift, Sat, 10-Jan-09",,,,,,,
4,270413,Frisch 6x6 misbehaving,LTU energy BPM DL1 oscillating about 2mm. Pag...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231659900,"Swing Shift, Sat, 10-Jan-09",[270415],,,,,,


In [206]:
s = datetime(2007, 4, 1, 0, 0).timestamp()
e = datetime(2007, 5, 1, 0, 0).timestamp()
df5 = get_data(s,e)
df5.head()

Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,supersedes,parent,attachments,superseded_by,children,highPriority,tag
0,143932,"MCC Shift Change: Owl Shift, Tuesday, 01-May-2007","BaBar, ROW","{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1050, 'username': 'gmilanov', 'fi...",1178002800,"Owl Shift, Tue, 01-May-07",,,,,,,
1,143922,Swing Shift Summary,Swing Shift</font></h2>\n\t<table>\n\t\t<tbody...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1048, 'username': 'sonya', 'first...",1178002799,"Swing Shift, Mon, 30-Apr-07",143921.0,,,,,,
2,143919,"LUM: 7973, HER: 1601, LER: 2501, SPLUM: 3.44",\n\n Luminosity of 7973 x10^30/cm^2s and Spec...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1197, 'username': 'acc_status', '...",1178002241,"Swing Shift, Mon, 30-Apr-07",,,,,,,
3,143917,LCLS injector vault to controlled access.,Successful after PEM breakered off BX01/02.\n,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1048, 'username': 'sonya', 'first...",1177999425,"Swing Shift, Mon, 30-Apr-07",,,,,,,
4,143916,* RE: C. Rivetta performing LER grow/damp meas...,He's done for the night. He has been having tr...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1048, 'username': 'sonya', 'first...",1177999279,"Swing Shift, Mon, 30-Apr-07",,143883.0,,,,,


In [148]:
df3.tag.value_counts()

LCLS     6430
FACET    1765
Name: tag, dtype: int64

In [160]:
df5.tag.value_counts()

LCLS    2
Name: tag, dtype: int64

In [207]:
s = datetime(2007, 5, 1, 0, 0).timestamp()
e = datetime(2007, 6, 1, 0, 0).timestamp()
df6 = get_data(s,e)
df6.head()

Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,supersedes,superseded_by,parent,attachments,children,highPriority,tag
0,150499,"MCC Shift Change: Owl Shift, Friday, 01-Jun-2007","BaBar, LCLS","{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1050, 'username': 'gmilanov', 'fi...",1180681200,"Owl Shift, Fri, 01-Jun-07",,,,,,,
1,150498,Swing Shift Summary,\t<table>\n\t\t<tbody>\n\t\t\t<tr>\n <td ...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1057, 'username': 'mboyes', 'firs...",1180681142,"Swing Shift, Thu, 31-May-07",,,,,,,
2,150496,"LUM: 8246, HER: 1501, LER: 2451, SPLUM: 3.88",\n\n Luminosity of 8246 x10^30/cm^2s and Spec...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1197, 'username': 'acc_status', '...",1180680954,"Swing Shift, Thu, 31-May-07",,,,,,,
3,150488,Openned RSWCF #2899 for E+ Stopper #1,Contacted ADSO and PPS group (J. Fitch). \n\nI...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1057, 'username': 'mboyes', 'firs...",1180679415,"Swing Shift, Thu, 31-May-07",150472.0,150490.0,,,,,
4,150491,"LUM: 8200, HER: 1501, LER: 2451, SPLUM: 3.86",\n\n Luminosity of 8200 x10^30/cm^2s and Spec...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1197, 'username': 'acc_status', '...",1180677313,"Swing Shift, Thu, 31-May-07",,,,,,,


In [210]:
df55 = df5.drop_duplicates(subset ="elogid", keep = 'first')
df66 = df6.drop_duplicates(subset ="elogid", keep = 'first')

In [212]:
result = pd.concat([df55, df66], sort = False)
print(df55.shape[0])
print(df66.shape[0])
print(result.shape[0])
result.head()

3949
3060
7009


Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,supersedes,parent,attachments,superseded_by,children,highPriority,tag
0,143932,"MCC Shift Change: Owl Shift, Tuesday, 01-May-2007","BaBar, ROW","{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1050, 'username': 'gmilanov', 'fi...",1178002800,"Owl Shift, Tue, 01-May-07",,,,,,,
1,143922,Swing Shift Summary,Swing Shift</font></h2>\n\t<table>\n\t\t<tbody...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1048, 'username': 'sonya', 'first...",1178002799,"Swing Shift, Mon, 30-Apr-07",143921.0,,,,,,
2,143919,"LUM: 7973, HER: 1601, LER: 2501, SPLUM: 3.44",\n\n Luminosity of 7973 x10^30/cm^2s and Spec...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1197, 'username': 'acc_status', '...",1178002241,"Swing Shift, Mon, 30-Apr-07",,,,,,,
3,143917,LCLS injector vault to controlled access.,Successful after PEM breakered off BX01/02.\n,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1048, 'username': 'sonya', 'first...",1177999425,"Swing Shift, Mon, 30-Apr-07",,,,,,,
4,143916,* RE: C. Rivetta performing LER grow/damp meas...,He's done for the night. He has been having tr...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1048, 'username': 'sonya', 'first...",1177999279,"Swing Shift, Mon, 30-Apr-07",,143883.0,,,,,


In [213]:
elog_id_list = result[result.elogid.duplicated() == True].elogid.tolist()
print(len(elog_id_list))
result[result.elogid.isin(elog_id_list) == True]

1


Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,supersedes,parent,attachments,superseded_by,children,highPriority,tag
0,143932,"MCC Shift Change: Owl Shift, Tuesday, 01-May-2007","BaBar, ROW","{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1050, 'username': 'gmilanov', 'fi...",1178002800,"Owl Shift, Tue, 01-May-07",,,,,,,
3065,143932,"MCC Shift Change: Owl Shift, Tuesday, 01-May-2007","BaBar, ROW","{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1050, 'username': 'gmilanov', 'fi...",1178002800,"Owl Shift, Tue, 01-May-07",,,,,,,


In [238]:
s = datetime(2007, 6, 1, 0, 0).timestamp()
e = datetime(2007, 6, 1, 23, 59).timestamp()
df7 = get_data(s,e)
print(df7.shape[0])
df7.head()

101


Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,attachments,superseded_by,supersedes,parent,children,highPriority
0,150627,"At HER I=1675, PR02 VGH 7045 is trending up",,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1057, 'username': 'mboyes', 'firs...",1180763245,"Swing Shift, Fri, 01-Jun-07","[{'attachmentid': 63998, 'url': 'https://mccel...",,,,,
1,150625,BXG is tripped,Joe Frisch wrote a matlab script to standardiz...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1207, 'username': 'sanzone', 'fir...",1180762341,"Swing Shift, Fri, 01-Jun-07",,,,,,
2,150623,"LUM: 8956, HER: 1656, LER: 2502, SPLUM: 3.75",\n\n Luminosity of 8956 x10^30/cm^2s and Spec...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1197, 'username': 'acc_status', '...",1180761319,"Swing Shift, Fri, 01-Jun-07",,,,,,
3,150622,Lumi Compare -- Now to 1st Two weeks of Apr07,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1057, 'username': 'mboyes', 'firs...",1180760658,"Swing Shift, Fri, 01-Jun-07","[{'attachmentid': 63997, 'url': 'https://mccel...",,,,,
4,150620,Moved LER IP Y Pos and L up from 8.6 --> 8.8,moved LER IP Y Pos -8um to see if it talks to ...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1057, 'username': 'mboyes', 'firs...",1180759342,"Swing Shift, Fri, 01-Jun-07","[{'attachmentid': 63995, 'url': 'https://mccel...",150621.0,,,,


In [239]:
df7['shift'].value_counts()

Swing Shift, Fri, 01-Jun-07    53
Day Shift, Fri, 01-Jun-07      34
Owl Shift, Fri, 01-Jun-07      14
Name: shift, dtype: int64