# Elog Tagging

The goal is to try and tag elog entries with the correct tag. In order to do this we need to:
* Scrape the data from the entries off the elog
* Also get the corresponding tag for entries
* Run through NLP algorithm to try and train the entries (NOTE: Can't really do this rn because all of our tagging is kinda horseshit so we'd be training it on poop data

In [74]:
import pandas as pd
import numpy as np
import requests
import time
from datetime import datetime
from sqlalchemy import create_engine

In [87]:
def get_data(s,e):
    '''
    --- Imports data from Elog and stores it in a workable format ---
    INPUT
        s: start time as unix timestamp
        e: end time as unix time stamp
    RETURN
        df: dataframe of uncleaned data between selected time range
    '''
    
    # api-endpoint 
    URL = "https://mccelog.slac.stanford.edu/elog/dev/mgibbs/dev_elog_display_json.php"

    PARAMS = {'logbook': 'MCC', 'start': s, 'end': e} 

    # sending get request and saving the response as response object 
    r = requests.get(url = URL, params = PARAMS) 

    # extracting data in json format 
    data = r.json()

    # Turning list of json objects into dataframe
    df = pd.DataFrame.from_records(data)

    return df

In [99]:
# Just checking that things work as expected
s = datetime(2008, 1, 11, 0, 0).timestamp()
e = datetime(2009, 1, 11, 0, 0).timestamp()
df = get_data(s,e)
print(df.shape)
df.head()

(24284, 14)


Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,children,parent,attachments,superseded_by,supersedes,highPriority,tag
0,270417,"MCC Shift Change: Owl Shift, Sunday, 11-Jan-2009",.250 nC 13.6 GeV 10 Hz e- to main dump. Undula...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1053, 'username': 'spw', 'firstna...",1231660800,"Owl Shift, Sun, 11-Jan-09",,,,,,,
1,270419,SWING SHIFT SUMMARY,<table CellPadding=5 BORDER=1>\n\t\t <TR><TD><...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231660799,"Swing Shift, Sat, 10-Jan-09",,,,,,,
2,270415,* RE: Frisch 6x6 misbehaving,Disabled BSY/LTU energy part of Frisch feedbac...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231660530,"Swing Shift, Sat, 10-Jan-09",[270428],270413.0,,,,,
3,270412,Instructions for resetting BSOBTH02,Go to the large blue box on the <u>North</u> h...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1160, 'username': 'jab', 'firstna...",1231660060,"Swing Shift, Sat, 10-Jan-09",,,,,,,
4,270413,Frisch 6x6 misbehaving,LTU energy BPM DL1 oscillating about 2mm. Pag...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231659900,"Swing Shift, Sat, 10-Jan-09",[270415],,,,,,


### Now we have a method to store the data in the data frame but there still is a lot of useless data here. Let's get rid of useless columns

* logbook (all mcc)
* author, eventTime, shift, parent, children, attachments, supersedes (irrelevant) 

This leaves the following columns left: `title`, `text`, `elog_id`, `tag`, and `superceded_by`
* `superceded_by` is useful because any column where this is not Nan, we can drop this. The reasoning behind this is that there basically are duplicate entries (when an entry is superceded) and we only want to keep one copy (the correct one). So we can drop the original entries, i.e. the entries where `superceded_by` is not Nan, and then delete this column

Finally we'd be left with: `title`, `text`, `elog_id`, `tag` <br>
<b> Questions </b> 
* NLP algorithm should really only be working on the `titl`e of the entries right? Like that's great if there are more key words in the body, but the title should be enough to tag the location (in my head). If this is true than there's no need for `text`
* Do I really need `elog_id` for anything...? If I keep title and text I definitely need to, if not then I see no need

In [100]:
def clean_data(df):
    '''
    --- Cleans data frame ---
    INPUT
        df: dataframe (not cleaned)
    RETURN
        df: dataframe (cleaned)
    '''
    # Dropping rows without any tags (these rows are useless for us)
    df = df[df.tag.notnull() == True]
    
    # Dropping useless columns
    important_cols = {'title', 'text', 'elogid', 'tag', 'superseded_by'}
    list1 = df.columns.tolist()
    list1 = [ele for ele in list1 if ele not in important_cols]
    for column in df.columns.tolist():
        if column in list1:
            df = df.drop(column,axis = 1)

    # Dropping all columns where superceded_by is not null to essentially drop duplicates. Then drop superceded_by column
    df = df[df['superseded_by'].isnull() == True]
    df = df.drop(['superseded_by'],axis = 1)
    
    # Reset the index
    df = df.reset_index(drop=True)
    
    return df

In [101]:
# Just checking that things work as expected
df = clean_data(df)
print(df.shape)
df.head()

(49, 4)


Unnamed: 0,elogid,title,text,tag
0,265530,Restart LCLS Magnet ChannelWatcher,I've restarted the lcls magnet channel watcher...,LCLS
1,259842,BYKIK pulse width change,Tony Beukers and I chagned the BYKIK pulse wid...,LCLS
2,252459,* Re: SW: Reboot BC1 Bunch Length IOCs-,Greg Dallt from the Klystron Group is working ...,LCLS
3,252453,SW: Reboot BC1 Bunch Length IOCs-,Rebooted Bunch Length Monitor EPICS IOC in li2...,LCLS
4,252399,Fallout from 120Hz Testing: BCS: Gun SBI (20-5...,"Hello,\n\nAfter the 120Hz testing, after the c...",LCLS


In [98]:
# Checking to see the number of tags present
df.tag.value_counts()

LCLS    49
Name: tag, dtype: int64

<b> Now lets save the data in a way that we can easily access </b>

In [80]:
# Function to save the data into sql database
def save_data(df, database_filename):
    engine = create_engine('sqlite:///'+database_filename+'.db')
    df.to_sql(database_filename, engine, index=False)

In [89]:
def main():
    '''
    Will go through all the necessary steps to extract the data from the elog, clean it, and save the data
    in an SQL database
    '''
    s = datetime(2008, 1, 11, 0, 0).timestamp()
    e = datetime(2009, 1, 11, 0, 0).timestamp()
    df = get_data(s,e)
    df = clean_data(df)
    save_data(df,'elog_data')

In [102]:
# Running this will save the data that we want to collect
main()

### Important changes that still need to be made:
* What time frame is a good time frame to capture all needed data???