# Elog Tagging

The goal is to try and tag elog entries with the correct tag. In order to do this we need to:
* Scrape the data from the entries off the elog
* Also get the corresponding tag for entries
* Run through NLP algorithm to try and train the entries (NOTE: Can't really do this rn because all of our tagging is kinda horseshit so we'd be training it on poop data

In [3]:
import pandas as pd
import numpy as np
import requests
import time
from datetime import datetime

In [36]:
# api-endpoint 
URL = "https://mccelog.slac.stanford.edu/elog/dev/mgibbs/dev_elog_display_json.php"

# defining a params dict, for now doing something simple
s = datetime(2015, 9, 11, 0, 0)
s = s.timestamp()
e = datetime(2015, 11, 11, 0, 0)
e = e.timestamp()
PARAMS = {'logbook': 'MCC', 'start': s, 'end': e} 
  
# sending get request and saving the response as response object 
r = requests.get(url = URL, params = PARAMS) 
  
# extracting data in json format 
data = r.json()

# Turning list of json objects into dataframe
df = pd.DataFrame.from_records(data)

In [37]:
print(df.shape)
df.head()

(5976, 14)


Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,parent,tag,children,superseded_by,supersedes,attachments,highPriority
0,813183,"MCC Shift Change: Owl Shift, Wednesday, 11-Nov...",LCLS MD\nLCLS PAMM,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1051, 'username': 'hvs', 'firstna...",1447228800,"Owl Shift, Wed, 11-Nov-15",,,,,,,
1,813179,SWING SHIFT SUMMARY,"<table CellPadding=""5"" BORDER=1>\n<tr>\n<th>Co...","{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1429, 'username': 'lafky', 'first...",1447228799,"Swing Shift, Tue, 10-Nov-15",,,,,,,
2,813178,* Re: BBA gui no longer able to collect data. ...,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1429, 'username': 'lafky', 'first...",1447228146,"Swing Shift, Tue, 10-Nov-15",813177.0,LCLS,,,,,
3,813177,* Re: BBA gui no longer able to collect data. ...,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1429, 'username': 'lafky', 'first...",1447227754,"Swing Shift, Tue, 10-Nov-15",813175.0,LCLS,[813178],,,,
4,813174,BBA gui no longer able to collect data. Cam,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1429, 'username': 'lafky', 'first...",1447227377,"Swing Shift, Tue, 10-Nov-15",,LCLS,,813175.0,,,


### Now we have a method to store the data in the data frame but there still is a lot of useless data here. Let's get rid of useless columns

* logbook (all mcc)
* author, eventTime, shift, parent, children, attachments, supersedes (irrelevant) 

This leaves the following columns left: `title`, `text`, `elog_id`, `tag`, and `superceded_by`
* `superceded_by` is useful because any column where this is not Nan, we can drop this. The reasoning behind this is that there basically are duplicate entries (when an entry is superceded) and we only want to keep one copy (the correct one). So we can drop the original entries, i.e. the entries where `superceded_by` is not Nan, and then delete this column

Finally we'd be left with: `title`, `text`, `elog_id`, `tag` <br>
<b> Questions </b> 
* NLP algorithm should really only be working on the `titl`e of the entries right? Like that's great if there are more key words in the body, but the title should be enough to tag the location (in my head). If this is true than there's no need for `text`
* Do I really need `elog_id` for anything...? If I keep title and text I definitely need to, if not then I see no need

In [34]:
# Dropping useless columns
important_cols = {'title', 'text', 'elogid', 'tag', 'superseded_by'}
list1 = df.columns.tolist()
list1 = [ele for ele in list1 if ele not in important_cols]
for column in df.columns.tolist():
    if column in list1:
        df = df.drop(column,axis = 1)

# Dropping all columns where superceded_by is not null to essentially drop duplicates. Then drop superceded_by column
df = df[df['superseded_by'].isnull() == True]
df = df.drop(['superseded_by'],axis = 1)

In [35]:
print(df.shape)
df.head()

(4825, 4)


Unnamed: 0,elogid,title,text,tag
0,813183,"MCC Shift Change: Owl Shift, Wednesday, 11-Nov...",LCLS MD\nLCLS PAMM,
1,813179,SWING SHIFT SUMMARY,"<table CellPadding=""5"" BORDER=1>\n<tr>\n<th>Co...",
2,813178,* Re: BBA gui no longer able to collect data. ...,,LCLS
3,813177,* Re: BBA gui no longer able to collect data. ...,,LCLS
5,813175,BBA gui no longer able to collect data. Can't...,,LCLS


In [38]:
# For now let's work on just creating this algorithmm using the title