# Elog Tagging

The goal is to try and tag elog entries with the correct tag. In order to do this we need to:
* Scrape the data from the entries off the elog
* Also get the corresponding tag for entries
* Run through NLP algorithm to try and train the entries (NOTE: Can't really do this rn because all of our tagging is kinda horseshit so we'd be training it on poop data

In [74]:
import pandas as pd
import numpy as np
import requests
import time
from datetime import datetime
from sqlalchemy import create_engine

In [257]:
def get_data(s,e):
    '''
    --- Imports data from Elog and stores it in a workable format ---
    INPUT
        s: start time as unix timestamp
        e: end time as unix time stamp
    RETURN
        df: dataframe of uncleaned data between selected time range
    '''
    
    # api-endpoint 
    URL = "https://mccelog.slac.stanford.edu/elog/dev/mgibbs/dev_elog_display_json.php"

    PARAMS = {'logbook': 'MCC', 'start': s, 'end': e} 

    # sending get request and saving the response as response object 
    r = requests.get(url = URL, params = PARAMS) 

    # extracting data in json format 
    data = r.json()

    # Turning list of json objects into dataframe
    df = pd.DataFrame.from_records(data)

    return df

In [99]:
# Just checking that things work as expected
s = datetime(2008, 1, 11, 0, 0).timestamp()
e = datetime(2009, 1, 11, 0, 0).timestamp()
df = get_data(s,e)
print(df.shape)
df.head()

(24284, 14)


Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,children,parent,attachments,superseded_by,supersedes,highPriority,tag
0,270417,"MCC Shift Change: Owl Shift, Sunday, 11-Jan-2009",.250 nC 13.6 GeV 10 Hz e- to main dump. Undula...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1053, 'username': 'spw', 'firstna...",1231660800,"Owl Shift, Sun, 11-Jan-09",,,,,,,
1,270419,SWING SHIFT SUMMARY,<table CellPadding=5 BORDER=1>\n\t\t <TR><TD><...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231660799,"Swing Shift, Sat, 10-Jan-09",,,,,,,
2,270415,* RE: Frisch 6x6 misbehaving,Disabled BSY/LTU energy part of Frisch feedbac...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231660530,"Swing Shift, Sat, 10-Jan-09",[270428],270413.0,,,,,
3,270412,Instructions for resetting BSOBTH02,Go to the large blue box on the <u>North</u> h...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1160, 'username': 'jab', 'firstna...",1231660060,"Swing Shift, Sat, 10-Jan-09",,,,,,,
4,270413,Frisch 6x6 misbehaving,LTU energy BPM DL1 oscillating about 2mm. Pag...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1161, 'username': 'jwarren', 'fir...",1231659900,"Swing Shift, Sat, 10-Jan-09",[270415],,,,,,


### Now we have a method to store the data in the data frame but there still is a lot of useless data here. Let's get rid of useless columns

* logbook (all mcc)
* author, eventTime, shift, parent, children, attachments, supersedes (irrelevant) 

This leaves the following columns left: `title`, `text`, `elog_id`, `tag`, and `superceded_by`
* `superceded_by` is useful because any column where this is not Nan, we can drop this. The reasoning behind this is that there basically are duplicate entries (when an entry is superceded) and we only want to keep one copy (the correct one). So we can drop the original entries, i.e. the entries where `superceded_by` is not Nan, and then delete this column

Finally we'd be left with: `title`, `text`, `elog_id`, `tag` <br>
<b> Questions </b> 
* NLP algorithm should really only be working on the `titl`e of the entries right? Like that's great if there are more key words in the body, but the title should be enough to tag the location (in my head). If this is true than there's no need for `text`
* Do I really need `elog_id` for anything...? If I keep title and text I definitely need to, if not then I see no need

In [307]:
def clean_data(df):
    '''
    --- Cleans data frame ---
    INPUT
        df: dataframe (not cleaned)
    RETURN
        df: dataframe (cleaned)
    '''
    # Dropping rows without any tags (these rows are useless for us)
    df = df[df.tag.notnull() == True]
    
    # Dropping useless columns
    important_cols = {'title', 'text', 'elogid', 'tag', 'superseded_by'}
    list1 = df.columns.tolist()
    list1 = [ele for ele in list1 if ele not in important_cols]
    for column in df.columns.tolist():
        if column in list1:
            df = df.drop(column,axis = 1)

    # Dropping all columns where superceded_by is not null to essentially drop duplicates. Then drop superceded_by column
    df = df[df['superseded_by'].isnull() == True]
    df = df.drop(['superseded_by'],axis = 1)
    
    # Reset the index
    df = df.reset_index(drop=True)
    
    return df

In [101]:
# Just checking that things work as expected
df = clean_data(df)
print(df.shape)
df.head()

(49, 4)


Unnamed: 0,elogid,title,text,tag
0,265530,Restart LCLS Magnet ChannelWatcher,I've restarted the lcls magnet channel watcher...,LCLS
1,259842,BYKIK pulse width change,Tony Beukers and I chagned the BYKIK pulse wid...,LCLS
2,252459,* Re: SW: Reboot BC1 Bunch Length IOCs-,Greg Dallt from the Klystron Group is working ...,LCLS
3,252453,SW: Reboot BC1 Bunch Length IOCs-,Rebooted Bunch Length Monitor EPICS IOC in li2...,LCLS
4,252399,Fallout from 120Hz Testing: BCS: Gun SBI (20-5...,"Hello,\n\nAfter the 120Hz testing, after the c...",LCLS


In [98]:
# Checking to see the number of tags present
df.tag.value_counts()

LCLS    49
Name: tag, dtype: int64

<b> Now lets save the data in a way that we can easily access </b>

In [80]:
# Function to save the data into sql database
def save_data(df, database_filename):
    engine = create_engine('sqlite:///'+database_filename+'.db')
    df.to_sql(database_filename, engine, index=False)

### Important changes that still need to be made:
* What time frame is a good time frame to capture all needed data???
> Looks like you want to capture data up till 2011. Perhaps the most efficient way to do this would be either by month or year and then process this data individually and recreate a giant dataframe. You would likely have to add more methods to your main() class and add a function that incorporates this

In [267]:
def main1():
    '''
    Will go through all the necessary steps to extract the data from the elog, clean it, and save the data
    in an SQL database
    '''
    s = datetime(2009, 1, 11, 0, 0).timestamp()
    e = datetime(2010, 1, 11, 0, 0).timestamp()
    df = get_data(s,e)
    df = clean_data(df)
    save_data(df,'elog_data')

In [126]:
# Running this will save the data that we want to collect
main1()

### Due to what was found in the cells below, we realize that there are duplicates
We'll need to rewrite out `clean_data` function to incorporate a few things:
* This function should drop duplicates

Also will need to write another function that does the following (call it `join_data_2011`):
* Uses `get_data` and collects data in one month intervals
* Cleans these individual months using the new `clean_data` function
* Joins the months together
* Drops duplicates if there is any overlap

In [309]:
# Creating dummy dataframe
s = datetime(2011, 4, 1, 0, 0).timestamp()
e = datetime(2011, 8, 1, 0, 0).timestamp()
df1 = get_data(s,e)
print('Number of entries in this dataframe: ' + str(df1.shape[0]))

# Printing out duplicates, just so that we have the visual proof
print('The duplicates for this one month period of time are shown below. Must be something wrong with the query')
bad_ids = df1[df1.elogid.duplicated() == True].elogid.tolist()
print('Number of entries in this duplicates dataframe: ' + str(df1[df1['elogid'].isin(bad_ids) == True].shape[0]))
df1[df1['elogid'].isin(bad_ids) == True]

Number of entries in this dataframe: 8663
The duplicates for this one month period of time are shown below. Must be something wrong with the query
Number of entries in this duplicates dataframe: 128


Unnamed: 0,elogid,title,text,logbook,author,eventTime,shift,tag,parent,children,superseded_by,supersedes,attachments,highPriority
105,515398,FACET Summary:,* Recover beam to FACET dump & scav ext. l...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1301, 'username': 'mgibbs', 'firs...",1312009104,"Swing Shift, Fri, 29-Jul-11",FACET,,,,515397.0,,
106,515398,FACET Summary:,* Recover beam to FACET dump & scav ext. l...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1062, 'username': 'cfh', 'firstna...",1312009104,"Swing Shift, Fri, 29-Jul-11",FACET,,,,515397.0,,
587,514588,MD summary,1) Cathode cleaning\n2) CQ/SQ01 scans on IN20 ...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1242, 'username': 'cmelton', 'fir...",1311717711,"Day Shift, Tue, 26-Jul-11",LCLS,,,514591.0,514587.0,"[{'attachmentid': 250859, 'url': 'https://mcce...",
588,514588,MD summary,1) Cathode cleaning\n2) CQ/SQ01 scans on IN20 ...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1374, 'username': 'aegger', 'firs...",1311717711,"Day Shift, Tue, 26-Jul-11",LCLS,,,514591.0,514587.0,"[{'attachmentid': 250859, 'url': 'https://mcce...",
1193,511531,FACET summary,<b>Program</b>\n1) OTR size v. sext\n2) etax/e...,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1242, 'username': 'cmelton', 'fir...",1311087473,"Owl Shift, Tue, 19-Jul-11",FACET,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8142,484955,* Re: Vacuum degraded in LI25,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1051, 'username': 'hvs', 'firstna...",1305161247,"Swing Shift, Wed, 11-May-11",,484935.0,,484956.0,,"[{'attachmentid': 234939, 'url': 'https://mcce...",
8143,484956,* Re: Vacuum degraded in LI25,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1003, 'username': 'stanek', 'firs...",1305161247,"Swing Shift, Wed, 11-May-11",,484935.0,,,484955.0,"[{'attachmentid': 234940, 'url': 'https://mcce...",
8144,484956,* Re: Vacuum degraded in LI25,,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1051, 'username': 'hvs', 'firstna...",1305161247,"Swing Shift, Wed, 11-May-11",,484935.0,,,484955.0,"[{'attachmentid': 234940, 'url': 'https://mcce...",
8265,483960,* Re: switched to 60Hz BGRP for K Kim.,Switched back to 120 Hz for the moment.,"{'logbookid': 122, 'name': 'MCC'}","{'authorid': 1265, 'username': 'alsberg', 'fir...",1304537663,"Day Shift, Wed, 04-May-11",,483959.0,,,,,


In [310]:
print('Number of total tags:\n' + str(df1['tag'].value_counts())+ '\n')
print('Number of tags on duplicates:\n' + str(df1[df1['elogid'].isin(bad_ids) == True].tag.value_counts()))

Number of total tags:
LCLS     698
FACET    331
Name: tag, dtype: int64

Number of tags on duplicates:
FACET    12
LCLS      2
Name: tag, dtype: int64


In [311]:
# Apparently there are some entries with more than 2 duplicates
df1[df1['elogid'].isin(bad_ids) == True].elogid.value_counts()[:5]

488157    6
496626    6
488162    6
488164    4
488518    2
Name: elogid, dtype: int64

In [312]:
# Dropping the duplicates
df11 = df1.drop_duplicates(subset ="elogid", keep = 'first')
print('After dropping the duplicates, the total length is now: ' +str(df11.shape[0]))
df11[df11['elogid'].isin(bad_ids) == True].shape[0]

After dropping the duplicates, the total length is now: 8592


57

In [313]:
# Will use later to make sure functions are working correctly
df11 = clean_data(df11)
print(df11.shape[0])
df11['tag'].value_counts()

802


LCLS     554
FACET    248
Name: tag, dtype: int64

### Writing second versions of functions down below to make things clearer

In [314]:
def clean_data(df):
    '''
    --- Cleans data frame ---
    INPUT
        df: dataframe (not cleaned)
    RETURN
        df: dataframe (cleaned)
    '''
    # Dropping rows without any tags (these rows are useless for us)
    df = df[df.tag.notnull() == True]
    
    # Dropping useless columns
    important_cols = {'title', 'text', 'elogid', 'tag', 'superseded_by'}
    list1 = df.columns.tolist()
    list1 = [ele for ele in list1 if ele not in important_cols]
    for column in df.columns.tolist():
        if column in list1:
            df = df.drop(column,axis = 1)

    # Dropping all columns where superceded_by is not null to essentially drop duplicates. Then drop superceded_by column
    df = df[df['superseded_by'].isnull() == True]
    df = df.drop(['superseded_by'],axis = 1)
    df = df.drop_duplicates(subset ="elogid", keep = 'first')
    
    # Reset the index
    df = df.reset_index(drop=True)
    
    return df

In [302]:
# Checking to see if new clean function works
s = datetime(2011, 4, 1, 0, 0).timestamp()
e = datetime(2011, 8, 1, 0, 0).timestamp()
df2 = get_data(s,e)
df2 = clean_data(df2)

In [303]:
print(df2.shape[0])
df2.tag.value_counts()

802


LCLS     554
FACET    248
Name: tag, dtype: int64

### Now create a `join_data_2011` function that will use `get_data` and `clean_data` to aquire the data up until the end of 2011

In [331]:
def join_data():
    year_list = [2007,2008,2009,2010,2011]
    month_list = list(range(1,13))
    for year in year_list:
        for month in month_list:
            if (year == 2007 and month < 4):
                continue
            elif (year == 2007 and month == 4):
                s = datetime(year, month, 1, 0, 0).timestamp()
                e = datetime(year, month+1, 1, 0, 0).timestamp()
                df = get_data(s,e)
                df = clean_data(df)
            elif (month == 12):
                #print(str(month)+'/'+str(year) +' - 1/' +str(year+1))
                s = datetime(year, month, 1, 0, 0).timestamp()
                e = datetime(year+1, 1, 1, 0, 0).timestamp()
                df_temp = get_data(s,e)
                df_temp = clean_data(df_temp)
            else:
                #print(str(month)+'/'+str(year) +' - ' + str(month+1) + '/' +str(year))
                s = datetime(year, month, 1, 0, 0).timestamp()
                e = datetime(year, month+1, 1, 0, 0).timestamp()
                df_temp = get_data(s,e)
                df_temp = clean_data(df_temp)
            df = pd.concat([df,df_temp])

In [333]:
s = datetime(2008, 1, 1, 0, 0).timestamp()
e = datetime(2008, 2, 1, 0, 0).timestamp()
df_temp1 = get_data(s,e)
df_temp1 = clean_data(df_temp1)


s = datetime(2008, 2, 1, 0, 0).timestamp()
e = datetime(2008, 3, 1, 0, 0).timestamp()
df_temp2 = get_data(s,e)
df_temp2 = clean_data(df_temp2)

result = pd.concat([df_temp1,df_temp2])

In [340]:
result.elogid.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
Name: elogid, dtype: bool