# Pensieve: Chrome History Exploration Tool

This project seeks to develop an application to help users browse through their chrome browser history more efficiently. It uses elastic search and the chrome's already existing history database to serve as the basis of the project. 

The user will need to load their file manually into the colab notebook environment. **If this is not possible, the project evaluator may use the History sqllite file that I have attached as part of my submission.** This has not been included on github due to privacy concerns. 

**For Mac users, the file will be found in:**
* **~/Library/Application\ Support/Google/Chrome/Default/History**

We may also see 'Profile 1' instead of 'Default'. 

Follow the steps to reach the file. 

1. Open the 'Finder' application and click on 'Go' in the upper tool bar. 
2. Keep the 'options' button on the keyboard pressed to make hidden files visible. You should see a library file become visible. 
3. In Library, navigate to Application Support, Google, Chrome and Default or Profile 1. 
4. Copy the 'History' file into the desktop. Save it as 'History'. 
5. Upload the file into the colab notebook 'files' using the file shaped icon on the left. 


**For Windows users, the file will be found in one of the following:**
* **C:\Users\<username>\AppData\Local\Google\Chrome\User Data\Default**
* **C:\Users\<username>\AppData\Local\Google\Chrome\User Data\Default\Cache**


Note that the code will only function for chrome browsing history. Other browsers are not currently supported. 

Note: If you are getting a 'malformed' error, please try again in 10 seconds...



In [3]:
import os
import sqlite3
import operator
from collections import OrderedDict
import json
 

try:
    # Making a connection between sqlite3 database and Python Program
    conn = sqlite3.connect("History")

    #This is the important part, here we are setting row_factory property of
    #connection object to sqlite3.Row(sqlite3.Row is an implementation of
    #row_factory)
    conn.row_factory = sqlite3.Row
    
    # Creating cursor object using connection object
    c = conn.cursor()
    
    # executing our sql query to load urls
    c.execute('SELECT * FROM urls')
    urls = [dict(row) for row in c.fetchall()]
   
    # executing our sql query to load searches
    c.execute('SELECT * FROM keyword_search_terms')
    searches = [dict(row) for row in c.fetchall()]
    
    # executing our sql query to load the visits data
    c.execute('SELECT * FROM visits')
    visits = [dict(row) for row in c.fetchall()]
 
except sqlite3.Error as error:
    print("Failed to execute the above query", error)
     
finally:
   
    # Inside Finally Block, If connection is
    # open, we need to close it
    if conn:
         
        # using close() method, we will close
        # the connection
        conn.close()
         
        # After closing connection object, we
        # will print "the sqlite connection is
        # closed"
        print("The sqlite connection is succesfully closed, you may proceed...")

The sqlite connection is succesfully closed, you may proceed...


The next piece of code will install the relevant dependencies. This has been adapted from the University of Berkeley School of Information Colab Notebook provided for the purpose of carrying out an assignment for this class. The link is provided here: 
https://colab.research.google.com/drive/1B15lS5j-CkzZf3xXLQS9ORfEYhSLsl7n#scrollTo=DPPCGPqaBb0_

In [4]:

# This code installs relevant dependencies for this project.
# NOTE: this cell will take about 30 seconds to run

!pip install elasticsearch==7.17
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.10.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting elasticsearch==7.17
  Downloading elasticsearch-7.17.0-py2.py3-none-any.whl (385 kB)
[K     |████████████████████████████████| 385 kB 5.2 MB/s 
Installing collected packages: elasticsearch
Successfully installed elasticsearch-7.17.0
--2022-12-14 03:03:53--  https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
Resolving artifacts.elastic.co (artifacts.elastic.co)... 34.120.127.130, 2600:1901:0:1d7::
Connecting to artifacts.elastic.co (artifacts.elastic.co)|34.120.127.130|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 318801277 (304M) [application/x-gzip]
Saving to: ‘elasticsearch-7.10.1-linux-x86_64.tar.gz’


2022-12-14 03:04:11 (16.9 MB/s) - ‘elasticsearch-7.10.1-linux-x86_64.tar.gz’ saved [318801277/318801277]



The next step is to set up our elastic search client and index for our search. We are creating a 'searches' index in our client and specifying the fields in the data which must be indexed for the purpose of searches and ranking. 

Apart from the text fields, such as 'url title' and 'google search query' we also specify the date and time which each url was accessed on. These will all be used to search and rank our search results. 



In [20]:
# Let's set up our ElasticSearch instance on our linux system. 
# NOTE: this will take ~1 minute to run
import os
from subprocess import Popen, PIPE, STDOUT

server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
!sleep 30

# NOTE: This code is used to set up the index that will be searched through. 
# The output should look like: {'acknowledged': True, 'index': 'conversations', 'shards_acknowledged': True}

# Let's set up the infrastructure for our elasticsearch index
from elasticsearch import Elasticsearch, helpers

# This code connects to the ElasticSearch instance we started in the previous cell.
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)

# We specify that we would like to use BM25 similiarity and specify the fields of the data that we would like to index. 
es.indices.create(index="searches", settings= {
    "similarity": {
      "default": { 
        "type": "BM25"
       }}},
    mappings = {
      
      "properties": {
        "title":    { "type": "text",
                        "index": True},
        "term":    { "type": "text",
                        "index": True},  
        "normalized_term":   { "type": "text",
                        "index": True},
         "fixed_visit_time": {
                    "type": "date",
                    "format": "yyyy-MM-dd HH:mm:ss"
                }                    
     }})

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'searches_index'}

In [6]:
# The below function will be used to convert the time values stored in the chrome database
# It uses the Webkit time, which measures microseconds from 1st Jan, 1601 (!)

import datetime
import re
def date_from_webkit(webkit_timestamp):
    epoch_start = datetime.datetime(1601,1,1)
    delta = datetime.timedelta(microseconds=int(webkit_timestamp))
    final = epoch_start + delta
    return str(final)[:19]

# Function to return only the domain name using regex.

def return_domain(url_link):
  domain_name = re.findall('^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+)', url_link)
  return domain_name


In [26]:
'''
The below are all functions to be used to index the information correctly and display the results. 
Function based on: https://bit.ly/3wu0S4E and the assignment document. 
'''

# The below function will be used to convert the time values stored in the chrome database
# It uses the Webkit time, which measures microseconds from 1st Jan, 1601 (!)

import datetime
import re
from IPython.display import display, HTML

def date_from_webkit(webkit_timestamp):
    epoch_start = datetime.datetime(1601,1,1)
    delta = datetime.timedelta(microseconds=int(webkit_timestamp))
    final = epoch_start + delta
    return str(final)[:19]

# Function to return only the domain name using regex.

def return_domain(url_link):
  domain_name = re.findall('^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+)', url_link)
  return domain_name

def table(category, query, rows):
    html = """
    <style type='text/css'>
    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
    table {
      border-collapse: collapse;
      width: 900px;
    }
    th, td {
        border: 1px solid #9e9e9e;
        padding: 10px;
        font: 15px Oswald;
    }
    </style>
    """

    html += "<h3>[%s] %s</h3><table><thead><tr><th>Score</th><th>URL Title</th><th>Domain Name</th><th>Last Visit</th></tr></thead>" % (category, query)
    for score,title,domain,lastvisit in rows:
        html += "<tr><td>%.4f</td><td>%s</td><td>%s</td><td>%s</td></tr>" % (score, title,domain,lastvisit)
    html += "</table>"

    display(HTML(html))

def search(query, limit):
  query_input = {
            "query": {
              "bool": {
                "must": [
                  {
                    "multi_match": {
                        "query" :    query, 
                        "fields": [ "term^3", "title" ] 
                    }
                  }
                ],
              "filter": 
                  {
                    "range": {
                      "fixed_visit_time": {
                        "gte": "2022-01-21 00:00:00"
                      }
                    }
                  }
              
              }
            }
}

  results = []
  for result in es.search(index="searches", body=query_input)["hits"]["hits"]:
    source = result["_source"]
    results.append((min(result["_score"], 18) / 18, source["title"], source["domain"], source["fixed_visit_time"]))

  return results




In [25]:
'''
The next bit of code prepares our information to be inputted into the index 
using the buffer function provided by elastic search. The data itself is already
in a dictionary so we only modify some fields (tagging, etc) before pushing. 
'''

rows = 0

# Creating a copy of urls to input into the index
buffer = urls.copy()

# These are lists of different tags that may arise for our urls. 
social_media = ['twitter.com', 'facebook.com', 'instagram.com', 'web.whatsapp.com']
music_platforms = ['open.spotify.com','music.apple']
movies_tv = ['netflix.com', 'primevideo.com', 'hulu.com', 'showtime.com', 'tv.apple']
news_reading = ['cnn', 'guardian', 'newyorkpost']
research_reading = ['arxiv', 'jstor', 'elsevier']


for url in buffer:

  # Here, we add an id and index to the data before pushing it to elastic search. 
  es_data = {"_id": rows, "_index": "searches"}
  url.update(es_data)

  for search in searches:
    if url["id"] == search["url_id"]:
      url.update(search)
  
  url["domain"] = return_domain(url["url"])

  url["fixed_visit_time"] = date_from_webkit(url["last_visit_time"])

  rows+=1

# Pushing the data into elastic search. 
helpers.bulk(es, buffer)
buffer = []

#The number of rows is the number of individual url visits. 
print("Total urls inserted: {}".format(rows))


# The below will confirm the information we have about the index we have created. 
es.indices.get_mapping(index="searches")

Total urls inserted: 16108


{'searches_index': {'mappings': {'properties': {'domain': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'fixed_visit_time': {'type': 'date', 'format': 'yyyy-MM-dd HH:mm:ss'},
    'hidden': {'type': 'long'},
    'id': {'type': 'long'},
    'keyword_id': {'type': 'long'},
    'last_visit_time': {'type': 'long'},
    'normalized_term': {'type': 'text'},
    'term': {'type': 'text'},
    'title': {'type': 'text'},
    'typed_count': {'type': 'long'},
    'url': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'url_id': {'type': 'long'},
    'visit_count': {'type': 'long'}}}}}

In [27]:
'''
This 
code will run the query on the basis of a user input. 
'''
query_user = input('Enter your query here: ')
table("Elasticsearch", query_user, search(query_user, 10))

Enter your query here: mac


  for result in es.search(index="searches", body=query_input)["hits"]["hits"]:


Score,URL Title,Domain Name,Last Visit
0.8104,mac miller circles - Google Search,['google.com'],2022-09-18 21:09:38
0.7291,foxit reader download mac - Google Search,['google.com'],2022-08-11 18:51:12
0.7291,best epub readers mac - Google Search,['google.com'],2022-08-20 05:17:23
0.7291,pip not working mac - Google Search,['google.com'],2022-08-20 06:06:31
0.7291,drawsvg cairo error mac - Google Search,['google.com'],2022-08-25 09:01:01
0.7291,best epub reader mac - Google Search,['google.com'],2022-09-18 17:28:34
0.7291,mac miller kurt vonnegut - Google Search,['google.com'],2022-09-22 23:14:17
0.7291,"what is mac ""rsession"" - Google Search",['google.com'],2022-11-03 22:15:53
0.6627,best free pdf readers mac - Google Search,['google.com'],2022-08-11 18:50:45
0.6627,splashtop display won't open mac - Google Search,['google.com'],2022-08-11 19:17:57
