<left><font size=3>March 2, 2022 / [Marisol Hernandez](https://www.linkedin.com/in/marisol-y-hernandez/)</font></left>
# <left><font size=6> *University of the Pacific*<br>Theses/Dissertations Co-occurrence Matrix</font>  
<left><font size=3>Building a co-occurrence matrix using [d3.js](https://d3js.org) to analyze overlapping topics in dissertations.</font></left>

---

## Table of Contents

[I. Objective](#objective)  
[II. Web Scraping](#scraping)  
[III. Combine Data and Sample](#data)  
[IV. Computing Pairwise Similarities](#pairwise)  
[V. Preparing the JSON](#json)  
[VI. Export the JSON](#export)
    

## Objective
Sometimes we think we find the holy grail, that one piece of literature that perfectly supports our research. As you skim the paper, you find yourself begging for more. What else is out there? There has to be other related work out there. A great place to start when looking for related literature is to look at the resources cited in the bibliography. I said "great," but is it really? 

There has to be **another way**, a more efficient, appealing, user-friendly solution.

Introducing the co-occurrence matrix. I went to work and developed a matrix diagram that visualizes overlapping topics in a sample of dissertations from my University. Each colored cell represents two published works and their similiarities; the darker the color, the greater the similarity.

In doing so, I had two main jobs: **get the data** and **build the d3 visualization**.

## Import Libraries

In [2]:
import pandas as pd
import numpy as np
import os
import urllib.request
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import re
from random import sample

from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import spacy

#nltk.download('punkt')

## Web Scraping
When a student completes their thesis or dissertation, their work is published on this site: https://scholarlycommons.pacific.edu/uop_etds/. With this site, I had to do the following:

1. Find the article listing of each thesis/dissertation.
2. Retrieve the link to the thesis/dissertation overview.
3. From that page, retrieve the title, author, group (department), and abstract.

First, using **requests** and **BeautifulSoup** I retrieve the contents of the URL. Additionally, I initialize several lists where I will store the data items.

In [4]:
titles = []
authors = []
groups = []
keywords = []

I will break the following into steps.

In [5]:
for i in range(1,8):
    url = 'https://scholarlycommons.pacific.edu/uop_etds/'
    
    if i == 1:
        extension = 'index.html'
        url = url + extension
    else:
        extension = 'index.' + str(i) + '.html'
        url = url + extension
    
    # retrieve contents of url
    response = requests.get(url)
    soup= BeautifulSoup(response.text, "html.parser")
    
    # search for article listing
    for string in soup.select("[class='article-listing']"):
        string = BeautifulSoup(str(string), "html.parser")

        # retrieve article overview
        articleListing = string.find('a').get('href')
        response = requests.get(articleListing)
        articleSoup = BeautifulSoup(response.text, "html.parser") 

        # retreive title
        try:
            title = articleSoup.find("meta", property="og:title")['content']
            titles.append(title)
        except:
            titles.append(np.nan)

        # retrieve authors
        try:
            author = articleSoup.find("meta", property="article:author")['content']
            authors.append(author)
        except:
            authors.append(np.nan)

        # retrieve group
        try:
            group = articleSoup.find('div', {'class':'element', 'id':'department'}).find('p').get_text()
            groups.append(group)
        except:
            groups.append(np.nan)

        # retrieve keywords
        try:
            keyword = articleSoup.find("meta", attrs={"name":"keywords"})['content']
            keywords.append(keyword)
        except:
            keywords.append(np.nan)

### 1. Find the article listing of each thesis/dissertation.
This is the first article listing. As you can see, the structure is of an HTML file. Note, I use the `prettify()` method just to better structure the print out, but it is not used in my code above.

In [6]:
string = soup.select("[class='article-listing']")[0]
string = BeautifulSoup(str(string), "html.parser")
print(string.prettify())

<p class="article-listing">
 <strong>
  Thesis - Pacific Access Restricted:
 </strong>
 <a href="https://scholarlycommons.pacific.edu/uop_etds/273">
  Human Cytochrome P450 3A4 Over-Expressing IEC-18 and MDCK Cell Lines as an In-Vitro Model to Assess Gut Permeability and the Enzyme Metabolism
 </a>
 , Swathi Vangala
</p>


### 2. Retrieve the link to the thesis/dissertation overview.
Above you may notice a URL. This is link to the thesis/dissertation overview. I retrieve this component from the HREF attribute of the anchor `<a>` tag. The HREF contains two components: the URL, which is the actual link, and the clickable text that appears on the page, called the "anchor text."

In [7]:
articleListing = string.find('a').get('href')
articleListing

'https://scholarlycommons.pacific.edu/uop_etds/273'

Similar to the beginning, we can use **requests** and **BeautifulSoup** to retrieve the contents of this URL.

In [8]:
response = requests.get(articleListing)
articleSoup = BeautifulSoup(response.text, "html.parser") 
# First 500 characters
print(articleSoup.prettify()[0:500])

<!DOCTYPE html>
<html lang="en">
 <head>
  <!-- inj yui3-seed: -->
  <script src="//cdnjs.cloudflare.com/ajax/libs/yui/3.6.0/yui/yui-min.js" type="text/javascript">
  </script>
  <script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js" type="text/javascript">
  </script>
  <!-- Adobe Analytics SiteCatalyst -->
  <script src="https://assets.adobedtm.com/376c5346e33126fdb6b2dbac81e307cbacfd7935/satelliteLib-fac053ad0cbd6e703a1df9a51f69fde523024cef.js" type="text/javascript">
  </s


### 3. From that page, retrieve the title, author, group (department), and keywords.
We can find the title, author, group (department) and abstract in this page. Below, I retrieve the title and author.

In [9]:
# retreive title
title = articleSoup.find("meta", property="og:title")['content']
print(title)

Human Cytochrome P450 3A4 Over-Expressing IEC-18 and MDCK Cell Lines as an In-Vitro Model to Assess Gut Permeability and the Enzyme Metabolism


In [10]:
# retreive author
author = articleSoup.find("meta", property="article:author")['content']
print(author)

Swathi Vangala


Similarily, I retrieve the group (department).

In [11]:
# retrieve group
group = articleSoup.find('div', {'class':'element', 'id':'department'}).find('p').get_text()
print(group)

Pharmaceutical and Chemical Sciences


Lastly, I retrieve the keywords.

In [12]:
# retrieve keywords
keyword = articleSoup.find("meta", attrs={"name":"keywords"})['content']
print(keyword)

Pharmacy sciences, Health and environmental sciences


## Combine Data And Sample
Using the lists, I combine all the data into one dataframe. I then sample just 73 and add my professor's thesis in there for a total sample of 74 theses/dissertations.

In [30]:
data = pd.DataFrame({'original title':titles, 'author':authors, 'group':groups, 'keywords':keywords})
data['title'] = data['original title'] + ' by ' + data['author']

data = data[['title', 'group', 'keywords']]
data.dropna(inplace=True) # remove NA's

In [31]:
data['group'].unique().tolist()

['Learning, Leadership and Change',
 'Pharmaceutical and Chemical Sciences',
 'Psychology',
 'Engineering',
 'Music Therapy',
 'Educational and School Psychology',
 'Department of Endodontics',
 'Communication',
 'Curriculum and Instruction',
 'Department of Orthodontics',
 'Educational Administration and Leadership',
 'Biological Sciences',
 'Speech-Language Pathology',
 'Food Studies',
 'Sport Sciences',
 'Education',
 'International Studies',
 'Music Education',
 'Benerd School of Education',
 'Intercultural Relations',
 'Health, Exercise, and Sport Sciences',
 'Engineering Science',
 'School Psychology',
 'Chemistry',
 'Graduate School',
 'Dentistry']

In [37]:
data1 = data[data['group']=='Learning, Leadership and Change'].head(19)
data2 = data[data['group']=='Educational and School Psychology'].head(19)
data3 = data[data['group']=='Educational Administration and Leadership'].head(18)
data4 = data[data['group']=='Education']
data5 = data[data['group']=='Music Education']
data6 = data[data['group']=='Benerd School of Education']
data7 = data[data['group']=='Music Therapy']

sample = pd.concat([data1, data2, data3, data4, data5, data6, data7])

# Add Dana's dissertation
dana = {'title':'How much do you care about education? Exploring fluctuations of public interest in education issues among top national priorities in the U.S. by Dana Nehoran',
       'group':'Learning, Leadership and Change',
       'keywords':'Education, Information science, Political science, Education, Mass Media, Natural Language Processing, Polls, Public Opinion Research, Topic Modeling'
       }

sample = sample.append(dana, ignore_index=True)
sample.head()

Unnamed: 0,title,group,keywords
0,Hostile Takeover: The Effects of Work Stress b...,"Learning, Leadership and Change","Educational leadership, female principals, job..."
1,All IN PIX YPAR: A YOUTH PARTICIPATORY ACTION ...,"Learning, Leadership and Change","Disability studies, Education policy, Secondar..."
2,Intrinsic motivation is not enough: Exploring ...,"Learning, Leadership and Change","Higher education, career advancement, faculty ..."
3,WHERE AM I?: THE ABSENCE OF THE BLACK MALE FRO...,"Learning, Leadership and Change","African American studies, Black males, Executi..."
4,EXPLORING THE IDENTIFICATION OF AMERICAN INDIA...,"Learning, Leadership and Change","American Indian, Autism, Indigenous Methodolog..."


## Topic Co-occurences

In [223]:
# store sample items into lists
titles = list(sample['title'])
groups = list(sample['group'])
corpus = list(sample['keywords'])

In [224]:
keywordsList = []
countLists = []

for i in range(0, len(corpus)):
    count = []
    commonKeys = []
    currentKeys = list(set(corpus[i].lower().split(', ')))
    
    for j in range(0, len(corpus)):
        nextKeys = list(set(corpus[j].lower().split(', ')))

        intersection = [key for key in currentKeys if key in nextKeys]
        noCommonKeys = len(intersection)
    
        count.append(noCommonKeys)
        commonKeys.append(intersection)
    
    countLists.append(count)
    keywordsList.append(commonKeys)

In [225]:
# Convert to array
arr = np.array(countLists)
np.fill_diagonal(arr, 0) # fill the diagonal with 0

arr2 = np.array(keywordsList, dtype=object)

## Preparing the Data

Here, we are just preparing our data for the JSON file. In this first cell, I prepare the nodes with the following format, 

`{'group':group, 'index':index, 'name':name}`

In [226]:
group = ''
index = 0
name = ''

allNodes = []

for i in range(0, len(corpus)):
    group = groups[i]
    index = i
    name = titles[i]
    
    node = {'group':group, 'index':index, 'name':name}
    allNodes.append(node)

In [227]:
# First 3 nodes
allNodes[0:3]

[{'group': 'Learning, Leadership and Change',
  'index': 0,
  'name': 'Hostile Takeover: The Effects of Work Stress by Monica D. Barletta'},
 {'group': 'Learning, Leadership and Change',
  'index': 1,
  'name': 'All IN PIX YPAR: A YOUTH PARTICIPATORY ACTION RESEARCH STUDY OF STUDENTS WITH SIGNIFICANT DISABILITIES IN HIGH SCHOOL by Jessica L. Jennings'},
 {'group': 'Learning, Leadership and Change',
  'index': 2,
  'name': 'Intrinsic motivation is not enough: Exploring the decision to pursue promotion to full professor by Margaret Roberts'}]

Next, I develop the links. These essentially contain the scaled cosine similarity. These have the following format,

`{'source':source, 'target':target, 'value':value}`

In [228]:
source = 0
target = 0
value = 0

allLinks = []

for i in range(0, len(corpus)):
    for j in range(0, len(corpus)):
        row = arr[i]
        source = i
        target = j
        value = int(arr[i][j])
        keys = arr2[i][j]
    
        link = {'source':source, 'target':target, 'value':value, 'keywords':keys}
        allLinks.append(link)        

In [229]:
# First 3 links
allLinks[:3]

[{'source': 0,
  'target': 0,
  'value': 0,
  'keywords': ['work-family border theory',
   'female principals',
   'work stress',
   'work-family conflict',
   'spillover',
   'educational leadership',
   'job satisfaction']},
 {'source': 0, 'target': 1, 'value': 0, 'keywords': []},
 {'source': 0, 'target': 2, 'value': 0, 'keywords': []}]

## Export to JSON
Lastly we export the data to JSON file. These file will essentially be used to develop the visualization.

In [232]:
import json

data = {'nodes':allNodes, 'links':allLinks}

with open("/Users/marisolhernandez/Desktop/SKAEL/Co-occurrence Matrix/data/data.json", "w") as outfile:
    json.dump(data, outfile)