# Analysis on LSE Professors' Research Interests

## Table of Contents

## Introduction

### Our Motivation
As undergraduate students in LSE who are taking modules across multiples departments like Statistics, Mathematics, Economics, Finance and the Data Science Institute, we observed that many topics in different modules intersect. This inspired us to think about how different departments in LSE can integrate their knowledges and interest by collaborating on research. Our goal is to encourage professors to use an interdiciplinary approach in their research, which would let them use a variety of methods and perspectives to gain a more comprehensive insight on complex problems. 

To be able to quantify this, we chose some variables to be able to analyse the problem in a standardised and comparable basis. Going through LSE's website, we noticed that for each department, every professor has their own webpage. We picked eight departments, which we thought are more similar in terms of their subjects, and then scraped the chosen variables to put them in a dataframe and conduct our analysis.

### Research Questions 
The main aim of the project is to create recommendations to the LSE professors for potential research collaborations. 

**Q1: Which departments have the highest potential for collaboration?**
- Based on the number of topics shared between them. 
- Based on the number of professors that share those interests. 

**Q2: Which topics have the most diverse backgrounds of professors interested in it?**
- Count the most popular topics with the most number of unique departments interested in it.

**Q3: Descriptives**
- Which topics are the most popular across all department (top 10)
- Which topics are the most popular within each department (top 5)


### Data Description
From the [LSE Departments and Institutes website](https://info.lse.ac.uk/staff/departments-and-institutes) website, we chose the following departments to work with:
- Statistics
- Mathematics
- Finance
- Accounting
- Management
- Economics
- Data Science Institute
- Methodology

The data used in our project is directly taken from the websites in the form https://www.lse.ac.uk/{department}/people/{professor_name}, where the department and professor name differ for each case.

## Data Acquisition

- For the first four departments in the list above, we used the **Selenium** library to automate web browsers to do the following:

From the [LSE Departments and Institutes website](https://info.lse.ac.uk/staff/departments-and-institutes) website, the department is clicked and then the "People" section is opened, leading to the page with the url https://www.lse.ac.uk/{department_name}/People. After that, the dropdown menu button called "Academic Faculty" (or another wording of it) is clicked and then all the professors' website links taken from the link-texts are stored in a list.



Run the following code to see the above process for Department of Accounting:

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Set up the web driver
driver = webdriver.Chrome()
driver.get("https://info.lse.ac.uk/Staff/Departments-and-Institutes")

# Find and click the department
department = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.LINK_TEXT, 'Department of Accounting')))
driver.execute_script("arguments[0].scrollIntoView();", department)
department.click()

# Find and click "People"
people = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.LINK_TEXT, 'People')))
people.click()
people_url = driver.current_url

# Find and click "Academic Faculty"
academic_faculty = WebDriverWait(driver, 15).until(EC.element_to_be_clickable((By.LINK_TEXT, 'Academic Faculty')))
driver.execute_script("arguments[0].scrollIntoView();", academic_faculty)
academic_faculty.click()

- For the last four departments short-cut method is followed using only **Beautiful Soup**:

From the webpage https://www.lse.ac.uk/{department_name}/People, the div of the "Academic Faculty" section is found and again the urls for each professors' website is put in a list.

- The following steps for all departments uses the library **Beautiful Soup**:

Finally all the links in that list are clicked one-by-one in a loop and the following variables are scraped:

- professor names
- professor titles / prefixes
- languages they speak
- modules they are teaching
- key expertise

The data scraped are first put in Python dictionaries and then converted to Pandas DataFrames. Finally the dataframe is turned to seperate **csv files** and stored in the "data" file in the "ST115_Project" folder. The codes for these processes can be found in the **"Data Acquisition"** notebook.

example for accounting department:

In [5]:
import pandas as pd
df = pd.read_csv('accounting.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Professor Name,Professor Prefix,Key Expertise,Languages,Title,Modules
0,0,Dr Per Ahblom,Dr,"Accounting & Finance Social Studies, Capital M...","English, Swedish",Assistant Professor of Accounting,['AC200 Accounting Theory and Practice (Curren...
1,1,Alnoor Bhimani,Professor,"Accounting, Management Accounting, Tech Entrep...",English,Professor of Management Accounting,"['AC490 Management Accounting, Decisions and C..."
2,2,Dr Jose Carabias Palmei,Dr,"Accounting, Financial Statements Analysis, Ass...",English,Assistant Professor of Accounting,"['AC330 Financial Accounting, Analysis and Val..."
3,3,Dr Stefano Cascin,Dr,"Disclosure Regulation, Business Groups, Credit...","English, Italian",Associate Professor of Accounting,"['AC332 Financial Statement, Analysis and Valu..."
4,4,Dr Maria Correia,Dr,"Credit Markets, Default Prediction, Business G...",English,Associate Professor of Accounting,['AC416 Topics in Financial Reporting']


## Data Preparation

## Data Analysis
### Exploratory Data Analysis (EDA)

### Q1: Which departments have the highest potential for collaboration?
**a) Based on the number of topics shared between them.**

**b) Based on the number of professors that share those interests.**

### Q2: Which topics have the most diverse backgrounds of professors interested in it?

### Q3: Descriptives

**Which topics are the most popular across all department (top 10)**

**Which topics are the most popular within each department (top 5)**

## Conclusion

### Limitations and Future Opportunities
First of all, as mentioned, we only picked 8 departments to work with while there are 29 departments and institutes in total. If our process was applied to all the departments, a much bigger number of shared interest could be found and a deeper analysis could be conducted. 

To measure shared research interests, we checked the exact string matches of the “key expertise” variable. However, this may not be the only way to do it. Alternative ways to measure shared interests could be applying some Natural Language Processing methods on the variables. Some example methods to modify the words into a more standardised form could be tokenisation, stemming and lemmatisation. Furthermore, more advanced techniques such as word embeddings could be applied. This method uses neural networds to capture the semantic relationship between the variables and classifies the texts according to those relationships. 

In some of the departments, there are a big amount of missing values for key expertise of professors because they were not available in their websites. This situation might have caused the answers to our research questions to be partly inaccurate. To further improve it, the “modules” variables could also be taken into account during the analysis. The relationships between the key expertise and the modules that the professor is teaching could be useful assuming that the professors should be interested and knowledgable about a topic that they are teaching. Another variable which would really help while identifying professors’ research interests is their publications’ titles. 

Overall, although this project gave useful results, it has a significant potential for development. 

### References