## US CS Faculty Dataset
- [CS Faculty Composition and Hiring Trends (Blog)](https://jeffhuang.com/computer-science-open-data/#cs-faculty-composition-and-hiring-trends)
- [2200 Computer Science Professors in 50 top US Graduate Programs](https://cs.brown.edu/people/apapouts/faculty_dataset.html)
- [CS Professors (Data Explorer)](https://drafty.cs.brown.edu/csprofessors?src=csopendata)
- [Drafty Project](https://drafty.cs.brown.edu/)


Use beautiful_soup to scrap CS Faculty info

In [1]:
from scrap_cs_faculty import *

In [2]:
SCHOOL = "MIT-CS"
URL = SCHOOL_DICT[SCHOOL]["url"]  #  "https://www.eecs.mit.edu/role/faculty-aid/"
print(URL)

https://www.eecs.mit.edu/role/faculty-cs/


In [3]:
page = requests.get(URL, headers=BROWSER_HEADERS)

In [4]:
soup = BeautifulSoup(page.content, "html.parser")

## Find Elements by HTML Class Name

In [5]:
cs_persons = soup.find_all("div", class_="people-entry small-12 medium-6 large-4 larger-3 cell")

In [6]:
len(cs_persons) , cs_persons[0]

(78,
 <div class="people-entry small-12 medium-6 large-4 larger-3 cell">
 <a aria-hidden="true" class="people-index-image" href="https://www.eecs.mit.edu/people/hal-abelson/" rel="bookmark" tabindex="-1" title="Hal Abelson">
 <img alt="Hal Abelson" class="attachment-people-thumb size-people-thumb wp-post-image" decoding="async" height="228" loading="lazy" sizes="(max-width: 228px) 100vw, 228px" src="https://www.eecs.mit.edu/wp-content/uploads/2021/06/halab.png" srcset="https://www.eecs.mit.edu/wp-content/uploads/2021/06/halab.png 228w, https://www.eecs.mit.edu/wp-content/uploads/2021/06/halab-125x125.png 125w" width="228"/> </a>
 <h5><a href="https://www.eecs.mit.edu/people/hal-abelson/" rel="bookmark">Hal Abelson</a></h5>
 <p>Class of 1992 Professor , [CS and AI+D]</p>
 <ul>
 <li><a href="mailto:hal@mit.edu">hal@mit.edu</a></li>
 <li>(617) 253-5856</li>
 <li>Office: 32-G516</li>
 </ul>
 <div class="people-research">
 <p><a href="?fwp_research=ai-and-society">AI and Society</a></p><p><

### Extract Text From HTML Elements

You can add .text to a Beautiful Soup object to return only the text content of the HTML elements that the object contains:

In [7]:
DEBUG =  False # True # 

school, dept = map_school_dept(SCHOOL)
data = []
all_research_dict = {}
for n, person in enumerate(cs_persons):
    try:
        data_dict = {"school": school, "department": dept} # default
        if DEBUG and n > 0: break  # debug
        
        # get name/url
        x = person.find("h5")
        name_x = x.find("a")
        data_dict['name'] = name_x.text.strip()
        data_dict['url'] = name_x["href"]
        
        # get img
        x = person.find("img")
        if x:
            data_dict['img_url'] = x["src"]
            if "http" not in data_dict['img_url']:
                y = person.find("a")
                data_dict['img_url'] = y["href"]
                if not data_dict['img_url']:
                    data_dict['img_url'] = data_dict['url']

        research_area_dict = {}
        for x in person.find_all("p"):
            if (is_job_title(x.text.lower())) and "[" in x.text:
                # get title and dept
                x_tmp = x.text.split("[")
                if len(x_tmp) > 0:
                    title = x_tmp[0].strip()
                    data_dict['job_title'] = title[:-1] if title.endswith(",") else title
                if len(x_tmp) > 1:
                    depts = [i.strip() for i in x_tmp[1].replace("]", "").split("and") if i.strip()]
                    if DEBUG: print(f"depts = {depts}")
                    depts = [DEPT_MAP.get(d, "") for d in depts]
                    data_dict['department'] = "; ".join(depts)
            else:
                # get research
                y = x.find("a")
                if y:
                    research_name = y.text.strip()
                    if research_name:
                        research_area_dict[research_name] = f"{URL}{y['href']}"
                        if research_name not in all_research_dict:
                            all_research_dict[research_name] = research_area_dict[research_name]
        researchs = research_area_dict.keys()
        if researchs:
            data_dict['research_area'] = ", ".join(researchs)
                
        # get email, phone, office
        for x in person.find_all("li"):
            if "Office" in x.text.strip():
                data_dict['office_address'] = x.text.strip()
            else:
                y = x.find("a")
                if y and "@" in y.text:
                    data_dict['email'] = y.text.strip()
                else:
                    data_dict['phone'] = x.text.strip()
                
        if DEBUG:
            print(f"n={n}\t=============")
            print(f"name= {data_dict.get('name','')}")
            print(f"job_title= {data_dict.get('job_title','')}")
            print(f"phone= {data_dict.get('phone','')}")
            print(f"office= {data_dict.get('office_address','')}")
            print(f"email= {data_dict.get('email','')}")
            print(f"url= {data_dict.get('url','')}")
            print(f"img_url= {data_dict.get('img_url','')}")
            print(f"research_area= {data_dict.get('research_area','')}")
            print(f"department= {data_dict.get('department','')}")
        
        if data_dict:
            row_data = []
            for c in COLUMNS:
                cell = data_dict.get(c,"")
                row_data.append(cell)
            data.append(row_data)
    except Exception as e:
        print(f"[Error] {str(e)}\n{person.prettify()}")

In [8]:
len(data)#

78

In [9]:
df = pd.DataFrame(data, columns=COLUMNS)

In [10]:
print(f"Number of faculties at {SCHOOL}: {df.shape[0]}")

Number of faculties at MIT-CS: 78


In [11]:
df

Unnamed: 0,name,job_title,phd_univ,phd_year,research_area,research_concentration,research_focus,url,img_url,phone,email,cell_phone,office_address,department,school
0,Hal Abelson,Class of 1992 Professor,,,"AI and Society, Artificial Intelligence + Deci...",,,https://www.eecs.mit.edu/people/hal-abelson/,https://www.eecs.mit.edu/wp-content/uploads/20...,(617) 253-5856,hal@mit.edu,,Office: 32-G516,Computer Science; AI & Decision-making,Massachusetts Institute Technology
1,Anant Agarwal,"CEO, edX; Professor of EECS;",,,"Computer Architecture, Multicore Processors & ...",,,https://www.eecs.mit.edu/people/anant-agarwal/,https://www.eecs.mit.edu/people/anant-agarwal/,(617) 253-1448,agarwal@mit.edu,,Office: NE55-900,Computer Science; Electrical Engineering,Massachusetts Institute Technology
2,Pulkit Agrawal,Steven and Renee Finn Career Development Profe...,,,"Artificial Intelligence + Machine Learning, Gr...",,,https://www.eecs.mit.edu/people/pulkit-agrawal/,https://www.eecs.mit.edu/people/pulkit-agrawal/,(617) 253-5851,pulkitag@mit.edu,,Office: 32-342,AI & Decision-making; Computer Science,Massachusetts Institute Technology
3,Mohammad Alizadeh,Associate Professor,,,"Multicore Processors & Cloud Computing, Securi...",,,https://www.eecs.mit.edu/people/mohammad-aliza...,https://www.eecs.mit.edu/people/mohammad-aliza...,(617) 253-6042,alizadeh@mit.edu,,Office: 32-G920,Computer Science;,Massachusetts Institute Technology
4,Saman Amarasinghe,Professor of CS and Engineering,,,"Artificial Intelligence + Machine Learning, Co...",,,https://www.eecs.mit.edu/people/saman-amarasin...,https://www.eecs.mit.edu/people/saman-amarasin...,(617) 253-8879,saman@csail.mit.edu,,Office: 38-427,Computer Science,Massachusetts Institute Technology
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73,Ryan Williams,Professor of EECS,,,Theory of Computation,,,https://www.eecs.mit.edu/people/ryan-williams/,https://www.eecs.mit.edu/people/ryan-williams/,617-253-5851,rrw@mit.edu,,Office: 32-G638,Computer Science; AI & Decision-making,Massachusetts Institute Technology
74,Virginia Vassilevska Williams,Professor,,,Theory of Computation,,,https://www.eecs.mit.edu/people/virginia-willi...,https://www.eecs.mit.edu/wp-content/uploads/20...,617-253-5851,virgi@mit.edu,,Office: 32-G640,Computer Science; AI & Decision-making,Massachusetts Institute Technology
75,Mengjia Yan,"Homer A. Burnell Career Development Professor,...",,,"Computer Architecture, Security and Cryptography",,,https://www.eecs.mit.edu/people/mengjia-yan/,https://www.eecs.mit.edu/people/mengjia-yan/,617-258-0719,mengjiay@mit.edu,,Office: 32G-840,Computer Science,Massachusetts Institute Technology
76,Nickolai Zeldovich,Professor of EECS,,,Programming Languages and Software Engineering...,,,https://www.eecs.mit.edu/people/nickolai-zeldo...,https://www.eecs.mit.edu/people/nickolai-zeldo...,(617) 253-6005,nickolai@csail.mit.edu,,Office: 32-G994,Computer Science,Massachusetts Institute Technology


In [12]:
# prepare research group dataframe
cols = ["research_group", "url"]
data = []
for i in all_research_dict.keys():
    data.append([i, all_research_dict.get(i,'')])
    print(f"{i}:\t {all_research_dict.get(i,'')}")

AI and Society:	 https://www.eecs.mit.edu/role/faculty-cs/?fwp_research=ai-and-society
Artificial Intelligence + Decision making:	 https://www.eecs.mit.edu/role/faculty-cs/?fwp_research=artificial-intelligence-decision-making
Artificial Intelligence + Machine Learning:	 https://www.eecs.mit.edu/role/faculty-cs/?fwp_research=artificial-intelligence-machine-learning
Computer Architecture:	 https://www.eecs.mit.edu/role/faculty-cs/?fwp_research=computer-architecture
Multicore Processors & Cloud Computing:	 https://www.eecs.mit.edu/role/faculty-cs/?fwp_research=multicore-processors-cloud-computing
Programming Languages and Software Engineering:	 https://www.eecs.mit.edu/role/faculty-cs/?fwp_research=programming-languages-and-software-engineering
Graphics and Vision:	 https://www.eecs.mit.edu/role/faculty-cs/?fwp_research=graphics-and-vision
Human-Computer Interaction:	 https://www.eecs.mit.edu/role/faculty-cs/?fwp_research=human-computer-interaction
Security and Cryptography:	 https://www.

In [13]:
df_research = pd.DataFrame(data, columns=cols)

In [14]:
df_research

Unnamed: 0,research_group,url
0,AI and Society,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...
1,Artificial Intelligence + Decision making,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...
2,Artificial Intelligence + Machine Learning,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...
3,Computer Architecture,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...
4,Multicore Processors & Cloud Computing,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...
5,Programming Languages and Software Engineering,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...
6,Graphics and Vision,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...
7,Human-Computer Interaction,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...
8,Security and Cryptography,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...
9,Systems and Networking,https://www.eecs.mit.edu/role/faculty-cs/?fwp_...


In [15]:
df.to_csv(f"faculty-{SCHOOL}.csv", index=False)

In [16]:
# import xlsxwriter
file_xlsx = f"faculty-{SCHOOL}.xlsx"
writer = pd.ExcelWriter(file_xlsx, engine='xlsxwriter')
df.to_excel(writer, sheet_name = "CS Faculty-MIT", index=False)
df_research.to_excel(writer, sheet_name = "Research Groups", index=False)
writer.save()