# Scraper

The main task of this part is to scrap the basic information of alumnis. It will use the [API](https://www.mediawiki.org/wiki/API:Main_page) provided by wikipedia to accelerate the scraping process. We will only scrap the introduction part of wiki page.

## Libraries

We just need `json` and `requests` libraries in this part. The only new library you need to install is [tqdm](https://pypi.python.org/pypi/tqdm). It is an easy-to-use progress meter for the python script. Since the number of alumnis are relatively large, tqdm could let us know the exact progress and estimated running time. You can install it using `pip`:

    $ pip install --upgrade tqdm
    
After you finished the installation, please make sure the following commands are workable:

In [3]:
import json
import requests
from tqdm import tqdm

## Categories
The main resource of alumni list is the [category](https://en.wikipedia.org/wiki/Help:Category) page of wikipedia. For example, the list of [Carnegie Mellon University alumni](https://en.wikipedia.org/wiki/Category:Carnegie_Mellon_University_alumni). Belowing is a dictionary with unviersities' name as keys and with the category name as values. We could manually add any unviersities we want into the dictionary.

In [4]:
universities = dict()
universities['CMU'] = "Category:Carnegie_Mellon_University_alumni"
universities['Stanford'] = "Category:Stanford_University_alumni"
universities['Harvard'] = "Category:Harvard_University_alumni"
universities['Yale'] = "Category:Yale_University_alumni"
universities['UCLA'] = "Category:University_of_California,_Los_Angeles_alumni"
universities['MIT'] = "Category:Massachusetts_Institute_of_Technology_alumni"
universities['Pitt'] = "Category:University_of_Pittsburgh_alumni"

## Collect Tittles Recursively

Using [API:Categorymembers](https://www.mediawiki.org/wiki/API:Categorymembers), we could collect all the titles of alumni in a category page. With parameter `format=json`, we could get the json format as following (`cmlimit=2`):
```python
{
    "batchcomplete": "",
    "continue": {
        "cmcontinue": "page|29374306043f51394d0453454303063f51394d0453454304293743011e01dcc2dcc2dcc3dcbedc06|5648968",
        "continue": "-||"
    },
    "query": {
        "categorymembers": [
            {
                "pageid": 14941891,
                "ns": 0,
                "title": "Gregory Abowd"
            },
            {
                "pageid": 18446374,
                "ns": 0,
                "title": "Linda Addison (poet)"
            }
        ]
    }
}
```
Since the maximum number of titles we could get from one request is 500, we need to use the `continue` parameter. If there is still any remaining members after a request, there would be a `continue` field in the return json. By passing the value of `continue` filed to the `continue` parameter of next request, we could grab all the members recursively. Function `get_members` does that.

In [5]:
def get_members(cat_name, cont=None):
    """
    Return the list of members in given category

    Args:
        cat_name (string): name of category in form as 'Category:Carnegie_Mellon_University_alumni'
        cont (string): value of continue parameter, users should not use it
    Returns:
        ret_val (list): list of all titles of members in this page only
    """
    
    url = 'https://en.wikipedia.org/w/api.php'
    params = {"action" : 'query', 'format' : 'json', 'list' : 'categorymembers', 'cmlimit' : 500, 'cmprop' : 'title'}
    params['cmtitle'] = cat_name
    if (cont != None):
        params['cmcontinue'] = cont
    response = requests.get(url, params=params)
    ret_val = []
    data = json.loads(response.text)
    if ('continue' in data):
        ret_val = get_members(cat_name, cont=data['continue']['cmcontinue'])
    ret_val += [x['title'] for x in data['query']['categorymembers']]
    return ret_val

## Collect Subcategories Recursively

Notice that there are also some subcategories. Using same API, we could grap all subcategories similarly:

In [6]:
def get_all_subcategories(cat_name, cont=None):
    """
    Return the list of all subcategories

    Args:
        cat_name (string): name of category in form as 'Category:Carnegie_Mellon_University_alumni'
        cont (string): value of continue parameter, users should not use it
    Returns:
        ret_val (list): list of all subcategories
    """
        
    url = 'https://en.wikipedia.org/w/api.php'
    params = {"action" : 'query', 'format' : 'json', 'list' : 'categorymembers', 'cmlimit' : 500, 'cmtype' : 'subcat'}
    params['cmtitle'] = cat_name
    if (cont != None):
        params['cmcontinue'] = cont
    response = requests.get(url, params=params)
    ret_val = []
    data = json.loads(response.text)
    if ('continue' in data):
        ret_val = get_all_subcategories(cat_name, cont=data['continue']['cmcontinue'])
    ret_val += [x['title'] for x in data['query']['categorymembers']]
    return ret_val

## Collect All Tittles

Combine the above two functions, we could grab all tittles in a category page, incluing the tittles in the subcategory pages. 

In [7]:
def get_all_members(cat_name):
    """
    Return the list of all members

    Args:
        cat_name (string): name of category in form as 'Category:Carnegie_Mellon_University_alumni'
    Returns:
        ret_val (list): list of all members
    """
    subcategories = get_all_subcategories(cat_name)
    ret_val = []
    for subcat in subcategories:
        ret_val += get_members(subcat)
    ret_val += get_members(cat_name)
    return list(filter(lambda a: not a.startswith("Category:"), ret_val))

## Extract Introduction From a Wiki Page

Now we have a list of tittles contained in a category page, the next step is to grab introduction of every wiki page. We could do that using [API: query/prop=extracts](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextracts). With the request 
``` python
'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&exintro=&explaintext=&titles=Gregory_Abowd|Ross%20Cohen'
```
the returned json is like:
```python
{
    "batchcomplete": "",
    "query": {
        "normalized": [
            {
                "from": "Gregory_Abowd",
                "to": "Gregory Abowd"
            }
        ],
        "pages": {
            "14941891": {
                "pageid": 14941891,
                "ns": 0,
                "title": "Gregory Abowd",
                "extract": "Gregory Dominic Abowd is a computer scientist best known for his work in ubiquitous computing, software engineering, and technologies for autism. He is the J.Z. Liang Professor in the School of Interactive Computing at the Georgia Institute of Technology, where he joined the faculty in 1994."
            },
            "34107971": {
                "pageid": 34107971,
                "ns": 0,
                "title": "Ross Cohen",
                "extract": "Ross Cohen is the cofounder of BitTorrent Inc along with brother Bram Cohen, where among other things he was involved in the Codeville project. He attended Carnegie Mellon University and Stuyvesant High School. He was forced out of the company in 2006."
            }
        }
    }
}
```
Notice that we could connect titles with `|` to get serveral request done at same time. It will greatly improve the efficientcy. The following function `get_info` will do this part.

In [8]:
def get_info(titles):
    """
    Return a dictionary of {title : intro}
    
    Args:
        titles (list of string): list of all titles
    Returns:
        ret_val: a dictionary of {title : intro}
    """
    url = 'https://en.wikipedia.org/w/api.php'
    params = {"action" : 'query', 'format' : 'json', 'prop' : 'extracts', 'exintro' : '', 'explaintext' : ''}
    params['titles'] = titles
    response = requests.get(url, params=params)
    data = list(json.loads(response.text)['query']['pages'].values())
    ret_val = dict()
    for d in data:
        if (d['extract'] != ""):
            ret_val[d['title']] = d['extract']
    return ret_val

## Collect All Data

Finally, we would combine all functions into a single function that grap all alumni's information. The data is formated as a json:
```python
{
    'CMU' : {
        'Kathleen_Carley' : "Kathleen M. Carley is an American social scientist ...""
        ...
    }
    ...
    'University_Name':{
        'Person_Name' : "Text"
    }
}
```

In [9]:
def get_all_info(title_list):
    """
    Return a dictionary of {title : intro}. This function would connect tittles into 
    one request less than 200 characters to improve efficiency.
    
    Args:
        titles (list of string): list of all titles
    Returns:
        ret_val: a dictionary of {title : intro}
    """
    data = dict()
    partial_titles = ""
    for title in tqdm(title_list):
        if len(partial_titles + title) < 200:
            partial_titles += title + '|'
        else:
            partial_titles += title
            info = get_info(partial_titles)
            data = {**data, **info}
            partial_titles = ""
    return data
def get_all_data(universities, append=False):
    """
    dump a data dictionary into data.json file
    
    Args:
        universities(dict) : {university name : alumni list category name}
        append(bool) :  if append is true, the data would be added into existing data dictionary
                        if append is false, the old data would be replaced
    Returns:
        
    """
    data = dict()
    if (append):
        with open('data.json', 'r') as f:
            data = json.load(f)
    for key, value in universities.items():
        print(key)
        data[key] = get_all_info(get_all_members(value))
    with open('data.json', 'w') as fp:
        json.dump(data, fp)

In [10]:
get_all_data(universities)

CMU


100%|██████████| 802/802 [00:09<00:00, 88.03it/s]


Stanford


100%|██████████| 3870/3870 [00:46<00:00, 82.68it/s]


Harvard


100%|██████████| 19706/19706 [04:11<00:00, 78.37it/s]


Yale


100%|██████████| 7414/7414 [01:36<00:00, 76.65it/s]


UCLA


100%|██████████| 2727/2727 [00:31<00:00, 86.30it/s]


MIT


100%|██████████| 3001/3001 [00:36<00:00, 81.59it/s]


Pitt


100%|██████████| 796/796 [00:09<00:00, 84.16it/s]


## Scrap Academic Disciplines

We also need some "Ground Truth" for different [academic disciplines](https://en.wikipedia.org/wiki/Outline_of_academic_disciplines) in later analysis. We need to import `BeautifulSoup` for this part.

In [11]:
from bs4 import BeautifulSoup

By observing the html of this page, we found that all `h3` tags are the discipline names we want and the link to this discipline is very close to `h3` tag. Thus we could use `BeautifulSoup` to find them.

In [13]:
def retrieve_html(url):
    """
    Return the raw HTML at the specified URL.

    Args:
        url (string): 

    Returns:
        status_code (integer):
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """
    response = requests.get(url)
    return (response.status_code, response.text)
    pass
#Scrap Academic Disciplines
outline_url = "https://en.wikipedia.org/wiki/Outline_of_academic_disciplines"
code, html = retrieve_html(outline_url)
soup = BeautifulSoup(html,'html.parser')
body = soup.find('div',{'id':'bodyContent'})
disciplines = dict()
for h3 in body.find_all('h3'):
    dis_name = h3.text[:-6]
    note = h3.next_sibling.next_sibling
    disciplines[dis_name] = []
    if (dis_name == 'Human geography'):
        disciplines[dis_name].append('/wiki/Human_geography')
        continue
    for link in note.find_all('a', href=True):
        url = link.attrs['href']
        if (url.find('Outline')==-1):
            disciplines[dis_name].append(url)
print(disciplines)

{'Arts': ['/wiki/The_arts'], 'History': ['/wiki/History'], 'Languages and literature': ['/wiki/Language', '/wiki/Literature'], 'Philosophy': ['/wiki/Philosophy'], 'Theology': ['/wiki/Theology'], 'Anthropology': ['/wiki/Anthropology'], 'Economics': ['/wiki/Economics'], 'Human geography': ['/wiki/Human_geography'], 'Law': ['/wiki/Law'], 'Political science': ['/wiki/Politics', '/wiki/Political_science'], 'Psychology': ['/wiki/Psychology', '/wiki/List_of_psychology_disciplines'], 'Sociology': ['/wiki/Sociology'], 'Biology': ['/wiki/List_of_life_sciences'], 'Chemistry': ['/wiki/Chemistry'], 'Earth sciences': ['/wiki/Earth_science'], 'Space sciences': [], 'Physics': ['/wiki/Physics'], 'Computer Science': ['/wiki/Computer_science'], 'Mathematics': ['/wiki/Mathematics'], 'Statistics': ['/wiki/Statistics'], 'Engineering and technology': ['/wiki/Engineering'], 'Medicine and health': ['/wiki/Medicine', '/wiki/Healthcare_science']}


The `disciplines.json` is formatted as following:
```python
{
    'Discipline Name' : ['Doc1', 'Doc2', ...]
    'Arts' : ['The arts refers to the theory and physical expression of creativity found in human societies and cultures...']
    ...
}
```