# PROJECT 4: Semantic Search

## The Task
The objective of this assignment is to engineer a novel wikipedia search engine using what you've learned about data collection, infrastructure, and natural language processing.

The task has two **required sections:**
- Data collection
- Search algorithm development

And one **optional section:** 
  - Predictive modeling

![](http://interactive.blockdiag.com/image?compression=deflate&encoding=base64&src=eJxdjrsOwjAMRXe-wlsmRhaQkDoiMSDxBW5slahtHDmGCiH-nfQxtKy-59zruhPfUsAGPjsA56XvMdIRSIbYCZKD_RncENqQuGBQ3S7TidCwxsynjZUZ1T8m4HqvJlXZnhrBJMHBbWlTDHEeSFravYUXQy_E3TKrwbioMKb5z16UmRxfXZurVY_GjegbhqJIjaXm-wNmzE4W)

### Part 1 -- Collection (required)

We want you to query the wikipedia API and **collect all of the articles** under the following wikipedia categories:

* [Machine Learning](https://en.wikipedia.org/wiki/Category:Machine_learning)
* [Business Software](https://en.wikipedia.org/wiki/Category:Business_software)

The raw page text and its category information should be written to a collection on a Mongo server running on a dedicated AWS instance.

We want your code to be modular enough that any valid category from Wikipedia can be queried by your code. You are encouraged to exploit this modularity to pull additional wikipedia categories beyond ML and Business Software. As always, the more data the better. 

**Note:** Both "Machine Learning" and "Business Software" contain a heirarchy of nested sub-categories. Make sure that you pull every single page within each parent category, not just those directly beneath them. Take time to explore wikipedia's organization structure. It is up to you if you want to model this heirarchy anywhere within Mongo, otherwise flatten it by only recording the parent category associated with each page.

**optional**  
Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python download.py #SOME_CATEGORY#
```
This docker command starts a disposable scipy-notebook container for one-time use to run your script, `download.py`. Where `#SOME_CATEGORY#` is the wikipedia category to be downloaded. Read about passing arguments to python scripts here: https://docs.python.org/3/library/sys.html. 

**optional**  
Make it so that your code can query nested sub-categories e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python download.py #SOME_CATEGORY# #NESTING_LEVEL#
```

### Part 2 -- Search (required)

Use Latent Semantic Analysis to search your pages. Given a search query, find the top 5 related articles to the search query. SVD and cosine similarity are a good place to start. 

**optional**  
Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python search.py #SOME_TERM#
```

### Part 3 -- Predictive Model (optional)

In this part, we want you to build a predictive model from the data you've just indexed. Specifically, when a new article from wikipedia comes along, we would like to be able to predict what category the article should fall into. We expect a training script of some sort that is runnable and will estimate a model. 

Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python train.py
```

Finally, you should be able to pass the url of a wikipedia page and it will generate a prediction for the best category for that page, along with a probability of that being the correct category. 

Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python predict.py #URL#
```

## Infrastructure

We recommend that you run a MongDB server on a dedicated t2.micro instance. Feel free to run your Jupyter environment either on another instance or locally.




In [1]:
import re
import requests
import pandas as pd
import numpy as np
import pymongo

from string import punctuation
from bs4 import BeautifulSoup

client = pymongo.MongoClient('35.163.182.105', 27016)

In [20]:
%run download.py

In [23]:
client.database_names()

['admin', 'business_software_wiki_db', 'local', 'test', 'wiki_content_db']

### Business Software Category

In [4]:
busi_soft_mongo = store_wiki_contents_in_mongo('business_software_wiki_db')

In [5]:
%time busi_soft_mongo.get_all_page_contents_from_category('Business software', overwrite=True, nesting_level=2)

Finished updating "business_software_wiki_db" with all page content from category "ASP Accounting Systems". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Abstract management software". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Account aggregation providers". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Accounting software". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Accounting software for Linux". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Administrative software". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from cat

Finished updating "business_software_wiki_db" with all page content from category "Electronic health record software". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Electronic health records". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Electronic trading platforms". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Electronic trading systems". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Email client software for Linux". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Email clients". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content fr

Finished updating "business_software_wiki_db" with all page content from category "Human resource management software". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "IBM WebSphere". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Industry-specific XML-based standards". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Instant messaging clients". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Internet search engines". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Java enterprise platform". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content fr

Finished updating "business_software_wiki_db" with all page content from category "Publication management software". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Publishing software". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Recommender systems". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Records management". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Recruitment software". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Repetitive strain injury software". | Parent Category: "Microfinance software"
Finished updating "business_software_wiki_db" with all page content from category "Rep

In [6]:
client.database_names()

['admin',
 'business_software_wiki_db',
 'local',
 'test',
 'test_wiki_db',
 'wiki_content_db']

In [11]:
cmdm = collection_merger_df_maker()

In [12]:
bs_df = cmdm.merge_collections('business_software_wiki_db')

In [15]:
bs_df.head()

Unnamed: 0,_id,category,content,page_id,parent_category,title
0,5a1217ba830fdb7f6a5fe1ef,Zoo Tycoon,Zoo Tycoon 2 African Adventure is the second e...,6453530,Microfinance software,Zoo Tycoon 2: African Adventure
1,5a1217ba830fdb7f6a5fe1f0,Zoo Tycoon,Blue Fang Games often shortened to Blue Fang w...,9069467,Microfinance software,Blue Fang Games
2,5a1217bb830fdb7f6a5fe1f1,Zoo Tycoon,Zoo Tycoon 2 Dino Danger Pack is a bonus pack ...,6106017,Microfinance software,Zoo Tycoon 2: Dino Danger Pack
3,5a1217bb830fdb7f6a5fe1f2,Zoo Tycoon,Zoo Tycoon 2 Endangered Species is the first o...,6454658,Microfinance software,Zoo Tycoon 2: Endangered Species
4,5a1217bc830fdb7f6a5fe1f3,Zoo Tycoon,Zoo Tycoon 2 Extinct Animals is a video game e...,9000696,Microfinance software,Zoo Tycoon 2: Extinct Animals


In [13]:
bs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8569 entries, 0 to 8568
Data columns (total 6 columns):
_id                8569 non-null object
category           8569 non-null object
content            8569 non-null object
page_id            8569 non-null int64
parent_category    8569 non-null object
title              8569 non-null object
dtypes: int64(1), object(5)
memory usage: 401.8+ KB


In [14]:
lengths = []
for x in client.business_software_wiki_db.collection_names():
    lengths.append(len(cmdm.make_collection_df('business_software_wiki_db', x)))
    
sum(lengths)

8569

### Machine Learning Category

In [24]:
ml_mongo = store_wiki_contents_in_mongo('machine_learning_wiki_db')
%time ml_mongo.get_all_page_contents_from_category('machine learning', nesting_level=3)

Finished updating "machine_learning_wiki_db" with all page content from category "machine learning". 
               Remain Categories: 0
There are 48 subcategories that need to be added to machine_learning_wiki_db.
Finished updating "machine_learning_wiki_db" with all page content from category "Applied machine learning". 
               Remain Categories: 47
Finished updating "machine_learning_wiki_db" with all page content from category "Artificial immune systems". 
               Remain Categories: 46
Finished updating "machine_learning_wiki_db" with all page content from category "Artificial intelligence conferences". 
               Remain Categories: 45
Finished updating "machine_learning_wiki_db" with all page content from category "Artificial neural networks". 
               Remain Categories: 44
Finished updating "machine_learning_wiki_db" with all page content from category "Bayesian networks". 
               Remain Categories: 43
Finished updating "machine_learning_wiki_d

In [291]:
len(wiki_content_db.collection_names())

49

In [None]:
wiki_content_db.collection_names()

In [281]:
client.database_names()

['admin', 'local', 'test', 'wiki_content_db']

In [315]:
for x in sorted(wiki_content_db.collection_names()):
    if len(list(wiki_content_db[x].find({'content':''}))) >= 1:
        print('{:42}'.format(x[:-24]), len(list(wiki_content_db[x].find({'content':''}))))

Artificial_neural_networks                 2
Classification_algorithms                  1
Cluster_analysis                           1
Data_mining_and_machine_learning_software  1
Decision_trees                             1
Dimension_reduction                        1
Evolutionary_algorithms                    4
Genetic_algorithms                         6
Graphical_models                           2
Kernel_methods_for_machine_learning        1
Latent_variable_models                     2
Markov_models                              2
Statistical_natural_language_processing    1
Structured_prediction                      1
machine_learning                           17


In [None]:
### saving this for later, code is to rename a db

# client.admin.command('copydb',
#                      fromdb='source_db_name',
#                      todb='target_db_name')

# client.drop_database('<DBNAME>')
