# PROJECT 4: Semantic Search

## The Task
The objective of this assignment is to engineer a novel wikipedia search engine using what you've learned about data collection, infrastructure, and natural language processing.

The task has two **required sections:**
- Data collection
- Search algorithm development

And one **optional section:** 
  - Predictive modeling

![](http://interactive.blockdiag.com/image?compression=deflate&encoding=base64&src=eJxdjrsOwjAMRXe-wlsmRhaQkDoiMSDxBW5slahtHDmGCiH-nfQxtKy-59zruhPfUsAGPjsA56XvMdIRSIbYCZKD_RncENqQuGBQ3S7TidCwxsynjZUZ1T8m4HqvJlXZnhrBJMHBbWlTDHEeSFravYUXQy_E3TKrwbioMKb5z16UmRxfXZurVY_GjegbhqJIjaXm-wNmzE4W)

### Part 1 -- Collection (required)

We want you to query the wikipedia API and **collect all of the articles** under the following wikipedia categories:

* [Machine Learning](https://en.wikipedia.org/wiki/Category:Machine_learning)
* [Business Software](https://en.wikipedia.org/wiki/Category:Business_software)

The raw page text and its category information should be written to a collection on a Mongo server running on a dedicated AWS instance.

We want your code to be modular enough that any valid category from Wikipedia can be queried by your code. You are encouraged to exploit this modularity to pull additional wikipedia categories beyond ML and Business Software. As always, the more data the better. 

**Note:** Both "Machine Learning" and "Business Software" contain a heirarchy of nested sub-categories. Make sure that you pull every single page within each parent category, not just those directly beneath them. Take time to explore wikipedia's organization structure. It is up to you if you want to model this heirarchy anywhere within Mongo, otherwise flatten it by only recording the parent category associated with each page.

**optional**  
Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python download.py #SOME_CATEGORY#
```
This docker command starts a disposable scipy-notebook container for one-time use to run your script, `download.py`. Where `#SOME_CATEGORY#` is the wikipedia category to be downloaded. Read about passing arguments to python scripts here: https://docs.python.org/3/library/sys.html. 

**optional**  
Make it so that your code can query nested sub-categories e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python download.py #SOME_CATEGORY# #NESTING_LEVEL#
```

### Part 2 -- Search (required)

Use Latent Semantic Analysis to search your pages. Given a search query, find the top 5 related articles to the search query. SVD and cosine similarity are a good place to start. 

**optional**  
Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python search.py #SOME_TERM#
```

### Part 3 -- Predictive Model (optional)

In this part, we want you to build a predictive model from the data you've just indexed. Specifically, when a new article from wikipedia comes along, we would like to be able to predict what category the article should fall into. We expect a training script of some sort that is runnable and will estimate a model. 

Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python train.py
```

Finally, you should be able to pass the url of a wikipedia page and it will generate a prediction for the best category for that page, along with a probability of that being the correct category. 

Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python predict.py #URL#
```

## Infrastructure

We recommend that you run a MongDB server on a dedicated t2.micro instance. Feel free to run your Jupyter environment either on another instance or locally.




In [9]:
import re
import requests
import pandas as pd
import numpy as np
import pymongo

from string import punctuation
from bs4 import BeautifulSoup

client = pymongo.MongoClient('35.163.182.105', 27016)

In [97]:
%run download.py

In [96]:
client.database_names()

['admin',
 'business_software_wiki_db',
 'local',
 'test',
 'test_wiki_db',
 'wiki_content_db']

In [15]:
client.business_software_wiki_db.collection_names()

[]

### Business Software Category

In [17]:
busi_soft_mongo = store_wiki_contents_in_mongo('business_software_wiki_db')

In [18]:
%time busi_soft_mongo.get_all_page_contents_from_category('Business software', overwrite=True, )

CPU times: user 57.7 s, sys: 4.71 s, total: 1min 2s
Wall time: 16min 56s


In [108]:
client.database_names()

['admin',
 'business_software_wiki_db',
 'local',
 'test',
 'test_wiki_db',
 'wiki_content_db']

In [107]:
class collection_merger_df_maker():
    
    my_df = pd.DataFrame()
    
#     def __init__(self):
        
#         self.my_df = pd.DataFrame()
    
    def make_collection_df(self, database, collection):
        return pd.DataFrame(list(client[database][collection].find()))

    def merge_collections(self, database):
        collection_list = [x for x in client[database].collection_names()]
        for col in collection_list:
            temp_df = self.make_collection_df(database, col)
            self.my_df = pd.concat([self.my_df, temp_df], ignore_index=True)
        return self.my_df

In [None]:
cmdm = collection_merger_df_maker()

In [81]:
bs_df = cmdm.merge_collections('business_software_wiki_db')

In [93]:
bs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1875 entries, 0 to 1874
Data columns (total 6 columns):
_id                1875 non-null object
category           1875 non-null object
content            1875 non-null object
page_id            1875 non-null int64
parent_category    1875 non-null object
title              1875 non-null object
dtypes: int64(1), object(5)
memory usage: 88.0+ KB


In [94]:
lengths = []
for x in bs_collection_list:
    lengths.append(len(cmdm.make_collection_df('business_software_wiki_db', x)))
    
sum(lengths)

1892

In [None]:
test_mongo = store_wiki_contents_in_mongo('test_wiki_db')
%time test_mongo.get_all_page_contents_from_category('Business software', overwrite=True, nesting_level=1)

In [109]:
len(pd.DataFrame(list(client.test_wiki_db.Business_software_wiki_content_collection.find())))

827

### Machine Learning Category

In [9]:
wiki_content_db = client.wiki_content_db
wiki_coll_ref = wiki_content_db.machine_learning_wiki_content_collection

In [274]:
# %time get_all_page_contents_from_category('Machine learning', overwrite=True)

CPU times: user 50.8 s, sys: 6.54 s, total: 57.3 s
Wall time: 17min 10s


In [291]:
len(wiki_content_db.collection_names())

49

In [None]:
wiki_content_db.collection_names()

In [11]:
pd.DataFrame(list(wiki_content_db.machine_learning_wiki_content_collection.find()))

Unnamed: 0,_id,category,content,page_id,parent_category,title
0,5a0d36fb830fdb7bcde23811,machine learning,Data exploration is an approach similar to ini...,43385931,none,Data exploration
1,5a0d36fb830fdb7bcde23812,machine learning,These datasets are used for machinelearning re...,49082762,none,List of datasets for machine learning research
2,5a0d36fc830fdb7bcde23813,machine learning,Machine learning is a field of computer scienc...,233488,none,Machine learning
3,5a0d36fd830fdb7bcde23814,machine learning,The following outline is provided as an overvi...,53587467,none,Outline of machine learning
4,5a0d36fd830fdb7bcde23815,machine learning,The accuracy paradox for predictive analytics ...,3771060,none,Accuracy paradox
5,5a0d36fe830fdb7bcde23816,machine learning,Action model learning sometimes abbreviated ac...,43808044,none,Action model learning
6,5a0d36fe830fdb7bcde23817,machine learning,Active learning is a special case of semisuper...,28801798,none,Active learning (machine learning)
7,5a0d36ff830fdb7bcde23818,machine learning,Adversarial machine learning is a research fie...,45049676,none,Adversarial machine learning
8,5a0d3700830fdb7bcde23819,machine learning,AIVA Artificial Intelligence Virtual Artist is...,52642349,none,AIVA
9,5a0d3700830fdb7bcde2381a,machine learning,AIXI ai̯k͡siː is a theoretical mathematical fo...,30511763,none,AIXI


In [281]:
client.database_names()

['admin', 'local', 'test', 'wiki_content_db']

In [315]:
for x in sorted(wiki_content_db.collection_names()):
    if len(list(wiki_content_db[x].find({'content':''}))) >= 1:
        print('{:42}'.format(x[:-24]), len(list(wiki_content_db[x].find({'content':''}))))

Artificial_neural_networks                 2
Classification_algorithms                  1
Cluster_analysis                           1
Data_mining_and_machine_learning_software  1
Decision_trees                             1
Dimension_reduction                        1
Evolutionary_algorithms                    4
Genetic_algorithms                         6
Graphical_models                           2
Kernel_methods_for_machine_learning        1
Latent_variable_models                     2
Markov_models                              2
Statistical_natural_language_processing    1
Structured_prediction                      1
machine_learning                           17


In [None]:
### saving this for later, code is to rename a db

# client.admin.command('copydb',
#                      fromdb='source_db_name',
#                      todb='target_db_name')

# client.drop_database('<DBNAME>')
