# Project 4: Wikipedia Semantic Search

*Author: William Buck*

### download.py, search.py, and predict.py

The three .py files that are run in this notebook will do the following:
1. Download all of the page text, pageids, and titles from pages from a certain category.
    - This information is stored in a mongoDB set up on an AWS instance. The client connection is set in all three of the scripts as ```MongoClient('35.163.182.105', 27016)```
    - The way I have organized the data in mongo: 
        - Each category that is downloaded has its own database in mongo. 
        - Each collection is a subcategory of the orginally downloaded category.
        - Each document in a collection is a page that falls under the category in Wikipedia.
1. Search for any word or phrase in the contents of the downloaded Wikipedia pages.
    - When search.py is first run, it will merge all of the information in the mongo databases so that they can be searched.
1. Predict the category of a page from the wikipdia page title.
    - predict.py analyzes all of the page content related to a certain category, and when a page title is passed to the ```predict``` method, it uses the Wikipedia API to get all of the page text of the passed title, then predicts the category of the page based on that content.
    - The downloaded data must be stored in a pandas DataFrame in order for categorical predictions to be made.



In [2]:
%run download.py

### Instantiating the store_wiki_contents_in_mongo object with the names of the databases I will create then update.

The ```store_wiki_contents_in_mongo``` method sets the database name in mongoDB. The string that is passed as the name will have the spaces replaced with ```_``` and the phrase ```_wiki_db``` will be added to the end so that all of the wikipedia databases can be easily recognized.

In [22]:
busi_soft_mongo = store_wiki_contents_in_mongo('business software')
mach_learn_mongo = store_wiki_contents_in_mongo('machine learning')
wild_mongo = store_wiki_contents_in_mongo('wilderness')
bicycles_mongo = store_wiki_contents_in_mongo('bicycles')
evolutionary_phenom_mongo = store_wiki_contents_in_mongo('evolutionary phenomena')

In [3]:
# see database names here
client.database_names()

['admin',
 'bicycles_wiki_db',
 'business_software_wiki_db',
 'business_software_wiki_db2',
 'evolutionary_phenomena_wiki_db',
 'local',
 'machine_learning_wiki_db',
 'machine_learning_wiki_db2',
 'wilderness_wiki_db']

### Getting all of the page contents for the 5 categories passed as parameters to the get_all_page_contents_from_category method.

A nesting level must be set due to Wikipedia's infinite nesting structure. Default is ```0```, which means that no subcategories will be included unless specified. It is impossible to get all of the subcategories using a recursive function because often, there is a child category that includes its parent category as a subcategory, which creates a loop.

In [None]:
# Category: Business Software
# Search can take over an hour.
busi_soft_mongo.get_all_page_contents_from_category('business software', nesting_level=3)

In [None]:
# Category: Machine Learning
# nesting_level=3 is enough to get all pages from all subcategories.
# Search usually takes 15 minutes.
mach_learn_mongo.get_all_page_contents_from_category('machine learning', nesting_level=3)

In [10]:
# Category: Wilderness
wild_mongo.get_all_page_contents_from_category('wilderness', nesting_level=5)

Finished updating "wilderness_wiki_db" with all page content from category "wilderness".

There are 72 subcategories that need to be added to wilderness_wiki_db.

71.70.69.68.67.66.65.64.63.62.61.60.59.58.57.56.55.54.53.52.51.50.49.48.47.46.45.44.43.42.41.40.39.38.37.36.35.34.33.32.31.30.29.28.27.26.25.24.23.22.21.20.19.18.17.16.15.14.13.12.11.10.9.8.7.6.5.4.3.2.1.0.

In [14]:
# Category: Bicycles
bicycles_mongo.get_all_page_contents_from_category('bicycles', nesting_level=4)

Finished updating "bicycles_wiki_db" with all page content from category "bicycles".

There are 66 subcategories that need to be added to bicycles_wiki_db.

65.64.63.62.61.60.59.58.57.56.55.54.53.52.51.50.49.48.47.46.45.44.43.42.41.40.39.38.37.36.35.34.33.32.31.30.29.28.27.26.25.24.23.22.21.20.19.18.17.16.15.14.13.12.11.10.9.8.7.6.5.4.3.2.1.0.

In [41]:
# Category: Evolutionarily significant biological phenomena
# Used this category because it has a large number of subcategories and I wanted to see how the nesting_level 
# scaled and affected performance

# 7 nested levels deep was so large that it had not run in 4+ hours.
# 5 nested levels deep is 3166 subcategories
# 3 nested levels deep is 361 subcategories
# 2 nested levels deep is 113 subcategories

evolutionary_phenom_mongo.get_all_page_contents_from_category('evolutionarily significant biological phenomena', \
                                                               overwrite=True, \
                                                               nesting_level=2)

About to delete database "evolutionary_phenomena_wiki_db". 
Do you want to continue [y/n]? n


'Quitting without dropping database.'

## Searching all mongo databases related to wiki search

When search.py is run, all of the content is prepared by combing all the collections in all databases into one repository for searching.

In [6]:
%run search.py

Merging 5 mongo databases for Wikipedia search.

67 Categories in bicycles_wiki_db.

1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.64.65.66.67.



315 Categories in business_software_wiki_db.

1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.64.65.66.67.68.69.70.71.72.73.74.75.76.77.78.79.80.81.82.83.84.85.86.87.88.89.90.91.92.93.94.95.96.97.98.99.100.101.102.103.104.105.106.107.108.109.110.111.112.113.114.115.116.117.118.119.120.121.122.123.124.125.126.127.128.129.130.131.132.133.134.135.136.137.138.139.140.141.142.143.144.145.146.147.148.149.150.151.152.153.154.155.156.157.158.159.160.161.162.163.164.165.166.167.168.169.170.171.172.173.174.175.176.177.178.179.180.181.182.183.184.185.186.187.188.189.190.191.192.193.194.195.1

#### Searching
Running some random searches. I found that the results are pretty accurate when I search phrases from wikipedia pages.

In [8]:
# Steve Fredette was the founder of Toast, Inc.

%time search('Steve Fredette')

CPU times: user 46.1 s, sys: 11.6 s, total: 57.7 s
Wall time: 34.9 s


Unnamed: 0,cosine_sim
"Toast, Inc.",0.189185
NeXTMail,0.098027
Excellence (software),0.068203
Google Kythe,0.063162
PaperClip,0.057718


In [32]:
# Added a bunch of random words related to bicycles to see how the search would react.

search('singletrack mountain wheel carbon fiber')

Unnamed: 0,cosine_sim
Google Fiber,0.288891
The Bicycle Wheel,0.235388
Big wheel (tricycle),0.202178
Copenhagen Wheel,0.186923
Carbon-based life,0.184708


In [33]:
search('cell organism evolution nature neural network')

Unnamed: 0,cosine_sim
Random neural network,0.353688
NeuroSolutions,0.328894
Neural network software,0.327341
Category:Neural network software,0.326809
Computational neurogenetic modeling,0.317748


### Instantiating the df_maker_merge_all_mongo_content to merge all contents from the mongo databases that are related to the wikipedia search, and then pickling the DataFrame

The DataFrame is used for the predict.py script.

In [12]:
dmmamc = df_maker_merge_all_mongo_content()

In [13]:
# getting a list of the databases I want to merge
wiki_db_list = [x for x in client.database_names() if x[-len('_wiki_db'):] == '_wiki_db']
wiki_db_list

['bicycles_wiki_db',
 'business_software_wiki_db',
 'evolutionary_phenomena_wiki_db',
 'machine_learning_wiki_db',
 'wilderness_wiki_db']

In [14]:
# pickling the DataFrame of the merged databases
final_df = dmmamc.merge_databases(wiki_db_list)
pd.to_pickle(final_df, 'Data/merged_wiki_databases_df.p')

In [15]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19411 entries, 0 to 19410
Data columns (total 5 columns):
_id         19411 non-null object
category    19411 non-null object
content     19411 non-null object
page_id     19411 non-null int64
title       19411 non-null object
dtypes: int64(1), object(4)
memory usage: 758.3+ KB


### Predict category from page title

The 5 items returned are the top 5 related categories to the page based on the content of the page.

In [11]:
%run predict.py

In [16]:
%time predict('machine learning')

CPU times: user 40.1 s, sys: 11.8 s, total: 51.9 s
Wall time: 31.4 s


['Deep learning',
 'Artificial neural networks',
 'Neural network software',
 'Unsupervised learning',
 'Data mining and machine learning software']

In [17]:
predict('Gore Creek Trail')

['Eagles Nest Wilderness',
 'Holy Cross Wilderness',
 'Eagle Cap Wilderness',
 'Wilderness Areas of Colorado',
 'Wilderness Areas of Virginia']

In [18]:
predict('symbiosis')

['Symbiosis',
 'Mutualism (biology)',
 'Parasitism',
 'Ecosystems',
 'Superorganisms']