# PROJECT 4: Semantic Search

## The Task
The objective of this assignment is to engineer a novel wikipedia search engine using what you've learned about data collection, infrastructure, and natural language processing.

The task has two **required sections:**
- Data collection
- Search algorithm development

And one **optional section:** 
  - Predictive modeling

![](http://interactive.blockdiag.com/image?compression=deflate&encoding=base64&src=eJxdjrsOwjAMRXe-wlsmRhaQkDoiMSDxBW5slahtHDmGCiH-nfQxtKy-59zruhPfUsAGPjsA56XvMdIRSIbYCZKD_RncENqQuGBQ3S7TidCwxsynjZUZ1T8m4HqvJlXZnhrBJMHBbWlTDHEeSFravYUXQy_E3TKrwbioMKb5z16UmRxfXZurVY_GjegbhqJIjaXm-wNmzE4W)

### Part 1 -- Collection (required)

We want you to query the wikipedia API and **collect all of the articles** under the following wikipedia categories:

* [Machine Learning](https://en.wikipedia.org/wiki/Category:Machine_learning)
* [Business Software](https://en.wikipedia.org/wiki/Category:Business_software)

The raw page text and its category information should be written to a collection on a Mongo server running on a dedicated AWS instance.

We want your code to be modular enough that any valid category from Wikipedia can be queried by your code. You are encouraged to exploit this modularity to pull additional wikipedia categories beyond ML and Business Software. As always, the more data the better. 

**Note:** Both "Machine Learning" and "Business Software" contain a heirarchy of nested sub-categories. Make sure that you pull every single page within each parent category, not just those directly beneath them. Take time to explore wikipedia's organization structure. It is up to you if you want to model this heirarchy anywhere within Mongo, otherwise flatten it by only recording the parent category associated with each page.

**optional**  
Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python download.py #SOME_CATEGORY#
```
This docker command starts a disposable scipy-notebook container for one-time use to run your script, `download.py`. Where `#SOME_CATEGORY#` is the wikipedia category to be downloaded. Read about passing arguments to python scripts here: https://docs.python.org/3/library/sys.html. 

**optional**  
Make it so that your code can query nested sub-categories e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python download.py #SOME_CATEGORY# #NESTING_LEVEL#
```

### Part 2 -- Search (required)

Use Latent Semantic Analysis to search your pages. Given a search query, find the top 5 related articles to the search query. SVD and cosine similarity are a good place to start. 

**optional**  
Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python search.py #SOME_TERM#
```

### Part 3 -- Predictive Model (optional)

In this part, we want you to build a predictive model from the data you've just indexed. Specifically, when a new article from wikipedia comes along, we would like to be able to predict what category the article should fall into. We expect a training script of some sort that is runnable and will estimate a model. 

Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python train.py
```

Finally, you should be able to pass the url of a wikipedia page and it will generate a prediction for the best category for that page, along with a probability of that being the correct category. 

Make it so that your code can be run via a python script e.g.

```bash
$ docker run --rm -v $(pwd):/home/jovyan jupyter/scipy-notebook python predict.py #URL#
```

## Infrastructure

We recommend that you run a MongDB server on a dedicated t2.micro instance. Feel free to run your Jupyter environment either on another instance or locally.




In [17]:
%run download.py

In [22]:
cmdf_dbmerge = collection_merger_df_maker()
db_merge_df = cmdf_dbmerge.merge_db_dfs(databases_list=['business_software_wiki_db', 'machine_learning_wiki_db'])

In [41]:
client

MongoClient(host=['35.163.182.105:27016'], document_class=dict, tz_aware=False, connect=True)

In [18]:
client.database_names()

['admin',
 'business_software_wiki_db',
 'local',
 'machine_learning_wiki_db',
 'test',
 'wiki_content_db']

In [2]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
cmdm_ml = collection_merger_df_maker()
ml_df = cmdm_ml.merge_collections('wiki_content_db')

In [8]:
# Showing in this cell the the content of every duplicate title is equal. Will need to figure out how to deal with
# the dups (if I need to).
# df.drop_duplicates('title') will do the trick, or test_df.drop_duplicates(['title', 'category'])['title']

all_content_equal_bool_list = []

for title in ml_df['title'].unique():
    
    if ml_df['title'].value_counts()[title] > 1:

        mask = ml_df['title'] == title
        previous_x = ''

        for x in ml_df[mask]['content']:
            if previous_x != '':
                all_content_equal_bool_list.append(x != previous_x)
            previous_x = x

sum(all_content_equal_bool_list)

0

In [39]:
no_dups_df_ml = ml_df.copy()

In [11]:
no_dups_df_ml = no_dups_df_ml.drop_duplicates(['title', 'category'])['title']

Deeplearning4j                                  6
OpenNN                                          6
ND4J (software)                                 5
BigDL                                           5
Caffe (software)                                5
Hierarchical temporal memory                    4
Brown clustering                                4
Gene expression programming                     4
Random forest                                   4
Self-organizing map                             4
Probabilistic latent semantic analysis          4
Promoter based genetic algorithm                3
Keras                                           3
Kernel principal component analysis             3
TensorFlow                                      3
Logic learning machine                          3
HyperNEAT                                       3
Wolfram Language                                3
Radial basis function network                   3
Boosting (machine learning)                     3


In [12]:
# making sure that at least one instance of the 'title' exists still.
len(test_df.drop_duplicates('title')['title'].value_counts()) == len(ml_df['title'].value_counts())

True

### Label Encode 'Title'

In [13]:
# le = LabelEncoder()
# ml_df['title_num'] = le.fit_transform(ml_df['title'])

### Prepare Document Term Matrix

In [14]:
tfidf_vectorizer = TfidfVectorizer(min_df = 1, stop_words = 'english')
document_term_matrix_sps = tfidf_vectorizer.fit_transform(ml_df['content'])
ml_document_term_matrix_df = pd.DataFrame(document_term_matrix_sps.toarray(),
                                       index=ml_df.title,
                                       columns=tfidf_vectorizer.get_feature_names())

In [15]:
len(ml_document_term_matrix_df.columns)

50882

In [16]:
search_phrase = ['machine learn dataframe american']

search_terms_encoded = \
    tfidf_vectorizer.transform(search_phrase)

search_terms_encoded_df = pd.DataFrame(search_terms_encoded.toarray(),
                                   index=search_phrase,
                                   columns=tfidf_vectorizer.get_feature_names())

In [17]:
search_terms_encoded_df

Unnamed: 0,00,000,000001,00001,0001,000198,000198ttt01584tft000198ttt0288ttf01584tft00tfffrac,0001l,00025,00043702,...,ﬁltered,ﬁnd,ﬁnding,ﬁnds,ﬁnitelength,ﬁxedsize,ﬂexibility,ﬂock,ﬂocking,ﬂow
machine learn dataframe american,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Append the search term to the document term matrix

In [18]:
dtm_with_search_term = ml_document_term_matrix_df.append(search_terms_encoded_df)

### Compute SVD of Augmented Document Term Matrix

In [19]:
n_components = 50
SVD = TruncatedSVD(n_components)
component_names = ["component_"+str(i+1) for i in range(n_components)]

In [20]:
svd_matrix = SVD.fit_transform(dtm_with_search_term)

In [32]:
svd_df = pd.DataFrame(svd_matrix, 
                      index=dtm_with_search_term.index, 
                      columns=component_names)

In [33]:
svd_df

Unnamed: 0,component_1,component_2,component_3,component_4,component_5,component_6,component_7,component_8,component_9,component_10,...,component_41,component_42,component_43,component_44,component_45,component_46,component_47,component_48,component_49,component_50
Causality,0.164693,0.036934,0.076514,0.009385,-0.120179,0.019930,-0.052851,0.027406,-0.002924,-0.048473,...,0.108638,0.068301,-0.153553,0.005066,-0.027743,0.001345,0.085961,0.135339,-0.098554,-0.004739
Causal inference,0.123511,0.039332,0.067196,0.034242,-0.110463,-0.021258,-0.081039,0.046297,-0.044909,-0.109925,...,0.090618,0.108369,-0.155068,-0.006480,-0.030430,0.002048,0.038005,0.116140,-0.109640,-0.042583
Confounding,0.127567,0.054380,0.094587,0.017964,-0.118616,0.007569,-0.019658,0.042353,0.016351,-0.018895,...,0.050939,0.037546,-0.066926,0.013114,-0.024777,-0.003934,0.061507,0.065443,-0.032142,0.002641
Correlation does not imply causation,0.097924,0.066298,0.094109,0.006272,-0.119453,0.018481,-0.021297,0.034444,0.002838,-0.039116,...,0.062334,0.053849,-0.110930,0.002877,-0.030912,0.005840,0.082900,0.096363,-0.072256,0.012790
Covariation model,0.050899,0.041446,0.052962,0.007252,-0.056772,0.000216,-0.016820,0.012845,-0.013992,-0.046511,...,0.028074,0.021839,-0.058199,0.014487,0.002841,0.011358,0.044132,0.030966,-0.055678,-0.016239
Difference in differences,0.258262,-0.096853,0.020293,0.027804,-0.028526,0.010413,-0.067765,0.012627,0.020052,-0.056596,...,0.052397,0.043461,-0.058067,0.008423,-0.039078,-0.008003,0.074422,0.071833,-0.030391,-0.009487
Event correlation,0.064189,0.076570,0.011626,0.000964,-0.034424,-0.005353,-0.015214,0.023495,0.022222,-0.021623,...,0.043175,-0.009032,-0.033034,0.007491,0.011832,0.003748,0.004150,0.013611,-0.010707,0.002031
Experiment,0.124822,0.094932,0.121178,0.015828,-0.120033,0.001701,-0.012501,0.034900,-0.018991,-0.054533,...,0.071262,0.015211,-0.071157,0.023952,-0.016385,0.005087,0.084573,0.041287,-0.014237,0.020374
External validity,0.102960,0.069374,0.117224,0.005901,-0.068476,0.003026,-0.023650,0.042890,-0.012799,-0.052198,...,0.010502,0.026826,-0.060570,0.055857,-0.051099,0.012910,0.100176,0.036581,-0.034942,-0.017557
Field experiment,0.069743,0.077629,0.062900,0.015252,-0.044646,-0.011490,-0.051694,0.031218,-0.017562,-0.075952,...,0.035024,0.021899,-0.062863,0.041660,-0.013890,0.025256,0.061730,0.004282,0.022638,0.000520


### Identify the Vector for our Search Term

In [34]:
search_term_svd_vector = svd_df.loc[search_terms_encoded_df.index]
search_term_svd_vector

Unnamed: 0,component_1,component_2,component_3,component_4,component_5,component_6,component_7,component_8,component_9,component_10,...,component_41,component_42,component_43,component_44,component_45,component_46,component_47,component_48,component_49,component_50
machine learn dataframe american,0.019804,0.026754,-0.001458,0.017989,-0.009401,-0.025227,-0.001261,-0.008731,-0.040857,-0.021613,...,-0.001893,-0.002105,-0.005874,0.00816,-0.006339,0.002136,-0.012866,0.005375,-0.008234,-0.012728


In [35]:
svd_df['cosine_sim'] = cosine_similarity(svd_df, search_term_svd_vector)

In [60]:
# Showing the top 10 pages for the search.
svd_df[['cosine_sim']].sort_values('cosine_sim', ascending=False)[1:6]

Unnamed: 0,cosine_sim
Machine Learning (journal),0.867978
Category:Machine learning researchers,0.824893
Journal of Machine Learning Research,0.823606
Portal:Machine learning/Selected biography,0.774281
Ofer Dekel (researcher),0.773873


In [11]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity


def search_py(mongo_database_name, phrase):
    
    # Merging all collections in mongodb
    cmdm = collection_merger_df_maker()
    df = cmdm.merge_collections(mongo_database_name)
    
    tfidf_vectorizer = TfidfVectorizer(min_df = 1, stop_words = 'english')
    document_term_matrix_sps = tfidf_vectorizer.fit_transform(df['content'])
    document_term_matrix_df = pd.DataFrame(document_term_matrix_sps.toarray(),
                                       index=df.title,
                                       columns=tfidf_vectorizer.get_feature_names())
    
    search_phrase = [phrase]

    search_terms_encoded = tfidf_vectorizer.transform(search_phrase)

    search_terms_encoded_df = pd.DataFrame(search_terms_encoded.toarray(),
                                   index=search_phrase,
                                   columns=tfidf_vectorizer.get_feature_names())
    
    dtm_with_search_term = document_term_matrix_df.append(search_terms_encoded_df)
    
    n_components = 50
    SVD = TruncatedSVD(n_components)
    component_names = ["component_"+str(i+1) for i in range(n_components)]
    
    svd_matrix = SVD.fit_transform(dtm_with_search_term)
    
    svd_df = pd.DataFrame(svd_matrix, 
                      index=dtm_with_search_term.index, 
                      columns=component_names)
    
    search_term_svd_vector = svd_df.loc[search_terms_encoded_df.index]
    
    svd_df['cosine_sim'] = cosine_similarity(svd_df, search_term_svd_vector)
    
    return list(svd_df[['cosine_sim']].sort_values('cosine_sim', ascending=False)[1:6].index)

In [12]:
client.database_names()

['admin',
 'business_software_wiki_db',
 'local',
 'machine_learning_wiki_db',
 'test',
 'wiki_content_db']

In [13]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

In [14]:
search_py('machine_learning_wiki_db','machine learn dataframe american')

['Machine Learning (journal)',
 'Category:Machine learning researchers',
 'Portal:Machine learning/Selected biography',
 'Journal of Machine Learning Research',
 'Ofer Dekel (researcher)']