# Web Classification - Exploration  


## Get websites to classify
Imports and connect to the database

In [10]:
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt

In [2]:
# Connect to SQLite database
con = sqlite3.connect("../data/top_100_shallow_1.sqlite")

cur = con.cursor()

Query to obtain all the 100 top sites visited from the table `CrawlHistory` (see the **DB_explore.ipynb** notebook for details on this table)

In [3]:
websites = pd.read_sql_query("SELECT DISTINCT arguments FROM CrawlHistory;", con)

websites.rename(columns={'arguments':'url'}, inplace=True)

In [4]:
websites['url'][:10]

0      http://google.com
1     http://youtube.com
2    http://facebook.com
3         http://msn.com
4        http://yelp.com
5        http://bing.com
6     http://twitter.com
7      http://amazon.com
8    http://buzzfeed.com
9     http://answers.com
Name: url, dtype: object

## Get Classifications  

For this we will use a simple class we have created that uses selenium webdriver to query `fortiguard.com` to obtain the classification of a website.  

In [5]:
from WebClassifier import WebClassifier

Let us start with a simple use case with a "normal" browser (Firefox).

In [6]:
wc = WebClassifier(invisible=False)
wc.get_classification('reddit.com')

'Newsgroups and Message Boards'

Firefox opened up and we got our classification. Let's try with a list of 10 websites...

In [7]:
wc.get_classifications(websites['url'][:10], verbose=True)


Classifying 10 URLs. Fetching ...
   - PROGRESS: ####################  DONE


['Search Engines and Portals',
 'Streaming Media',
 'Social Networking',
 'Search Engines and Portals',
 'Reference',
 'Search Engines and Portals',
 'Social Networking',
 'Shopping',
 'Entertainment',
 'Reference']

Pretty fast!  

However, since we want to obtain a reasonably large list, there is no need to watch Firefox do everything. Let's use a headless browser ([PhantomJS](http://phantomjs.org/)) for this one...

In [8]:
wc2 = WebClassifier()

websites['class'] = wc2.get_classifications(websites['url'], verbose=True)
wc2.close()


Classifying 100 URLs. Fetching ...
   - PROGRESS: ####################  DONE


Got it! Let's take a look at what we got 

In [14]:
websites['class'].value_counts()

News and Media                   18
Entertainment                    18
Reference                         9
Search Engines and Portals        8
Information Technology            7
Shopping                          5
Personal Websites and Blogs       5
Finance and Banking               3
Health and Wellness               3
Business                          3
Newsgroups and Message Boards     3
Streaming Media                   2
File Sharing and Storage          2
Arts and Culture                  2
Sports                            2
Adult Materials                   2
Social Networking                 2
Travel                            1
Job Search                        1
Auction                           1
Education                         1
Web Hosting                       1
Political Organizations           1
Name: class, dtype: int64

## Get third-party cookie coverage