# Introduction

This presentation aims at prepairing group members with some tools and knowledge to take the upcoming Quant Quest challenge.

## Some Tools
This section details some Python libraries that might be helpful
1. Numerical analysis
  * [numpy](http://www.numpy.org/) - Linear algebra, matrix and vector manipulation
  * [pandas](http://pandas.pydata.org/) - Data anaysis, data manipulation
2. Machine learning
  * [scikit-learn](http://scikit-learn.org/stable/) - General machine learning. Supports basic/advance level algorithms, but only run on CPU.
  * [theano](http://deeplearning.net/software/theano/) - Deep learning framework.
  * [tensorflow](https://www.tensorflow.org/) - Another deep learning framework.
3. Natural language processing
  * [nltk](http://www.nltk.org/) - General NLP
  * [gensim](https://radimrehurek.com/gensim/) - Topic modeling
4. Utilities
  * [beautiful soup](https://www.crummy.com/software/BeautifulSoup/) - Utility for working with text
  * [urllib](https://docs.python.org/2/library/urllib2.html) - Dealing with url, lightweight scraping.
  * [wikipedia](https://wikipedia.readthedocs.io/en/latest/quickstart.html) - Scraping from wikipedia
  
## Download
You can get most of these libraries from the [Anaconda distribution](https://www.continuum.io/downloads) or from the links above.

# Machine Learning Pipeline
1. Obtain data
  * Either from scraping, downloading, or other means.
2. Preprocess data
  * Remove unwanted data.
  * Filter out noise.
  * Patitioning data into *training set*, *validation set*, *test set*
  * Scale, shift, and normalize.
3. Find a good representation
  * The purpose of this step is to find a more representative representation of the data. 
  * In NLP, a good representation can be *word count*, or *tf-idf*.
  * Dimensionality reduction.
4. Training the classifier/regressor
  * People often [k-fold cross-validation](https://www.cs.cmu.edu/~schneide/tut5/node42.html).
  * *training* is done using gradient descent.
  * Hyper-parameters tuning.
5. Testing
  * Accuracy, false-positive, false-negative, f-1 score, etc.

## Obtain data
This section will introduce basic tools to download text corpus from wikipedia articles. We will download the content of all 500 articles of 500 companies in the S&P 500.

In [11]:
import urllib2
import string
import time
import os
from bs4 import BeautifulSoup, NavigableString
import wikipedia as wk


In [12]:

def initOpener():
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    return opener

The function bellow output a dictionary whose keys are *stock tickers* and values are article *titles*. These *titles* are then used for scraping.

In [13]:
def getSP500Dictionary():
    stockTickerUrl = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

    usableStockTickerURL = initOpener().open(stockTickerUrl).read()

    stockTickerSoup = BeautifulSoup(usableStockTickerURL, 'html.parser')

    stockTickerTable = stockTickerSoup.find('table')

    stockTickerRows = stockTickerTable.find_all('tr')

    SP500companies = {}

    stockBaseURL = 'https://en.wikipedia.org'

    for stockTickerRow in stockTickerRows:
        stockTickerColumns = stockTickerRow.find_all('td')
        counter = 1
        for element in stockTickerColumns:
            # Stock Ticker
            if (counter % 8) == 1:
                stockTicker = element.get_text().strip().encode('utf-8', 'ignore')
                counter = counter + 1
            # Corresponding link to wiki page
            elif (counter % 8 == 2):
                SP500companies[stockTicker] = element.find('a', {'href': True}).get('href')
                counter = counter + 1

    return SP500companies

The cell bellow uses *wikipedia* package to load the summary paragraph of the wikipedia article of each company.

In [18]:
import codecs
import wikipedia as wk
import sys
import json

SP500dict = getSP500Dictionary()
err = []
data = []
comp_name = []
for k, v in SP500dict.iteritems():
    # k: ticker, v: company name
    v_str = str(v)
    pageId = v_str.split('/')[-1]
    pageId = pageId.replace('_',' ')
    try:
        data.append(wk.summary(pageId).encode('utf-8'))
        comp_name.append(pageId.encode('utf-8'))
    except:
        err.append((k,v))
# Dump the data into json file for later use
with open('data.json', 'w') as outfile:
    json.dump((data, comp_name), outfile)

In [1]:
import json

with open('data.json') as json_data:
    data_ = json.load(json_data)
data = data_[0]
comp_name = data_[1]
# print 2 companies
print data[10]
print '-----'
print data[11]

BorgWarner Inc. is an American worldwide automotive industry components and parts supplier. It is primarily known for its powertrain products, which include manual and automatic transmissions and transmission components, such as electro-hydraulic control components, transmission control units, friction materials, and one-way clutches, turbochargers, engine valve timing system components, along with four-wheel drive system components.
The company has 60 manufacturing facilities across 18 countries, including the U.S., Canada, Europe, and Asia. It provides drivetrain components to all three U.S. automakers, as well as a variety of European and Asian original equipment manufacturer (OEM) customers. BorgWarner has diversified into several automotive-related markets (1999), including ignition interlock technology (ACS Corporation est.1976) for preventing impaired operation of vehicles.
Historically, BorgWarner was also known for its ownership of the Norge appliance company (washers and drye

## Preprocessing and Feature Representation
Vectorize documents to matrix of occurence. While counting, filter out stopwords.

In [2]:
# Import the method
from sklearn.feature_extraction.text import CountVectorizer
# Initialize the vectorizer with the option of stopword, which will eliminate 
# common words like 'the', 'a', etc.
count_vect = CountVectorizer(stop_words='english')
# fit_transform method applies the vectorizer on the data set
X_train_counts = count_vect.fit_transform(data)
# The resulting matrix is 496 by 7942. Each row is a document (a wikipedia article)
# each column is the occurence of each word.
print X_train_counts.shape

(498, 7940)


$tf(t,d)$ is the frequency that term $t$ appears in document $d$.

$df(d,t)$ is the number of documents that contain term $t$.

$idf(t)=\log \frac{1+n_d}{1+df(d,t)} + 1$, 
  * $n_d$ is number of documents

$tfidf(t,d)=tf(t,d)\times idf(t)$

In sklearn implementation, the final tfidf vector is normalized by the L2 norm.

Tfidf gives a nice numerical representation of the document. From this representation, we can perform numerical analysis technique on the data.

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer()
X_train_tf = tf_transformer.fit_transform(X_train_counts)
print X_train_tf.shape
print 

(498, 7940)


## Clustering
K-means cluster your dataset into K centroids. 

In [33]:
from sklearn.cluster import KMeans
# Note that n_clusters is number of cluster. This is important for accuracy. Play around with it
classifier = KMeans(n_clusters = 90, n_jobs=-1)
classifier.fit(X_train_tf)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=90, n_init=10,
    n_jobs=-1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [34]:
print (classifier.labels_)

[42 30 60 40 41 39 14 51 52 68 29 53 10  3 79 42 49 14 72 50 58 26 68 65 26
 49 71  4 68 69 76 37 70 40 79 50 41 48 26 28 23 70 82 76 73  8 56 37 85 40
 80 37  4 11 52 54 48 32 48  0 28 83 11 73 30 15 11 42 62 61 59 31 73 30  7
 19 83 42 43 48 35 52 62 14 76 26 75 28 36 79 39 52  4 14 36 49 15 41  2 15
 83 60 39 12 53 65 17 35 16 88 55 26 16  3 17 49 60  8  8 18 89 75 12 12 42
 12  2 40  8  8 40 17 24 44 89 24  0  8 23 35 73  8 68 34 89 80 12 62  3  8
 87 89 20 81 58 65 18 81  6 86 33 73 26 32 65 14 81 81 53 30 19 18  8 22 31
 78 20 79 59 61 75 12 46 52 81 77 67 70 38  9 79 51 12 36 71 35  3 54  1 89
 11 26  5 88 47  2 87 75  7 80 51 25 79 32 88 28 26 46 67  6 79 40 28 43 33
 66 85 58 14 10  3  8 16 55 69  7 68 30 14 22  4  9 67 19  8 70 15 38 43 63
 60 65 26 17 68  2 73 19 27  8 82 88  8 11 15 31 67 16 40 44 13 79 79 12 14
 41 24 44  9 39 71 72  7  8 68 81  9 51 70 83 52 69 18  8 63 60 81 79 18 35
 85 40 76 49 60 55 35 12  4 79 22 24 37 64 23 26 62 33  8  3 37 74 83 16 74
  5 13 30 76

In [35]:
import numpy as np
group1 = np.where(classifier.labels_==9)[0]
print [comp_name[x] for x in group1]
print "____"
print [comp_name[x] for x in np.where(classifier.labels_==10)[0]]

[u'O%27Reilly Automotive', u'Advance Auto Parts', u'Delphi Automotive', u'Genuine Parts', u'LKQ Corporation', u'AutoZone Inc', u'AutoNation Inc']
____
[u'Under Armour', u'Xilinx Inc', u'McCormick %26 Co.', u'Under Armour']
